The language of social media Dr. Diana Maynard University of Sheffield PROFESSIONAL STANDARDS – cipr.co.uk
Twitter Fun Facts 500 million tweets sent per day 24% of all internet male users use Twitter (vs 21% of women) 37% of Twitter users are 18-29 25% of Twitter users are 30-49
Which country has the most Twitter users?
Twitter Users per country US: 67 million Brazil: 27.7 million Japan: 25.9 million Mexico: 23.5 million .... UK: 13 million
Which country has the highest penetration of Twitter users?
1/3 of all internet users there are on Twitter
Who do we follow on Twitter?
Top 10 most followed Twitter users 2017 2015 2013 Katy Perry Katy Perry Katy Perry Justin Bieber Justin Bieber Justin Bieber Barack Obama Taylor Swift Lady Gaga Taylor Swift Barack Obama Barack Obama Rihanna Youtube Taylor Swift Ellen de Generes Lady Gaga YouTube Lady Gaga Rihanna Britney Spears Youtube Ellen de Generes Rihanna Justin Timberlake Twitter Instagram Twitter Justin Timberlake Justin Timberlake
Social media: a valuable source of information (not just stupid stuff about pop stars) business insights sharing and receiving news campaigns sharing information during disasters all kinds of collective intelligence an alternative to traditional polls and much more
Why is social media interesting to study? Fast-growing, highly dynamic and high volume source of data – big data! Reflects language used in today's society Reflects current views of society Challenging research area for Text Analysis due to specialised use of language
Gartner 3V definition of Big Data Volume Velocity High volume & velocity of messages: 500 million tweets per day Variety Stock markets Earthquakes Social arrangements + Veracity
Big Data is not new! Staff sorting 4M used tickets from #London Underground to analyse line use in 1939
Linguistic challenges of social media Language Problem: typically exhibits very different language style Solution: train specific language processing components Relevance Problem: topics and comments can rapidly diverge. Solution: train a classifier Lack of context Problem: hard to disambiguate entities Solution: data aggregation, metadata, entity linking
People don’t write “properly” Grundman:politics makes # climatechange scientific issue,people don’t like knowitall rational voice tellin em wat 2do Want to solve the problem of # ClimateChange ? Just #vote for a #politician! Poof! Problem gone! #sarcasm #TVP #99% Human Caused # ClimateChange is a Monumental Scam! http:// www.youtube.com / watch?v =LiX792kNQeE … F**k yes!! Lying to us like MOFO's Tax The Air We Breath! F**k Them! The last people I will listen2 about guns r those that know nothing about them&politicians who live in states w/strictest gun laws #cali #ny
16 Ecuador, 7.8 earthquake , April 2017, ~700 people die Droughts, affecting 60 million in 34 countries Maxwell, California, Feb 2017 Portugal, forest fires, 64 confirmed deaths, Jun 2017 Manchester, May 2017, 22 dead Haiti, Hurricane Matthew, Oct 2016, ~500 people died, farming devastated
How is social media relevant to disasters?
Uses of social media during disasters Broadcasting info about the disaster Requesting info from local people and eyewitnesses Requesting and offering help and support Disaster mapping Mobilising the crowd to support initiatives
In the US, 1.1 million tweets were sent in the first day of Hurricane Sandy, and over 20 million in total Over 800K photos with #Sandy hashtag on Instagram 2.3M tweets were sent with the words “Haiti” or “Red Cross” in 2010 More than 23 million tweets were posted about the haze in Singapore In Nepal, more than half a million posts were shared about the devastating earthquake in 2015 Some (big) numbers)
How can social media help? Harnessing the Crowd Using citizen reporters, and digital responders for mapping crises Ushahidi deployed over 50k times Free and open source Working with us on the COMRADES project
Tools to help disaster victims get aid quickly Find mentions of locations in the text, match them to a knowledge base, and plot them on a map 21
How important and urgent is the message? What actions need to be taken?
Understanding climate change: sex will save the planet!
Behaviour Analysis Based on the assumption that users in different behavioural stages communicate differently (different emotions, directives, etc.) Pajarito @ lindopajarito . 2h Our building needs 40% of all energy consumed in Switzerland! DJPajarito @ DJPajaritoGenial . 12h I'm so proud when I remember to save energy and I know however small it's helping. Desirability : Negative sentiment (expressing personal frustration- anger/sadness) Buzz : Positive sentiment (happiness/joy). I/we + present tense HotelPajarito @ HotelPajarito . 18h Join us today today to switch of a light for EH! Invitation : Positive sentiment (happy) + use of vocatives
What matters most to people around the world? Exploring opinions on Twitter of people around the world about societal issues – priorities used to re-rank topics for well-being index http://www.oecdbetterlifeindex.org/
How do people talk about elections and political events? How do the MPs talk about different topics? How does the public respond to them? Social media and politics
Real-time Opinion Monitoring vs replies
Climate change, ISIS and Trump 29
Parties, topics and location
Twits, twats and twaddle: analysis of hate speech towards politicians
Online abuse Puts people off debating online Puts people off becoming politicians Seems to be getting worse Might be particularly bad for particular groups (females, ethnic minorities, LGBT etc ) "I am seriously considering whether or not to stand next time" "My staff try not to let me go out alone" "Misogynist comments, sexual abuse … My children saw this" "death threats"
Swear words are a sign of abuse, right?
Well, maybe not always
What about if we specifically mention someone with a nasty word? That has to be bad, right? You *$!%*&”!
Well, not always ….
Hashtags can be misleading These are all perfectly innocent: # powergenitalia # lesbocages # molestationnursery # teacherstalking #therapist
And what about foreign words? # slagroom
But we still need to analyse hashtags
Who is being abused? Who is abusing them? What is the abuse about? Is it really getting worse? Aims of the Analysis
Collect tweets to and from politicians in the run-up to the 2015 and 2017 UK elections Annotate all the interesting information (who, what, when, where) with the social media toolkit Run an abuse classifier Analyse the results Plan
Tweets are tracked in real time using the streaming API Tweet Collection
Individual tokens are extracted Tokenisation
Spelling and abbreviations are normalised Normalization
Parts-of-speech are identified POS tagging
We discover mentions of entities such as people, locations, organisations and products Named Entities
Find mentions of MPs and link to information from YourNextMP + DBpedia Politician Recognition
Tweets are matched against a detailed topic ontology Topic Detection
Tweets are linked to NUTS regions based on place tags and user home locations Geolocation
We classify users into e.g. journalist, charity, member of publ ic User Classification
Tweet text and annotations are indexed in semantic search engine Mímir for search and visualisation Semantic Search
F ind ing abus ive terms n* gge r witch homo God botherer 404 abusive terms collected But only annotated when used in specific situations shut up f**k you Uncivil language idiot kill Threats die Obscene nouns c* nt tw *t rape Racist and bigoted language
Analysing the data
Did the abuse get worse? There was more abuse in 2017 than in 2015 2017 2015
Who got the abuse in 2015? Men got more abuse then women Conservatives got more abuse than Labour
Who got the abuse in 2015 ? A small number of prominent MPs
What about in 2017? The same thing happened (but to different people)
Check out the interactive version! http://demos.gate.ac.uk/politics/buzzfeed/sunburst.html
Take-away message Social media contains an awful lot of interesting information The way people talk on social media is critical, and messages framed in the right way can lead to real behavioural change If we can understand this properly, this can give us incredibly valuable insights It’s worth spending the time to do this properly More about all this on our blog: https://gate4ugc.blogspot.com/
Acknowledgements Work partially funded by the European Union/EU under the Information and Communication Technologies (ICT) theme of the 7th Framework and H2020 Programmes for R&D SoBigData ( 654024) http://www.sobigdata.eu COMRADES (687847) http://www.comrades-project.eu