Web Archives for Verifying Attribution in Twitter Screenshots
TARANNUMZAKI1
16 views
22 slides
May 14, 2024
Slide 1 of 22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
About This Presentation
Users rarely think about verifying screenshots of social media posts before sharing them on social media. This eventually leads to the spread of misinformation and disinformation. We are developing an automated tool to estimate the probability that a screenshot of a social media post is fake. In man...
Users rarely think about verifying screenshots of social media posts before sharing them on social media. This eventually leads to the spread of misinformation and disinformation. We are developing an automated tool to estimate the probability that a screenshot of a social media post is fake. In many cases, web archives can be used to validate the attribution of such screenshots.
Size: 15.61 MB
Language: en
Added: May 14, 2024
Slides: 22 pages
Slide Content
Web Archives for Verifying Attribution in Twitter Screenshots Presented By: Tarannum Zaki, PhD Student Advisors: Dr. Michael L. Nelson & Dr. Michele C. Weigle Department of Computer Science Old Dominion University, Norfolk, Virginia April 26, 2024 @tarannum_zaki @WebSciDL 2024 Web Science and Digital Libraries Research Group Expo
Screenshots are commonly used to annotate the social media of others Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL ‹#› https://twitter.com/BetteMidler/status/1541472225341198338 https://twitter.com/MahyarTousi/status/1534307163073658881 https://twitter.com/urbanachievr/status/1505944201208516612
Why screenshots? To use as an evidence for deleted posts ‹#› https://web.archive.org/web/20220525125749/https://twitter.com/DanielDefense/status/1526237750277681154 Controversial posts may be deleted. https://twitter.com/ashtonpittman/status/1530243294868930560 https://twitter.com/DanielDefense/status/1526237750277681154 Other reasons: To deny cross-platform engagement, to aggregate, to mark-up etc. Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Did they really post that? Screenshots can also be used for humor, satire, and disinformation ‹#› https://twitter.com/Shayan86/status/1515753937139388418 https://twitter.com/paulthacker11/status/1495436489492090881 https://twitter.com/elonmusk/status/1544051155562598401 Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Creating fake tweets using Tweetgen ‹#› https://www.tweetgen.com/ https://www.tweetgen.com/create/tweet.html Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Using the live web and web archives to validate attribution of screenshot s ‹#› https://www.google.com/search https://archive.org/web/ https://www.reuters.com/ https://www.snopes.com/ Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Motivation Fake tweets can be responsible for misinformation/disinformation spread. Fake tweets are easy to create using online tools. There are no tools currently available to evaluate the authenticity of attribution of screenshots. ‹#› Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Aim To develop a tool that would automatically provide a probability whether screenshot of a social media post was actually posted by the alleged author using the services of live web and web archives. ‹#› Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
To search for a tweet in the Wayback Machine, you must first know its URL ‹#› https://web.archive.org/web/20220323185843/https://twitter.com/annaturley/status/1506706947239817224 URL of the tweet: https://twitter.com/annaturley/status/1506706947239817224 https://web.archive.org/ Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
But, URL of a tweet is not present in most screenshots ‹#› https://twitter.com/AaronBastani/status/1507391218854117377 @annaturley March 23, 2022 March 25, 2022 https://twitter.com/ TWITTER_HANDLE /status / TWEET_ID https://web.archive.org/web/20220323185843/https://twitter.com/annaturley/status/1506706947239817224 Tweet ID encodes the timestamp of when the tweet was created Construction of a tweet URL - Use the Twitter handle and approximate a time window based on the timestamp. - Construct URL for the tweet. - Search for the tweet in the Wayback Machine using the URL. Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Verifying if screenshot exists in the Wayback Machine ‹#› Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Creating a dataset of screenshots collected from Twitter ‹#› Fields Shared post’s URL Original post’s URL Category Reason Content category Structural features Post type Social media Search strategy Annotated images Screenshot Remarks Screenshot images shared on Twitter. 200 examples Examples include both real and fake screenshots https://ws-dl.blogspot.com/2022/12/2022-12-12-disinformation-spread-on.html https://twitter.com/rvawonk/status/1503227687917305863 https://twitter.com/RealCandaceO/status/1501576352587292673 Category: Real Reason: Found in the live web Content category: Politics Post Type: Tweet Structural features: Single author, single post Search strategy: Searched on Twitter interface Social media: Twitter Original post’s URL Shared post’s URL Screenshot Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
OCRing screenshots: Single tweet images ‹#› OCR Optical Character Recognition extracts information as text from digital image. Example screenshot image OCR extracted output Twitter Handle Timestamp Tweet Text Zaki, T., Nelson, M.L., and Weigle, M.C. (2023, Jun 14). Extracting Information from Twitter Screenshots . Tech Report arXiv:2306.08236. https://doi.org/10.48550/arXiv.2306.08236 Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Computing a time window based on the screenshot timestamp ‹#› The maximum difference between two time zones on Earth is 26 hours. Example screenshot image OCR extracted output Twitter handle and computed timestamps Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Using CDX API to retrieve archived tweets with left hand boundary ‹#› request = "http://web.archive.org/cdx/search/cdx?url=" + urir + params urir = "https://twitter.com/"+ randyhillier +"/status" params = "&matchType=prefix&from="+ 20220218154100 CDX API prefix search process Twitter handle and computed timestamps Output: Retrieved archived tweets with the left hand boundary(cropped). https://archive.org/help/wayback_api.php Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Extracting tweet IDs and determining tweet creation timestamp using TweetedAt ‹#› https://web.archive.org/web/202202 22 163926/https://twitter.com/randyhillier/status/ 1006984708109099008 https://ws-dl.blogspot.com/2019/08/2019-08-03-tweetedat-finding-tweet.html Each tweet ID encodes its creation timestamp An archived tweet’s URL https://oduwsdl.github.io/tweetedat/#1006984708109099008 Tweet ID Tweet Creation Date 1006984708109099008 20180613194037 ………… ………….. Mapping between all the tweet IDs and tweet creation timestamps Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Determining the final set of archived tweets by filtering the tweet creation timestamps within the time window ‹#› Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL Output: 917 archived tweets with left hand boundary (cropped) Mapping between tweet ID and tweet creation timestamp Output: 29 archived tweets within 52 hours time window (cropped) Creation timestamp of tweets which does not fall within the 52 hours time window are filtered out. 449 archived tweets Multiple mementos are filtered out. 29 archived tweets
Extracting tweet text from archived tweets using BeautifulSoup and Selenium ‹#› https://web.archive.org/web/20220220024223/https://twitter.com/randyhillier/status/1495226962058649603 TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text An archived tweet’s URL Extracted text from archived tweet HTML tag containing the tweet text https://www.selenium.dev/ https://pypi.org/project/beautifulsoup4/ Selenium automates web scraping and BeautifulSoup parses text from HTML. Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Computing text similarity score between tweet text from screenshot and archived tweets using Python’s difflib library ‹#› https://docs.python.org/3/library/difflib.html Example screenshot image Extracted text from archived tweet Extracted tweet text from screenshot match_score( Archived_Tweet_Text , Screenshot_Tweet_Text )= 81.40% Text similarity score is computed based on longest common subsequence Archived_Tweet_Text1 Screenshot_Tweet_Text match _score = 81.40% Archived_Tweet_Text2 Screenshot_Tweet_Text match_score = 30.78% Archived_Tweet_Text3 Screenshot_Tweet_Text match_score = 5.67% …………….. A match score of 81.40% helps us to prove the existence of the screenshot tweet posted by the alleged author. Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
A threshold of 60% produced the highest F1 (0.69) ‹#› Threshold Value Precision Recall F1 Score 90% 1.00 0.42 0.59 80% 1.00 0.49 0.66 70% 1.00 0.51 0.67 60% 1.00 0.53 0.69 Experimented on 108 single tweet images from the collected dataset. Performance of the overlap between the tweet text from the screenshot and the archived tweets. Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL
Limitations & Future Work ‹#› Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL OCR Complex screenshot images Extracted output mostly results in garbage value.
Summary ‹#› Screenshots are an easy way to share content on social media. Since screenshots can be easily faked, i t is a critical task to detect a fabricated post. Services of web archives could be useful to verify attribution of a screenshot by finding an archived version of the screenshot content. Our research will mitigate misinformation and disinformation spread on social media. Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots @tarannum_zaki @WebSciDL