Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Uncertainty in Replaying
Archived Twitter Pages
Michael L. Nelson
@phonedude_mln
with: Sawood Alam, Kritika Garg, Himarsha Jayanetti,
Shawn M. Jones, Nauman Siddique, Michele C. Weigle
@WebSciDL
Ethics and Archiving the Web: How to ethically collect and use web archives
2021-03-30
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
2
Please don’t use screenshots to capture the past
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Did they really tweet that?
3
No. Social media is filled with fake screenshots:
sometimes it’s disinformation (left), and sometimes it’s humor (right).
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Full disclosure: I realize
screenshots serve a purpose,
and won’t go away
4
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
A screenshot would prevent context
being lost when the original tweet is deleted
5
https://twitter.com/AOC/status/1364623055658635268
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Screenshots extend platform capability:
Annotation, Aggregation, Non-notification
6
https://twitter.com/IlhanMN/status/1374503025289523207
Two different tweets (can only QT
one), dates circled, @NRA &
@NRAILA not notified.
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Screenshots also provide platform interoperability
7
https://www.facebook.com/watchclassinsession
with a screen shot of https://twitter.com/CoriBush/status/1369800497654284289
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
After 50+ years of R&D in hypertext & networks,
recursive screenshots of text is anticlimactic
8
https://twitter.com/visakanv/status/1376165640361271303
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Web archives are preferable to screenshots --
“If You See Something, Save Something”
9
http://blog.archive.org/2017/01/25/see-something-save-something/
Even better: save to > 1 web archive: http://archive.is/, https://perma.cc/
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
“The past is never dead. It's not even past.”
Web archives are
live web simulations of the past.
Uncertainty in the simulation
is an opportunity for mis/disinformation.
10
https://www.cni.org/events/membership-meetings/past-meetings/spring-2019/plenary-sessions-s19#closing
https://ws-dl.blogspot.com/2020/03/2020-03-07-at-nexus-of-cni-keynote-and.html
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Twitter template text isn’t
always in the expected language
11
https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html
This page is not wrong -- this content did
exist on the live web at this URL for some
people -- but for us it is unexpected.
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
In fact, template text can be a mix of languages
that would never have occurred on the live web
12
https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
https://blog.dshr.org/2019/03/the-47-links-mystery.html
This page is wrong: no one on
the live web ever saw a
English/Portuguese/Urdu
combo page.
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Twitter completed its UI change in Summer 2020;
regular users no longer see it on the live web
13
Archives (mostly*) have difficulty archiving the new UI, so they pretend to be
“googlebot” so they can archive the old UI.
Result: view a page on the live web, archive it & replay it, and they don’t match.
https://twitter.com/AnnaPerricci/status/1375437873898516483 (New UI, left),
http://web.archive.org/web/20210328232353/https://twitter.com/AnnaPerricci/status/1375437873898516483 (Old UI, right)
https://ws-dl.blogspot.com/2020/07/2020-07-15-twitter-was-already.html
*Webrecorder/Conifer will correctly archive the new UI, but for existing collections (IA, archive.today, perma.cc, etc.)
you can’t influence if the collection will archive the old or new UI
Live, new UI Archived, old UI
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
If the new UI is archived,
replay can inject weird artifacts
14
https://ws-dl.blogspot.com/2021/01/2020-01-22-twitter-rewrites-your-urls.html
https://blog.dshr.org/2021/02/more-on-archiving-twitter.html
not an encoded message
to/from Q, the deep state,
or the Easter Bunny
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
The old UI does not support “fact-check” labels
(and did not support “Violated Twitter rules” labels for ~3 months)
15
You can cherry pick from the TimeMap
to prove or disprove that Twitter labeled this tweet.
http://web.archive.org/web/*/https://twitter.com/realDonaldTrump/status/1265255835124539392
https://ws-dl.blogspot.com/2020/12/2020-12-08-twitter-added-labels-on-its.html
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Replaying the new UI can result in temporal
violations: pages that never existed on the live web
16
https://ws-dl.blogspot.com/2020/11/2020-11-04-new-twitter-ui-replaying.html
https://blog.dshr.org/2020/12/michael-nelsons-group-on-archiving.html
@realDonaldTrump has multiple,
independent tweet archives, so we
can compute what is missing -- not
true for less popular suspended
accounts.
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Good news: Embedded tweets don’t disappear
17
https://twitter.com/phonedude_mln/status/1347948053152616459
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Bad news: without live web Twitter for verification,
orphaned embedded tweets can be faked
18
https://www.youtube.com/watch?v=iWWL-mblxF4
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Archive-aware embeds can verify the tweets,
point to copies in different archives
19
https://ws-dl.blogspot.com/2021/01/2021-01-09-embedded-tweets-from.html
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
But tweet URL lookup in archives can produce false
negatives; 100+ URL variations for a single tweet
20
% curl -s
"http://web.archive.org/cdx/search/cdx?url=https://twitter.com/realDonaldTrump
/status/1343328283824447488&matchType=prefix" | awk ' {print $3}' | sort -fu |
head -10
https://twitter.com/realdonaldtrump/status/1343328283824447488
https://twitter.com/realDonaldTrump/status/1343328283824447488 ?ftag=MSF0951a18
https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=ar
https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=bg
https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=bn
https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=ca
https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=cs
https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=da
https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=de
https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=el
% curl -s
"http://web.archive.org/cdx/search/cdx?url=https://twitter.com/realDonaldTrump
/status/1343328283824447488&matchType=prefix" | awk ' {print $3}' | sort -fu |
wc -l
102
https://ws-dl.blogspot.com/2019/08/2019-08-03-searching-web-archives-for.html
The top level tweet won’t always be archived.
Sometimes URLs with parameters are the only
URLs archived, and they are hard to discover if not
known in advance (only the first 10 are shown).
Uncertainty in Replaying Archived Twitter Pages
2021-03-30 @phonedude_mln, @WebSciDL
Screenshots are bad, mkay, and web archives are good
just remember: replay uncertainty allows for mis/disinformation
21
● Replay does not always match live web expectations
○ template text can be in a foreign language
○ old UI vs. new UI
○ competing Javascripts can leave artifacts on page
● Old UI does not have “fact-check” label
○ can cherry pick old UI archived pages to “prove” that a tweet
was not labeled
● Archived new UI pages can hide significant temporal violations
○ page you are viewing might never have existed as presented
● Embedded tweets do not disappear if an account is suspended or
deleted
○ unlike live web tweets, orphaned embeds can be faked
● False negatives are possible when verifying if a Twitter page is
archived
○ more likely to occur with less popular accounts

Uncertainty in replaying archived Twitter pages

  • 1.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Uncertainty in Replaying Archived Twitter Pages Michael L. Nelson @phonedude_mln with: Sawood Alam, Kritika Garg, Himarsha Jayanetti, Shawn M. Jones, Nauman Siddique, Michele C. Weigle @WebSciDL Ethics and Archiving the Web: How to ethically collect and use web archives 2021-03-30
  • 2.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL 2 Please don’t use screenshots to capture the past
  • 3.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Did they really tweet that? 3 No. Social media is filled with fake screenshots: sometimes it’s disinformation (left), and sometimes it’s humor (right).
  • 4.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Full disclosure: I realize screenshots serve a purpose, and won’t go away 4
  • 5.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL A screenshot would prevent context being lost when the original tweet is deleted 5 https://twitter.com/AOC/status/1364623055658635268
  • 6.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Screenshots extend platform capability: Annotation, Aggregation, Non-notification 6 https://twitter.com/IlhanMN/status/1374503025289523207 Two different tweets (can only QT one), dates circled, @NRA & @NRAILA not notified.
  • 7.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Screenshots also provide platform interoperability 7 https://www.facebook.com/watchclassinsession with a screen shot of https://twitter.com/CoriBush/status/1369800497654284289
  • 8.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL After 50+ years of R&D in hypertext & networks, recursive screenshots of text is anticlimactic 8 https://twitter.com/visakanv/status/1376165640361271303
  • 9.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Web archives are preferable to screenshots -- “If You See Something, Save Something” 9 http://blog.archive.org/2017/01/25/see-something-save-something/ Even better: save to > 1 web archive: http://archive.is/, https://perma.cc/
  • 10.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL “The past is never dead. It's not even past.” Web archives are live web simulations of the past. Uncertainty in the simulation is an opportunity for mis/disinformation. 10 https://www.cni.org/events/membership-meetings/past-meetings/spring-2019/plenary-sessions-s19#closing https://ws-dl.blogspot.com/2020/03/2020-03-07-at-nexus-of-cni-keynote-and.html
  • 11.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Twitter template text isn’t always in the expected language 11 https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html This page is not wrong -- this content did exist on the live web at this URL for some people -- but for us it is unexpected.
  • 12.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL In fact, template text can be a mix of languages that would never have occurred on the live web 12 https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html https://blog.dshr.org/2019/03/the-47-links-mystery.html This page is wrong: no one on the live web ever saw a English/Portuguese/Urdu combo page.
  • 13.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Twitter completed its UI change in Summer 2020; regular users no longer see it on the live web 13 Archives (mostly*) have difficulty archiving the new UI, so they pretend to be “googlebot” so they can archive the old UI. Result: view a page on the live web, archive it & replay it, and they don’t match. https://twitter.com/AnnaPerricci/status/1375437873898516483 (New UI, left), http://web.archive.org/web/20210328232353/https://twitter.com/AnnaPerricci/status/1375437873898516483 (Old UI, right) https://ws-dl.blogspot.com/2020/07/2020-07-15-twitter-was-already.html *Webrecorder/Conifer will correctly archive the new UI, but for existing collections (IA, archive.today, perma.cc, etc.) you can’t influence if the collection will archive the old or new UI Live, new UI Archived, old UI
  • 14.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL If the new UI is archived, replay can inject weird artifacts 14 https://ws-dl.blogspot.com/2021/01/2020-01-22-twitter-rewrites-your-urls.html https://blog.dshr.org/2021/02/more-on-archiving-twitter.html not an encoded message to/from Q, the deep state, or the Easter Bunny
  • 15.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL The old UI does not support “fact-check” labels (and did not support “Violated Twitter rules” labels for ~3 months) 15 You can cherry pick from the TimeMap to prove or disprove that Twitter labeled this tweet. http://web.archive.org/web/*/https://twitter.com/realDonaldTrump/status/1265255835124539392 https://ws-dl.blogspot.com/2020/12/2020-12-08-twitter-added-labels-on-its.html
  • 16.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Replaying the new UI can result in temporal violations: pages that never existed on the live web 16 https://ws-dl.blogspot.com/2020/11/2020-11-04-new-twitter-ui-replaying.html https://blog.dshr.org/2020/12/michael-nelsons-group-on-archiving.html @realDonaldTrump has multiple, independent tweet archives, so we can compute what is missing -- not true for less popular suspended accounts.
  • 17.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Good news: Embedded tweets don’t disappear 17 https://twitter.com/phonedude_mln/status/1347948053152616459
  • 18.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Bad news: without live web Twitter for verification, orphaned embedded tweets can be faked 18 https://www.youtube.com/watch?v=iWWL-mblxF4
  • 19.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Archive-aware embeds can verify the tweets, point to copies in different archives 19 https://ws-dl.blogspot.com/2021/01/2021-01-09-embedded-tweets-from.html
  • 20.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL But tweet URL lookup in archives can produce false negatives; 100+ URL variations for a single tweet 20 % curl -s "http://web.archive.org/cdx/search/cdx?url=https://twitter.com/realDonaldTrump /status/1343328283824447488&matchType=prefix" | awk ' {print $3}' | sort -fu | head -10 https://twitter.com/realdonaldtrump/status/1343328283824447488 https://twitter.com/realDonaldTrump/status/1343328283824447488 ?ftag=MSF0951a18 https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=ar https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=bg https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=bn https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=ca https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=cs https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=da https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=de https://twitter.com/realdonaldtrump/status/1343328283824447488 ?lang=el % curl -s "http://web.archive.org/cdx/search/cdx?url=https://twitter.com/realDonaldTrump /status/1343328283824447488&matchType=prefix" | awk ' {print $3}' | sort -fu | wc -l 102 https://ws-dl.blogspot.com/2019/08/2019-08-03-searching-web-archives-for.html The top level tweet won’t always be archived. Sometimes URLs with parameters are the only URLs archived, and they are hard to discover if not known in advance (only the first 10 are shown).
  • 21.
    Uncertainty in ReplayingArchived Twitter Pages 2021-03-30 @phonedude_mln, @WebSciDL Screenshots are bad, mkay, and web archives are good just remember: replay uncertainty allows for mis/disinformation 21 ● Replay does not always match live web expectations ○ template text can be in a foreign language ○ old UI vs. new UI ○ competing Javascripts can leave artifacts on page ● Old UI does not have “fact-check” label ○ can cherry pick old UI archived pages to “prove” that a tweet was not labeled ● Archived new UI pages can hide significant temporal violations ○ page you are viewing might never have existed as presented ● Embedded tweets do not disappear if an account is suspended or deleted ○ unlike live web tweets, orphaned embeds can be faked ● False negatives are possible when verifying if a Twitter page is archived ○ more likely to occur with less popular accounts