Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Impact of HTTP Cookie Violations
in Web Archives
Sawood Alam, Michele C. Weigle, and Michael L. Nelson
Old Dominion Univer...
@ibnesayeed
Cookies Are Why Your Archived Twitter Page Is Not in English
2https://ws-dl.blogspot.com/2018/03/2018-03-21-co...
@ibnesayeed
All Your Tweets Are Belong To Kannada
3
9,000+ mementos of @BarackObama
English: 53%
Kannada: 22%
Other 45 lan...
@ibnesayeed
Is JavaScript Causing This?
4
Twitter seems to be rendering translated phrases on the server.
So, JavaScript c...
@ibnesayeed
Is Cache Conflicting at a Shared Proxy?
5
Twitter goes to lengths (sometimes in wrong ways) in ensuring their ...
@ibnesayeed
Is On-demand Archiving Bringing User Preferences In?
6
IA replays users’ headers in Save Page Now, but
other a...
@ibnesayeed
Is Geo-location Affecting It?
7
Most of the archival crawlers run in the USA or European regions, which does n...
@ibnesayeed
Is Heritrix Sending Wrong Accept-Language Headers?
8
Heritrix generated WARC files do not contain any Accept-L...
@ibnesayeed
Language Content Negotiation in Twitter
9
The “?lang=<lang-code>” query parameter has the highest precedence.
...
@ibnesayeed
Alternate Language Links Pollute Crawler’s Frontier Queue
10
Kannada (kn) being
at the end of the list,
causes...
@ibnesayeed
Experiment With Heritrix On Two Seed URIs
● https://twitter.com/?lang=ar
○ First request has an explicit lang ...
@ibnesayeed
Replaying Captured WARC With PyWB
12
https://twitter.com/?lang=ar https://twitter.com/phonedude_mln/
@ibnesayeed
Cookie Violations Cause Archived Twitter Pages to
Simultaneously Replay in Multiple Languages
13https://ws-dl....
@ibnesayeed
Defaced Composite Mementos That Never Existed
on the Live Web
14
Live leakage (Zombies) Temporal Violations
Or...
@ibnesayeed
Anatomy of a Twitter Timeline
15
● Page is loaded with the initial set of tweets
● Navigation bar is in the cu...
@ibnesayeed
Twitter Returns Server-side Rendered Markup
16
Cookies set by of prior responses may impact subsequent XHR res...
@ibnesayeed
Pages With Explicit lang Parameter Are Consistent
17
?lang=pt
?lang=en
?lang=ur
Mementos with explicit “lang” ...
@ibnesayeed
Replicate Heritrix Behavior on the Live Web
18
Load https://twitter.com/
in a browser tab B
Retweet a tweet
in...
@ibnesayeed
What Can We Do About These Cookie Violations?
● Crawling
○ Sandbox short crawl sessions
○ Explicitly enforce s...
@ibnesayeed
Conclusions
● Cookies Are Why Your Archived Twitter Page Is Not in English
○ https://ws-dl.blogspot.com/2018/0...
Upcoming SlideShare
Loading in …5
×

Impact of HTTP Cookie Violations in Web Archives

652 views

Published on

Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for values in the Vary header along with URIs.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Impact of HTTP Cookie Violations in Web Archives

  1. 1. Impact of HTTP Cookie Violations in Web Archives Sawood Alam, Michele C. Weigle, and Michael L. Nelson Old Dominion University, Norfolk, VA, USA @ibnesayeed @WebSciDL Supported by NSF Grant IIS-1526700 WADL '19, June 6, 2019, Urbana-Champaign, Illinois
  2. 2. @ibnesayeed Cookies Are Why Your Archived Twitter Page Is Not in English 2https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
  3. 3. @ibnesayeed All Your Tweets Are Belong To Kannada 3 9,000+ mementos of @BarackObama English: 53% Kannada: 22% Other 45 languages: 25% https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html
  4. 4. @ibnesayeed Is JavaScript Causing This? 4 Twitter seems to be rendering translated phrases on the server. So, JavaScript cannot be responsible.
  5. 5. @ibnesayeed Is Cache Conflicting at a Shared Proxy? 5 Twitter goes to lengths (sometimes in wrong ways) in ensuring their pages are not cached.
  6. 6. @ibnesayeed Is On-demand Archiving Bringing User Preferences In? 6 IA replays users’ headers in Save Page Now, but other archives do not have on-demand archiving. Archive.is sends custom Accept-Language header, not the one a user’s browser sends to it.
  7. 7. @ibnesayeed Is Geo-location Affecting It? 7 Most of the archival crawlers run in the USA or European regions, which does not explain why Kannada (a regional Indian language) is so popular.
  8. 8. @ibnesayeed Is Heritrix Sending Wrong Accept-Language Headers? 8 Heritrix generated WARC files do not contain any Accept-Language header.
  9. 9. @ibnesayeed Language Content Negotiation in Twitter 9 The “?lang=<lang-code>” query parameter has the highest precedence. Twitter honors Accept-Language header for content negotiation, but does not advertise it in a Vary header.
  10. 10. @ibnesayeed Alternate Language Links Pollute Crawler’s Frontier Queue 10 Kannada (kn) being at the end of the list, causes its “lang” cookie stick around for long, affecting many subsequent Twitter URLs.
  11. 11. @ibnesayeed Experiment With Heritrix On Two Seed URIs ● https://twitter.com/?lang=ar ○ First request has an explicit lang query parameter ○ First response has a “Set-Cookie: lang=ar” header ● https://twitter.com/phonedude_mln/ ○ Second request has no lang query parameter, but sends a “Cookie: lang=ar” ○ Second response returns the page in Arabic 11
  12. 12. @ibnesayeed Replaying Captured WARC With PyWB 12 https://twitter.com/?lang=ar https://twitter.com/phonedude_mln/
  13. 13. @ibnesayeed Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages 13https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
  14. 14. @ibnesayeed Defaced Composite Mementos That Never Existed on the Live Web 14 Live leakage (Zombies) Temporal Violations Origin Violations And now, Cookie Violations! https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
  15. 15. @ibnesayeed Anatomy of a Twitter Timeline 15 ● Page is loaded with the initial set of tweets ● Navigation bar is in the current language ● Some sidebar blocks are loaded lazily ● New tweets are polled after every 30 seconds ● Global trends are polled after every 5 minutes
  16. 16. @ibnesayeed Twitter Returns Server-side Rendered Markup 16 Cookies set by of prior responses may impact subsequent XHR responses.
  17. 17. @ibnesayeed Pages With Explicit lang Parameter Are Consistent 17 ?lang=pt ?lang=en ?lang=ur Mementos with explicit “lang” parameter are language consistent.
  18. 18. @ibnesayeed Replicate Heritrix Behavior on the Live Web 18 Load https://twitter.com/ in a browser tab B Retweet a tweet in the tab A Load https://twitter.com/?lang=en in a browser tab A Expand notification in the tab B Change lang param in the tab A
  19. 19. @ibnesayeed What Can We Do About These Cookie Violations? ● Crawling ○ Sandbox short crawl sessions ○ Explicitly enforce short cookie expiration time and garbage collect frequently ○ Identify such sources of cookie violations and filter them off ● Replay ○ Respect content negotiation headers (advertised in “Vary” header) ○ Identify non-advertised cookies that affect the content to incorporate in replay ○ Classify cookies in categories like session, tracking, and configs etc. 19 Ignoring cookies in replay causes cookie violations and has privacy concerns in personal archiving. Blindly utilizing cookies causes false positives (hurts discovery of archived resources).
  20. 20. @ibnesayeed Conclusions ● Cookies Are Why Your Archived Twitter Page Is Not in English ○ https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html ● Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages ○ https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html ● Identified yet another source of bias in archives (over represented languages) ● Described behavior of cookies in crawling and replay (cookie violations) ● Proposed some potential solutions like keeping cookies short-lived ● Described open problems that need more in-depth research 20

×