Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Telling Stories with Web Archives

3,215 views

Published on

Keynote presentation from the Southeast Women in Computing Conference
November 16, 2013
Lake Guntersville State Park, Alabama

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Telling Stories with Web Archives

  1. 1. Telling Stories with Web Archives Dr. Michele C. Weigle Web Sciences and Digital Libraries (WS-DL) Lab Department of Computer Science Old Dominion University Norfolk, VA Includes joint work with Dr. Michael L. Nelson and our PhD students, Scott Ainsworth, Yasmin AlNoamany, Ahmed AlSum, Justin Brunelle, Mat Kelly, Hany SalahEldeen Southeast Women in Computing Conference November 16, 2013
  2. 2. Outline • What is a web archive? • Why are archives important? • What's my story? • How can we help others tell their stories? • Related WS-DL Projects Southeast Women in Computing Conference - Nov 16, 2013 #SEWIC2013
  3. 3. What is a web archive? Southeast Women in Computing Conference - Nov 16, 2013
  4. 4. What are some web archives? Southeast Women in Computing Conference - Nov 16, 2013
  5. 5. How can I access the archives? MementoFox Memento for Chrome http://www.mementoweb.org/ http://ws-dl.blogspot.com/2010/03/2010-03-19-mementofox-add-on-released.html http://ws-dl.blogspot.com/2013/10/2013-10-14-right-click-to-past-memento.html Southeast Women in Computing Conference - Nov 16, 2013
  6. 6. Outline • What is a web archive? • Why are archives important? • What's my story? • How can we help others tell their stories? • Related WS-DL Projects Southeast Women in Computing Conference - Nov 16, 2013
  7. 7. The Web holds our stories Southeast Women in Computing Conference - Nov 16, 2013
  8. 8. But webpages can disappear • Average lifespan of a webpage - 50-100 days • A year after publication, about 11% of content shared on social media will be gone. SalahEldeen and Nelson, "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?", TPDL 2012 http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html Southeast Women in Computing Conference - Nov 16, 2013
  9. 9. But maybe it's archived Ainsworth, AlSum, SalahEldeen, Weigle, and Nelson, "How Much of the Web is Archived?", JCDL 2011 http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html Southeast Women in Computing Conference - Nov 16, 2013
  10. 10. But social media is hard to archive Southeast Women in Computing Conference - Nov 16, 2013
  11. 11. Our Research Group Goals • We believe that web archives are valuable cultural resources, and we want everyone to know about them. • We want to make it easy for people to bridge the gap between the live web and the archives. • We believe that replaying the past is more compelling than reading a summary. Southeast Women in Computing Conference - Nov 16, 2013
  12. 12. vs. Southeast Women in Computing Conference - Nov 16, 2013
  13. 13. Replaying the past can be more compelling than just a summary Southeast Women in Computing Conference - Nov 16, 2013
  14. 14. Outline • What is a web archive? • Why are archives important? • What's my story? • How can we help others tell their stories? • Related WS-DL Projects Southeast Women in Computing Conference - Nov 16, 2013
  15. 15. What's My Story? • As another illustration, I'll tell you a little bit more about myself ... • ... using the Internet Archive Southeast Women in Computing Conference - Nov 16, 2013
  16. 16. NLU - 1997 Southeast Women in Computing Conference - Nov 16, 2013
  17. 17. UNC-CS - 1997 Southeast Women in Computing Conference - Nov 16, 2013
  18. 18. My CS Homepage - 1997 Southeast Women in Computing Conference - Nov 16, 2013
  19. 19. CS Student Assoc Pres - 1999 Southeast Women in Computing Conference - Nov 16, 2013
  20. 20. Teaching - 2000 Southeast Women in Computing Conference - Nov 16, 2013
  21. 21. Finding gems in the archive Southeast Women in Computing Conference - Nov 16, 2013
  22. 22. My Research - 2002 Southeast Women in Computing Conference - Nov 16, 2013
  23. 23. Married, Graduated, and Teaching - 2003 Southeast Women in Computing Conference - Nov 16, 2013
  24. 24. Faculty Position at Clemson - 2004 Southeast Women in Computing Conference - Nov 16, 2013
  25. 25. Clemson missing captures Southeast Women in Computing Conference - Nov 16, 2013
  26. 26. Proof I was there - 2006 Southeast Women in Computing Conference - Nov 16, 2013
  27. 27. Faculty Position at ODU - 2006 Southeast Women in Computing Conference - Nov 16, 2013
  28. 28. Vehicular Networks - 2006 Southeast Women in Computing Conference - Nov 16, 2013
  29. 29. 1st PhD Student Graduated - 2010 Southeast Women in Computing Conference - Nov 16, 2013
  30. 30. InfoVis, Work with WS-DL - 2011 Southeast Women in Computing Conference - Nov 16, 2013
  31. 31. Telling My Story • Going through the archive was a lot of fun. • But, it wasn't always easy. • Today, I might want to incorporate Facebook and Twitter posts in my story. Not saved at Internet Archive. =( • Let's make this easy to do for everyone. Southeast Women in Computing Conference - Nov 16, 2013
  32. 32. Outline • What is a web archive? • Why are archives important? • What's my story? • How can we help others tell their stories? • Related WS-DL Projects Southeast Women in Computing Conference - Nov 16, 2013
  33. 33. Project Overview • Project forms the PhD work of Yasmin AlNoamany, ideas in early stages • Joins my interests in measurement, web science, information visualization. – measurement - how do people use web archives? – web science - how can we analyze web archives to find pages related to live web pages? – info vis - how can we present the stories that we have harvested from the archive? Southeast Women in Computing Conference - Nov 16, 2013
  34. 34. How do people use web archives? • We obtained a year's worth (2012) of requests to the Internet Archive's Wayback Machine – client IPs anonymized Southeast Women in Computing Conference - Nov 16, 2013
  35. 35. How do people use web archives? • First, there are a lot of robots (aka bots) who access the archive – 10 bot sessions for every 1 human session – maybe people don't know about the archive? • Typical human sessions are pretty short – people aren't spending lots of time in the archive – it took me over an hour of walking through the archive to build my story – maybe people who do know about the archive aren't using it to build stories? AlNoamany, Weigle, and Nelson, "Access Patterns for Robots and Humans in Web Archives", JCDL 2013 Southeast Women in Computing Conference - Nov 16, 2013
  36. 36. How do people use web archives? • 65% of the requested archived pages no longer exist on the live web • People use the archive because the pages they are interested in no longer exist – like most of my examples from my story AlNoamany, AlSum, Weigle, and Nelson, "Who and What Links to the Internet Archive", IJDL, to appear, 2013 Southeast Women in Computing Conference - Nov 16, 2013
  37. 37. Helping Others Tell Stories • How can we use this information to help people tell stories? • How do people tell stories? • What tools do they use today? Southeast Women in Computing Conference - Nov 16, 2013
  38. 38. Egyptian Revolution on Storify Southeast Women in Computing Conference - Nov 16, 2013
  39. 39. Bookmarking is not preserving Southeast Women in Computing Conference - Nov 16, 2013
  40. 40. How do people tell stories? • There are three levels of information: – overview – recent events – story definition and replay Southeast Women in Computing Conference - Nov 16, 2013
  41. 41. Overview Southeast Women in Computing Conference - Nov 16, 2013
  42. 42. Overview Southeast Women in Computing Conference - Nov 16, 2013
  43. 43. Recent Events Southeast Women in Computing Conference - Nov 16, 2013
  44. 44. Recent Events Southeast Women in Computing Conference - Nov 16, 2013
  45. 45. Story Replay Southeast Women in Computing Conference - Nov 16, 2013
  46. 46. Story Replay Not yet addressed Southeast Women in Computing Conference - Nov 16, 2013
  47. 47. Research Questions How do we • define the time frame of a story? • define the individual events that make up a story? • identify, evaluate, and select candidate archived web pages to support the events of the story? • visualize the resulting story? Southeast Women in Computing Conference - Nov 16, 2013
  48. 48. Define the Time Frame of a Story • People remember the name of the story, but not the date – Hurricane Katrina - Aug 29, 2005 – 2011 Egyptian Revolution - Jan 25, 2011 – Boston Marathon Bombing - April 15, 2013 • Some stories have no definitive beginning/ending – BP Gulf Oil Spill - April 20 - September? 2010 effects, court cases still ongoing – Egyptian Revolution - which one? (1952, 2011, 2013) Southeast Women in Computing Conference - Nov 16, 2013
  49. 49. Define the Time Frame of a Story • Propose candidate times based on user query Southeast Women in Computing Conference - Nov 16, 2013
  50. 50. Define a Story's Events • Consult hand-crafted timelines • User-provided timelines • Detect themes in relevant archived web pages Southeast Women in Computing Conference - Nov 16, 2013
  51. 51. Identify Relevant Archived Web Pages • Identify "seed URIs" and query the archive for their existence during the appropriate time – also query for URIs linked from the seed URIs • How to identify seed URIs? – wikipedia – news sites – social media (tweets, Facebook shares) – Storify Southeast Women in Computing Conference - Nov 16, 2013
  52. 52. Different sources will provide different seed URIs Southeast Women in Computing Conference - Nov 16, 2013
  53. 53. What about social media pages? Southeast Women in Computing Conference - Nov 16, 2013
  54. 54. Create your own Facebook archive • May need to allow for usercontributed content Kelly, Nelson, and Weigle, "WARCreate and WAIL: WARC, Wayback, and Heritrix Made Easy," Demo at Digital Preservation 2013. http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html Southeast Women in Computing Conference - Nov 16, 2013
  55. 55. Suppose we found 100 relevant pages for each event in the story I’ll add here many copies from bbc, nytimes, foxnews Southeast Women in Computing Conference - Nov 16, 2013
  56. 56. Evaluate Relevant Archived Web Pages • Are there duplicate accounts? • What is the reputation, bias, or point of view of the source? • How well was the page archived? Southeast Women in Computing Conference - Nov 16, 2013
  57. 57. Duplication Southeast Women in Computing Conference - Nov 16, 2013
  58. 58. Reputation of Source Southeast Women in Computing Conference - Nov 16, 2013
  59. 59. Quality of Archived Page Southeast Women in Computing Conference - Nov 16, 2013
  60. 60. Select Relevant Archived Web Pages • User will select pages to use in the final story • But user needs to be presented with some choices Southeast Women in Computing Conference - Nov 16, 2013
  61. 61. Selecting Relevant Pages Mubarak's Resignation Southeast Women in Computing Conference - Nov 16, 2013
  62. 62. Visualize the Story • Provide different interactive visualizations that enable exploring the story easily • Provide the user with the ability to modify the story and specify the start and end dates Southeast Women in Computing Conference - Nov 16, 2013
  63. 63. Using Storify Southeast Women in Computing Conference - Nov 16, 2013
  64. 64. Interactive Timeline Replaying Story of Egyptian Revolution Southeast Women in Computing Conference - Nov 16, 2013
  65. 65. Slideshow • Different View Southeast Women in Computing Conference - Nov 16, 2013
  66. 66. Research Questions How do we • define the time frame of a story? • define the individual events that make up a story? • identify, evaluate, and select candidate archived web pages to support the events of the story? • visualize the resulting story? Southeast Women in Computing Conference - Nov 16, 2013
  67. 67. Outline • What is a web archive? • Why are archives important? • What's my story? • How can we help others tell their stories? • Related WS-DL Projects Southeast Women in Computing Conference - Nov 16, 2013
  68. 68. User Access Patterns AlNoamany, Weigle, and Nelson, "Access Patterns for Robots and Humans in Web Archives", JCDL 2013 Southeast Women in Computing Conference - Nov 16, 2013
  69. 69. Everybody Dips, Humans Dive, Robots Skim Robots (34,203 sessions) Humans (3,431 sessions) AlNoamany, Weigle, and Nelson, "Access Patterns for Robots and Humans in Web Archives", JCDL 2013 Southeast Women in Computing Conference - Nov 16, 2013
  70. 70. What domains does each archive hold? AlSum, Weigle, Nelson and Van de Sompel, "Profiling Web Archive Coverage for Top-Level Domain and Content Language," TPDL 2013. Southeast Women in Computing Conference - Nov 16, 2013
  71. 71. What domains does each archive hold? AlSum, Weigle, Nelson and Van de Sompel, "Profiling Web Archive Coverage for Top-Level Domain and Content Language," TPDL 2013. Southeast Women in Computing Conference - Nov 16, 2013
  72. 72. Sometimes the live web "leaks" into the archive Sept 3, 2008 2012 http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html Southeast Women in Computing Conference - Nov 16, 2013
  73. 73. ODU's WS-DL Group ODU You are here Southeast Women in Computing Conference - Nov 16, 2013
  74. 74. ODU's WS-DL Group • Our recent work has been featured in the popular press • We're always looking for more great students! Dr. Michele C. Weigle Old Dominion University Norfolk, VA mweigle@cs.odu.edu @weiglemc http://www.cs.odu.edu/~mweigle/ http://ws-dl.blogspot.com/ Southeast Women in Computing Conference - Nov 16, 2013

×