Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

@WebSciDL PhD Student Project Reviews August 5&6, 2015

2,674 views

Published on

Herbert Van de Sompel (LANL) visisted the Web Science & Digital Libraries Group @ ODU on August 5--7, 2015. The seven PhD students who were in town at that time reviewed their current status for him.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

@WebSciDL PhD Student Project Reviews August 5&6, 2015

  1. 1. Web$Science$and$Digital$Libraries$ Research$Group$$ @WebSciDL$ Review$of$Projects$for$$ Herbert$Van$de$Sompel,$LANL$ August$5&6,$2015$ $
  2. 2. Corren McCoy Disambiguation of Alumni from Publicly Available Social Media Profiles Presentation for Herbert Van de Sompel 08/05/2015
  3. 3. Let’s  be  Social! Directory Search Name: Michael Nelson College: Old Dominion Degree: Computer Science Year: 1997 2
  4. 4. Motivation Maintain relationships with alumni Interact and re-engage 3 Pew Research Survey, Sept. 2014 LinkedIn is used by 28% of online adults. 23% are between 18-29* Twitter is used by 23% of online adults. 37% are between 18-29 *Pew Research Center noted a significant change in this percentage from 2013
  5. 5. Research Goals • Given discrete set of attributes • Leverage public information • Collect structured/unstructured metadata • Develop a probabilistic matching scheme • Analyze and discover new profile attributes • Connect the networks 4
  6. 6. Seminal Works • Mislove, A., Viswanath, B., Gummadi, K. P., & Druschel, P. (2010, February). You are who you know: inferring user profiles in online social networks. In Proceedings of the third ACM international conference on Web search and data mining (pp. 251-260). ACM. • Northern, C. T., & Nelson, M. L. (2011). An unsupervised approach to discovering and disambiguating social media profiles. In Proceedings of Mining Data Semantics Workshop. • Powell, J., Shankar, H., Rodriguez, M., & Van de Sompel, H. (2014). EgoSystem: Where are our Alumni?. Code4Lib Journal, (24). 5
  7. 7. Our Work is Informed Attribute inference based on a Facebook crawl of a known friends network with matching to a Student or Alumni Directory. Examination of digital preservation strategies across social media sites using feature data to score and disambiguate the discovered profiles. Aggregation of discovered social and institutional artifacts to a public identity which are linked in a property graph to facilitate searching. Mislove NorthernPowell 6
  8. 8. Similarity Metrics
  9. 9. Does it help to know a name? Census Surnames Social Security Administration Name Ranking as of 2014 Michael 7 Nelson 40 Michele ----- Weigle 13,604 First names 19,584 Surnames 150,436 8
  10. 10. Are Vanity Screen Names Re-used? LinkedIn: michaellloydnelson Twitter: phonedude_mln 9
  11. 11. Is the Affiliation Repeated? LinkedIn: Old Dominion University Twitter: Old Dominion University mentioned in bio but could be a false positive 10
  12. 12. How Far Apart in Space? LinkedIn: Norfolk, Virginia Area Twitter: Norfolk, VA 11
  13. 13. Do People Re-use Profile Photos? TinEye Reverse Image Search 12
  14. 14. Do Web Links Point to the Same Page? LinkedIn: http://www.cs.odu.edu/~mln/ http://ws-dl.blogspot.com/ http://f-measure.blogspot.com/ Twitter: cs.odu.edu/~mln/ 13
  15. 15. Community Analysis Surrogate Connections - People Also Viewed One step from Dr. Nelson One step from Brittany Johnson 14
  16. 16. Community Analysis Disclosed – (Followers?) and Following 15
  17. 17. Property Graph Analysis https://twitter.com /phonedude_mln https://www.linked in.com/in/michaelll oydnelson 16
  18. 18. Property Graph Analysis https://twitter.com /phonedude_mln https://www.linked in.com/in/michaelll oydnelson 17 Location Norfolk, Virginia area Norfolk, VA
  19. 19. Property Graph Analysis https://twitter.com /phonedude_mln https://www.linked in.com/in/michaelll oydnelson 18 Location Norfolk, Virginia area Norfolk, VA Affiliation Value: Old Dominion Attended
  20. 20. Property Graph Analysis https://twitter.com /phonedude_mln https://www.linked in.com/in/michaelll oydnelson 19 Geo- Location Norfolk, Virginia area Norfolk, VA Affiliation Value: Old Dominion Attended Twitter @ODUNow hasOfficialAccount
  21. 21. Property Graph Analysis https://twitter.com /phonedude_mln https://www.linked in.com/in/michaelll oydnelson 20 Geo- Location Norfolk, Virginia area Norfolk, VA Affiliation Value: Old Dominion Attended Twitter @ODUNow hasOfficialAccount follows
  22. 22. Example Searches
  23. 23. LinkedIn Candidate Search • Leverage  Google’s  advanced   search operators to improve precision. • Trusted information from the Registrar’s  Office. 22
  24. 24. LinkedIn Metadata How Prevalent are Nicknames? Name Michael Nelson Mike Nelson Mike Nelson Headline Professor at Old Dominion University Orthotist / Certified Athletic Trainer Driver at Old Dominion Freight Line Location Norfolk, Virginia Area Providence, Rhode Island Area Phoenix, Arizona URL https://www.linkedin.com/in/michaellloy dnelson https://www.linkedin.com/in/mikenel son64 https://www.linkedin.com/pub/m ike-nelson/6b/50b/879 Profile Photo https://media.licdn.com/mpr/mpr/shrinkn p_400_400/p/1/000/019/1d1/39275de.jp g https://media.licdn.com/mpr/mpr/shri nknp_400_400/p/2/000/02f/11d/3f17 849.jpg ----- Vanity Screen Name michaellloydnelson mikenelson64 Industry Research Hospital & Health Care Transportation/Trucking/Railroad Websites http://www.cs.odu.edu/~mln/ http://ws-dl.blogspot.com/ http://f-measure.blogspot.com/ ----- ---- Affiliation(s) Old Dominion University, 1997-2000 Old Dominion University, 1996-1997 Virginia Polytechnic Institute and State University, 1987-1991 Old Dominion University, 1999-2001 ----- 23
  25. 25. Twitter Candidate Search 24
  26. 26. Twitter Metadata Given and Nickname Search User Name Michael L. Nelson Mike Nelson Mike Nelson Bio Head of @WebSciDL, Computer Science, Old Dominion University; Formerly: @NASA_Langley (1991-2002), @UNCSILS (2000-2001); OAI-PMH OAI-ORE Memento ResourceSync ----- ----- Location Norfolk, VA ----- ----- URL https://twitter.com/phonedude_mln https://twitter.com/mikenelson64 ----- Profile Photo https://pbs.twimg.com/profile_images/95 9295176/mln-ad-100x130_400x400.jpg ----- ----- Screen Name Phonedude_mln mikenelson64 Industry ----- ----- ----- Websites cs.odu.edu/~mln/ ----- Affiliation(s) Old Dominion University in bio. Following @ODUNow official account ----- ----- 25
  27. 27. Known Issues • Reliability of Name Searches – Nicknames list from the Northern (2011) study is incomplete. Ignores ethnic given names. – Given and surname data from US census and SSA must exist at a certain threshold to protect privacy. – Naïve calculation of name probabilities. Some name combinations do not occur frequently. • Uncovering social data is difficult – LinkedIn limits use of API to get real connections. – Rate limits on the Twitter API constrain the depth of the followers/following search. 26
  28. 28. Known Issues • Each network takes a different approach to the visibility of metadata – Exploit the structure of LinkedIn – Twitter data is noisy, limited space with no controlled vocabulary 27
  29. 29. By: Alexander Nwala August 5, 2015 Progress Report Presented To: Dr. Herbert Van de Sompel, Dr. Michael Nelson
  30. 30. Progress Report Outline • Past projects • Refactoring Hany’s Carbon date • What Did It Look Like? • I Can Haz Memento • Present research • Exploration of Distributed Information Retrieval • Problem • Goal • Research paths; possibility contributions
  31. 31. Carbon date • Estimates the creation date of a URI • The current implementation features a: • Threaded server • Concurrent API requests • Cached responses • This is achieved by picking the least date from these sources: • Last modified date • Bitly • Topsy • Backlinks • Archives Website: http://cd.cs.odu.edu Blog post: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html
  32. 32. What Did It Look Like? • Tumblr blog which • Uses the Memento framework to poll various public web archives • Creates an animated image for each year that shows the progression of the site through the years • Everyone is free to nominate web sites to What Did It Look Like? by tweeting: “#whatdiditlooklike URL” Website: http://whatdiditlooklike.mementoweb.org/ Blog post: http://ws-dl.blogspot.com/2015/01/2015-02-05-what-did-it-look-like.html
  33. 33. I Can Haz Memento • Inspired by the “#icanhazpdf” movement and also built upon the Memento framework • For tweets with links containing “#icanhazmemento” • I Can Haz Memento service replies the tweet with a link pointing to: Website: https://twitter.com/icanhazmemento/ Blog post: http://ws-dl.blogspot.com/2015/07/2015-07-22-i-can-haz-memento.html Archived version of the page closest to the time of the tweet
  34. 34. Progress Report Outline • Past projects • Refactoring Hany’s Carbon date • What Did It Look Like? • I Can Haz Memento • Present research • Exploration of Distributed Information Retrieval • Problem • Goal • Research paths; possibility contributions
  35. 35. Problem :: Undiscoverable resources are not included in SERPs • SERP does not have intended resource: “A kinetic theory for age- structured stochastic birth-death processes” • But resource is available in a special collection (arXiv.org) Case 1, SERP for Query: “stochastic birth-death processes” Google Search arXiv.org Search
  36. 36. Problem :: Information not discoverable from Google do not exist to many web users • 1st page of SERP does not have intended resource: “EPIDEMIOLOGY THROUGH CELLULAR…” Case 2, SERP for Query: “influenza indonesia”
  37. 37. Case 2, SERP for Query: “influenza indonesia” Google Search arXiv.org Search Relevant resource on 7th page Relevant resource on 1st page Problem :: Inconsistent views between SERP and special collections
  38. 38. Problem :: When to stop? • A user potentially misses relevant information because it is NOT presented with search results OR presented too far (e.g. last 7th page) • In other words, if relevant content is not presented in the first n pages (e.g. n < 3), it does not exist ? ? ?
  39. 39. Goal :: Present resources from multiple unindexed sources with Google SERP • This can be achieved through middleware such as a browser plugin 10 more relevant resources1. 2. Click Relevant resource on 1st page
  40. 40. Exploration of DIR :: Problem summary and Goal • Problem • Inconsistent views between SERP and special collections leads to absence of relevant resources in SERPs (Case 1) • If relevant content is not presented in the first n pages (e.g. n < 3), it does not exist (Case 2) • Goal • Present resources from multiple unindexed sources with Google SERP
  41. 41. Exploration of DIR :: Possible research paths • Research Pathway 1: Understanding the search results • Research Pathway 2: Understanding the query • Research Pathway 3: Understanding the data source
  42. 42. Research Pathway 1 vs Research Pathway 2 Research Pathway 2: Understanding the query • Blindly routing every query to every data source is unacceptable • Query understanding • Domain classification of query • Intent recognition of query • Semantic labelling of query • Route only queries that are relevant to the data source, to the data source: e.g. a News related query to a News source, academic queries to academic sources • State of the art targets building statistical machine learning methods to solve the query understanding problem • Include results from data source with SERP Research Pathway 1: Understanding search results • Blindly routing every query to every data source is unacceptable • Understand the search results for clues to unravel nature of query • Are Advertisements present • Are Images present • Are pdfs types present • Route only queries that are relevant to the data source, to the data source: e.g. a News related query to a News source, academic queries to academic sources • State of the art doesn’t focus on search results • Include results from data source with SERP
  43. 43. Research Pathway 1: Find discriminative features for “non-scholarly materials domain” Query length Permutation of Pages Result count Title match Images present HTML resource News present Google knowledge entity present
  44. 44. Research Pathway 1: Find discriminative features for “scholarly materials domain” Query length Permutation of Pages Result count Title subset match PDF resources Notable Absences • Google Knowledge Entity • News • Ads Notable Presence • Non HTML resources (PDF)
  45. 45. Research Pathway 1: What next after finding discriminative features? • Find a dataset (Done) • NASA NTRS query log for scholarly materials domain (400,000+) • AOL 2006 query logs for non-scholarly materials domain (400,000+) • Train a classify (Not done) • Given a query and a list of search results. Classify the query as belonging to one of multiple classes e.g. (Scholarly material)
  46. 46. Research Pathway 2: Heuristic for unsupervised domain classification Original algorithm 1: • Idea: Given a query and a list of search results, the important terms which co- occur across multiple search results are indicative of the domain of the query. Query 1: Search Engine URIs List doc2 <a, a, a, b, b.., c> doc1 2: Generate unigram vectors, remove redundant terms <a, c, x, y, d, d> <a, p, w, s> docn <a, b, c> <a, c, x, y, d> <a, p, w, s> <a, a, a, b, c, c, d, p, s, w, x, y> 3: Sort <a, a, a> <b> <c, c> <d> <p> <s> <w> <x> <y> 4: Find clusters Domain Set: P
  47. 47. Original algorithm 1 Example: Possible domains for query “Lionel messi” • (terms), 10 of 11 pages • (barcelona"., barcellona-granada, barcelon,, barcelon, barcelona), 9 of 11 pages • (best"., best), 9 of 11 pages • (championship, champion, championship,, champions..., champions:, championships., champions', championships, championship:, champions.", championship-winning, champions, champions".), 9 of 11 pages • (city, city)), 9 of 11 pages • (club, club's, club's...), 9 of 11 pages • (consented, considerably, consecutively)., consecutively,, considered, consent, consistent, conscious, consecutively"., consecutive, considers, consider), 9 of 11 pages • (everybody, every), 9 of 11 pages • (fc, fc.), 9 of 11 pages • (football, football".), 9 of 11 pages • (game"., game".[370], game), 9 of 11 pages Relevant domains based on human judgement
  48. 48. Original algorithm 2: Heuristic for supervised domain classification • Given a set of predefined domains D: <a, a, a> <b> <c, c> <d> <p> <s> <w> <x> <y> 4: Find clusters Domain set: P … max( similarity (Pi, Di) ) • Similarity • Naive hybrid similarity (Jaccard/Overlap coefficient) • Word net • Explicit Semantic Analysis
  49. 49. Exploration of DIR :: Summary • Problem • There exists an inconsistency between between SERP and special collections, thus many relevant resources are not included in SERPs or • Included too late (e.g. last page) • Goal • Present resources from multiple unindexed sources with Google SERP which can be done through a browser plugin • Research Pathways • Understand the search result and train a model to learn when a query should be forwarded to a special collection • Understand the query, for example the domain, then forward only relevant queries to their respective special collections • Include results from special collection with SERP
  50. 50. TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES SCOTT G. AINSWORTH OLD DOMINION UNIVERSITY AUGUST 5, 2015 OLD DOMINION UNIVERSITY
  51. 51. CONTENTS ■ Motivation (Appearances can be deceiving) ■ Background ■ Temporal Coherence ■ Research ■ What’s next? 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 2
  52. 52. MOTIVATION TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 3
  53. 53. APPEARANCES … 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 4 http://web.archive.org/web/20041209190926/http://www.wunderground.org/cgi-bin/findWeather/getForecast?query=50593 (now 404, but that's a different story…)
  54. 54. … CAN BE DECEIVING 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 5 Root Memento-Datetime: 2004-12-09T19:09:26
  55. 55. CLEAR OR CLOUDY? 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 6
  56. 56. QUESTIONS ■ How prevalent is temporal incoherence? ■ Can Temporal Coherence be improved using ■ Multiple archives? ■ Additional memento selection heuristics? ■ How can Temporal Coherence be conveyed? 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 7
  57. 57. BACKGROUND COMPOSITE MEMENTOS COHERENCE STATES COHERENCE PATTERNS TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 8
  58. 58. COMPOSITE MEMENTO PRESENTATION STRUCTURE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 9 URI-M0 URI-M1 URI-M2 URI-Mi-1 ... URI-Mi URI-Mi+1 URI-Mn ...
  59. 59. COHERENCE STATES ■ Prima Facie Coherent Evidence that the memento existed in its archived state when the root was acquired. ■ Prima Facie Violative Evidence … did not exist ... ■ Possibly Coherent Evidence … might have existed ... ■ Probably Violative Evidence … probably did not exist ... 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 10
  60. 60. CONSIDER THIS HTML… 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 11 <html> <img src="foo.jpeg"> </html>
  61. 61. AND THESE RESPONSE HEADERS HTTP/1.1 200 OK Server: Tengine/2.0.3 Date: Mon, 27 Apr 2015 22:03:32 GMT Content-Type: image/jpeg Content-Length: 15632 Connection: keep-alive Memento-Datetime: Tue, 07 Feb 2006 00:58:23 GMT Link: <Memento links deleted...> X-Archive-Orig-server: Apache/1.3.26 (Unix) ApacheJServ/1.1.2 PHP/4.3.4 X-Archive-Orig-etag: "4978-3d10-3e4d822e" X-Archive-Orig-content-length: 15632 X-Archive-Orig-accept-ranges: bytes X-Archive-Orig-date: Tue, 07 Feb 2006 00:58:20 GMT X-Archive-Orig-content-type: image/jpeg X-Archive-Orig-last-modified: ↩︎ Fri, 14 Feb 2003 23:56:30 GMT X-Archive-Orig-connection: close <other headers deleted> 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 12
  62. 62. PRIMA FACIE COHERENT 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 13 Bracket Pattern: Memento-Datetime + Last-Modified (yes, Last-Modified is sometimes wrong, but many of those cases can be detected)
  63. 63. PRIMA FACIE COHERENT 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 14 Equal Pattern: simultaneous capture (with an optionally tunable “bubble of simultaneity”)
  64. 64. PRIMA FACIE VIOLATIVE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 15
  65. 65. POSSIBLY COHERENT 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 16 Closest (or only) memento captured before the root
  66. 66. PROBABLY VIOLATIVE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 17 Closest (or only) memento captured after the root but no Last-Modified (possibly indicating a dynamically generated representations)
  67. 67. TEMPORAL COHERENCE EMBEDDED RESOURCES REPRESENTING COHERENCE TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 18
  68. 68. TEMPORAL COHERENCE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 19
  69. 69. TEMPORAL COHERENCE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 20 2005-05-1 4 01:36:08 +9 days +18 days +18 days +7 months +2.1 years
  70. 70. EMBEDDED RESOURCES Resource Memento-Datetime Delta Resource Memento- Datetime Delta h"p://www.cs.odu.edu. 2005205214.01:36:08. spacer.gif. 2005206201.16:23:10. 18.6.d. mm_menu.js. 2005205223.02:39:12. 9.0.d. jimcheng.gif. 2005206201.16:37:39. 18.6.d. style.css. 2005205223.02:39:39. 9.0.d. jsmith.gif. 2005206201.16:58:50. 18.6.d. gfx2logo2odu2crown.gif. 2005205223.02:39:39. 9.0.d. rmenu_1st_featured_alumni.png. 2005206201.21:21:45. 18.8.d. ddmenu_ddown.js. 2005205223.02:39:43. 9.0.d. hmenu_college_...2new.png. 2005212221.20:14:25. 7.3.mo. university.js. 2005205223.02:39:56. 9.0.d. rmenu_1st_upcoming_news.png. 2005212221.20:15:14. 7.3.mo. rmenu_1st_about.png. 2005206201.13:40:25. 18.5.d. rmenu_1st_upcoming_events.png. 2005212221.21:01:12. 7.3.mo. rmenu_bo"om_229.gif. 2005206201.14:07:29. 18.5.d. lmenu_1st_resources.png. 2005212228.17:47:41. 7.5.mo. shadow2bl.gif. 2005206201.14:55:53. 18.6.d. bullet_blue_triangle.gif. 2005212228.19:43:48. 7.5.mo. ecsbdg.jpg. 2005206201.14:56:17. 18.6.d. logo2cs.gif. 2005212228.19:54:29. 7.5.mo. shadow2br.gif. 2005206201.15:18:18. 18.6.d. rmenu_1st_featured_student.png. 2007206212.02:36:07. 2.1.years. gfx2btn2go2dblue.gif. 2005206201.15:34:19. 18.6.d. shadow2b.gif. 2007206221.02:35:17. 2.1.years. shadow2tr.gif. 2005206201.15:55:57. 18.6.d. shadow2r.gif. 404.Not.Found. header2right1.gif. 2005206201.16:06:16. 18.6.d. 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 21 Embedded Resources 26 Mean Delta 125.9 days Standard Deviation 207.7 days Minimum Delta 9.0 days Maximum Delta 2.1 years
  71. 71. REPRESENTING COHERENCE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 22
  72. 72. REPRESENTING COHERENCE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 23
  73. 73. REPRESENTING COHERENCE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 24
  74. 74. REPRESENTING COHERENCE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 25
  75. 75. REPRESENTING COHERENCE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 26
  76. 76. THE FULL CHART Mementos by Delta RootMemento-Datetime -3y -1y 0 1y 2y 3y 4y 5y 6y 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 Probably Coherent rURI-M Probably Violative Prima Facie Coherent Prima Vacie Violative 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 27 2005-03-10
  77. 77. RESEARCH DATA SET SAMPLING STATISTICS TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 28
  78. 78. DATA SET ■ 4,000 sample URI-Rs (JCDL’11 data set) ■ Single and Multiple Archives ■ Two Heuristics: ■ Minimum distance (current default Wayback behavior) ■ choose closest Memento-Datetime ■ Bracket (proposed here) ■ use combination of Memento-Datetime + Last-Modified (when available) 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 29
  79. 79. SAMPLING & RECOMPOSITION ■ For each sample URI-R (rURI-R): ■ Download available TimeMaps ■ Download a single root Memento per month ■ For each monthly Memento ■ Extract embedded URI-Rs (eURI-Rs) ■ Download TimeMaps for eURI-Rs ■ Download heuristically-best eURI-Ms ■ Repeat recursively ■ Run each heuristic and single-/multi- archive combination 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 30
  80. 80. ROOT URI-R STATISTICS 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 31 Root URI-Rs archived 2,756 • 68.9% In multiple archives 1,180 • 29.5% Mean archives per URI-R 1.58 Mean mementos per URI-R 124.57 200 OK 82,425 • 93.6% 503 Service Unavailable 4,444 • 5.0% 404 Not found 583 • 0.7% 403 Forbidden 388 • 0.4% Others 214 • 0.3% URI-M Status Archival Data
  81. 81. EMBEDDED URI-R STATISTICS 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 32 Embedded URI-Rs 1,623,127 per root URI-M 19.7 Embedded URI-Ms available 1,332,993 • 93.6% per root URI-M 15.1 Not archived 312,641 • 83.9% 404 Not found 44,852 • 12.0% 403 Forbidden 6,116 • 1.6% 503 Service Unavailable 5,442 • 1.5% Others 3,508 • 0.9% URI-M Failure Reasons Archival Data
  82. 82. COMPOSITE MEMENTO (ROOT) COMPLETENESS & COHERENCE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 33 Description MinDist Single MinDist Multi Bracket Single Bracket Multi Mean Complete 76.1% 80.2% 76.2% 80.3% Mean Missing 23.9% 19.8% 23.8% 19.7% Completeness (and Missing) Description MinDist Single MinDist Multi Bracket Single Bracket Multi Mean Prima Facie Coherent 41.0% 40.9% 54.7% 54.6% Mean Possibly Coherent 27.3% 28.7% 12.8% 14.2% Mean Probably Violative 2.5% 5.3% 2.5% 5.3% Mean Prima Facie Violative 5.3% 5.3% 6.2% 6.2% Coherence At least 5% of pages can be shown to have temporal violations! Multiple archives: +completeness, -coherence?
  83. 83. EMBEDDED MEMENTO COHERENCE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 34 Description MinDist Single MinDist Multi Bracket Single Bracket Multi Prima Facie Coherent 622,565 621,447 864,736 859,625 Possibly Coherent 497,405 466,046 244,104 215,585 Probably Violative 104,376 53,734 104,339 53,694 Prima Facie Violative 100,760 103,662 114,062 117,469 Totals 1,325,106 1,244,889 1,327,241 1,246,373 Description MinDist Single MinDist Multi Bracket Single Bracket Multi Prima Facie Coherent 47.0% 49.9% 65.2% 69.0% Possibly Coherent 37.5% 37.4% 18.4% 17.3% Probably Violative 7.9% 4.3% 7.9% 4.3% Prima Facie Violative 7.6% 8.3% 8.6% 9.4% At least 7% of embedded resources are used violatively!
  84. 84. WHAT’S NEXT? EQUALITY & SIMILARITY MINOR & MAJOR VIOLATIONS POLICIES & HEURISTICS CONVEYING COHERENCE TEMPORAL COHERENCE OF COMPOSITE MEMENTOS IN WEB ARCHIVES 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 35
  85. 85. EQUALITY & SIMILARITY 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 36 Equality and similarity allow prima facie coherence without Last-Modified Early results: equality yields < 2% improvement
  86. 86. MINOR OR MAJOR VIOLATIONS? ■ This is a temporal violation. But is it meaningful? ■ How to judge? ■ Most archives transform HTML ■ Few support export of original file ■ How to measure similarity on binary files? 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 37
  87. 87. POLICY & HEURISTIC TRADEOFFS ■ Speed: minimize distance ■ Completeness: query all archives (not just top k) ■ Accuracy: maximize coherence 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 38
  88. 88. CONVEYING COHERENCE 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 39 How to scale to > 100 embedded mementos? How to convey coherence & contributing archive?
  89. 89. WHAT’S NEXT SUMMARY ■ Equality & Similarity ■ Significance of violation (major? minor?) ■ Policies & Heuristics ■ Conveying Coherence 8/5/2015 Scott G. Ainsworth • Status for Herbert Van de Sompel Visit 40
  90. 90. Progress Report Lulwah Alkwai Presented to: Dr. Herbert Van de Sompel 1
  91. 91. Previous Work JCDL 2015 Paper: “How Well Are Arabic Websites Archived?” Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle We won “Best Student Paper Award” 2 2
  92. 92. English sports websites are more archived than Arabic www.espn.go.com www.kooora.com 3
  93. 93. GeoIP only ccTLD only Both Neither !  News: alarabiya.net !  ccTLD: Not Arabic (.net) !  GeoIP: Not Arabic country (US) !  E-Marketing: haraj.com.sa !  ccTLD: Arabic (.sa) !  GeoIP: Not an Arabic country (Ireland) !  News: al-watan.com !  ccTLD: Not Arabic (.com) !  GeoIP: Arabic country (Qatar) !  Educational: uoh.edu.sa !  ccTLD: Arabic (.sa) !  GeoIP: Arabic country (SA) How do we classify Arabic websites? 4
  94. 94. Selecting seed URIs Name Registered Year URI count DMOZ US 1999 Dmoz.org/world/arabic 4,086 Raddadi Saudi Arabia 2000 Raddadi.com 3,271 Star28 Lebanon 2004 Star28.com 8,386 Total 15,743 •  15,092 unique seed URIs •  11,014 URIs that existed in the live web 5
  95. 95. ~41% ~38% ~36% ~39% 872 ~8% Language test intersection testing for Arabic language 6
  96. 96. Total Arabic URIs Dataset = (7,976+292,670) = 300,646 Crawling Arabic seed URIs 7
  97. 97. Findings Our Arabic language dataset was not largely located in Arabic countries "  Only 14.84% had an Arabic ccTLD "  Only 10.53% had a GeoIP in an Arabic country "  Popular Western domains (e.g., cnn.com, wikipedia.org) appeared in the top 10 Arabic webpages are not particularly well archived or indexed "  46% were not archived "  31% were not indexed by Google An Arabic webpage is more likely to be... "  indexed if it is present in a directory "  archived if it is present in DMOZ "  archived if it has neither Arabic GeoIP nor Arabic ccTLD For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ 8
  98. 98. Youssef Eldakar Bibliotheca Alexandrina "  Since 2011, the BA crawls have focused on Egyptian content "  Seeds are manually selected "  Future plans are to cover content related to the Arab world 9 9 Bibliotheca Alexandrina
  99. 99. Current Work Replacements for missing images Goal: Make contribution by finding missing images through context and discover the replacement for the image Example: 10
  100. 100. Motivation "  D-Lib Magazine, Jan 2005: “Transparent Format Migration of Preserved Web Content” David S. H. Rosenthal, Thomas Lipkis, Thomas S. Robertson, and Seth Morabito "  The main idea was to change a file format that is no longer understandable to a new format without changing the URI "  Can this be done for images with 404 responses? "  We can define a new response code, location header e.g. “210 Not Quite OK, But Close” 11
  101. 101. Sample log query 0.36.125.141)web.archive.org)5)[01/Jan/2011:01:30:58)+0000])"GET) hBp://web.archive.org/web/20110101013058/hBp:// www.slaverymuseum.org/IraAtTable.jpeg)HTTP/1.1")404)2135)"hBp:// web.archive.org/web/20030413174118/www.slaverymuseum.org/ home.htm")"Mozilla/5.0)(Windows;)U;)Windows)NT)5.1;)en5US)) AppleWebKit/534.10)(KHTML,)like)Gecko))Chrome/8.0.552.224)Safari/ 534.10")TCP_MISS:SOURCEHASH_PARENT/207.241.227.95)205) 12
  102. 102. Check full URI in the IA >"curl"'I"http://web.archive.org/web/20110101013058/ http://www.slaverymuseum.org/IraAtTable.jpeg"" HTTP/1.1"404"Not"Found" Server:"Tengine/2.1.0" Date:"Tue,"04"Aug"2015"18:17:46"GMT" Content'Type:"text/html;charset=utf'8" Connection:"keep'alive" set'cookie:"wayback_server=73;"Domain=archive.org;"Path=/;" Expires=Thu,"03'Sep'15"18:17:45"GMT;" X'Archive'Wayback'Runtime'Error:"ResourceNotInArchiveException:" http://www.slaverymuseum.org/IraAtTable.jpeg"was"not"found" X'Archive'Wayback'Perf:"{"IndexLoad":144,"IndexQueryTotal": 144,"RobotsFetchTotal":2,"RobotsRedis":1,"RobotsTotal":2,"Total":390}" X'Archive'Playback:"0" 13
  103. 103. 14 URI requested
  104. 104. 15 Referring URI
  105. 105. Check full URI in the live web " >"curl"'I"http://www.slaverymuseum.org/ IraAtTable.jpeg" HTTP/1.1"404"Not"Found" Date:"Tue,"04"Aug"2015"18:15:34"GMT" Server:"Apache" Content'Type:"text/html;"charset=iso'8859'1" 16
  106. 106. Check Timetravel 17
  107. 107. Check domain in the live web >"curl"'I"http://www.slaverymuseum.org" HTTP/1.1"301"Moved"Permanantly" Date:"Tue,"04"Aug"2015"18:26:41"GMT" Server:"Apache" Location:"https://vimeo.com/search? q=slaverymuseum.org" Content'Type:"text/plain;"charset=UTF'8" 18
  108. 108. Check image name in new page "  Not found 19
  109. 109. Check leaf page for image name 20 "  Not found
  110. 110. Check domain in the IA 21
  111. 111. Check search engine for image surrounding text "  Using the “src” and saving the “alt” in HTML (alternative information) as a back up. e.g. "  Image src="IraAtTable.jpeg” "  alt="Ira)Hunter,)Jr.)and)Oni)Lasana <img)border="0")src="IraAtTable.jpeg")width="120")height="97") align="top")alt="Ira)Hunter,)Jr.)and)Oni)Lasana)">) 22
  112. 112. Searching Google for (IraAtTable.jpeg) 23
  113. 113. 24 Found same src name and parts of the surrounding text
  114. 114. http://signhom.net/professionalshub/wp-content/uploads/ sites/3/2013/11/IraAtTable.jpg 25
  115. 115. >"curl"–I"http://web.archive.org/web/20110101013058/ http://www.slaverymuseum.org/IraAtTable.jpeg"" 210"Not"Quite"OK,"But"Close" Date:"Wed,"05"Aug"2015"12:56:03"GMT" Location:"http://signhom.net/professionalshub/wp' content/uploads/sites/3/2013/11/IraAtTable.jpg" 26 New response code
  116. 116. Summary of approaches "  Check full URI in the live web "  Check full in URI the IA "  Check full in URI the timetravel "  Check domain in the live web "  Check domain in IA "  Check images in the redirected webpage "  Check leaf pages "  Check surrounding text in search engines "  Compare results of different search engine using image duplication, such as Google large-scale analysis of images: http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper- into-neural.html 27
  117. 117. Other ideas Image de-duplication "  JCDL 2015: “Identifying Duplicate and Contradictory Information in Wikipedia”, by Sarah Weissman, Samet Ayhan, Joshua Bradley, Jimmy Lin "  Can we do the same for the archives by detecting and removing duplicate images "  How many duplicate images? "  Which version should be kept? 28
  118. 118. What has Justin been up to, lately? Justin F. Brunelle Presentation for Herbert Van de Sompel 08/06/2015
  119. 119. A simpler time...
  120. 120. Mass hysteria. Human sacrifices. Dogs and cats living together. <iframe><script>...</script></iframe>
  121. 121. Missing resources (bad) and Temporal violations (worse) http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
  122. 122. http://en.wikipedia.org/wiki/Main_Page January 18th, 2012
  123. 123. http://web.archive.org/web/20120118110520/http://en.wikipedia.org/wiki/Main_Page: January 18th, 2012
  124. 124. Not all tools can crawl equally Live Resource PhantomJS Crawled Heritrix Crawled, Wayback replayed
  125. 125. Current Work4ow • Dereference URI-Rs • Archive • representation • Extract embedded • URI-Rs • Repeat
  126. 126. Proposed Workflow
  127. 127. <script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives! Current workflow not suitable for deferred representations Use PhantomJS to run JavaScript, interact with the representation Two-tiered crawling approach to optimize performance
  128. 128. <script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives! Current workflow not suitable for deferred representations Use PhantomJS to run JavaScript, interact with the representation Two-tiered crawling approach to optimize performance More URI-Rs in the crawl frontier Runs more slowly but more deeply
  129. 129. Run-time & Frontier size PhantomJS vs. Heritrix To appear: iPres2015
  130. 130. Constructed a classi=er for Deferred Representations
  131. 131. Performance metrics of a two-tiered crawling approach
  132. 132. The classi=er helps crawl deferred representations most e>ciently
  133. 133. Current & Future Work  Using PhantomJS to execute actions on the client – Pushing buttons – Selecting drop-downs – Archiving resulting representation changes  Represent representation state in WARCs – Graph structure of embedded resources – Replay in the Wayback Machine http://ws-dl.blogspot.com/2015/06/2015-06-26-phantomjsvisualevent-or.html
  134. 134. Presented(by(Mat(Kelly(for(Herbert(Van(de(Sompel( ! Web$Science$and$Digital$Libraries$Research$Lab$ Old(Dominion(University,(Norfolk,(VA( August(6,(2015(
  135. 135. •  Software as a support vehicle •  Issues investigating for PhD research topic •  Sample access patterns mitigated by new Memento-related entities HVDS(PresentaFon( 2(
  136. 136. Building Software as a PhD Researcher SoGware(as(a(Support(Vehicle(
  137. 137. •  Purpose: capture what user sees into WARC – instead of delegation-by-URI •  Barriers: – Restrictive browser extension API (Evolved/time) – Wheel inventing (nothing for WARCs in JS) •  Perks: – Seeded private web archiving research – Exposed hard-to-archive content Website:$hKp://warcreate.com( Blog:$hKp://wsOdl.blogspot.com/2013/07/2013O07O10OwarcreateOandOwailOwarc.html(
  138. 138. •  “Glue” between institutional tools – hard to configure and use •  Native binaries – difficult to maintain but novel •  Further facilitated private web archiving interest Website:$hKp://matkelly.com/wail( Blog:$hKp://wsOdl.blogspot.com/2013/07/2013O07O10OwarcreateOandOwailOwarc.html(
  139. 139. •  Integrates live + archived web experience •  Become familiar with Memento dynamics & usage patterns •  Provide eventual hook into new entities Website:$hKp://matkelly.com/mink( Blog:$hKp://wsOdl.blogspot.com/2014/10/2014O10O03OintegraFngOliveOand.html(
  140. 140. •  Given same input (URI), tools produce varying output •  Experiment to measure variance •  Identified hard-to-archive resources •  Highlighted cutting edge browser-crawler Website:$hKp://acid.matkelly.com( Blog:$hKp://wsOdl.blogspot.com/2014/07/2014O07O14OarchivalOacidOtest.html(
  141. 141. Current Research
  142. 142. private( archive( private( archive( other( private( archive( other( private( archive( HVDS(PresentaFon( 9(
  143. 143. private( archive( private( archive( other( private( archive( TimeMap other( private( archive( HVDS(PresentaFon( 10(
  144. 144. t = k! t = k-1!≠ HVDS(PresentaFon( 11(
  145. 145. HVDS(PresentaFon( 12(
  146. 146. 90 DAYS AT A TIME ONLY BACK TO ONE YEAR! HVDS(PresentaFon( 13(
  147. 147. 1(year(ago( 2(year(ago( 10(year(ago( …( 180(days(ago( TimeMap HVDS(PresentaFon( 14(
  148. 148. private( archive( HVDS(PresentaFon( 15(
  149. 149. HVDS(PresentaFon( 16( Facebook.com$replay$ What(is(expected( What(the(tools(captured(
  150. 150. Internet Archive public, aggregated Archive.today public, aggregated Foo Archives public, non-aggregated My web archive private, non-aggregated time → Archives capturing My homepage Changes to my homepage HVDS(PresentaFon( 17(
  151. 151. Internet Archive public, aggregated Archive.today public, aggregated Foo Archives public, non-aggregated My web archive private, non-aggregated time → Archives capturing My homepage Changes to my homepage HVDS(PresentaFon( 18(
  152. 152. Sample Access Patterns
  153. 153. OR$ TimeMap HVDS(PresentaFon( 20(
  154. 154. •  More mementos from a superset of sources TimeMap HVDS(PresentaFon( 21(
  155. 155. •  Patterns 1 and 2 are status quo – provided by framework •  Querying web archives currently only considers public web content – URI for lookup •  Framework introduces 2 new entities –  Memento Meta Aggregator (MMA) –  Private Web Archive Adapter (PWAA) HVDS(PresentaFon( 22(
  156. 156. •  Functional superset of (MA) •  Can act as intermediary client to relay MA results to ultimate user •  Allows just-in-time (JIT) inclusion of archives – as specified at query time •  Set of archives aggregated can be dynamic – e.g., Results must not include IA HVDS(PresentaFon( 23(
  157. 157. MY$CAPTURES$ MY$BANK$CAPTURES$ Various(public(web(archives( My(web(archives( HVDS(PresentaFon( 24(
  158. 158. MY$CAPTURES$ MY$BANK$CAPTURES$ 100( 30( 10( HVDS(PresentaFon( 25(
  159. 159. MY$CAPTURES$ MY$BANK$CAPTURES$ 100( 30( 10( HVDS(PresentaFon( 26(
  160. 160. MY$CAPTURES$ MY$BANK$CAPTURES$ NOT$AGGREGATED$ NOT$AGGREGATED$ 100( 30( 10( 140( HVDS(PresentaFon( 27(
  161. 161. HVDS(PresentaFon( 28(
  162. 162. HVDS(PresentaFon( 29(
  163. 163. Access(via(the(Meta(Aggregator( ( MY$CAPTURES$ MY$BANK$CAPTURES$ 100( 30( 10( 140(140( HVDS(PresentaFon( 30(
  164. 164. MY$CAPTURES$ MY$BANK$CAPTURES$ Access(via(the(Meta(Aggregator( …allows(our(archives(to(be(included( 100( 30( 10( 15( 140(155( HVDS(PresentaFon(
  165. 165. MY$CAPTURES$ MY$BANK$CAPTURES$ 100( 30( 10( 15( 140(155( 155( 155( HVDS(PresentaFon( 32(
  166. 166. MY$CAPTURES$ MY$BANK$CAPTURES$ …( Bob’s$public$ CAPTURES$ The$organizaLon’s$ public$CAPTURES$1$ The$organizaLon’s$ public$CAPTURES$2$ contains$ A$B$C$D$ Contains$ B$C$D$ Contains$ C$D$ A B C( D 10( 5( 15( 15( 20( 35( 35( 15( 50( 50( HVDS(PresentaFon( 33(
  167. 167. •  Allow dynamic and JIT set of archives •  Superset can be recursively constructed •  Sets can be shared My public captures! can be integrated ! with public web archives’! HVDS(PresentaFon( 34(
  168. 168. HVDS(PresentaFon( 35(
  169. 169. •  Regulates access to Private Web Archives (PWAs) •  Acts as token authorizer •  With correct credentials, relays results as if querying the PWA directly HVDS(PresentaFon( 36(
  170. 170. MY$CAPTURES$ 37( MY$BANK$CAPTURES$ GET(TOKEN(for(PWA( Key:(abcd1234( HVDS(PresentaFon( 100( 30( 10( 3!captures! 10,000!captures!
  171. 171. MY$CAPTURES$ 38( MY$BANK$CAPTURES$ GET(TOKEN(for(PWA( Key:(abcd1234( HVDS(PresentaFon( 100( 30( 10( 3!captures! 10,000!captures!
  172. 172. MY$CAPTURES$ MY$BANK$CAPTURES$ ACCESS(OK( Token:(4f33c64( 100( 30( 10( 3!captures! 10,000!captures! HVDS(PresentaFon( 39(
  173. 173. MY$CAPTURES$ MY$BANK$CAPTURES$ GET(mementos(for(URI( Token:(4f33c64( 100( 30( 10( 3!captures! 10,000!captures! HVDS(PresentaFon( 40(
  174. 174. MY$CAPTURES$ MY$BANK$CAPTURES$ GET(mementos(for(URI( Token:(4f33c64( 100( 30( 10( 3!captures! 10,000!captures! HVDS(PresentaFon( 41(
  175. 175. MY$CAPTURES$ MY$BANK$CAPTURES$ Token:(4f33c64( OK( GET(mementos(for(URI( GET(mementos(for(URI( 100( 30( 10( 3!captures! 10,000!captures! HVDS(PresentaFon( 42(
  176. 176. MY$CAPTURES$ MY$BANK$CAPTURES$ Token:(4f33c64(OK( Returning(mementos( Return(mementos( For(URI( 100( 30( 10( 3!captures! 10,000!captures! HVDS(PresentaFon( 43(
  177. 177. MY$CAPTURES$ 44( MY$BANK$CAPTURES$ TimeMap TimeMap TimeMap HVDS(PresentaFon( 100( 30( 10( 3!captures! 10,000!captures! 140( 10,000( 10,000( 10,143( 140!captures!
  178. 178. MY$CAPTURES$ 45( MY$BANK$CAPTURES$ TimeMap TimeMap TimeMap HVDS(PresentaFon( 100( 30( 10( 3!captures! 10,000!captures! 10,143( 140!captures! !!3!captures! !!!!10,000!captures!
  179. 179. MY$CAPTURES$ 46( MY$BANK$CAPTURES$ TimeMap HVDS(PresentaFon( 100( 30( 10( 3!captures! 10,000!captures! 10,143!captures!
  180. 180. ... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento"; datetime="Sat, 28 Feb 2015 15:57:03 GMT" , <http://web.archive.org/web/20150228163939/http://www.facebook.com/ >;rel="memento"; datetime="Sat, 28 Feb 2015 16:39:39 GMT" , <http://web.archive.org/web/20150303162841/https://www.facebook.com/ >;rel="memento"; datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/ >;rel="memento"; datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e" , <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento"; datetime="Tue, 05 Mar 2015 21:59:22 GMT" , <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/ >;rel="memento"; datetime="Wed, 06 Mar 2015 12:34:57 GMT" , <http://web.archive.org/web/20150310140721/https://www.facebook.com/ >;rel="memento"; datetime="Tue, 10 Mar 2015 14:07:21 GMT" ... TimeMap
  181. 181. ... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento"; datetime="Sat, 28 Feb 2015 15:57:03 GMT" , <http://web.archive.org/web/20150228163939/http://www.facebook.com/ >;rel="memento"; datetime="Sat, 28 Feb 2015 16:39:39 GMT" , <http://web.archive.org/web/20150303162841/https://www.facebook.com/ >;rel="memento"; datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/ >;rel="memento"; datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e" , <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento"; datetime="Tue, 05 Mar 2015 21:59:22 GMT" , <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/ >;rel="memento"; datetime="Wed, 06 Mar 2015 12:34:57 GMT" , <http://web.archive.org/web/20150310140721/https://www.facebook.com/ >;rel="memento"; datetime="Tue, 10 Mar 2015 14:07:21 GMT" ... MY$PRIVATE$FACEBOOK$CAPTURES$
  182. 182. ... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento"; datetime="Sat, 28 Feb 2015 15:57:03 GMT" , <http://web.archive.org/web/20150228163939/http://www.facebook.com/ >;rel="memento"; datetime="Sat, 28 Feb 2015 16:39:39 GMT" , <http://web.archive.org/web/20150303162841/https://www.facebook.com/ >;rel="memento"; datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/ >;rel="memento"; datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e" , <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento"; datetime="Tue, 05 Mar 2015 21:59:22 GMT" , <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/ >;rel="memento"; datetime="Wed, 06 Mar 2015 12:34:57 GMT" , <http://web.archive.org/web/20150310140721/https://www.facebook.com/ >;rel="memento"; datetime="Tue, 10 Mar 2015 14:07:21 GMT" ... MY$PRIVATE$FACEBOOK$CAPTURES$ NOT RFC 5988 COMPLIANT!
  183. 183. ... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento"; datetime="Sat, 28 Feb 2015 15:57:03 GMT" , <http://web.archive.org/web/20150228163939/http://www.facebook.com/ >;rel="memento"; datetime="Sat, 28 Feb 2015 16:39:39 GMT" , <http://web.archive.org/web/20150303162841/https://www.facebook.com/ >;rel="memento"; datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/ >;rel="memento"; datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e" , <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento"; datetime="Tue, 05 Mar 2015 21:59:22 GMT" , <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/ >;rel="memento"; datetime="Wed, 06 Mar 2015 12:34:57 GMT" , <http://web.archive.org/web/20150310140721/https://www.facebook.com/ >;rel="memento"; datetime="Tue, 10 Mar 2015 14:07:21 GMT" ... MY$PUBLIC$FACEBOOK$CAPTURES$
  184. 184. MY$CAPTURES$ 51( MY$BANK$CAPTURES$ GET(mementos(for(URI( Token:(4f33c64( GET(mementos(for(URI( Token:(c5463b4( GET(TOKEN(for(PWA( Key:(2265eef3( No/invalid!token! returned! Access!denied!or$ 0!mementos! HVDS(PresentaFon( 3!captures! 10,000!captures!
  185. 185. HVDS(PresentaFon( 52( MY$BANK$CAPTURES$ Linda’s$Private$ Captures$ Bob’s$Private$ Captures$ GET(TOKENs(for(PWAs( Key:(abcd1234,(Archive:(My( Key:(cab45cbf,(Archive:(Linda$ Key:(b0b01b,(Archive:(Bob$ 3!captures! 5!captures! 10!captures! 5( 3( 10(
  186. 186. HVDS(PresentaFon( 53( MY$BANK$CAPTURES$ Access(OK( Token:(7790ca( Access(OK( Token:(b0b01b( ACCESS$ DENIED$ Linda’s$Private$ Captures$ Bob’s$Private$ Captures$ 3!captures! 5!captures! 10!captures! 5( 3( 10(
  187. 187. HVDS(PresentaFon( 54( MY$BANK$CAPTURES$ GET(mementos(for(URI( Token:(7790ca,((Archive:(My( Token:(null,(Archive:(Linda$ Token:(b0b01b,(Archive:(Bob$ Linda’s$Private$ Captures$ Bob’s$Private$ Captures$ 3!captures! 5!captures! 10!captures! 5( 3( 10( 3( 10( ø(13(
  188. 188. •  Preserve Private Web Content HVDS(PresentaFon( •  Simulate & Quickly Deploy Private Web Archives •  Interface with New Entities Using Memento New(SoGware:( &(
  189. 189. •  Background research on state-of-the-art •  Exploring use cases – Both existing, anticipated, and fabricated •  Resisting desire to code HVDS(PresentaFon( 56( &( 56(
  190. 190. •  Why? – No means exists to integrate private and public web archives. •  How to Evaluate? – Does this framework fit real world needs? Scalable? •  When will I know I am done? – Any public/private web archive* can be integrated. *((((((((((((Ocompliant(

×