Big Data: Indexing ~50Tb of URIs

  • 422 views
Uploaded on

Presentation at the Big Data interest group at Los Alamos National Laboratory, regarding with with the IIPC to merge the indexes of multiple web archives.

Presentation at the Big Data interest group at Los Alamos National Laboratory, regarding with with the IIPC to merge the indexes of multiple web archives.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
422
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Using DISC to Reindex ~50Tb of URIs Robert Sanderson rsanderson@lanl.gov @azaroth42 Herbert Van de Sompel herbertv@lanl.gov @hvdsomp http://www.mementoweb.org/Towards Seamless Navigation of the Web of the Past Big Data Interest Group, LANL, Feb 21st 2013 1 [Unclassified]
  • 2. SummaryMementoProject OverviewParticipantsThe Plan …… and the RealityTechnical DetailsNext Run Big Data Interest Group, LANL, Feb 21st 2013 2 [Unclassified]
  • 3. Memento Project “ Provide web citizens seamless access to the web of the past ”Some accolades:•  $1M funding from the Library of Congress•  Winner of 2010 International Digital Preservation Award•  Accepted by International Internet Preservation Consortium as the way forwards for access to web archives•  Nominated for Best Paper at JCDL 2010•  Best Poster at JCDL 2012•  Tim Berners-Lee @ LDOW10: “We absolutely need this”•  Internet Draft 6 (final call) by May Big Data Interest Group, LANL, Feb 21st 2013 3 [Unclassified]
  • 4. Old Copies have New NamesEveryone knows the URL for CNN is: http://www.cnn.com/Everyone knows the URL for CNN was: http://www.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 4 [Unclassified]
  • 5. Old Copies have New NamesCopies of the past versions of to http://www.cnn.com/ exist… but you don’t go there to get them, they have new URLs… and there’s no way to automatically discover those new URLs http://web.archive.org/web/20031013111028/http://www2.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 5 [Unclassified]
  • 6. Discovery is HardPeople want to find, not search… and especially not searching by hand! http://web.archive.org/web/*/http://www2.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 6 [Unclassified]
  • 7. Navigating in the PastLinks to the real resource bring you back out of the past… or trap you in a single incomplete archive. Pentagon Dec 20 2001, 4:51:00 UTC current http://web.archive.org/web/*/http://www2.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 7 [Unclassified]
  • 8. Memento: Time Travel for the Web TimeGate: Resource that knows the locations of archived copies Memento: An archived copy of previous state of a web resourceLink header to TimeGate 302 redirect to Memento How does the TimeGate know where the appropriate Memento is? Big Data Interest Group, LANL, Feb 21st 2013 8 [Unclassified]
  • 9. Project OverviewGoal:•  To aggregate the metadata of the distributed archives of the IIPC, and •  To provide Memento based access to the holdings of open archives •  To provide knowledge of the holdings of restricted archives •  To provide knowledge to IIPC members of the holdings of totally closed archives Big Data Interest Group, LANL, Feb 21st 2013 9 [Unclassified]
  • 10. Experiment Participants•  Austrian National Library•  Bibliothèque Nationale de France• British Library• Institut National de lAudiovisuel• Internet Archive• Koninklijke Bibliotheek• Library of Congress•  Netarchive.dk•  Swiss National Library•  University of North Texas• Los Alamos National Laboratory• Old Dominion University Big Data Interest Group, LANL, Feb 21st 2013 10 [Unclassified]
  • 11. DataCanonical URL kosovakosovo.com/photo.php?id=5785 Datestamp 20090608161553 Request URL http://www.kosovakosovo.com/photo.php?id=5785 MIME type text/html HTTP Status 200 Checksum M36MRHSBVPLKMUN6PFOIEV3AH5ADITAN Redirect? - Bytes 563096 Storage File AIT-1068-20090608161511.warc.gz Multiplied by 6Tb compressed, ~50Tb uncompressed Big Data Interest Group, LANL, Feb 21st 2013 11 [Unclassified]
  • 12. The Plan…•  To provide fast access to distributed archives, LANL would merge the indexes of the holdings of multiple archives and provide Memento based access•  Step 1: Library of Congress gathers CDX files Step 2: LANL indexes (a.k.a. “…”) Step 3: Profit•  Data: 6T of gzipped CDX files (mostly from IA) •  Shipped on hard drives•  Computing: 210 node DISC cluster at LANL •  2x 2ghz processors, 2x 2T HDD, 8G RAM Big Data Interest Group, LANL, Feb 21st 2013 12 [Unclassified]
  • 13. … and the Reality•  Hardware failure killed one of the drives en route •  Transferred remaining files via BagIt from LoC•  DISC has restricted access: •  Had to transfer data over intranet •  2 weeks to sync (5Mb/sec) •  And then 2 weeks to get the processed results off•  Compute cluster has faulty switch, unreliable nodes: •  Ran original processing 15 times without success due to hardware failures Big Data Interest Group, LANL, Feb 21st 2013 13 [Unclassified]
  • 14. Processing Design•  For each CDX file, •  For each URI + timestamps, •  Map URI to an appropriate database slice •  Merge timestamps with those of previous CDXs•  Possible because: •  No need to do truncated search •  No need to walk through URIs in order •  No need for time based access, only URI•  Problem is “Embarrassingly Parallel” Big Data Interest Group, LANL, Feb 21st 2013 14 [Unclassified]
  • 15. Approach 1: Online Messaging•  25 read nodes, 1 control node, 150 write nodes•  Messages (1000 URIs) sent via control node to write•  Failed 15 times due to hardware issues Big Data Interest Group, LANL, Feb 21st 2013 15 [Unclassified]
  • 16. Approach 1: Implementation•  Python•  MPI L•  PyMPI LL•  No failure detection, no auto restart, no zombie killing, no … LLL•  Several months of experiments to find issues, fine tune parameters most likely to complete… Big Data Interest Group, LANL, Feb 21st 2013 16 [Unclassified]
  • 17. Approach 2: No Interaction•  43 read/split nodes•  Phase 1: Read nodes split CDX files to 3000 slices Big Data Interest Group, LANL, Feb 21st 2013 17 [Unclassified]
  • 18. Approach 2: No Interaction•  Phase 2a: Transfer CDX slices to Control node•  Phase 2b: Transfer CDX slices to Write nodes Big Data Interest Group, LANL, Feb 21st 2013 18 [Unclassified]
  • 19. Approach 2: No Interaction•  50 write nodes (* 60 slices each = 3000 slices)•  Phase 3: Merge slices from nodes to BerkeleyDBs Big Data Interest Group, LANL, Feb 21st 2013 19 [Unclassified]
  • 20. Next Steps•  Re-index, using non-interactive approach •  New data: 14 Tb (~120Tb uncompressed?) •  Use FileMap for better automation?•  Two eSata 8T LaCie cubes, 2 3T internal drives to install locally•  More intelligent data partitioning•  Some sort of error detection and handling! Big Data Interest Group, LANL, Feb 21st 2013 20 [Unclassified]
  • 21. Memento http://mementoweb.org/ Robert Sanderson rsanderson@lanl.gov @azaroth42 Herbert Van de Sompel herbertv@lanl.gov @hvdsompBig Data Interest Group, LANL, Feb 21st 2013 21 [Unclassified]