Advertisement
Advertisement

More Related Content

Similar to The Elephant in the Library(20)

More from DataWorks Summit(20)

Advertisement

The Elephant in the Library

  1. SCAP E The Elephant in the Library Integrating Hadoop Clemens Neudecker Sven Schlarb @cneudecker @SvenSchlarb
  2. Contents 1. Background: Digitization of cultural heritage 2. Numbers: Scaling up! 3. Challenges: Use cases and scenarios 4. Outlook
  3. 1. Background “The digital revolution is far more significant than the invention of writing or even of printing” Douglas Engelbart
  4. Then
  5. Our libraries • The Hague, Netherlands • Vienna, Austria • Founded in 1798 • Founded in 14th century • 120.000 visitors per year • 300.000 visitors per year • 6 million documents • 8 million documents • 260 FTE • 300 FTE www.kb.nl www.onb.ac.at
  6. Digitization Libraries are rapidly transforming from physical… to digital…
  7. Transformation Curation Lifecycle Model from Digital Curation Centre www.dcc.ac.uk
  8. Now
  9. Digital Preservation
  10. Our data – cultural heritage • Traditionally • Bibliographic and other metadata • Images (Portraits/Pictures, Maps, Posters, etc.) • Text (Books, Articles, Newspapers, etc.) • More recently • Audio/Video • Websites, Blogs, Twitter, Social Networks • Research Data/Raw Data • Software? Apps?
  11. 2. Numbers “A good decision is based on knowledge and not on numbers” Plato, 400 BC
  12. Numbers (I) National Library of the Netherlands • Digital objects • > 500 million files • 18 million digital publications (+ 2M/year) • 8 million newspaper pages (+ 4M/year) • 152.000 books (+ 100k/year) • 730.000 websites (+ 170k/year) • Storage • 1.3 PB (currently 458 TB used) • Growing approx. 150 TB a year
  13. Numbers (II) Austrian National Library • Digital objects • 600.000 volumes being digitised during the next years (currently 120.000 volumes, 40 million pages) • 10 million newspapers and legal texts • 1.16 billion files in web archive from > 1 million domains • Several 100.000 images and portraits • Storage • 84 TB • Growing approx. 15 TB a year
  14. Numbers (III) • Google Books Project • 2012: 20 million books scanned (approx. 7,000,000,000 pages) • www.books.google.com • Europeana • 2012: 25 million digital objects • All metadata licensed CC-0 • www.europeana.eu/portal
  15. Numbers (IV) • Hathi Trust • 3,721,702,950 scanned pages • 477 TBytes • www.hathitrust.org • Internet Archive • 245 billion web pages archived • 10 PBytes • www.archive.org
  16. Numbers (V) • What can we expect? • Enumerate 2012: only about 4% digitised so far • Strong growth of born digital information Source: www.idc.com Source: security.networksasia.net
  17. 3. Challenges “What do you do with a million books?” Gregory Crane, 2006
  18. Making it scale Scalability in terms of … • size • number • complexity • heterogeneity
  19. SCAPE • SCAPE = SCAlable Preservation Environments • €8.6M EU funding, Feb 2011 – July 2014 • 20 partners from public sector, academia, industry • Main objectives: • Scalability • Automation • Planning www.scape-project.eu
  20. Use cases (I) • Document recognition: From image to XML • Business case: • Better presentation options • Creation of eBooks • Full-text indexing
  21. Use cases (II) • File type migration: JP2k  TIFF • Business case: • Originally migration to JP2k to reduce storage costs • Reverse process used in case JP2k becomes obsolete
  22. Use cases (III) • Web archiving: Characterization of web content • Business case: • What is in a Top Level Domain? • What is the distribution of file formats? • http://www.openplanetsfoundation.org/blogs/2013-01- 09-year-fits xkcd.com/688
  23. Use cases (IV) • Digital Humanities: Making sense of the millions • Business case: • Text mining & NLP • Statistical analysis • Semantic enrichment • Visualizations Source: www.open.ac.uk/
  24. Enter the Elephants… Source: Biopics
  25. Experimental Cluster
  26. Execution environment Cluster Taverna Server File server (REST API) Hadoop Apache Tomcat Jobtracker Web Application
  27. Scenarios (I) Log file analysis • Metadata log files generated by the web crawler during the harvesting process (no mime type identification – just the mime types returned by the web server) 20110830130705 9684 46 16 3 image/jpeg http://URL at IP 17311 200 20110830130709 9684 46 16 3 image/jpeg http://URL at IP 22123 200 20110830130710 9684 46 16 3 image/gif http://URL at IP 9794 200 20110830130707 9684 46 16 3 image/jpeg http://URL at IP 40056 200 20110830130704 9684 46 16 3 text/html http://URL at IP 13149 200 20110830130712 9684 46 16 3 image/gif http://URL at IP 2285 200 20110830130712 9684 46 16 3 text/html http://URL at IP 415 301 20110830130710 9684 46 16 3 text/html http://URL at IP 7873 200 20110830130712 9684 46 16 3 text/html http://URL at IP 632 302 20110830130712 9684 46 16 3 image/png http://URL at IP 679 200
  28. Scenarios (II) Web archiving: File format identification → Run file type identification on archived web content (W)ARC Container JPG (W)ARC RecordReader MapReduce Apache Tika GIF JPG image/jpg detect MIME based on HTM HERITRIX Map Web crawler Reduce read/write (W)ARC image/jpg 1 HTM image/gif 1 text/html 2 audio/midi 1 MID
  29. Scenarios (II) Web archiving: File format identification → Using MapReduce to calculate statistics DROID 6.01 TIKA 1.0
  30. Scenarios (III) File format migration • Risk of format obsolescence • Quality assurance • File format validation • Original/target image comparison • Imagine runtime of 1 minute per image for 200 million pages ...
  31. Parallel execution of file format validation using Mapper ●Jpylyzer (Python) ●Jhove2 (Java)
  32. ●Feature extraction requires sharing resources between processing steps ●Challenge to model more complex image comparison scenarios, e.g. book page duplicates detection or digital book comparison
  33. Scenarios (IV) Book page analysis
  34. Create text file containing JPEG2000 input file paths and read image metadata using Exiftool via the Hadoop Streaming API
  35. Reading image metadata Jp2PathCreator HadoopStreamingExiftoolRead reading files from NAS /NAS/Z119585409/00000001.jp2 Z119585409/00000001 2345 /NAS/Z119585409/00000002.jp2 Z119585409/00000002 2340 /NAS/Z119585409/00000003.jp2 Z119585409/00000003 2543 … … /NAS/Z117655409/00000001.jp2 Z117655409/00000001 2300 /NAS/Z117655409/00000002.jp2 Z117655409/00000002 2300 /NAS/Z117655409/00000003.jp2 Z117655409/00000003 2345 … find … /NAS/Z119585987/00000001.jp2 Z119585987/00000001 2300 /NAS/Z119585987/00000002.jp2 Z119585987/00000002 2340 /NAS/Z119585987/00000003.jp2 Z119585987/00000003 2432 … … /NAS/Z119584539/00000001.jp2 Z119584539/00000001 5205 NAS /NAS/Z119584539/00000002.jp2 Z119584539/00000002 2310 /NAS/Z119584539/00000003.jp2 Z119584539/00000003 2134 … … /NAS/Z119599879/00000001.jp2l Z119599879/00000001 2312 /NAS/Z119589879/00000002.jp2 Z119589879/00000002 ... 2300 /NAS/Z119589879/00000003.jp2 Z119589879/00000003 2300 ... ... 1,4 GB 1,2 GB 60.000 books : ~5h + ~ 38 h = ~ 43 h 24 Million pages
  36. Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS
  37. SequenceFile creation HtmlPathCreator SequenceFileCreator reading files from NAS /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html Z119585409/00000707 /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html Z119585409/00000708 /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html find … Z119585409/00000709 /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … Z119585409/00000710 /NAS/Z967985409/00000707.html NAS /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html Z119585409/00000711 … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html Z119585409/00000712 /NAS/Z196545409/00000709.html ... 1,4 GB 997 GB (uncompressed) 60.000 books : ~5h + ~ 24 h = ~ 29 h 24 Million pages
  38. Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width
  39. HTML Parsing HadoopAvBlockWidthMapReduce Map Reduce Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2250 Z119585409/00000001 2300 Z119585409/00000001 2400 Z119585409/00000001 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2250 Z119585409/00000002 2300 Z119585409/00000002 2400 Z119585409/00000002 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2250 Z119585409/00000003 2300 Z119585409/00000003 Z119585409/00000003 2400 Z119585409/00000004 2100 Z119585409/00000004 Z119585409/00000004 2200 Z119585409/00000004 2250 Z119585409/00000004 2300 Z119585409/00000004 2400 ... Z119585409/00000005 2100 Z119585409/00000005 Z119585409/00000005 2200 Z119585409/00000005 2250 Z119585409/00000005 2300 Z119585409/00000005 2400 SequenceFile Textfile 60.000 books : ~6h 24 Million pages
  40. Create Hive table and load generated data into the Hive database
  41. Analytic Queries HiveLoadExifData & HiveLoadHocrData htmlwidth hid hwidth Z119585409/00000001 1870 Z119585409/00000001 1870 Z119585409/00000002 2100 CREATE TABLE htmlwidth Z119585409/00000003 2015 Z119585409/00000002 2100 Z119585409/00000004 1350 (hid STRING, hwidth INT) Z119585409/00000005 1700 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 jp2width jid jwidth Z119585409/00000001 2250 Z119585409/00000001 2250 Z119585409/00000002 2150 CREATE TABLE jp2width Z119585409/00000003 2125 Z119585409/00000002 2150 Z119585409/00000004 2125 (hid STRING, jwidth INT) Z119585409/00000005 2250 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 60.000 books 24 Million pages : ~6h
  42. Analytic Queries HiveSelect jp2width htmlwidth jid jwidth hid hwidth Z119585409/00000001 2250 Z119585409/00000001 1870 Z119585409/00000002 2150 Z119585409/00000002 2100 Z119585409/00000003 2125 Z119585409/00000003 2015 Z119585409/00000004 2125 Z119585409/00000004 1350 Z119585409/00000005 2250 Z119585409/00000005 1700 select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid jid jwidth hwidth Z119585409/00000001 2250 1870 Z119585409/00000002 2150 2100 Z119585409/00000003 2125 2015 Z119585409/00000004 2125 1350 Z119585409/00000005 2250 1700 60.000 books : ~6h 24 Million pages
  43. Perform a simple Hive query to test if the database has been created successfully
  44. Outlook “Progress generally appears much greater than it really is” Johan Nestroy, 1847
  45. What have WE learned? • We need to carefully assess the efforts for data preparation vs. the actual processing load • HDFS prefers large files over many small ones, is basically “append-only” • There is still much more the Hadoop ecosystem has to offer, e.g. YARN, Pig, Mahout
  46. What can YOU do? • Come join our “Hadoop in cultural heritage” hackathon on 2-4 December 2013, Vienna (See http://www.scape-project.eu/events ) • Check out some tools from our github at https://github.com/openplanets/ and help us make them better and more scalable • Follow us at @SCAPEProject and spread the word!
  47. What’s in it for US? • Digital (free) access to centuries of cultural heritage data, 24x7 and from anywhere • Ensuring our cultural history is not lost • New innovative applications using cultural heritage data (education, creative industries)
  48. Thank you! Questions? (btw, we’re hiring) www.kb.nl www.onb.ac.at www.scape-project.eu www.openplanetsfoundation.org
Advertisement