SlideShare a Scribd company logo
SCAP
                                             E

The Elephant in the Library
Integrating Hadoop


Clemens Neudecker             Sven Schlarb
@cneudecker                   @SvenSchlarb
Contents


1. Background: Digitization of cultural heritage


2. Numbers: Scaling up!


3. Challenges: Use cases and scenarios


4. Outlook
1. Background



“The digital revolution is far more
 significant than the invention of
    writing or even of printing”
         Douglas Engelbart
Then
Our libraries




•   The Hague, Netherlands       •   Vienna, Austria
•   Founded in 1798              •   Founded in 14th century
•   120.000 visitors per year    •   300.000 visitors per year
•   6 million documents          •   8 million documents
•   260 FTE                      •   300 FTE
    www.kb.nl                        www.onb.ac.at
Digitization

Libraries are rapidly transforming from physical…




to digital…
Transformation




Curation Lifecycle Model from Digital Curation Centre www.dcc.ac.uk
Now
Digital Preservation
Our data – cultural heritage

• Traditionally
     • Bibliographic and other metadata
     • Images (Portraits/Pictures, Maps, Posters, etc.)
     • Text (Books, Articles, Newspapers, etc.)
• More recently
     • Audio/Video
     • Websites, Blogs, Twitter, Social Networks
     • Research Data/Raw Data
     • Software? Apps?
2. Numbers



“A good decision is based on knowledge
         and not on numbers”
             Plato, 400 BC
Numbers (I)
        National Library of the Netherlands

• Digital objects
        • > 500 million files
        •    18 million digital publications (+ 2M/year)
        •    8 million newspaper pages (+ 4M/year)
        •    152.000 books (+ 100k/year)
        •    730.000 websites (+ 170k/year)
• Storage
        •    1.3 PB (currently 458 TB used)
        •    Growing approx. 150 TB a year
Numbers (II)
              Austrian National Library

• Digital objects
        • 600.000 volumes being digitised during the next
             years (currently 120.000 volumes, 40 million pages)
        •    10 million newspapers and legal texts
        •    1.16 billion files in web archive from
             > 1 million domains
        •    Several 100.000 images and portraits
• Storage
        •    84 TB
        •    Growing approx. 15 TB a year
Numbers (III)

• Google Books Project
   • 2012: 20 million books scanned
     (approx. 7,000,000,000 pages)
   • www.books.google.com


• Europeana
   • 2012: 25 million digital objects
   • All metadata licensed CC-0
   • www.europeana.eu/portal
Numbers (IV)

• Hathi Trust
   • 3,721,702,950 scanned pages
   • 477 TBytes
   • www.hathitrust.org


• Internet Archive
   • 245 billion web pages archived
   • 10 PBytes
   • www.archive.org
Numbers (V)

• What can we expect?
  • Enumerate 2012: only about 4% digitised so far
  • Strong growth of born digital information




     Source: www.idc.com                 Source: security.networksasia.net
3. Challenges



“What do you do with a million books?”
        Gregory Crane, 2006
Making it scale


Scalability in terms of …
   • size
   • number
   • complexity
   • heterogeneity
SCAPE

• SCAPE = SCAlable Preservation Environments
  • €8.6M EU funding, Feb 2011 – July 2014
  • 20 partners from public sector, academia, industry
  • Main objectives:
     • Scalability
     • Automation
     • Planning

              www.scape-project.eu
Use cases (I)

• Document recognition: From image to XML

• Business case:
   • Better presentation options
   • Creation of eBooks
   • Full-text indexing
Use cases (II)

• File type migration: JP2k  TIFF

• Business case:
   • Originally migration
     to JP2k to reduce
     storage costs
   • Reverse process
     used in case JP2k
     becomes obsolete
Use cases (III)

• Web archiving: Characterization of web content

• Business case:
   • What is in a Top Level Domain?
   • What is the distribution of file formats?
   • http://www.openplanetsfoundation.org/blogs/2013-01-
     09-year-fits




                                 xkcd.com/688
Use cases (IV)

• Digital Humanities: Making sense of the millions

• Business case:
  •   Text mining & NLP
  •   Statistical analysis
  •   Semantic enrichment
  •   Visualizations                    Source: www.open.ac.uk/
Enter the Elephants…




                       Source: Biopics
Experimental Cluster
Execution environment


              Cluster
                         Taverna Server
File server              (REST API)



                          Hadoop          Apache Tomcat
                          Jobtracker      Web Application
Scenarios (I)
                             Log file analysis

 • Metadata log files generated by the web crawler
   during the harvesting process
   (no mime type identification – just the mime types
   returned by the web server)
20110830130705   9684   46   16   3   image/jpeg   http://URL   at   IP   17311   200
20110830130709   9684   46   16   3   image/jpeg   http://URL   at   IP   22123   200
20110830130710   9684   46   16   3   image/gif    http://URL   at   IP   9794    200
20110830130707   9684   46   16   3   image/jpeg   http://URL   at   IP   40056   200
20110830130704   9684   46   16   3   text/html    http://URL   at   IP   13149   200
20110830130712   9684   46   16   3   image/gif    http://URL   at   IP   2285    200
20110830130712   9684   46   16   3   text/html    http://URL   at   IP   415     301
20110830130710   9684   46   16   3   text/html    http://URL   at   IP   7873    200
20110830130712   9684   46   16   3   text/html    http://URL   at   IP   632     302
20110830130712   9684   46   16   3   image/png    http://URL   at   IP   679     200
Scenarios (II)
             Web archiving: File format identification
           → Run file type identification on archived web content


(W)ARC Container

     JPG             (W)ARC RecordReader           MapReduce

                                                   Apache Tika
     GIF                                     JPG                  image/jpg
                                                   detect MIME
                          based on
     HTM                 HERITRIX                       Map
                         Web crawler                   Reduce
                       read/write (W)ARC
                                                     image/jpg    1
    HTM                                              image/gif    1
                                                     text/html    2
                                                     audio/midi   1

     MID
Scenarios (II)
Web archiving: File format identification
→ Using MapReduce to calculate statistics



   DROID 6.01                               TIKA 1.0
Scenarios (III)
                 File format migration

• Risk of format obsolescence
• Quality assurance
  • File format validation
  • Original/target image
    comparison
• Imagine runtime of 1 minute
  per image for 200 million
  pages ...
Parallel execution of
file format validation
using Mapper
  ●Jpylyzer (Python)
  ●Jhove2 (Java)
●Feature extraction
 requires sharing
 resources between
 processing steps

●Challenge to model
 more complex image
 comparison scenarios,
 e.g. book page
 duplicates detection
 or digital book
 comparison
Scenarios (IV)
Book page analysis
Create text file containing JPEG2000 input file paths and read
image metadata using Exiftool via the Hadoop Streaming API
Reading image metadata
           Jp2PathCreator                            HadoopStreamingExiftoolRead
                       reading files from NAS




                 /NAS/Z119585409/00000001.jp2                    Z119585409/00000001   2345
                 /NAS/Z119585409/00000002.jp2                    Z119585409/00000002   2340
                 /NAS/Z119585409/00000003.jp2                    Z119585409/00000003   2543
                 …                                               …
                 /NAS/Z117655409/00000001.jp2                    Z117655409/00000001   2300
                 /NAS/Z117655409/00000002.jp2                    Z117655409/00000002   2300
                 /NAS/Z117655409/00000003.jp2                    Z117655409/00000003   2345
                                                                 …
          find   …
                 /NAS/Z119585987/00000001.jp2                    Z119585987/00000001   2300
                 /NAS/Z119585987/00000002.jp2                    Z119585987/00000002   2340
                 /NAS/Z119585987/00000003.jp2                    Z119585987/00000003   2432
                 …                                               …
                 /NAS/Z119584539/00000001.jp2                    Z119584539/00000001   5205
   NAS           /NAS/Z119584539/00000002.jp2                    Z119584539/00000002   2310
                 /NAS/Z119584539/00000003.jp2                    Z119584539/00000003   2134
                 …                                               …
                 /NAS/Z119599879/00000001.jp2l                   Z119599879/00000001   2312
                 /NAS/Z119589879/00000002.jp2                    Z119589879/00000002
                                                                    ...                2300
                 /NAS/Z119589879/00000003.jp2                    Z119589879/00000003   2300
                 ...                                             ...


                             1,4 GB                            1,2 GB

 60.000 books
                 :                ~5h            +   ~ 38 h       =       ~ 43 h
24 Million pages
Create text file containing HTML input file paths and create
one sequence file with the complete file content in HDFS
SequenceFile creation
           HtmlPathCreator                                SequenceFileCreator
                      reading files from NAS




                 /NAS/Z119585409/00000707.html
                 /NAS/Z119585409/00000708.html
                                                                  Z119585409/00000707
                 /NAS/Z119585409/00000709.html
                 …
                 /NAS/Z138682341/00000707.html
                                                                  Z119585409/00000708
                 /NAS/Z138682341/00000708.html
                 /NAS/Z138682341/00000709.html
          find   …
                                                                  Z119585409/00000709
                 /NAS/Z178791257/00000707.html
                 /NAS/Z178791257/00000708.html
                 /NAS/Z178791257/00000709.html
                 …                                                Z119585409/00000710
                 /NAS/Z967985409/00000707.html
   NAS           /NAS/Z967985409/00000708.html
                 /NAS/Z967985409/00000709.html                    Z119585409/00000711
                 …
                 /NAS/Z196545409/00000707.html
                 /NAS/Z196545409/00000708.html                    Z119585409/00000712
                 /NAS/Z196545409/00000709.html
                 ...


                             1,4 GB                            997 GB (uncompressed)

 60.000 books
                 :          ~5h              +   ~ 24 h       =      ~ 29 h
24 Million pages
Execute Hadoop MapReduce job using the sequence file created
before in order to calculate the average paragraph block width
HTML Parsing
                      HadoopAvBlockWidthMapReduce
                                        Map            Reduce
                                Z119585409/00000001 2100
                                Z119585409/00000001 2200
                                                           Z119585409/00000001 2250
                                Z119585409/00000001 2300
                                Z119585409/00000001 2400

Z119585409/00000001             Z119585409/00000002 2100
                                Z119585409/00000002 2200   Z119585409/00000002 2250
                                Z119585409/00000002 2300
                                Z119585409/00000002 2400
Z119585409/00000002
                                Z119585409/00000003 2100
                                Z119585409/00000003 2200
                                                           Z119585409/00000003 2250
                                Z119585409/00000003 2300
Z119585409/00000003
                                Z119585409/00000003 2400

                                Z119585409/00000004 2100
Z119585409/00000004             Z119585409/00000004 2200   Z119585409/00000004 2250
                                Z119585409/00000004 2300
                                Z119585409/00000004 2400
                                 ...
                                Z119585409/00000005 2100
Z119585409/00000005             Z119585409/00000005 2200
                                                           Z119585409/00000005 2250
                                Z119585409/00000005 2300
                                Z119585409/00000005 2400

    SequenceFile                                                    Textfile

 60.000 books
                 :                ~6h
24 Million pages
Create Hive table and load generated data into the Hive database
Analytic Queries
                           HiveLoadExifData & HiveLoadHocrData
                                                                   htmlwidth

                                                         hid                   hwidth
Z119585409/00000001 1870                                 Z119585409/00000001   1870
Z119585409/00000002 2100      CREATE TABLE htmlwidth
Z119585409/00000003 2015                                 Z119585409/00000002   2100
Z119585409/00000004 1350
                              (hid STRING, hwidth INT)
Z119585409/00000005 1700                                 Z119585409/00000003   2015

                                                         Z119585409/00000004   1350

                                                         Z119585409/00000005   1700


                                                                    jp2width

                                                         jid                   jwidth
Z119585409/00000001 2250                                 Z119585409/00000001   2250
Z119585409/00000002 2150      CREATE TABLE jp2width
Z119585409/00000003 2125                                 Z119585409/00000002   2150
Z119585409/00000004 2125      (hid STRING, jwidth INT)
Z119585409/00000005 2250                                 Z119585409/00000003   2125

                                                         Z119585409/00000004   2125

                                                         Z119585409/00000005   2250


  60.000 books
 24 Million pages :                             ~6h
Analytic Queries
                                                 HiveSelect
             jp2width                                                            htmlwidth

   jid                          jwidth                                 hid                   hwidth
   Z119585409/00000001          2250                                   Z119585409/00000001   1870

   Z119585409/00000002          2150                                   Z119585409/00000002   2100

   Z119585409/00000003          2125                                   Z119585409/00000003   2015

   Z119585409/00000004          2125                                   Z119585409/00000004   1350

   Z119585409/00000005          2250                                   Z119585409/00000005   1700


select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid


                    jid                   jwidth              hwidth
                    Z119585409/00000001   2250                1870

                    Z119585409/00000002   2150                2100

                    Z119585409/00000003   2125                2015

                    Z119585409/00000004   2125                1350

                    Z119585409/00000005   2250                1700


 60.000 books
                 :                                   ~6h
24 Million pages
Perform a simple Hive query to test if the
database has been created successfully
Outlook



“Progress generally appears much
     greater than it really is”
       Johan Nestroy, 1847
What have WE learned?

• We need to carefully assess the efforts for data
  preparation vs. the actual processing load

• HDFS prefers large files over many small ones,
  is basically “append-only”

• There is still much more the Hadoop ecosystem
  has to offer, e.g. YARN, Pig, Mahout
What can YOU do?

• Come join our “Hadoop in cultural heritage”
  hackathon on 2-4 December 2013, Vienna
  (See http://www.scape-project.eu/events )

• Check out some tools from our github at
  https://github.com/openplanets/ and help
  us make them better and more scalable

• Follow us at @SCAPEProject and spread the word!
What’s in it for US?

• Digital (free) access to centuries of cultural
  heritage data, 24x7 and from anywhere

• Ensuring our cultural history is not lost

• New innovative applications using cultural
  heritage data (education, creative industries)
Thank you! Questions?
       (btw, we’re hiring)



        www.kb.nl
      www.onb.ac.at
   www.scape-project.eu
www.openplanetsfoundation.org

More Related Content

Similar to The Elephant in the Library

Gyorgy balogh modern_big_data_technologies_sec_world_2014
Gyorgy balogh modern_big_data_technologies_sec_world_2014Gyorgy balogh modern_big_data_technologies_sec_world_2014
Gyorgy balogh modern_big_data_technologies_sec_world_2014
LogDrill
 
IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019
Glen Robson
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Future Perfect 2012
 
Just Digitise It! - Daniel Wilksch
Just Digitise It! - Daniel WilkschJust Digitise It! - Daniel Wilksch
Just Digitise It! - Daniel Wilksch
National Library of Australia
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013
Ahmed AlSum
 
container crash course
container crash coursecontainer crash course
container crash course
Andrew Shafer
 
Optimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital LibraryOptimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital Library
UCD Library
 
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
cneudecker
 
2012 09 caas-ag_infra
2012 09 caas-ag_infra2012 09 caas-ag_infra
2012 09 caas-ag_infra
Johannes Keizer
 
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaJust digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
National Library of Australia
 
Call for papers - Data Mining & Warehousing (IJORCS) Special issue
Call for papers - Data Mining & Warehousing (IJORCS) Special issue Call for papers - Data Mining & Warehousing (IJORCS) Special issue
Call for papers - Data Mining & Warehousing (IJORCS) Special issue
IJORCS
 
Digital game preservation conference 12 25-2018
Digital game preservation conference   12 25-2018Digital game preservation conference   12 25-2018
Digital game preservation conference 12 25-2018
peterchanws
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
SCAPE Project
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Artefactual Systems - AtoM
 
DURAARK at IGeLU 2014
DURAARK at IGeLU 2014DURAARK at IGeLU 2014
DURAARK at IGeLU 2014
panitzm
 
Ariadne overview
Ariadne overviewAriadne overview
Ariadne overview
ariadnenetwork
 
agINFRA - Elements for an Information Infrastructure in Agricultural Resear...
agINFRA -  Elements for an Information  Infrastructure in Agricultural Resear...agINFRA -  Elements for an Information  Infrastructure in Agricultural Resear...
agINFRA - Elements for an Information Infrastructure in Agricultural Resear...
AIMS (Agricultural Information Management Standards)
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
Ian Foster
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
EqinNiftalyev
 
Introduction to Data Processing (by Srinath Perera)
Introduction to Data Processing (by Srinath Perera)Introduction to Data Processing (by Srinath Perera)
Introduction to Data Processing (by Srinath Perera)
SLASSCOM Technology Forum
 

Similar to The Elephant in the Library (20)

Gyorgy balogh modern_big_data_technologies_sec_world_2014
Gyorgy balogh modern_big_data_technologies_sec_world_2014Gyorgy balogh modern_big_data_technologies_sec_world_2014
Gyorgy balogh modern_big_data_technologies_sec_world_2014
 
IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
 
Just Digitise It! - Daniel Wilksch
Just Digitise It! - Daniel WilkschJust Digitise It! - Daniel Wilksch
Just Digitise It! - Daniel Wilksch
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013
 
container crash course
container crash coursecontainer crash course
container crash course
 
Optimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital LibraryOptimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital Library
 
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
2012 09 caas-ag_infra
2012 09 caas-ag_infra2012 09 caas-ag_infra
2012 09 caas-ag_infra
 
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaJust digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
 
Call for papers - Data Mining & Warehousing (IJORCS) Special issue
Call for papers - Data Mining & Warehousing (IJORCS) Special issue Call for papers - Data Mining & Warehousing (IJORCS) Special issue
Call for papers - Data Mining & Warehousing (IJORCS) Special issue
 
Digital game preservation conference 12 25-2018
Digital game preservation conference   12 25-2018Digital game preservation conference   12 25-2018
Digital game preservation conference 12 25-2018
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
DURAARK at IGeLU 2014
DURAARK at IGeLU 2014DURAARK at IGeLU 2014
DURAARK at IGeLU 2014
 
Ariadne overview
Ariadne overviewAriadne overview
Ariadne overview
 
agINFRA - Elements for an Information Infrastructure in Agricultural Resear...
agINFRA -  Elements for an Information  Infrastructure in Agricultural Resear...agINFRA -  Elements for an Information  Infrastructure in Agricultural Resear...
agINFRA - Elements for an Information Infrastructure in Agricultural Resear...
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 
Introduction to Data Processing (by Srinath Perera)
Introduction to Data Processing (by Srinath Perera)Introduction to Data Processing (by Srinath Perera)
Introduction to Data Processing (by Srinath Perera)
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 

Recently uploaded (20)

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 

The Elephant in the Library

  • 1. SCAP E The Elephant in the Library Integrating Hadoop Clemens Neudecker Sven Schlarb @cneudecker @SvenSchlarb
  • 2. Contents 1. Background: Digitization of cultural heritage 2. Numbers: Scaling up! 3. Challenges: Use cases and scenarios 4. Outlook
  • 3. 1. Background “The digital revolution is far more significant than the invention of writing or even of printing” Douglas Engelbart
  • 5. Our libraries • The Hague, Netherlands • Vienna, Austria • Founded in 1798 • Founded in 14th century • 120.000 visitors per year • 300.000 visitors per year • 6 million documents • 8 million documents • 260 FTE • 300 FTE www.kb.nl www.onb.ac.at
  • 6. Digitization Libraries are rapidly transforming from physical… to digital…
  • 7. Transformation Curation Lifecycle Model from Digital Curation Centre www.dcc.ac.uk
  • 8. Now
  • 10. Our data – cultural heritage • Traditionally • Bibliographic and other metadata • Images (Portraits/Pictures, Maps, Posters, etc.) • Text (Books, Articles, Newspapers, etc.) • More recently • Audio/Video • Websites, Blogs, Twitter, Social Networks • Research Data/Raw Data • Software? Apps?
  • 11. 2. Numbers “A good decision is based on knowledge and not on numbers” Plato, 400 BC
  • 12. Numbers (I) National Library of the Netherlands • Digital objects • > 500 million files • 18 million digital publications (+ 2M/year) • 8 million newspaper pages (+ 4M/year) • 152.000 books (+ 100k/year) • 730.000 websites (+ 170k/year) • Storage • 1.3 PB (currently 458 TB used) • Growing approx. 150 TB a year
  • 13. Numbers (II) Austrian National Library • Digital objects • 600.000 volumes being digitised during the next years (currently 120.000 volumes, 40 million pages) • 10 million newspapers and legal texts • 1.16 billion files in web archive from > 1 million domains • Several 100.000 images and portraits • Storage • 84 TB • Growing approx. 15 TB a year
  • 14. Numbers (III) • Google Books Project • 2012: 20 million books scanned (approx. 7,000,000,000 pages) • www.books.google.com • Europeana • 2012: 25 million digital objects • All metadata licensed CC-0 • www.europeana.eu/portal
  • 15. Numbers (IV) • Hathi Trust • 3,721,702,950 scanned pages • 477 TBytes • www.hathitrust.org • Internet Archive • 245 billion web pages archived • 10 PBytes • www.archive.org
  • 16. Numbers (V) • What can we expect? • Enumerate 2012: only about 4% digitised so far • Strong growth of born digital information Source: www.idc.com Source: security.networksasia.net
  • 17. 3. Challenges “What do you do with a million books?” Gregory Crane, 2006
  • 18. Making it scale Scalability in terms of … • size • number • complexity • heterogeneity
  • 19. SCAPE • SCAPE = SCAlable Preservation Environments • €8.6M EU funding, Feb 2011 – July 2014 • 20 partners from public sector, academia, industry • Main objectives: • Scalability • Automation • Planning www.scape-project.eu
  • 20. Use cases (I) • Document recognition: From image to XML • Business case: • Better presentation options • Creation of eBooks • Full-text indexing
  • 21. Use cases (II) • File type migration: JP2k  TIFF • Business case: • Originally migration to JP2k to reduce storage costs • Reverse process used in case JP2k becomes obsolete
  • 22. Use cases (III) • Web archiving: Characterization of web content • Business case: • What is in a Top Level Domain? • What is the distribution of file formats? • http://www.openplanetsfoundation.org/blogs/2013-01- 09-year-fits xkcd.com/688
  • 23. Use cases (IV) • Digital Humanities: Making sense of the millions • Business case: • Text mining & NLP • Statistical analysis • Semantic enrichment • Visualizations Source: www.open.ac.uk/
  • 24. Enter the Elephants… Source: Biopics
  • 26. Execution environment Cluster Taverna Server File server (REST API) Hadoop Apache Tomcat Jobtracker Web Application
  • 27. Scenarios (I) Log file analysis • Metadata log files generated by the web crawler during the harvesting process (no mime type identification – just the mime types returned by the web server) 20110830130705 9684 46 16 3 image/jpeg http://URL at IP 17311 200 20110830130709 9684 46 16 3 image/jpeg http://URL at IP 22123 200 20110830130710 9684 46 16 3 image/gif http://URL at IP 9794 200 20110830130707 9684 46 16 3 image/jpeg http://URL at IP 40056 200 20110830130704 9684 46 16 3 text/html http://URL at IP 13149 200 20110830130712 9684 46 16 3 image/gif http://URL at IP 2285 200 20110830130712 9684 46 16 3 text/html http://URL at IP 415 301 20110830130710 9684 46 16 3 text/html http://URL at IP 7873 200 20110830130712 9684 46 16 3 text/html http://URL at IP 632 302 20110830130712 9684 46 16 3 image/png http://URL at IP 679 200
  • 28. Scenarios (II) Web archiving: File format identification → Run file type identification on archived web content (W)ARC Container JPG (W)ARC RecordReader MapReduce Apache Tika GIF JPG image/jpg detect MIME based on HTM HERITRIX Map Web crawler Reduce read/write (W)ARC image/jpg 1 HTM image/gif 1 text/html 2 audio/midi 1 MID
  • 29. Scenarios (II) Web archiving: File format identification → Using MapReduce to calculate statistics DROID 6.01 TIKA 1.0
  • 30. Scenarios (III) File format migration • Risk of format obsolescence • Quality assurance • File format validation • Original/target image comparison • Imagine runtime of 1 minute per image for 200 million pages ...
  • 31. Parallel execution of file format validation using Mapper ●Jpylyzer (Python) ●Jhove2 (Java)
  • 32. ●Feature extraction requires sharing resources between processing steps ●Challenge to model more complex image comparison scenarios, e.g. book page duplicates detection or digital book comparison
  • 34. Create text file containing JPEG2000 input file paths and read image metadata using Exiftool via the Hadoop Streaming API
  • 35. Reading image metadata Jp2PathCreator HadoopStreamingExiftoolRead reading files from NAS /NAS/Z119585409/00000001.jp2 Z119585409/00000001 2345 /NAS/Z119585409/00000002.jp2 Z119585409/00000002 2340 /NAS/Z119585409/00000003.jp2 Z119585409/00000003 2543 … … /NAS/Z117655409/00000001.jp2 Z117655409/00000001 2300 /NAS/Z117655409/00000002.jp2 Z117655409/00000002 2300 /NAS/Z117655409/00000003.jp2 Z117655409/00000003 2345 … find … /NAS/Z119585987/00000001.jp2 Z119585987/00000001 2300 /NAS/Z119585987/00000002.jp2 Z119585987/00000002 2340 /NAS/Z119585987/00000003.jp2 Z119585987/00000003 2432 … … /NAS/Z119584539/00000001.jp2 Z119584539/00000001 5205 NAS /NAS/Z119584539/00000002.jp2 Z119584539/00000002 2310 /NAS/Z119584539/00000003.jp2 Z119584539/00000003 2134 … … /NAS/Z119599879/00000001.jp2l Z119599879/00000001 2312 /NAS/Z119589879/00000002.jp2 Z119589879/00000002 ... 2300 /NAS/Z119589879/00000003.jp2 Z119589879/00000003 2300 ... ... 1,4 GB 1,2 GB 60.000 books : ~5h + ~ 38 h = ~ 43 h 24 Million pages
  • 36. Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS
  • 37. SequenceFile creation HtmlPathCreator SequenceFileCreator reading files from NAS /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html Z119585409/00000707 /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html Z119585409/00000708 /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html find … Z119585409/00000709 /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … Z119585409/00000710 /NAS/Z967985409/00000707.html NAS /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html Z119585409/00000711 … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html Z119585409/00000712 /NAS/Z196545409/00000709.html ... 1,4 GB 997 GB (uncompressed) 60.000 books : ~5h + ~ 24 h = ~ 29 h 24 Million pages
  • 38. Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width
  • 39. HTML Parsing HadoopAvBlockWidthMapReduce Map Reduce Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2250 Z119585409/00000001 2300 Z119585409/00000001 2400 Z119585409/00000001 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2250 Z119585409/00000002 2300 Z119585409/00000002 2400 Z119585409/00000002 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2250 Z119585409/00000003 2300 Z119585409/00000003 Z119585409/00000003 2400 Z119585409/00000004 2100 Z119585409/00000004 Z119585409/00000004 2200 Z119585409/00000004 2250 Z119585409/00000004 2300 Z119585409/00000004 2400 ... Z119585409/00000005 2100 Z119585409/00000005 Z119585409/00000005 2200 Z119585409/00000005 2250 Z119585409/00000005 2300 Z119585409/00000005 2400 SequenceFile Textfile 60.000 books : ~6h 24 Million pages
  • 40. Create Hive table and load generated data into the Hive database
  • 41. Analytic Queries HiveLoadExifData & HiveLoadHocrData htmlwidth hid hwidth Z119585409/00000001 1870 Z119585409/00000001 1870 Z119585409/00000002 2100 CREATE TABLE htmlwidth Z119585409/00000003 2015 Z119585409/00000002 2100 Z119585409/00000004 1350 (hid STRING, hwidth INT) Z119585409/00000005 1700 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 jp2width jid jwidth Z119585409/00000001 2250 Z119585409/00000001 2250 Z119585409/00000002 2150 CREATE TABLE jp2width Z119585409/00000003 2125 Z119585409/00000002 2150 Z119585409/00000004 2125 (hid STRING, jwidth INT) Z119585409/00000005 2250 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 60.000 books 24 Million pages : ~6h
  • 42. Analytic Queries HiveSelect jp2width htmlwidth jid jwidth hid hwidth Z119585409/00000001 2250 Z119585409/00000001 1870 Z119585409/00000002 2150 Z119585409/00000002 2100 Z119585409/00000003 2125 Z119585409/00000003 2015 Z119585409/00000004 2125 Z119585409/00000004 1350 Z119585409/00000005 2250 Z119585409/00000005 1700 select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid jid jwidth hwidth Z119585409/00000001 2250 1870 Z119585409/00000002 2150 2100 Z119585409/00000003 2125 2015 Z119585409/00000004 2125 1350 Z119585409/00000005 2250 1700 60.000 books : ~6h 24 Million pages
  • 43. Perform a simple Hive query to test if the database has been created successfully
  • 44. Outlook “Progress generally appears much greater than it really is” Johan Nestroy, 1847
  • 45. What have WE learned? • We need to carefully assess the efforts for data preparation vs. the actual processing load • HDFS prefers large files over many small ones, is basically “append-only” • There is still much more the Hadoop ecosystem has to offer, e.g. YARN, Pig, Mahout
  • 46. What can YOU do? • Come join our “Hadoop in cultural heritage” hackathon on 2-4 December 2013, Vienna (See http://www.scape-project.eu/events ) • Check out some tools from our github at https://github.com/openplanets/ and help us make them better and more scalable • Follow us at @SCAPEProject and spread the word!
  • 47. What’s in it for US? • Digital (free) access to centuries of cultural heritage data, 24x7 and from anywhere • Ensuring our cultural history is not lost • New innovative applications using cultural heritage data (education, creative industries)
  • 48. Thank you! Questions? (btw, we’re hiring) www.kb.nl www.onb.ac.at www.scape-project.eu www.openplanetsfoundation.org