SlideShare a Scribd company logo
SCAP 
E 
The Elephant in the Library 
Integrating Hadoop 
Clemens Neudecker Sven Schlarb 
@cneudecker @SvenSchlarb
Contents 
1. Background: Digitization of cultural heritage 
2. Numbers: Scaling up! 
3. Challenges: Use cases and scenarios 
4. Outlook
1. Background 
“The digital revolution is far more 
significant than the invention of 
writing or even of printing” 
Douglas Engelbart
Then
Our libraries 
• The Hague, Netherlands 
• Founded in 1798 
• 120.000 visitors per year 
• 6 million documents 
• 260 FTE 
www.kb.nl 
• Vienna, Austria 
• Founded in 14th century 
• 300.000 visitors per year 
• 8 million documents 
• 300 FTE 
www.onb.ac.at
Digitization 
Libraries are rapidly transforming from physical… 
to digital…
Transformation 
Curation Lifecycle Model from Digital Curation Centre www.dcc.ac.uk
Now
Digital Preservation
Our data – cultural heritage 
• Traditionally 
• Bibliographic and other metadata 
• Images (Portraits/Pictures, Maps, Posters, etc.) 
• Text (Books, Articles, Newspapers, etc.) 
• More recently 
• Audio/Video 
• Websites, Blogs, Twitter, Social Networks 
• Research Data/Raw Data 
• Software? Apps?
2. Numbers 
“A good decision is based on knowledge 
and not on numbers” 
Plato, 400 BC
Numbers (I) 
National Library of the Netherlands 
• Digital objects 
• > 500 million files 
• 18 million digital publications (+ 2M/year) 
• 8 million newspaper pages (+ 4M/year) 
• 152.000 books (+ 100k/year) 
• 730.000 websites (+ 170k/year) 
• Storage 
• 1.3 PB (currently 458 TB used) 
• Growing approx. 150 TB a year
Numbers (II) 
Austrian National Library 
• Digital objects 
• 600.000 volumes being digitised during the next 
years (currently 120.000 volumes, 40 million pages) 
• 10 million newspapers and legal texts 
• 1.16 billion files in web archive from 
> 1 million domains 
• Several 100.000 images and portraits 
• Storage 
• 84 TB 
• Growing approx. 15 TB a year
Numbers (III) 
• Google Books Project 
• 2012: 20 million books scanned 
(approx. 7,000,000,000 pages) 
• www.books.google.com 
• Europeana 
• 2012: 25 million digital objects 
• All metadata licensed CC-0 
• www.europeana.eu/portal
Numbers (IV) 
• Hathi Trust 
• 3,721,702,950 scanned pages 
• 477 TBytes 
• www.hathitrust.org 
• Internet Archive 
• 245 billion web pages archived 
• 10 PBytes 
• www.archive.org
Numbers (V) 
• What can we expect? 
• Enumerate 2012: only about 4% digitised so far 
• Strong growth of born digital information 
Source: www.idc.com Source: security.networksasia.net
3. Challenges 
“What do you do with a million books?” 
Gregory Crane, 2006
Making it scale 
Scalability in terms of … 
• size 
• number 
• complexity 
• heterogeneity
SCAPE 
• SCAPE = SCAlable Preservation Environments 
• €8.6M EU funding, Feb 2011 – July 2014 
• 20 partners from public sector, academia, industry 
• Main objectives: 
• Scalability 
• Automation 
• Planning 
www.scape-project.eu
Use cases (I) 
• Document recognition: From image to XML 
• Business case: 
• Better presentation options 
• Creation of eBooks 
• Full-text indexing
Use cases (II) 
• File type migration: JP2k  TIFF 
• Business case: 
• Originally migration 
to JP2k to reduce 
storage costs 
• Reverse process 
used in case JP2k 
becomes obsolete
Use cases (III) 
• Web archiving: Characterization of web content 
• Business case: 
• What is in a Top Level Domain? 
• What is the distribution of file formats? 
• http://www.openplanetsfoundation.org/blogs/2013-01-09- 
year-fits 
xkcd.com/688
Use cases (IV) 
• Digital Humanities: Making sense of the millions 
• Business case: 
• Text mining & NLP 
• Statistical analysis 
• Semantic enrichment 
• Visualizations Source: www.open.ac.uk/
Enter the Elephants… 
Source: Biopics
Experimental Cluster
Apache Tomcat 
Web Application 
Execution environment 
Taverna Server 
(REST API) 
Hadoop 
Jobtracker 
File server 
Cluster
Scenarios (I) 
Log file analysis 
• Metadata log files generated by the web crawler 
during the harvesting process 
(no mime type identification – just the mime types 
returned by the web server) 
20110830130705 9684 46 16 3 image/jpeg http://URL at IP 17311 200 
20110830130709 9684 46 16 3 image/jpeg http://URL at IP 22123 200 
20110830130710 9684 46 16 3 image/gif http://URL at IP 9794 200 
20110830130707 9684 46 16 3 image/jpeg http://URL at IP 40056 200 
20110830130704 9684 46 16 3 text/html http://URL at IP 13149 200 
20110830130712 9684 46 16 3 image/gif http://URL at IP 2285 200 
20110830130712 9684 46 16 3 text/html http://URL at IP 415 301 
20110830130710 9684 46 16 3 text/html http://URL at IP 7873 200 
20110830130712 9684 46 16 3 text/html http://URL at IP 632 302 
20110830130712 9684 46 16 3 image/png http://URL at IP 679 200
Scenarios (II) 
Web archiving: File format identification 
→ Run file type identification on archived web content 
(W)ARC Container 
JPG 
GIF 
HTM 
HTM 
MID 
(W)ARC RecordReader 
based on 
HERITRIX 
Web crawler 
read/write (W)ARC 
MapReduce 
JPG 
Apache Tika 
detect MIME 
Map 
Reduce 
image/jpg 
image/jpg 1 
image/gif 1 
text/html 2 
audio/midi 1
Scenarios (II) 
Web archiving: File format identification 
→ Using MapReduce to calculate statistics 
DROID 6.01 TIKA 1.0
Scenarios (III) 
File format migration 
• Risk of format obsolescence 
• Quality assurance 
• File format validation 
• Original/target image 
comparison 
• Imagine runtime of 1 minute 
per image for 200 million 
pages ...
Parallel execution of 
file format validation 
using Mapper 
● Jpylyzer (Python) 
● Jhove2 (Java)
● Feature extraction 
requires sharing 
resources between 
processing steps 
● Challenge to model 
more complex image 
comparison scenarios, 
e.g. book page 
duplicates detection 
or digital book 
comparison
Scenarios (IV) 
Book page analysis
Create text file containing JPEG2000 input file paths and read 
image metadata using Exiftool via the Hadoop Streaming API
Jp2PathCreator HadoopStreamingExiftoolRead 
find 
/NAS/Z119585409/00000001.jp2 
/NAS/Z119585409/00000002.jp2 
/NAS/Z119585409/00000003.jp2 
… 
/NAS/Z117655409/00000001.jp2 
/NAS/Z117655409/00000002.jp2 
/NAS/Z117655409/00000003.jp2 
… 
/NAS/Z119585987/00000001.jp2 
/NAS/Z119585987/00000002.jp2 
/NAS/Z119585987/00000003.jp2 
… 
/NAS/Z119584539/00000001.jp2 
/NAS/Z119584539/00000002.jp2 
/NAS/Z119584539/00000003.jp2 
… 
/NAS/Z119599879/00000001.jp2l 
/NAS/Z119589879/00000002.jp2 
/NAS/Z119589879/00000003.jp2 
... 
... 
NAS 
reading files from NAS 
1,4 GB 1,2 GB 
: ~ 5 h + ~ 38 h = ~ 43 h 
60.000 books 
24 Million pages 
Z119585409/00000001 2345 
Z119585409/00000002 2340 
Z119585409/00000003 2543 
… 
Z117655409/00000001 2300 
Z117655409/00000002 2300 
Z117655409/00000003 2345 
… 
Z119585987/00000001 2300 
Z119585987/00000002 2340 
Z119585987/00000003 2432 
… 
Z119584539/00000001 5205 
Z119584539/00000002 2310 
Z119584539/00000003 2134 
… 
Z119599879/00000001 2312 
Z119589879/00000002 2300 
Z119589879/00000003 2300 
... 
Reading image metadata
Create text file containing HTML input file paths and create 
one sequence file with the complete file content in HDFS
HtmlPathCreator SequenceFileCreator 
find 
/NAS/Z119585409/00000707.html 
/NAS/Z119585409/00000708.html 
/NAS/Z119585409/00000709.html 
… 
/NAS/Z138682341/00000707.html 
/NAS/Z138682341/00000708.html 
/NAS/Z138682341/00000709.html 
… 
/NAS/Z178791257/00000707.html 
/NAS/Z178791257/00000708.html 
/NAS/Z178791257/00000709.html 
… 
/NAS/Z967985409/00000707.html 
/NAS/Z967985409/00000708.html 
/NAS/Z967985409/00000709.html 
… 
/NAS/Z196545409/00000707.html 
/NAS/Z196545409/00000708.html 
/NAS/Z196545409/00000709.html 
... 
Z119585409/00000707 
Z119585409/00000708 
Z119585409/00000709 
Z119585409/00000710 
Z119585409/00000711 
Z119585409/00000712 
NAS 
reading files from NAS 
1,4 GB 997 GB (uncompressed) 
: ~ 5 h + ~ 24 h = ~ 29 h 
60.000 books 
24 Million pages 
SequenceFile creation
Execute Hadoop MapReduce job using the sequence file created 
before in order to calculate the average paragraph block width
Z119585409/00000001 
Z119585409/00000002 
Z119585409/00000003 
Z119585409/00000004 
Z119585409/00000005 
HTML Parsing 
... 
Map Reduce 
: ~ 6 h 
60.000 books 
24 Million pages 
Z119585409/00000001 2100 
Z119585409/00000001 2200 
Z119585409/00000001 2300 
Z119585409/00000001 2400 
Z119585409/00000002 2100 
Z119585409/00000002 2200 
Z119585409/00000002 2300 
Z119585409/00000002 2400 
Z119585409/00000003 2100 
Z119585409/00000003 2200 
Z119585409/00000003 2300 
Z119585409/00000003 2400 
Z119585409/00000004 2100 
Z119585409/00000004 2200 
Z119585409/00000004 2300 
Z119585409/00000004 2400 
Z119585409/00000005 2100 
Z119585409/00000005 2200 
Z119585409/00000005 2300 
Z119585409/00000005 2400 
Z119585409/00000001 2250 
Z119585409/00000002 2250 
Z119585409/00000003 2250 
Z119585409/00000004 2250 
Z119585409/00000005 2250 
HadoopAvBlockWidthMapReduce 
SequenceFile Textfile
Create Hive table and load generated data into the Hive database
Analytic Queries 
CREATE TABLE htmlwidth 
(hid STRING, hwidth INT) 
: ~ 6 h 
60.000 books 
24 Million pages 
HiveLoadExifData & HiveLoadHocrData 
htmlwidth 
hid hwidth 
Z119585409/00000001 1870 
Z119585409/00000002 2100 
Z119585409/00000003 2015 
Z119585409/00000004 1350 
Z119585409/00000005 1700 
jp2width 
jid jwidth 
Z119585409/00000001 2250 
Z119585409/00000002 2150 
Z119585409/00000003 2125 
Z119585409/00000004 2125 
Z119585409/00000005 2250 
Z119585409/00000001 1870 
Z119585409/00000002 2100 
Z119585409/00000003 2015 
Z119585409/00000004 1350 
Z119585409/00000005 1700 
Z119585409/00000001 2250 
Z119585409/00000002 2150 
Z119585409/00000003 2125 
Z119585409/00000004 2125 
Z119585409/00000005 2250 
CREATE TABLE jp2width 
(hid STRING, jwidth INT)
Analytic Queries 
jp2width htmlwidth 
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid 
: ~ 6 h 
60.000 books 
24 Million pages 
HiveSelect 
jid jwidth 
Z119585409/00000001 2250 
Z119585409/00000002 2150 
Z119585409/00000003 2125 
Z119585409/00000004 2125 
Z119585409/00000005 2250 
hid hwidth 
Z119585409/00000001 1870 
Z119585409/00000002 2100 
Z119585409/00000003 2015 
Z119585409/00000004 1350 
Z119585409/00000005 1700 
jid jwidth hwidth 
Z119585409/00000001 2250 1870 
Z119585409/00000002 2150 2100 
Z119585409/00000003 2125 2015 
Z119585409/00000004 2125 1350 
Z119585409/00000005 2250 1700
Perform a simple Hive query to test if the 
database has been created successfully
Outlook 
“Progress generally appears much 
greater than it really is” 
Johan Nestroy, 1847
What have WE learned? 
• We need to carefully assess the efforts for data 
preparation vs. the actual processing load 
• HDFS prefers large files over many small ones, 
is basically “append-only” 
• There is still much more the Hadoop ecosystem 
has to offer, e.g. YARN, Pig, Mahout
What can YOU do? 
• Come join our “Hadoop in cultural heritage” 
hackathon on 2-4 December 2013, Vienna 
(See http://www.scape-project.eu/events ) 
• Check out some tools from our github at 
https://github.com/openplanets/ and help 
us make them better and more scalable 
• Follow us at @SCAPEProject and spread the word!
What’s in it for US? 
• Digital (free) access to centuries of cultural 
heritage data, 24x7 and from anywhere 
• Ensuring our cultural history is not lost 
• New innovative applications using cultural 
heritage data (education, creative industries)
Thank you! Questions? 
(btw, we’re hiring) 
www.kb.nl 
www.onb.ac.at 
www.scape-project.eu 
www.openplanetsfoundation.org

More Related Content

Viewers also liked

The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
Ioanna Tsalouchidou
 
Introduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureIntroduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azure
Angelo Gino Varrati
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
Anshul Bhatnagar
 
Amazon S3 Overview
Amazon S3 OverviewAmazon S3 Overview
Amazon S3 Overview
Emilio Trussardi
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
Microsoft Azure - O poder da nuvem
Microsoft Azure - O poder da nuvemMicrosoft Azure - O poder da nuvem
Microsoft Azure - O poder da nuvem
Lucas Chies
 
Amazon's Simple Storage Service (S3)
Amazon's Simple Storage Service (S3)Amazon's Simple Storage Service (S3)
Amazon's Simple Storage Service (S3)
James Gray
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
Benjamin Bengfort
 
Middleware and Middleware in distributed application
Middleware and Middleware in distributed applicationMiddleware and Middleware in distributed application
Middleware and Middleware in distributed application
Rishikese MR
 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3
Yu Lun Teo
 
Overview of Amazon Web Services
Overview of Amazon Web ServicesOverview of Amazon Web Services
Overview of Amazon Web Services
Amazon Web Services
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?
Amazon Web Services
 
What is AWS?
What is AWS?What is AWS?
What is AWS?
Martin Yan
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Heart beat monitor system PPT
Heart beat monitor system PPT Heart beat monitor system PPT
Heart beat monitor system PPT
Anand Dwivedi
 
Introduction to OpenStack Architecture
Introduction to OpenStack ArchitectureIntroduction to OpenStack Architecture
Introduction to OpenStack Architecture
OpenStack Foundation
 
Middleware Basics
Middleware BasicsMiddleware Basics
Middleware Basics
Varun Arora
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
sudhakara st
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
Romain Jacotin
 
Introducing OpenStack for Beginners
Introducing OpenStack for Beginners Introducing OpenStack for Beginners
Introducing OpenStack for Beginners
openstackindia
 

Viewers also liked (20)

The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
 
Introduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureIntroduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azure
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Amazon S3 Overview
Amazon S3 OverviewAmazon S3 Overview
Amazon S3 Overview
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Microsoft Azure - O poder da nuvem
Microsoft Azure - O poder da nuvemMicrosoft Azure - O poder da nuvem
Microsoft Azure - O poder da nuvem
 
Amazon's Simple Storage Service (S3)
Amazon's Simple Storage Service (S3)Amazon's Simple Storage Service (S3)
Amazon's Simple Storage Service (S3)
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Middleware and Middleware in distributed application
Middleware and Middleware in distributed applicationMiddleware and Middleware in distributed application
Middleware and Middleware in distributed application
 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3
 
Overview of Amazon Web Services
Overview of Amazon Web ServicesOverview of Amazon Web Services
Overview of Amazon Web Services
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?
 
What is AWS?
What is AWS?What is AWS?
What is AWS?
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Heart beat monitor system PPT
Heart beat monitor system PPT Heart beat monitor system PPT
Heart beat monitor system PPT
 
Introduction to OpenStack Architecture
Introduction to OpenStack ArchitectureIntroduction to OpenStack Architecture
Introduction to OpenStack Architecture
 
Middleware Basics
Middleware BasicsMiddleware Basics
Middleware Basics
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
 
Introducing OpenStack for Beginners
Introducing OpenStack for Beginners Introducing OpenStack for Beginners
Introducing OpenStack for Beginners
 

Similar to The Elephant in the Library - Integrating Hadoop

The Elephant in the Library
The Elephant in the LibraryThe Elephant in the Library
The Elephant in the Library
DataWorks Summit
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
Sven Schlarb
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
Kings fund - implementing Hyku
Kings fund - implementing HykuKings fund - implementing Hyku
Kings fund - implementing Hyku
PTFS Europe Limited
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Biblioteca Nacional de España
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
SCAPE Project
 
State of Image Annotations - I Annotate 2016
State of Image Annotations - I Annotate 2016State of Image Annotations - I Annotate 2016
State of Image Annotations - I Annotate 2016
r0bcas7
 
Optimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital LibraryOptimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital Library
UCD Library
 
container crash course
container crash coursecontainer crash course
container crash course
Andrew Shafer
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
Srinath Perera
 
IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019
Glen Robson
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
Genoveva Vargas-Solar
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 
Edinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline WorkshopEdinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline Workshop
Petr Pridal
 
Mechanical curator - Technical notes
Mechanical curator - Technical notesMechanical curator - Technical notes
Mechanical curator - Technical notes
benosteen
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
Ahmed AlSum
 
Big data
Big dataBig data
Big data
roysonli
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Artefactual Systems - AtoM
 
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon
 

Similar to The Elephant in the Library - Integrating Hadoop (20)

The Elephant in the Library
The Elephant in the LibraryThe Elephant in the Library
The Elephant in the Library
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Kings fund - implementing Hyku
Kings fund - implementing HykuKings fund - implementing Hyku
Kings fund - implementing Hyku
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
State of Image Annotations - I Annotate 2016
State of Image Annotations - I Annotate 2016State of Image Annotations - I Annotate 2016
State of Image Annotations - I Annotate 2016
 
Optimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital LibraryOptimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital Library
 
container crash course
container crash coursecontainer crash course
container crash course
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Edinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline WorkshopEdinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline Workshop
 
Mechanical curator - Technical notes
Mechanical curator - Technical notesMechanical curator - Technical notes
Mechanical curator - Technical notes
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Big data
Big dataBig data
Big data
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
 

More from cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
cneudecker
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
cneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
cneudecker
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
cneudecker
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
cneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
cneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
cneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
cneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
cneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
cneudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
cneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
cneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
cneudecker
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
cneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
cneudecker
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
cneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
cneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
cneudecker
 

More from cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 

Recently uploaded

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 

Recently uploaded (20)

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 

The Elephant in the Library - Integrating Hadoop

  • 1. SCAP E The Elephant in the Library Integrating Hadoop Clemens Neudecker Sven Schlarb @cneudecker @SvenSchlarb
  • 2. Contents 1. Background: Digitization of cultural heritage 2. Numbers: Scaling up! 3. Challenges: Use cases and scenarios 4. Outlook
  • 3. 1. Background “The digital revolution is far more significant than the invention of writing or even of printing” Douglas Engelbart
  • 5. Our libraries • The Hague, Netherlands • Founded in 1798 • 120.000 visitors per year • 6 million documents • 260 FTE www.kb.nl • Vienna, Austria • Founded in 14th century • 300.000 visitors per year • 8 million documents • 300 FTE www.onb.ac.at
  • 6. Digitization Libraries are rapidly transforming from physical… to digital…
  • 7. Transformation Curation Lifecycle Model from Digital Curation Centre www.dcc.ac.uk
  • 8. Now
  • 10. Our data – cultural heritage • Traditionally • Bibliographic and other metadata • Images (Portraits/Pictures, Maps, Posters, etc.) • Text (Books, Articles, Newspapers, etc.) • More recently • Audio/Video • Websites, Blogs, Twitter, Social Networks • Research Data/Raw Data • Software? Apps?
  • 11. 2. Numbers “A good decision is based on knowledge and not on numbers” Plato, 400 BC
  • 12. Numbers (I) National Library of the Netherlands • Digital objects • > 500 million files • 18 million digital publications (+ 2M/year) • 8 million newspaper pages (+ 4M/year) • 152.000 books (+ 100k/year) • 730.000 websites (+ 170k/year) • Storage • 1.3 PB (currently 458 TB used) • Growing approx. 150 TB a year
  • 13. Numbers (II) Austrian National Library • Digital objects • 600.000 volumes being digitised during the next years (currently 120.000 volumes, 40 million pages) • 10 million newspapers and legal texts • 1.16 billion files in web archive from > 1 million domains • Several 100.000 images and portraits • Storage • 84 TB • Growing approx. 15 TB a year
  • 14. Numbers (III) • Google Books Project • 2012: 20 million books scanned (approx. 7,000,000,000 pages) • www.books.google.com • Europeana • 2012: 25 million digital objects • All metadata licensed CC-0 • www.europeana.eu/portal
  • 15. Numbers (IV) • Hathi Trust • 3,721,702,950 scanned pages • 477 TBytes • www.hathitrust.org • Internet Archive • 245 billion web pages archived • 10 PBytes • www.archive.org
  • 16. Numbers (V) • What can we expect? • Enumerate 2012: only about 4% digitised so far • Strong growth of born digital information Source: www.idc.com Source: security.networksasia.net
  • 17. 3. Challenges “What do you do with a million books?” Gregory Crane, 2006
  • 18. Making it scale Scalability in terms of … • size • number • complexity • heterogeneity
  • 19. SCAPE • SCAPE = SCAlable Preservation Environments • €8.6M EU funding, Feb 2011 – July 2014 • 20 partners from public sector, academia, industry • Main objectives: • Scalability • Automation • Planning www.scape-project.eu
  • 20. Use cases (I) • Document recognition: From image to XML • Business case: • Better presentation options • Creation of eBooks • Full-text indexing
  • 21. Use cases (II) • File type migration: JP2k  TIFF • Business case: • Originally migration to JP2k to reduce storage costs • Reverse process used in case JP2k becomes obsolete
  • 22. Use cases (III) • Web archiving: Characterization of web content • Business case: • What is in a Top Level Domain? • What is the distribution of file formats? • http://www.openplanetsfoundation.org/blogs/2013-01-09- year-fits xkcd.com/688
  • 23. Use cases (IV) • Digital Humanities: Making sense of the millions • Business case: • Text mining & NLP • Statistical analysis • Semantic enrichment • Visualizations Source: www.open.ac.uk/
  • 24. Enter the Elephants… Source: Biopics
  • 26. Apache Tomcat Web Application Execution environment Taverna Server (REST API) Hadoop Jobtracker File server Cluster
  • 27. Scenarios (I) Log file analysis • Metadata log files generated by the web crawler during the harvesting process (no mime type identification – just the mime types returned by the web server) 20110830130705 9684 46 16 3 image/jpeg http://URL at IP 17311 200 20110830130709 9684 46 16 3 image/jpeg http://URL at IP 22123 200 20110830130710 9684 46 16 3 image/gif http://URL at IP 9794 200 20110830130707 9684 46 16 3 image/jpeg http://URL at IP 40056 200 20110830130704 9684 46 16 3 text/html http://URL at IP 13149 200 20110830130712 9684 46 16 3 image/gif http://URL at IP 2285 200 20110830130712 9684 46 16 3 text/html http://URL at IP 415 301 20110830130710 9684 46 16 3 text/html http://URL at IP 7873 200 20110830130712 9684 46 16 3 text/html http://URL at IP 632 302 20110830130712 9684 46 16 3 image/png http://URL at IP 679 200
  • 28. Scenarios (II) Web archiving: File format identification → Run file type identification on archived web content (W)ARC Container JPG GIF HTM HTM MID (W)ARC RecordReader based on HERITRIX Web crawler read/write (W)ARC MapReduce JPG Apache Tika detect MIME Map Reduce image/jpg image/jpg 1 image/gif 1 text/html 2 audio/midi 1
  • 29. Scenarios (II) Web archiving: File format identification → Using MapReduce to calculate statistics DROID 6.01 TIKA 1.0
  • 30. Scenarios (III) File format migration • Risk of format obsolescence • Quality assurance • File format validation • Original/target image comparison • Imagine runtime of 1 minute per image for 200 million pages ...
  • 31. Parallel execution of file format validation using Mapper ● Jpylyzer (Python) ● Jhove2 (Java)
  • 32. ● Feature extraction requires sharing resources between processing steps ● Challenge to model more complex image comparison scenarios, e.g. book page duplicates detection or digital book comparison
  • 33. Scenarios (IV) Book page analysis
  • 34. Create text file containing JPEG2000 input file paths and read image metadata using Exiftool via the Hadoop Streaming API
  • 35. Jp2PathCreator HadoopStreamingExiftoolRead find /NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ... ... NAS reading files from NAS 1,4 GB 1,2 GB : ~ 5 h + ~ 38 h = ~ 43 h 60.000 books 24 Million pages Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ... Reading image metadata
  • 36. Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS
  • 37. HtmlPathCreator SequenceFileCreator find /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ... Z119585409/00000707 Z119585409/00000708 Z119585409/00000709 Z119585409/00000710 Z119585409/00000711 Z119585409/00000712 NAS reading files from NAS 1,4 GB 997 GB (uncompressed) : ~ 5 h + ~ 24 h = ~ 29 h 60.000 books 24 Million pages SequenceFile creation
  • 38. Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width
  • 39. Z119585409/00000001 Z119585409/00000002 Z119585409/00000003 Z119585409/00000004 Z119585409/00000005 HTML Parsing ... Map Reduce : ~ 6 h 60.000 books 24 Million pages Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400 Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400 Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400 Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250 HadoopAvBlockWidthMapReduce SequenceFile Textfile
  • 40. Create Hive table and load generated data into the Hive database
  • 41. Analytic Queries CREATE TABLE htmlwidth (hid STRING, hwidth INT) : ~ 6 h 60.000 books 24 Million pages HiveLoadExifData & HiveLoadHocrData htmlwidth hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 jp2width jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 CREATE TABLE jp2width (hid STRING, jwidth INT)
  • 42. Analytic Queries jp2width htmlwidth select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid : ~ 6 h 60.000 books 24 Million pages HiveSelect jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 jid jwidth hwidth Z119585409/00000001 2250 1870 Z119585409/00000002 2150 2100 Z119585409/00000003 2125 2015 Z119585409/00000004 2125 1350 Z119585409/00000005 2250 1700
  • 43. Perform a simple Hive query to test if the database has been created successfully
  • 44. Outlook “Progress generally appears much greater than it really is” Johan Nestroy, 1847
  • 45. What have WE learned? • We need to carefully assess the efforts for data preparation vs. the actual processing load • HDFS prefers large files over many small ones, is basically “append-only” • There is still much more the Hadoop ecosystem has to offer, e.g. YARN, Pig, Mahout
  • 46. What can YOU do? • Come join our “Hadoop in cultural heritage” hackathon on 2-4 December 2013, Vienna (See http://www.scape-project.eu/events ) • Check out some tools from our github at https://github.com/openplanets/ and help us make them better and more scalable • Follow us at @SCAPEProject and spread the word!
  • 47. What’s in it for US? • Digital (free) access to centuries of cultural heritage data, 24x7 and from anywhere • Ensuring our cultural history is not lost • New innovative applications using cultural heritage data (education, creative industries)
  • 48. Thank you! Questions? (btw, we’re hiring) www.kb.nl www.onb.ac.at www.scape-project.eu www.openplanetsfoundation.org