SlideShare a Scribd company logo
1 of 18
ArchiveSpark:
Efficient Access, Extraction and
Derivation for Archival Collections
https://github.com/helgeho/ArchiveSpark
Helge Holzmann (holzmann@L3S.de) 1
https://github.com/helgeho/MHLonArchiveSpark
in cooperation with
What is ArchiveSpark?
• Expressive and efficient data access / processing framework
• Originally built for Web Archives, later extended to any collection
• Joint work with the Internet Archive
• Open source
• Fork us on GitHub: https://github.com/helgeho/ArchiveSpark
• Star, contribute, fix, spread, get involved!
• Modular, easily extensible
• More details in: (please cite)
• Helge Holzmann, Vinay Goel, Avishek Anand. ArchiveSpark: Efficient Web
Archive Access, Extraction and Derivation. JCDL 2016 (Best Paper Nominee)
• Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant
Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017
20/02/2018 Helge Holzmann (holzmann@L3S.de)
2
The ArchiveSpark Approach
• Metadata first, content second, your corpus third
• Two step loading approach for improved efficiency
• Filter as much as possible on metadata before touching the archive
• Enrich metadata instead of mapping / transforming the full records
3
20/02/2018 Helge Holzmann (holzmann@L3S.de)
in Web archives: generally (MHL):
ArchiveSpark and MHL
• MHL-specific Data Specifications are available on GitHub
• https://github.com/helgeho/MHLonArchiveSpark
• The metadata source is MHL‘s advanced full-text search
• http://mhl.countway.harvard.edu/search
• The basic features are replicated by our tool
• more advanced filtering can be done on the retrieved metadata
• Full-text contents are fetched from the Internet Archive
• Seamlessly, abstracted away from the user
• Metadata records are enriched with requested contents
4
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Simple and Expressive Interface
• Based on Spark, powered by Scala
• This does not mean you have to learn a new programming language!
• The interface is rather declarative and writing scripts for
ArchiveSpark does not require deep knowledge about Spark / Scala
• Simple data accessors are included
• Provide simplified access to the underlying data model
• Easy extraction / enrichment mechanisms
• Customizable and extensible by advanced users
5
20/02/2018 Helge Holzmann (holzmann@L3S.de)
val query = MhlSearchOptions(query = "polio", collections = MhlCollections.Statemedicalsocietyjournals)
val rdd = ArchiveSpark.load(MhlSearchSpec(query))
val enriched = rdd.enrich(Entities)
enriched.saveAsJson("enriched.json.gz")
Implicit Lineage Documentation
• Nested JSON output encodes lineage of applied enrichments
6
20/02/2018 Helge Holzmann (holzmann@L3S.de)
title
text
entities
persons
Getting Started
• We recommend the interactive use with Jupyter
• http://jupyter.org with Apache Toree: https://toree.apache.org
• Commands can be run live, results are returned immediately
• Documentation with examples is available on GitHub
• https://github.com/helgeho/ArchiveSpark/blob/master/docs
• Everything set up in a Docker container to get you started
• https://github.com/helgeho/ArchiveSpark-docker
• ArchiveSpark with Jupyter and Toree, pre-configured
• It’s just one command away, instructions are on GitHub
7
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Run ArchiveSpark with Jupyter
• Start Docker (see https://github.com/helgeho/ArchiveSpark-docker)
• Add additional JAR files for MHL
• Download from https://github.com/helgeho/MHLonArchiveSpark/releases
• Copy to config/lib (the config path you specified with Docker)
8
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Your First ArchiveSpark Jupyter Notebook
• Open Jupyter in your browser:
• Create a new Jupyter notebook with ArchiveSpark:
• You are now ready to write your first job:
• Press ctrl+enter to run a cell and you will immediately see the output:
9
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (1)
• What are the most frequently occurring symptoms and affected
body parts of Polio in journals of the Medical Heritage Library?
• The full example is available on GitHub
• https://github.com/helgeho/MHLonArchiveSpark/blob/master/examples/Mhl
PolioSymptomsSearch.ipynb
• Details can be found in:
• Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant
Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017
• http://www.helgeholzmann.de/papers/BIGDATA_2017.pdf
• More about the used Spark operations: Spark Programming Guide
• https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
10
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (2)
• Import the required modules / methods
• Specify the query and load the dataset
• Available options to specify the query can be found in the code
• https://github.com/helgeho/MHLonArchiveSpark/blob/master/src/main/scal
a/edu/harvard/countway/mhl/archivespark/search/MhlSearchOptions.scala 11
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (3)
• Now the dataset can be filtered based on the available metadata
• at any time, peekJson let’s you look at the data of the first record as JSON
12
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (4)
• We define a new enrich function, which extracts the symptoms
• This is based on the content in lower case (LowerCase Enrich Function)
• We specify a set of interesting symptoms and affected body parts
• For each record, this is filtered by the ones contained in the content
• This new Enrich Function is assigned to a variable called symptoms
• Finally, the dataset is enriched with the set of contained symptoms
13
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (5)
• We again print out the first record to check the result
• println is used to see the full output, Jupyter would cut it otherwise
…
14
20/02/2018 Helge Holzmann (holzmann@L3S.de)
The JSON structure nicely reflects the
lineage. We can immediately see that
symptoms in this case were extracted
from the text, which was first
converted to lower case.
Example: Polio Symptoms in MHL (6)
• Eventually, we can count to the contained symptoms
• Our symptoms Enrich Function can be used as a pointer to the values here
• We flat-map these values, i.e., we create a flat list of all symptoms in the
dataset, each occurring once per record it is contained in
• To see the results, we print each line of the computed counts
15
20/02/2018 Helge Holzmann (holzmann@L3S.de)
Example: Polio Symptoms in MHL (7)
• Alternatively, we could save all filtered and enriched records as JSON
• The JSON format is widely supported by many third-party tools, so
that the resulting dataset can easily be post-processed
• to post-process it with Spark, records can be mapped to its raw values
• value can be used to access an enriched value of a record, e.g.:
val mapped = enriched.map(r => (r.title, r.valueOrElse(symptoms, Seq.empty)))
• Hint: to access the raw text of a MHL document use Text:
E.g., Text.map("length"){txt: String => txt.length) or Entities.on(Text)
16
20/02/2018 Helge Holzmann (holzmann@L3S.de)
For Advanced Users
• Some Enrich Functions, such as Entities, need additional JAR files
• Entities requires the Stanford CoreNLP library and models
• http://central.maven.org/maven2/edu/stanford/nlp/stanford-corenlp/3.4.1/
• These need to be added to the classpath (your config/lib directory if you use Docker)
• ArchiveSpark is also available on Maven Central
• https://mvnrepository.com/artifact/com.github.helgeho/archivespark
• To be used as library / API to access archival collections programmatically
• New Enrich Functions and DataSpecs are easy to create
• All required base classes are provided with the core project
• https://github.com/helgeho/ArchiveSpark/blob/master/docs/Contribute.md
• Please share yours!
17
20/02/2018 Helge Holzmann (holzmann@L3S.de)
That’s all Folks!
20/02/2018 Helge Holzmann (holzmann@L3S.de)
18
• Happy coding and please share your insights
• Fork us on GitHub: https://github.com/helgeho/ArchiveSpark
• Star, contribute, fix, spread, get involved!
• Feedback is very welcome…
• Visit us
• https://www.L3S.de
• https://archive.org
• http://alexandria-project.eu
• http://www.medicalheritage.org

More Related Content

What's hot

Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
Sammy Fung
 
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Andrii Vozniuk
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
EUCLID project
 

What's hot (20)

Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
EKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern FragmentsEKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern Fragments
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Querying Linked Data
Querying Linked DataQuerying Linked Data
Querying Linked Data
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
ResourceSync Quick Overview
ResourceSync Quick OverviewResourceSync Quick Overview
ResourceSync Quick Overview
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open Chemistry
 
Indexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data TypesIndexing Complex PostgreSQL Data Types
Indexing Complex PostgreSQL Data Types
 
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
 

Similar to Medical Heritage Library (MHL) on ArchiveSpark

Linguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future WorkLinguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future Work
Sebastian Hellmann
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
Robert Viseur
 

Similar to Medical Heritage Library (MHL) on ArchiveSpark (20)

ArchiveSpark Introduction @ WebSci' 2016 Hackathon
ArchiveSpark Introduction @ WebSci' 2016 HackathonArchiveSpark Introduction @ WebSci' 2016 Hackathon
ArchiveSpark Introduction @ WebSci' 2016 Hackathon
 
ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web Archives
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web Archives
 
Linguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future WorkLinguistic Linked Open Data, Challenges, Approaches, Future Work
Linguistic Linked Open Data, Challenges, Approaches, Future Work
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologies
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
01 html-introduction
01 html-introduction01 html-introduction
01 html-introduction
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
ROHub-Argos integration
ROHub-Argos integrationROHub-Argos integration
ROHub-Argos integration
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 

Recently uploaded

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Recently uploaded (20)

Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 

Medical Heritage Library (MHL) on ArchiveSpark

  • 1. ArchiveSpark: Efficient Access, Extraction and Derivation for Archival Collections https://github.com/helgeho/ArchiveSpark Helge Holzmann (holzmann@L3S.de) 1 https://github.com/helgeho/MHLonArchiveSpark in cooperation with
  • 2. What is ArchiveSpark? • Expressive and efficient data access / processing framework • Originally built for Web Archives, later extended to any collection • Joint work with the Internet Archive • Open source • Fork us on GitHub: https://github.com/helgeho/ArchiveSpark • Star, contribute, fix, spread, get involved! • Modular, easily extensible • More details in: (please cite) • Helge Holzmann, Vinay Goel, Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016 (Best Paper Nominee) • Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017 20/02/2018 Helge Holzmann (holzmann@L3S.de) 2
  • 3. The ArchiveSpark Approach • Metadata first, content second, your corpus third • Two step loading approach for improved efficiency • Filter as much as possible on metadata before touching the archive • Enrich metadata instead of mapping / transforming the full records 3 20/02/2018 Helge Holzmann (holzmann@L3S.de) in Web archives: generally (MHL):
  • 4. ArchiveSpark and MHL • MHL-specific Data Specifications are available on GitHub • https://github.com/helgeho/MHLonArchiveSpark • The metadata source is MHL‘s advanced full-text search • http://mhl.countway.harvard.edu/search • The basic features are replicated by our tool • more advanced filtering can be done on the retrieved metadata • Full-text contents are fetched from the Internet Archive • Seamlessly, abstracted away from the user • Metadata records are enriched with requested contents 4 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 5. Simple and Expressive Interface • Based on Spark, powered by Scala • This does not mean you have to learn a new programming language! • The interface is rather declarative and writing scripts for ArchiveSpark does not require deep knowledge about Spark / Scala • Simple data accessors are included • Provide simplified access to the underlying data model • Easy extraction / enrichment mechanisms • Customizable and extensible by advanced users 5 20/02/2018 Helge Holzmann (holzmann@L3S.de) val query = MhlSearchOptions(query = "polio", collections = MhlCollections.Statemedicalsocietyjournals) val rdd = ArchiveSpark.load(MhlSearchSpec(query)) val enriched = rdd.enrich(Entities) enriched.saveAsJson("enriched.json.gz")
  • 6. Implicit Lineage Documentation • Nested JSON output encodes lineage of applied enrichments 6 20/02/2018 Helge Holzmann (holzmann@L3S.de) title text entities persons
  • 7. Getting Started • We recommend the interactive use with Jupyter • http://jupyter.org with Apache Toree: https://toree.apache.org • Commands can be run live, results are returned immediately • Documentation with examples is available on GitHub • https://github.com/helgeho/ArchiveSpark/blob/master/docs • Everything set up in a Docker container to get you started • https://github.com/helgeho/ArchiveSpark-docker • ArchiveSpark with Jupyter and Toree, pre-configured • It’s just one command away, instructions are on GitHub 7 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 8. Run ArchiveSpark with Jupyter • Start Docker (see https://github.com/helgeho/ArchiveSpark-docker) • Add additional JAR files for MHL • Download from https://github.com/helgeho/MHLonArchiveSpark/releases • Copy to config/lib (the config path you specified with Docker) 8 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 9. Your First ArchiveSpark Jupyter Notebook • Open Jupyter in your browser: • Create a new Jupyter notebook with ArchiveSpark: • You are now ready to write your first job: • Press ctrl+enter to run a cell and you will immediately see the output: 9 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 10. Example: Polio Symptoms in MHL (1) • What are the most frequently occurring symptoms and affected body parts of Polio in journals of the Medical Heritage Library? • The full example is available on GitHub • https://github.com/helgeho/MHLonArchiveSpark/blob/master/examples/Mhl PolioSymptomsSearch.ipynb • Details can be found in: • Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017 • http://www.helgeholzmann.de/papers/BIGDATA_2017.pdf • More about the used Spark operations: Spark Programming Guide • https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html 10 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 11. Example: Polio Symptoms in MHL (2) • Import the required modules / methods • Specify the query and load the dataset • Available options to specify the query can be found in the code • https://github.com/helgeho/MHLonArchiveSpark/blob/master/src/main/scal a/edu/harvard/countway/mhl/archivespark/search/MhlSearchOptions.scala 11 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 12. Example: Polio Symptoms in MHL (3) • Now the dataset can be filtered based on the available metadata • at any time, peekJson let’s you look at the data of the first record as JSON 12 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 13. Example: Polio Symptoms in MHL (4) • We define a new enrich function, which extracts the symptoms • This is based on the content in lower case (LowerCase Enrich Function) • We specify a set of interesting symptoms and affected body parts • For each record, this is filtered by the ones contained in the content • This new Enrich Function is assigned to a variable called symptoms • Finally, the dataset is enriched with the set of contained symptoms 13 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 14. Example: Polio Symptoms in MHL (5) • We again print out the first record to check the result • println is used to see the full output, Jupyter would cut it otherwise … 14 20/02/2018 Helge Holzmann (holzmann@L3S.de) The JSON structure nicely reflects the lineage. We can immediately see that symptoms in this case were extracted from the text, which was first converted to lower case.
  • 15. Example: Polio Symptoms in MHL (6) • Eventually, we can count to the contained symptoms • Our symptoms Enrich Function can be used as a pointer to the values here • We flat-map these values, i.e., we create a flat list of all symptoms in the dataset, each occurring once per record it is contained in • To see the results, we print each line of the computed counts 15 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 16. Example: Polio Symptoms in MHL (7) • Alternatively, we could save all filtered and enriched records as JSON • The JSON format is widely supported by many third-party tools, so that the resulting dataset can easily be post-processed • to post-process it with Spark, records can be mapped to its raw values • value can be used to access an enriched value of a record, e.g.: val mapped = enriched.map(r => (r.title, r.valueOrElse(symptoms, Seq.empty))) • Hint: to access the raw text of a MHL document use Text: E.g., Text.map("length"){txt: String => txt.length) or Entities.on(Text) 16 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 17. For Advanced Users • Some Enrich Functions, such as Entities, need additional JAR files • Entities requires the Stanford CoreNLP library and models • http://central.maven.org/maven2/edu/stanford/nlp/stanford-corenlp/3.4.1/ • These need to be added to the classpath (your config/lib directory if you use Docker) • ArchiveSpark is also available on Maven Central • https://mvnrepository.com/artifact/com.github.helgeho/archivespark • To be used as library / API to access archival collections programmatically • New Enrich Functions and DataSpecs are easy to create • All required base classes are provided with the core project • https://github.com/helgeho/ArchiveSpark/blob/master/docs/Contribute.md • Please share yours! 17 20/02/2018 Helge Holzmann (holzmann@L3S.de)
  • 18. That’s all Folks! 20/02/2018 Helge Holzmann (holzmann@L3S.de) 18 • Happy coding and please share your insights • Fork us on GitHub: https://github.com/helgeho/ArchiveSpark • Star, contribute, fix, spread, get involved! • Feedback is very welcome… • Visit us • https://www.L3S.de • https://archive.org • http://alexandria-project.eu • http://www.medicalheritage.org