Medical Heritage Library (MHL) on ArchiveSpark

ArchiveSpark:
Efficient Access, Extraction and
Derivation for Archival Collections
https://github.com/helgeho/ArchiveSpark
Helge Holzmann (holzmann@L3S.de) 1
https://github.com/helgeho/MHLonArchiveSpark
in cooperation with

What is ArchiveSpark?
• Expressive and efficient data access / processing framework
• Originally built for Web Archives, later extended to any collection
• Joint work with the Internet Archive
• Open source
• Fork us on GitHub: https://github.com/helgeho/ArchiveSpark
• Star, contribute, fix, spread, get involved!
• Modular, easily extensible
• More details in: (please cite)
• Helge Holzmann, Vinay Goel, Avishek Anand. ArchiveSpark: Efficient Web
Archive Access, Extraction and Derivation. JCDL 2016 (Best Paper Nominee)
• Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant
Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017
20/02/2018 Helge Holzmann (holzmann@L3S.de)
2

The ArchiveSpark Approach
• Metadata first, content second, your corpus third
• Two step loading approach for improved efficiency
• Filter as much as possible on metadata before touching the archive
• Enrich metadata instead of mapping / transforming the full records
3
in Web archives: generally (MHL):

ArchiveSpark and MHL
• MHL-specific Data Specifications are available on GitHub
• https://github.com/helgeho/MHLonArchiveSpark
• The metadata source is MHL‘s advanced full-text search
• http://mhl.countway.harvard.edu/search
• The basic features are replicated by our tool
• more advanced filtering can be done on the retrieved metadata
• Full-text contents are fetched from the Internet Archive
• Seamlessly, abstracted away from the user
• Metadata records are enriched with requested contents
4

Simple and Expressive Interface
• Based on Spark, powered by Scala
• This does not mean you have to learn a new programming language!
• The interface is rather declarative and writing scripts for
ArchiveSpark does not require deep knowledge about Spark / Scala
• Simple data accessors are included
• Provide simplified access to the underlying data model
• Easy extraction / enrichment mechanisms
• Customizable and extensible by advanced users
5
val query = MhlSearchOptions(query = "polio", collections = MhlCollections.Statemedicalsocietyjournals)
val rdd = ArchiveSpark.load(MhlSearchSpec(query))
val enriched = rdd.enrich(Entities)
enriched.saveAsJson("enriched.json.gz")

Implicit Lineage Documentation
• Nested JSON output encodes lineage of applied enrichments
6
title
text
entities
persons

Getting Started
• We recommend the interactive use with Jupyter
• http://jupyter.org with Apache Toree: https://toree.apache.org
• Commands can be run live, results are returned immediately
• Documentation with examples is available on GitHub
• https://github.com/helgeho/ArchiveSpark/blob/master/docs
• Everything set up in a Docker container to get you started
• https://github.com/helgeho/ArchiveSpark-docker
• ArchiveSpark with Jupyter and Toree, pre-configured
• It’s just one command away, instructions are on GitHub
7

Run ArchiveSpark with Jupyter
• Start Docker (see https://github.com/helgeho/ArchiveSpark-docker)
• Add additional JAR files for MHL
• Download from https://github.com/helgeho/MHLonArchiveSpark/releases
• Copy to config/lib (the config path you specified with Docker)
8

Your First ArchiveSpark Jupyter Notebook
• Open Jupyter in your browser:
• Create a new Jupyter notebook with ArchiveSpark:
• You are now ready to write your first job:
• Press ctrl+enter to run a cell and you will immediately see the output:
9

Example: Polio Symptoms in MHL (1)
• What are the most frequently occurring symptoms and affected
body parts of Polio in journals of the Medical Heritage Library?
• The full example is available on GitHub
• https://github.com/helgeho/MHLonArchiveSpark/blob/master/examples/Mhl
PolioSymptomsSearch.ipynb
• Details can be found in:
• Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant
Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017
• http://www.helgeholzmann.de/papers/BIGDATA_2017.pdf
• More about the used Spark operations: Spark Programming Guide
• https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
10

• Import the required modules / methods
• Specify the query and load the dataset
• Available options to specify the query can be found in the code
• https://github.com/helgeho/MHLonArchiveSpark/blob/master/src/main/scal
a/edu/harvard/countway/mhl/archivespark/search/MhlSearchOptions.scala 11

• Now the dataset can be filtered based on the available metadata
• at any time, peekJson let’s you look at the data of the first record as JSON
12

• We define a new enrich function, which extracts the symptoms
• This is based on the content in lower case (LowerCase Enrich Function)
• We specify a set of interesting symptoms and affected body parts
• For each record, this is filtered by the ones contained in the content
• This new Enrich Function is assigned to a variable called symptoms
• Finally, the dataset is enriched with the set of contained symptoms
13

• We again print out the first record to check the result
• println is used to see the full output, Jupyter would cut it otherwise
…
14
The JSON structure nicely reflects the
lineage. We can immediately see that
symptoms in this case were extracted
from the text, which was first
converted to lower case.

• Eventually, we can count to the contained symptoms
• Our symptoms Enrich Function can be used as a pointer to the values here
• We flat-map these values, i.e., we create a flat list of all symptoms in the
dataset, each occurring once per record it is contained in
• To see the results, we print each line of the computed counts
15

• Alternatively, we could save all filtered and enriched records as JSON
• The JSON format is widely supported by many third-party tools, so
that the resulting dataset can easily be post-processed
• to post-process it with Spark, records can be mapped to its raw values
• value can be used to access an enriched value of a record, e.g.:
val mapped = enriched.map(r => (r.title, r.valueOrElse(symptoms, Seq.empty)))
• Hint: to access the raw text of a MHL document use Text:
E.g., Text.map("length"){txt: String => txt.length) or Entities.on(Text)
16

For Advanced Users
• Some Enrich Functions, such as Entities, need additional JAR files
• Entities requires the Stanford CoreNLP library and models
• http://central.maven.org/maven2/edu/stanford/nlp/stanford-corenlp/3.4.1/
• These need to be added to the classpath (your config/lib directory if you use Docker)
• ArchiveSpark is also available on Maven Central
• https://mvnrepository.com/artifact/com.github.helgeho/archivespark
• To be used as library / API to access archival collections programmatically
• New Enrich Functions and DataSpecs are easy to create
• All required base classes are provided with the core project
• https://github.com/helgeho/ArchiveSpark/blob/master/docs/Contribute.md
• Please share yours!
17

That’s all Folks!
18
• Happy coding and please share your insights
• Fork us on GitHub: https://github.com/helgeho/ArchiveSpark
• Star, contribute, fix, spread, get involved!
• Feedback is very welcome…
• Visit us
• https://www.L3S.de
• https://archive.org
• http://alexandria-project.eu
• http://www.medicalheritage.org

Medical Heritage Library (MHL) on ArchiveSpark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Medical Heritage Library (MHL) on ArchiveSpark

Similar to Medical Heritage Library (MHL) on ArchiveSpark (20)

Recently uploaded

Recently uploaded (20)

Medical Heritage Library (MHL) on ArchiveSpark