Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Medical Heritage Library (MHL) on ArchiveSpark


Published on

This presentation gives an introduction to ArchiveSpark and the recent extension to use it with any archival collection. The slides demonstrate how to set it up and use it for analyzing data from medical journals of the Medical Heritage Library (MHL).

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Medical Heritage Library (MHL) on ArchiveSpark

  1. 1. ArchiveSpark: Efficient Access, Extraction and Derivation for Archival Collections Helge Holzmann ( 1 in cooperation with
  2. 2. What is ArchiveSpark? • Expressive and efficient data access / processing framework • Originally built for Web Archives, later extended to any collection • Joint work with the Internet Archive • Open source • Fork us on GitHub: • Star, contribute, fix, spread, get involved! • Modular, easily extensible • More details in: (please cite) • Helge Holzmann, Vinay Goel, Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016 (Best Paper Nominee) • Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017 20/02/2018 Helge Holzmann ( 2
  3. 3. The ArchiveSpark Approach • Metadata first, content second, your corpus third • Two step loading approach for improved efficiency • Filter as much as possible on metadata before touching the archive • Enrich metadata instead of mapping / transforming the full records 3 20/02/2018 Helge Holzmann ( in Web archives: generally (MHL):
  4. 4. ArchiveSpark and MHL • MHL-specific Data Specifications are available on GitHub • • The metadata source is MHL‘s advanced full-text search • • The basic features are replicated by our tool • more advanced filtering can be done on the retrieved metadata • Full-text contents are fetched from the Internet Archive • Seamlessly, abstracted away from the user • Metadata records are enriched with requested contents 4 20/02/2018 Helge Holzmann (
  5. 5. Simple and Expressive Interface • Based on Spark, powered by Scala • This does not mean you have to learn a new programming language! • The interface is rather declarative and writing scripts for ArchiveSpark does not require deep knowledge about Spark / Scala • Simple data accessors are included • Provide simplified access to the underlying data model • Easy extraction / enrichment mechanisms • Customizable and extensible by advanced users 5 20/02/2018 Helge Holzmann ( val query = MhlSearchOptions(query = "polio", collections = MhlCollections.Statemedicalsocietyjournals) val rdd = ArchiveSpark.load(MhlSearchSpec(query)) val enriched = rdd.enrich(Entities) enriched.saveAsJson("enriched.json.gz")
  6. 6. Implicit Lineage Documentation • Nested JSON output encodes lineage of applied enrichments 6 20/02/2018 Helge Holzmann ( title text entities persons
  7. 7. Getting Started • We recommend the interactive use with Jupyter • with Apache Toree: • Commands can be run live, results are returned immediately • Documentation with examples is available on GitHub • • Everything set up in a Docker container to get you started • • ArchiveSpark with Jupyter and Toree, pre-configured • It’s just one command away, instructions are on GitHub 7 20/02/2018 Helge Holzmann (
  8. 8. Run ArchiveSpark with Jupyter • Start Docker (see • Add additional JAR files for MHL • Download from • Copy to config/lib (the config path you specified with Docker) 8 20/02/2018 Helge Holzmann (
  9. 9. Your First ArchiveSpark Jupyter Notebook • Open Jupyter in your browser: • Create a new Jupyter notebook with ArchiveSpark: • You are now ready to write your first job: • Press ctrl+enter to run a cell and you will immediately see the output: 9 20/02/2018 Helge Holzmann (
  10. 10. Example: Polio Symptoms in MHL (1) • What are the most frequently occurring symptoms and affected body parts of Polio in journals of the Medical Heritage Library? • The full example is available on GitHub • PolioSymptomsSearch.ipynb • Details can be found in: • Helge Holzmann, Emily Novak Gustainis, Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017 • • More about the used Spark operations: Spark Programming Guide • 10 20/02/2018 Helge Holzmann (
  11. 11. Example: Polio Symptoms in MHL (2) • Import the required modules / methods • Specify the query and load the dataset • Available options to specify the query can be found in the code • a/edu/harvard/countway/mhl/archivespark/search/MhlSearchOptions.scala 11 20/02/2018 Helge Holzmann (
  12. 12. Example: Polio Symptoms in MHL (3) • Now the dataset can be filtered based on the available metadata • at any time, peekJson let’s you look at the data of the first record as JSON 12 20/02/2018 Helge Holzmann (
  13. 13. Example: Polio Symptoms in MHL (4) • We define a new enrich function, which extracts the symptoms • This is based on the content in lower case (LowerCase Enrich Function) • We specify a set of interesting symptoms and affected body parts • For each record, this is filtered by the ones contained in the content • This new Enrich Function is assigned to a variable called symptoms • Finally, the dataset is enriched with the set of contained symptoms 13 20/02/2018 Helge Holzmann (
  14. 14. Example: Polio Symptoms in MHL (5) • We again print out the first record to check the result • println is used to see the full output, Jupyter would cut it otherwise … 14 20/02/2018 Helge Holzmann ( The JSON structure nicely reflects the lineage. We can immediately see that symptoms in this case were extracted from the text, which was first converted to lower case.
  15. 15. Example: Polio Symptoms in MHL (6) • Eventually, we can count to the contained symptoms • Our symptoms Enrich Function can be used as a pointer to the values here • We flat-map these values, i.e., we create a flat list of all symptoms in the dataset, each occurring once per record it is contained in • To see the results, we print each line of the computed counts 15 20/02/2018 Helge Holzmann (
  16. 16. Example: Polio Symptoms in MHL (7) • Alternatively, we could save all filtered and enriched records as JSON • The JSON format is widely supported by many third-party tools, so that the resulting dataset can easily be post-processed • to post-process it with Spark, records can be mapped to its raw values • value can be used to access an enriched value of a record, e.g.: val mapped = => (r.title, r.valueOrElse(symptoms, Seq.empty))) • Hint: to access the raw text of a MHL document use Text: E.g.,"length"){txt: String => txt.length) or Entities.on(Text) 16 20/02/2018 Helge Holzmann (
  17. 17. For Advanced Users • Some Enrich Functions, such as Entities, need additional JAR files • Entities requires the Stanford CoreNLP library and models • • These need to be added to the classpath (your config/lib directory if you use Docker) • ArchiveSpark is also available on Maven Central • • To be used as library / API to access archival collections programmatically • New Enrich Functions and DataSpecs are easy to create • All required base classes are provided with the core project • • Please share yours! 17 20/02/2018 Helge Holzmann (
  18. 18. That’s all Folks! 20/02/2018 Helge Holzmann ( 18 • Happy coding and please share your insights • Fork us on GitHub: • Star, contribute, fix, spread, get involved! • Feedback is very welcome… • Visit us • • • •