SlideShare a Scribd company logo
1 of 58
Download to read offline
BYOI: Build you own index
Build your own discovery index of scholary e-resources
40th European Library Automation Group (ELAG) Conference 2016,
2016–06–06, Copenhagen, Den Sorte Diamant, is.gd/nDh4TY
Martin Czygan, David Aumüller, Leander Seige
Leipzig University Library
https://ub.uni-leipzig.de
https://finc.info
https://amsl.technology
itprojekte@ub.uni-leipzig.de
Welcome
During the next few hours, we will create an small aggregated index
from scratch.
You can code along if you like. Code, data and slides are distributed
in a VM (on a USB stick).
Why
At Leipzig University Library we built a version that serves as a
successor to a commercial product.
Index includes data from Crossref, DOAJ, JSTOR, Elsevier, Genios,
Thieme, DeGruyter among others.
About 55% of our holdings covered. Potentially growable in breadth
and depth.
Format
We will use a combination of
slides to motivate concepts and
live coding and experimentation
–
We will not use a product, we will build it.
Goals
a running VuFind 3 with a small aggregated index
learn about a batch processing framework
First Steps
Figure 1: First steps
Prerequisites
Virtualbox: https://www.virtualbox.org/wiki/Downloads
Import Appliance
On the USB-Stick you can find an OVA file that you can import
into Virtualbox (or try to download it from https://goo.gl/J7hcYC).
This VM contains:
a VuFind 3 installation – /usr/local/vufind
raw metadata (around 3M records) – ~/Bootcamp/input
scripts and stubs for processing – ~/Bootcamp/code
these slides – ~/Bootcamp/slides.pdf
Forwarded ports
Guest (VM) >> Host
80 >> 8085 (HTTP, VuFind)
8080 >> 8086 (SOLR)
8082 >> 8087 (luigi)
22 >> 2200 (SSH)
3306 >> 13306 (MySQL)
SSH tip:
$ curl -sL https://git.io/vrxoC > vm.sh
$ chmod +x vm.sh
$ ./vm.sh
Outline
Bootcamp play book:
intro: problem setting (heterogenous data, batch processing)
VM setup - during intro
Then we will write some code:
a basic pipeline with luigi python library
combine various sources into a common format
apply licensing information
index into solr
Outline DAG
Figure 2: Tour
Intro: Problem setting
batch processing, not small data (but not too big, either)
regular processing required
varying requirements
multiple small steps to apply on all data
iterative development
Intro: Rise of the DAG
DAG = directed acyclic graph, partial ordering
many things use DAGs, make, Excel, scheduling problems
model tasks in a DAG, then run topological sort to determine
order
Examples:
http://goo.gl/FCpxiK (history is a DAG)
http://i.stack.imgur.com/iVNcu.png (airflow)
https://git.io/vw9rW (luigi)
https://goo.gl/vMEezR (Azkaban)
Intro: Immutability
immutability = data is not modified, after it is created
immutable data has some advantages, e.g.
“human fault tolerance”
performance
our use case: recompute everything from raw data
tradeoff: more computation, but less to think about
Intro: Frameworks
many libraries and frameworks for batch processing and
scheduling, e.g. Oozie, Askaban, Airflow, luigi, . . .
even more tools, when working with stream, Kafka, various
queues, . . .
luigi is nice, because it has only a few prerequisities
Intro: Luigi in one slide
import luigi
class MyTask(luigi.Task):
param = luigi.Parameter(default='ABC')
def requires(self):
return SomeOtherTask()
def run(self):
with self.output().open('w') as output:
output.write('%s business' % self.param)
def output(self):
return luigi.LocalTarget(path='output.txt')
if __name__ == '__main__':
luigi.run()
Intro: luigi
many integrations, e.g. MySQL, Postgres, elasticsearch, . . .
support for Hadoop and HDFS, S3, Redshift, . . .
200+ contributors, 350+ ML
hackable, extendable – e.g.
https://github.com/ubleipzig/gluish
Intro: Decomposing our goal
clean and rearrange input data files
convert data into a common (intermediate) format
apply licensing information (from kbart)
index into solr
Intro: Incremental Development
when we work with unknown data sources, we have to
gradually move forward
Intro: Wrap up
many approaches to data processing
we will focus on one library only here
concepts like immutability, recomputation and incremental
development are more general
–
now back to the code
Test VuFind installation
We can SSH into the VM and start VuFind:
$ ./vm.sh
(vm) $ cd /usr/local/vufind
(vm) $ ./solr.sh start
Starting VuFind ...
...
http://localhost:8085/vufind
http://localhost:8085/vufind/Install/Home
Hello World
Test a Python script on guest. Go to the Bootcamp directory:
$ cd $HOME/Bootcamp
$ python hello.py
...
Note: Files follow PEP-8, so indent with space, here: 4.
Setup wrap-up
You can now edit Python files on your guest (or host) and run them
inside the VM. You can start and stop VuFind inside the VM and
access it through a browser on your host.
We are all set ot start exploring the data and to write some code.
Bootcamp outline
parts 0 to 6: intro, crossref, doaj, combination, licensing, export
each part is self contained, although we will reuse some
artifacts
Bootcamp outline
you can use scaffoldP_. . . , if you want to code along
the partP_. . . files contain the target code
code/part{0-6}_....py
code/scaffold{0-6}_....py
Coding: Part 0
Hello World from luigi
$ cd code
$ python part0_helloworld.py
Coding: Part 0 Recap
simple things should be simple
basic notion of a task
command line integration
Coding: Part 1
An input and a task
$ python part1_require.py
Coding: Part 1 Recap
it is easy to start with static data
business logic in python, can reuse any existing python library
Coding: Part 2
a first look at Crossref data
harvest via API, the files contain batch responses
custom format
Coding: Part 2
Three things to do:
find all relevant files (we will use just one for now)
extract the records from the batch
convert to an intermediate format
Coding: Part 2
Now on to the code
$ python part2_crossref.py
Coding: Part 2 Recap
used command line tools (fast, simple interface)
chained three tasks together
Excursion: Normalization
suggested and designed by system librarian
internal name: intermediate schema –
https://github.com/ubleipzig/intermediateschema
enough fields to accomodate various inputs
can be extended carefully, if necessary
tooling (licensing, export, quality checks) only for a single
format
Excursion: Normalization
{
"finc.format": "ElectronicArticle",
"finc.mega_collection": "DOAJ",
"finc.record_id": "ai-28-00001...",
"finc.source_id": "28",
"rft.atitle": "Importância da vitamina B12 na ...",
"rft.epage": "78",
"rft.issn": [
"1806-5562",
"1980-6108"
],
"rft.jtitle": "Scientia Medica",
...
}
Coding: Part 3
DOAJ index data
a complete elasticsearch dump
Coding: Part 3
This source is not batched and comes in a single file, so it is a bit
simpler:
locate file
convert to intermediate schema
$ python part3_require.py
Coding: Part 3 Recap
it is easy to start with static data
business logic in python, can reuse any existing python library
Coding: Part 4
after normalization, we can merge the two data sources
$ python part4_combine.py
Coding: Part 4 Recap
a list of dependencies
python helps with modularization
using the shell for performance and to reuse existing tools
Coding: Part 5
licensing turned out to be an important issue
a complex topic
we need to look at every record, so it is performance critical
we use AMSL for ERM, and are on the way to a self-service
interface
AMSL has great APIs
we convert collection information to an expression-tree-ish
format – https://is.gd/Fxx0IU, https://is.gd/ZTqLqB
Coding: Part 5
$ python part5_licensing.py
Coding: Part 5
boolean expression trees allow us to specify complex licensing
rules
the result is a file, where each record is annotated with an ISIL
at Leipzig University Library we currently do this for about 20
ISILs
Coding: Part 5 Recap
dependencies as dictionary
flexibility in modeling workflows
again: use command line tools for performance critical parts
Coding: Part 6
a final conversion to a SOLR-importable format
Coding: Part 6
$ python part6_export.py
Coding: Part 6
slightly different from SOLRMARC style processing
keep things (conversion, indexing) a bit separate
standalone tool: solrbulk
Coding: Part 6 Recap
flexibility in modeling workflows
again: use command line tools for performance critical parts
Indexing
finally, we can index the data into SOLR
make sure SOLR is running on your VM
Indexing
$ solrbulk -host localhost -port 8080 
-w 2 -z -verbose -commit 100000 
-collection biblio 
output/6/Export/output.ldj.gz
might want to increase SOLR_HEAP (defaults to 512M)
Indexing
go to http://localhost:8085
index should be slowly growing
Code recap
_ Export()
_ ApplyLicensing()
_ CombinedIntermediateSchema()
_ DOAJIntermediateSchema()
_ DOAJInput()
_ CrossrefIntermediateSchema()
_ CrossrefItems()
_ CrossrefInput()
_ CreateConfiguration()
_ HoldingFile()
Code recap
$ python deps.py
_ Export()
_ ApplyLicensing()
_ CombinedIntermediateSchema()
_ DOAJIntermediateSchema()
_ DOAJInput()
_ CrossrefIntermediateSchema()
_ CrossrefItems()
_ CrossrefInput()
_ CreateConfiguration()
_ HoldingFile()
Code recap
Figure 3: Deps
Follow up with workflow changes
https://git.io/vOZFQ
Indexing
Production data points:
sustained indexing rates between 2000-4000 docs/s
a full reindex of about 100M docs currently takes about 10h
with SOLR
Discussion
what we left out:
more data sets
larger data sets
XML
errors
parameters
collaboration and deployment
Discussion
what are your experiences with batch systems?
how do you manage large heterogeneous data?
what could we add to the pipeline?
Q & A
Thanks for your attention.
For any questions, please get in touch during the conference or via
e-mail:
{czygan,aumueller,seige}@ub.uni-leipzig.de

More Related Content

What's hot

Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...PyData
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j FundamentalsMax De Marzi
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLChristian Antognini
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ NetflixMichelle Ufford
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingDatabricks
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkJen Aman
 

What's hot (20)

Why Data Vault?
Why Data Vault? Why Data Vault?
Why Data Vault?
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j Fundamentals
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQL
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model Serving
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 

Similar to Build your own discovery index of scholary e-resources

Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Open erp technical_memento_v0.6.3_a4
Open erp technical_memento_v0.6.3_a4Open erp technical_memento_v0.6.3_a4
Open erp technical_memento_v0.6.3_a4openerpwiki
 
maXbox starter30 Web of Things
maXbox starter30 Web of ThingsmaXbox starter30 Web of Things
maXbox starter30 Web of ThingsMax Kleiner
 
OpenERP Technical Memento
OpenERP Technical MementoOpenERP Technical Memento
OpenERP Technical MementoOdoo
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using SwiftDiego Freniche Brito
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)
Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)
Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)sparkfabrik
 
An introduction to workflow-based programming with Node-RED
An introduction to workflow-based programming with Node-REDAn introduction to workflow-based programming with Node-RED
An introduction to workflow-based programming with Node-REDBoris Adryan
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep LearningGreg Gandenberger
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONAdrian Cockcroft
 
Python the lingua franca of FEWS
Python the lingua franca of FEWSPython the lingua franca of FEWS
Python the lingua franca of FEWSLindsay Millard
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
 
Headless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoHeadless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoSander Mangel
 
Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014biicode
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Luciano Resende
 
Node-RED and getting started on the Internet of Things
Node-RED and getting started on the Internet of ThingsNode-RED and getting started on the Internet of Things
Node-RED and getting started on the Internet of ThingsBoris Adryan
 
Introduction-to-C-Part-1.pdf
Introduction-to-C-Part-1.pdfIntroduction-to-C-Part-1.pdf
Introduction-to-C-Part-1.pdfAnassElHousni
 

Similar to Build your own discovery index of scholary e-resources (20)

Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Open erp technical_memento_v0.6.3_a4
Open erp technical_memento_v0.6.3_a4Open erp technical_memento_v0.6.3_a4
Open erp technical_memento_v0.6.3_a4
 
maXbox starter30 Web of Things
maXbox starter30 Web of ThingsmaXbox starter30 Web of Things
maXbox starter30 Web of Things
 
OpenERP Technical Memento
OpenERP Technical MementoOpenERP Technical Memento
OpenERP Technical Memento
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)
Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)
Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)
 
An introduction to workflow-based programming with Node-RED
An introduction to workflow-based programming with Node-REDAn introduction to workflow-based programming with Node-RED
An introduction to workflow-based programming with Node-RED
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep Learning
 
NodeJS
NodeJSNodeJS
NodeJS
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
 
Python the lingua franca of FEWS
Python the lingua franca of FEWSPython the lingua franca of FEWS
Python the lingua franca of FEWS
 
Javascript mynotes
Javascript mynotesJavascript mynotes
Javascript mynotes
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
 
Headless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoHeadless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in Magento
 
Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
 
Node-RED and getting started on the Internet of Things
Node-RED and getting started on the Internet of ThingsNode-RED and getting started on the Internet of Things
Node-RED and getting started on the Internet of Things
 
Introduction-to-C-Part-1.pdf
Introduction-to-C-Part-1.pdfIntroduction-to-C-Part-1.pdf
Introduction-to-C-Part-1.pdf
 

Recently uploaded

Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGAPNIC
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service PuneVIP Call Girls Pune Madhuri 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service PuneCall girls in Ahmedabad High profile
 
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kestopur 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Roomdivyansh0kumar0
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With RoomVIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Roomdivyansh0kumar0
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$kojalkojal131
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC
 

Recently uploaded (20)

Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOG
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service PuneVIP Call Girls Pune Madhuri 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service Pune
 
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kestopur 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With RoomVIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 

Build your own discovery index of scholary e-resources

  • 1. BYOI: Build you own index Build your own discovery index of scholary e-resources 40th European Library Automation Group (ELAG) Conference 2016, 2016–06–06, Copenhagen, Den Sorte Diamant, is.gd/nDh4TY Martin Czygan, David Aumüller, Leander Seige Leipzig University Library https://ub.uni-leipzig.de https://finc.info https://amsl.technology itprojekte@ub.uni-leipzig.de
  • 2. Welcome During the next few hours, we will create an small aggregated index from scratch. You can code along if you like. Code, data and slides are distributed in a VM (on a USB stick).
  • 3. Why At Leipzig University Library we built a version that serves as a successor to a commercial product. Index includes data from Crossref, DOAJ, JSTOR, Elsevier, Genios, Thieme, DeGruyter among others. About 55% of our holdings covered. Potentially growable in breadth and depth.
  • 4. Format We will use a combination of slides to motivate concepts and live coding and experimentation – We will not use a product, we will build it. Goals a running VuFind 3 with a small aggregated index learn about a batch processing framework
  • 5. First Steps Figure 1: First steps
  • 7. Import Appliance On the USB-Stick you can find an OVA file that you can import into Virtualbox (or try to download it from https://goo.gl/J7hcYC). This VM contains: a VuFind 3 installation – /usr/local/vufind raw metadata (around 3M records) – ~/Bootcamp/input scripts and stubs for processing – ~/Bootcamp/code these slides – ~/Bootcamp/slides.pdf
  • 8. Forwarded ports Guest (VM) >> Host 80 >> 8085 (HTTP, VuFind) 8080 >> 8086 (SOLR) 8082 >> 8087 (luigi) 22 >> 2200 (SSH) 3306 >> 13306 (MySQL) SSH tip: $ curl -sL https://git.io/vrxoC > vm.sh $ chmod +x vm.sh $ ./vm.sh
  • 9. Outline Bootcamp play book: intro: problem setting (heterogenous data, batch processing) VM setup - during intro Then we will write some code: a basic pipeline with luigi python library combine various sources into a common format apply licensing information index into solr
  • 11. Intro: Problem setting batch processing, not small data (but not too big, either) regular processing required varying requirements multiple small steps to apply on all data iterative development
  • 12. Intro: Rise of the DAG DAG = directed acyclic graph, partial ordering many things use DAGs, make, Excel, scheduling problems model tasks in a DAG, then run topological sort to determine order Examples: http://goo.gl/FCpxiK (history is a DAG) http://i.stack.imgur.com/iVNcu.png (airflow) https://git.io/vw9rW (luigi) https://goo.gl/vMEezR (Azkaban)
  • 13. Intro: Immutability immutability = data is not modified, after it is created immutable data has some advantages, e.g. “human fault tolerance” performance our use case: recompute everything from raw data tradeoff: more computation, but less to think about
  • 14. Intro: Frameworks many libraries and frameworks for batch processing and scheduling, e.g. Oozie, Askaban, Airflow, luigi, . . . even more tools, when working with stream, Kafka, various queues, . . . luigi is nice, because it has only a few prerequisities
  • 15. Intro: Luigi in one slide import luigi class MyTask(luigi.Task): param = luigi.Parameter(default='ABC') def requires(self): return SomeOtherTask() def run(self): with self.output().open('w') as output: output.write('%s business' % self.param) def output(self): return luigi.LocalTarget(path='output.txt') if __name__ == '__main__': luigi.run()
  • 16. Intro: luigi many integrations, e.g. MySQL, Postgres, elasticsearch, . . . support for Hadoop and HDFS, S3, Redshift, . . . 200+ contributors, 350+ ML hackable, extendable – e.g. https://github.com/ubleipzig/gluish
  • 17. Intro: Decomposing our goal clean and rearrange input data files convert data into a common (intermediate) format apply licensing information (from kbart) index into solr
  • 18. Intro: Incremental Development when we work with unknown data sources, we have to gradually move forward
  • 19. Intro: Wrap up many approaches to data processing we will focus on one library only here concepts like immutability, recomputation and incremental development are more general – now back to the code
  • 20. Test VuFind installation We can SSH into the VM and start VuFind: $ ./vm.sh (vm) $ cd /usr/local/vufind (vm) $ ./solr.sh start Starting VuFind ... ... http://localhost:8085/vufind http://localhost:8085/vufind/Install/Home
  • 21. Hello World Test a Python script on guest. Go to the Bootcamp directory: $ cd $HOME/Bootcamp $ python hello.py ... Note: Files follow PEP-8, so indent with space, here: 4.
  • 22. Setup wrap-up You can now edit Python files on your guest (or host) and run them inside the VM. You can start and stop VuFind inside the VM and access it through a browser on your host. We are all set ot start exploring the data and to write some code.
  • 23. Bootcamp outline parts 0 to 6: intro, crossref, doaj, combination, licensing, export each part is self contained, although we will reuse some artifacts
  • 24. Bootcamp outline you can use scaffoldP_. . . , if you want to code along the partP_. . . files contain the target code code/part{0-6}_....py code/scaffold{0-6}_....py
  • 25. Coding: Part 0 Hello World from luigi $ cd code $ python part0_helloworld.py
  • 26. Coding: Part 0 Recap simple things should be simple basic notion of a task command line integration
  • 27. Coding: Part 1 An input and a task $ python part1_require.py
  • 28. Coding: Part 1 Recap it is easy to start with static data business logic in python, can reuse any existing python library
  • 29. Coding: Part 2 a first look at Crossref data harvest via API, the files contain batch responses custom format
  • 30. Coding: Part 2 Three things to do: find all relevant files (we will use just one for now) extract the records from the batch convert to an intermediate format
  • 31. Coding: Part 2 Now on to the code $ python part2_crossref.py
  • 32. Coding: Part 2 Recap used command line tools (fast, simple interface) chained three tasks together
  • 33. Excursion: Normalization suggested and designed by system librarian internal name: intermediate schema – https://github.com/ubleipzig/intermediateschema enough fields to accomodate various inputs can be extended carefully, if necessary tooling (licensing, export, quality checks) only for a single format
  • 34. Excursion: Normalization { "finc.format": "ElectronicArticle", "finc.mega_collection": "DOAJ", "finc.record_id": "ai-28-00001...", "finc.source_id": "28", "rft.atitle": "Importância da vitamina B12 na ...", "rft.epage": "78", "rft.issn": [ "1806-5562", "1980-6108" ], "rft.jtitle": "Scientia Medica", ... }
  • 35. Coding: Part 3 DOAJ index data a complete elasticsearch dump
  • 36. Coding: Part 3 This source is not batched and comes in a single file, so it is a bit simpler: locate file convert to intermediate schema $ python part3_require.py
  • 37. Coding: Part 3 Recap it is easy to start with static data business logic in python, can reuse any existing python library
  • 38. Coding: Part 4 after normalization, we can merge the two data sources $ python part4_combine.py
  • 39. Coding: Part 4 Recap a list of dependencies python helps with modularization using the shell for performance and to reuse existing tools
  • 40. Coding: Part 5 licensing turned out to be an important issue a complex topic we need to look at every record, so it is performance critical we use AMSL for ERM, and are on the way to a self-service interface AMSL has great APIs we convert collection information to an expression-tree-ish format – https://is.gd/Fxx0IU, https://is.gd/ZTqLqB
  • 41. Coding: Part 5 $ python part5_licensing.py
  • 42. Coding: Part 5 boolean expression trees allow us to specify complex licensing rules the result is a file, where each record is annotated with an ISIL at Leipzig University Library we currently do this for about 20 ISILs
  • 43. Coding: Part 5 Recap dependencies as dictionary flexibility in modeling workflows again: use command line tools for performance critical parts
  • 44. Coding: Part 6 a final conversion to a SOLR-importable format
  • 45. Coding: Part 6 $ python part6_export.py
  • 46. Coding: Part 6 slightly different from SOLRMARC style processing keep things (conversion, indexing) a bit separate standalone tool: solrbulk
  • 47. Coding: Part 6 Recap flexibility in modeling workflows again: use command line tools for performance critical parts
  • 48. Indexing finally, we can index the data into SOLR make sure SOLR is running on your VM
  • 49. Indexing $ solrbulk -host localhost -port 8080 -w 2 -z -verbose -commit 100000 -collection biblio output/6/Export/output.ldj.gz might want to increase SOLR_HEAP (defaults to 512M)
  • 50. Indexing go to http://localhost:8085 index should be slowly growing
  • 51. Code recap _ Export() _ ApplyLicensing() _ CombinedIntermediateSchema() _ DOAJIntermediateSchema() _ DOAJInput() _ CrossrefIntermediateSchema() _ CrossrefItems() _ CrossrefInput() _ CreateConfiguration() _ HoldingFile()
  • 52. Code recap $ python deps.py _ Export() _ ApplyLicensing() _ CombinedIntermediateSchema() _ DOAJIntermediateSchema() _ DOAJInput() _ CrossrefIntermediateSchema() _ CrossrefItems() _ CrossrefInput() _ CreateConfiguration() _ HoldingFile()
  • 54. Follow up with workflow changes https://git.io/vOZFQ
  • 55. Indexing Production data points: sustained indexing rates between 2000-4000 docs/s a full reindex of about 100M docs currently takes about 10h with SOLR
  • 56. Discussion what we left out: more data sets larger data sets XML errors parameters collaboration and deployment
  • 57. Discussion what are your experiences with batch systems? how do you manage large heterogeneous data? what could we add to the pipeline?
  • 58. Q & A Thanks for your attention. For any questions, please get in touch during the conference or via e-mail: {czygan,aumueller,seige}@ub.uni-leipzig.de