1) Dr. Schapranow presents a federated in-memory database computing platform called AnalyzeGenomes.com to enable real-time analysis of big medical data.
2) The platform aims to incorporate all available patient data, reference latest lab results and medical knowledge, and support interactive analysis to help clinicians make treatment decisions.
3) It uses a distributed in-memory database across nodes to combine and link heterogeneous medical data sources while addressing challenges of data privacy, locality, and volume.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Federated In-Memory Platform for Analyzing Big Medical Data
1. A Federated In-Memory Database Computing Platform Enabling Real-
time Analysis of Big Medical Data
Dr.-Ing. Matthieu-P. Schapranow
Hasso Plattner Institute, Potsdam, Germany
May 17, 2017
2. ■ Can we enable clinicians to take their therapy decisions:
□ Incorporating all available patient specifics,
□ Referencing latest lab results and worldwide medical knowledge, and
□ In an interactive manner during their ward round?
Our Motivation
Turn Precision Medicine Into Clinical Routine
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
2
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
3. Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
3
4. Our Vision
Medical Board Incorporating Latest Medical Knowledge
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
4
5. Project Time Line
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
5
2009 2010 2011 2012 2013 2014 2015
SAP HANA
launched Oncolyzer SORMAS
Drug Response
Analysis
Enterprise
Software
Medical
Knowledge
Cockpit
Analyze
Genomes
Platform
IMDB
Research
2016 2017
A R
T
+
T
RAM
S
+
S
M
6. The Challenge
Distributed Heterogeneous Data Sources
6
Human genome/biological data
600GB per full genome
15PB+ in databases of leading institutes
Prescription data
1.5B records from 10,000 doctors and
10M Patients (100 GB)
Clinical trials
Currently more than 30k
recruiting on ClinicalTrials.gov
Human proteome
160M data points (2.4GB) per sample
>3TB raw proteome data in ProteomicsDB
PubMed database
>23M articles
Hospital information systems
Often more than 50GB
Medical sensor data
Scan of a single organ in 1s
creates 10GB of raw dataCancer patient records
>160k records at NCT Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
7. ■ Requirements
□ Managed services
□ Reproducibility
□ Real-time data analysis
■ Restrictions
□ Data privacy
□ Data locality
□ Volume of big medical data
Software Requirements in Life Sciences
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
7
http://stevedempsen.blogspot.de/2013/08/agile-software-requirements-comic.html
8. Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Our Approach: AnalyzeGenomes.com
In-Memory Computing Platform for Big Medical Data
8
In-Memory Database
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
9. Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Our Approach: AnalyzeGenomes.com
In-Memory Computing Platform for Big Medical Data
9
In-Memory Database
Combined and Linked Data
Genome
Data
Cellular
Pathways
Genome
Metadata
Research
Publications
Pipeline and
Analysis Models
Drugs and
Interactions
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
Indexed
Sources
10. Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Our Approach: AnalyzeGenomes.com
In-Memory Computing Platform for Big Medical Data
10
In-Memory Database
Extensions for Life Sciences
Data Exchange,
App Store
Access Control,
Data Protection
Fair Use
Statistical
Tools
Real-time
Analysis
App-spanning
User Profiles
Combined and Linked Data
Genome
Data
Cellular
Pathways
Genome
Metadata
Research
Publications
Pipeline and
Analysis Models
Drugs and
Interactions
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
Indexed
Sources
11. Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Our Approach: AnalyzeGenomes.com
In-Memory Computing Platform for Big Medical Data
11
In-Memory Database
Extensions for Life Sciences
Data Exchange,
App Store
Access Control,
Data Protection
Fair Use
Statistical
Tools
Real-time
Analysis
App-spanning
User Profiles
Combined and Linked Data
Genome
Data
Cellular
Pathways
Genome
Metadata
Research
Publications
Pipeline and
Analysis Models
Drugs and
Interactions
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
Drug Response
Analysis
Pathway Topology
Analysis
Medical
Knowledge CockpitOncolyzer
Clinical Trial
Recruitment
Cohort
Analysis
...
Indexed
Sources
12. Combined column
and row store
Map/Reduce Single and
multi-tenancy
Lightweight
compression
Insert only
for time travel
Real-time
replication
Working on
integers
SQL interface on
columns and rows
Active/passive
data store
Minimal
projections
Group key Reduction of
software layers
Dynamic multi-
threading
Bulk load
of data
Object-
relational
mapping
Text retrieval
and extraction engine
No aggregate
tables
Data partitioning Any attribute
as index
No disk
On-the-fly
extensibility
Analytics on
historical data
Multi-core/
parallelization
Our Technology
In-Memory Database Technology
+
++
+
+
P
v
+++
t
SQL
x
x
T
disk
12
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
13. Scheduling and Execution of
Genome Data Processing Pipelines
Analyze Genomes:
A Federated In-Memory
Database
Computing Platform
In-Memory Database
Tasks
Scheduler
ID Pipeline Params
12 BWA xyz.fastq
13 Stanford A_1.fastq
14 Bowtie xyz.fastq
Worker
Worker
Subtasks
Task ID Job Status Params
12 97 Split done xyz.fastq
12 98 Import todo abc.vcf
12 98 Import done abc.vcf
Webservice
. . .
1. Trigger task execution
2. Schedule subtasks
3. Execute subtasks
13
14. Managed Services provided by
Federated In-Memory Database System (FIMDB)
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
14
Node i
WorkerWorkerWorker
IMDB
Node j
WorkerWorkerWorker
IMDB
Node k
WorkerWorkerWorker
IMDB
Scheduler
Node m
WorkerWorkerWorker
IMDB
Relay
Node n
WorkerWorkerWorker
IMDB
...
Cloud Service Provider
(Shared Algorithms and Public Reference Data)
Hospital or Research Department
(Sensitive/Patient Data)
VPN
UDP
TCP
Shared File System (Pool) Shared File System (Pool)
...
Shared File System (Global)
15. ■ Not standardized
■ Not exchangeable
■ Concatenation of bash scripts reading from and writing to files
■ Requires IT expertise for
□ Setup
□ Error handling, and
□ Efficient processing and parallelization
■ Objective: Model, configure, and execute pipelines without involving IT experts
Genome Data Processing Pipelines
State of the Art
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
15
bwa aln ref.fa sample.fastq | bwa samse ref.fa – sample.fastq | samtools view -Su - | samtools sort …
16. ■ Graphical modeling notation
■ Compliant with BPMN 2.0 extended by
□ Modular structure
□ Degree of parallelization
□ Parameters and variables
■ Model descriptions (XPDL) are stored in IMDB
■ Model instances are transformed into graph structure
executed by our worker framework
Genome Data Processing Pipelines
Standardized Modeling
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
Chart 16
17. Genome Data Processing Pipelines
XML Process Definition Language
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
17
19. ■ Results are imported into IMDB
■ Optimization reduced execution time by >50%
Genome Data Processing Pipelines
Traditional vs. Optimized Approach
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
19
20. Reproducibility
Modeling of Data Analysis Pipelines
1. Design time (researcher, process expert)
□ Definition of parameterized process model
□ Uses graphical editor and jobs from repository
2. Configuration time (researcher, lab assistant)
□ Select model and specify parameters, e.g. aln opts
□ Results in model instance stored in repository
3. Execution time (researcher)
□ Select model instance
□ Specify execution parameters, e.g. input files
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
20
21. ■ Query-oriented search interface
■ Seamless integration of patient specifics, e.g. from EMR
■ Parallel search in international knowledge bases, e.g. for biomarkers, literature,
cellular pathway, and clinical trials
App Example:
Medical Knowledge Cockpit for Patients and Clinicians
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
21
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
22. Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Medical Knowledge Cockpit for Patients and Clinicians
Pathway Topology Analysis
■ Search in pathways is limited to “is a certain
element contained” today
■ Integrated >1,5k pathways from international
sources, e.g. KEGG, HumanCyc, and WikiPathways,
into HANA
■ Implemented graph-based topology exploration and
ranking based on patient specifics
■ Enables interactive identification of possible
dysfunctions affecting the course of a therapy
before its start
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
Unified access to multiple formerly
disjoint data sources
Pathway analysis of genetic
variants with graph engine
22
23. Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
■ Interactively explore relevant publications, e.g. PDFs
■ Improved ease of exploration, e.g. by highlighted medical terms and relevant
concepts
Medical Knowledge Cockpit for Patients and Clinicians
Publications
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
23
24. App Example:
Real-time Assessment of Clinical Trial Candidates
■ Supports trial design and recruitment process through
statistical data analysis
■ Real-time matching and clustering of patients and
clinical trial inclusion/exclusion criteria
■ Reassessment of already screened or participating
citizens to reduce recruitment costs
■ Integrates smoothly with the
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
Real-time assessment of
clinical trial candidates
24
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
25. ■ Online: Visit we.analyzegenomes.com for latest research
results, slides, videos, tools, and publications
■ Offline: High-Performance In-Memory Genome Data Analysis:
In-Memory Data Management Research, Springer,
ISBN: 978-3-319-03034-0, 2014
■ In Person: Visit us at the HPI booth 200!
■ Join us for Intel Tech Talks at SAPPHIRE booth 669!
□ May 17 01.00pm: A Federated In-Memory Database Computing Platform Enabling
Real-time Analysis of Big Medical Data
□ May 18 3.00pm: In-Memory Apps For Precision Medicine
Where to find additional information?
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
25
26. Keep in contact with us!
Dr. Schapranow, Intel
Tech Talk at SAPPHIRE,
May 17, 2017
Analyze Genomes:
A Federated In-
Memory Database
Computing Platform
26
Dr. Matthieu-P. Schapranow
Program Manager E-Health & Life Sciences
Hasso Plattner Institute
August-Bebel-Str. 88
14482 Potsdam, Germany
schapranow@hpi.de
http://we.analyzegenomes.com/