© 2016 Mohammad Sadoghi (Purdue University)
ExpoDB:An Exploratory Data Science Platform
(A New Frontier: From Data Processing to Knowledge Exploration)
Mohammad Sadoghi
Assistant Professor
Department of Computer Science
Purdue University
IBM Cognitive Systems Institute Speaker Series
September 29, 2016
© 2016 Mohammad Sadoghi (Purdue University)
Insight is Lost in Islands of Data
2
http://www.cpsresearch.eu/clinical-trials/
http://news.mit.edu/2015/mnookin-vaccination-public-health-0227
http://www.healthcarepackaging.com/trends-and-issues/clinical-trials
http://stormercellularloo.gq/evolve-ii-clinical-trial.html
https://www.geneticliteracyproject.org
Data is spread across many islands of disconnected sources
(a lack of holistic view)
© 2016 Mohammad Sadoghi (Purdue University)
Insight is Lost in Islands of Data
3
http://www.cpsresearch.eu/clinical-trials/
http://news.mit.edu/2015/mnookin-vaccination-public-health-0227
http://www.healthcarepackaging.com/trends-and-issues/clinical-trials
http://stormercellularloo.gq/evolve-ii-clinical-trial.html
https://www.geneticliteracyproject.org
Sadly, adverse drug reactions (ADRs) is the 4th leading cause of
deaths in United States, resulting in100,000 loss of life annually
© 2016 Mohammad Sadoghi (Purdue University)
Insight is Lost in Islands of Data
4
http://www.cpsresearch.eu/clinical-trials/
http://news.mit.edu/2015/mnookin-vaccination-public-health-0227
http://www.healthcarepackaging.com/trends-and-issues/clinical-trials
http://stormercellularloo.gq/evolve-ii-clinical-trial.html
https://www.geneticliteracyproject.org
Adverse drug reaction costs over $136 billion dollars in US annually
© 2016 Mohammad Sadoghi (Purdue University)
Real-time Fusion and Exploration of Data
© 2016 Mohammad Sadoghi (Purdue University)
Real-time Fusion and Exploration of Enriched Data
© 2016 Mohammad Sadoghi (Purdue University)
Real-time Fusion and Exploration of Enriched Data at Web Scale
© 2016 Mohammad Sadoghi (Purdue University)
Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data
8
PTGS2
(Gene)
inhibits
TP53
(Gene)
Rheumatoid
Arthritis
Osteosarcoma
(Bone Cancer)
Naproxen
(Aleve)
Disease
Immune
System
Autoimmune
Joint
Diseases
Sarcoma
Neoplasms
Methotrexate
DHFR
(Gene)
inhibits
Arthritis
Warfarin
Embolism
(Blood Clot)
Nicotine
VKORC1
(Gene)CYP2C9
(Enzyme)
Chemical
Carboxylic
Acids
Heterocyclic
Aminopterin
Phenylpro-
pionates
Approved
Drugs
increased
degradation
inhibits
Inhibits
Inhibits
Inhibits
limit cells
growth
tum
or
suppressor
Why capture the semantic/context?
Semantic is essential to connect the dots.
© 2016 Mohammad Sadoghi (Purdue University)
Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data
9
PTGS2
(Gene)
inhibits
TP53
(Gene)
Rheumatoid
Arthritis
Osteosarcoma
(Bone Cancer)
Naproxen
(Aleve)
Disease
Immune
System
Autoimmune
Joint
Diseases
Sarcoma
Neoplasms
Methotrexate
DHFR
(Gene)
inhibits
Arthritis
Warfarin
Embolism
(Blood Clot)
Nicotine
VKORC1
(Gene)CYP2C9
(Enzyme)
Chemical
Carboxylic
Acids
Heterocyclic
Aminopterin
Phenylpro-
pionates
Approved
Drugs
increased
degradation
inhibits
Inhibits
Inhibits
Inhibits
limit cells
growth
tum
or
suppressor
Why capture the semantic/context?
Semantic is essential to connect the dots.
© 2016 Mohammad Sadoghi (Purdue University)
Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data
10
PTGS2
(Gene)
inhibits
TP53
(Gene)
Rheumatoid
Arthritis
Osteosarcoma
(Bone Cancer)
Naproxen
(Aleve)
Disease
Immune
System
Autoimmune
Joint
Diseases
Sarcoma
Neoplasms
Methotrexate
DHFR
(Gene)
inhibits
limit cells
growth
Arthritis
Warfarin
Embolism
(Blood Clot)
Nicotine
VKORC1
(Gene)CYP2C9
(Enzyme)
Chemical
Carboxylic
Acids
Heterocyclic
Aminopterin
Phenylpro-
pionates
Approved
Drugs
increased
degradation
inhibits
Inhibits
Inhibits
Inhibits
tum
or
suppressor
Why capture the semantic/context?
Semantic is essential to connect the dots.
© 2016 Mohammad Sadoghi (Purdue University)
Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data
11
PTGS2
(Gene)
inhibits
TP53
(Gene)
Rheumatoid
Arthritis
Osteosarcoma
(Bone Cancer)
Naproxen
(Aleve)
Disease
Immune
System
Autoimmune
Joint
Diseases
Sarcoma
Neoplasms
Methotrexate
DHFR
(Gene)
inhibits
limit cells
growth
Arthritis
Warfarin
Embolism
(Blood Clot)
Nicotine
VKORC1
(Gene)CYP2C9
(Enzyme)
Chemical
Carboxylic
Acids
Heterocyclic
Aminopterin
Phenylpro-
pionates
Approved
Drugs
increased
degradation
inhibits
Inhibits
Inhibits
Inhibits
tum
or
suppressor
?
Why capture the semantic/context?
Semantic is essential to connect the dots.
© 2016 Mohammad Sadoghi (Purdue University)
Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data
12
PTGS2
(Gene)
inhibits
TP53
(Gene)
Rheumatoid
Arthritis
Osteosarcoma
(Bone Cancer)
tum
or
suppressor
Naproxen
(Aleve)
Disease
Immune
System
Autoimmune
Joint
Diseases
Sarcoma
Neoplasms
Methotrexate
DHFR
(Gene)
inhibits
Arthritis
Warfarin
Embolism
(Blood Clot)
Nicotine
VKORC1
(Gene)CYP2C9
(Enzyme)
Chemical
Carboxylic
Acids
Heterocyclic
Aminopterin
Phenylpro-
pionates
Approved
Drugs
increased
degradation
inhibits
Inhibits
Inhibits
Inhibits
limit cells
growth
?
?
?
Why capture the semantic/context?
Semantic is essential to connect the dots.
© 2016 Mohammad Sadoghi (Purdue University)
Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data
13
PTGS2
(Gene)
inhibits
TP53
(Gene)
Rheumatoid
Arthritis
Osteosarcoma
(Bone Cancer)
Naproxen
(Aleve)
Disease
Immune
System
Autoimmune
Joint
Diseases
Sarcoma
Neoplasms
Methotrexate
DHFR
(Gene)
inhibits
Arthritis
Warfarin
Embolism
(Blood Clot)
Nicotine
VKORC1
(Gene)CYP2C9
(Enzyme)
Chemical
Carboxylic
Acids
Heterocyclic
Aminopterin
Phenylpro-
pionates
Approved
Drugs
increased
degradation
inhibits
Inhibits
Inhibits
Inhibits
(1) Instance Layer: Capturing raw data instances
including both structured & semi-structured data
How to capture the context?
limit cells
growth
tum
or
suppressor
© 2016 Mohammad Sadoghi (Purdue University)
Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data
14
PTGS2
(Gene)
inhibits
TP53
(Gene)
Rheumatoid
Arthritis
Osteosarcoma
(Bone Cancer)
Naproxen
(Aleve)
Disease
Immune
System
Autoimmune
Joint
Diseases
Sarcoma
Neoplasms
Methotrexate
DHFR
(Gene)
inhibits
Arthritis
Warfarin
Embolism
(Blood Clot)
Nicotine
VKORC1
(Gene)CYP2C9
(Enzyme)
Chemical
Carboxylic
Acids
Heterocyclic
Aminopterin
Phenylpro-
pionates
Approved
Drugs
increased
degradation
inhibits
Inhibits
Inhibits
Inhibits
How to capture the context?
limit cells
growth
tum
or
suppressor
(2) Relation Layer: Capturing the interconnectedness
of data instances across data sources
© 2016 Mohammad Sadoghi (Purdue University)
Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data
15
PTGS2
(Gene)
inhibits
TP53
(Gene)
Rheumatoid
Arthritis
Osteosarcoma
(Bone Cancer)
Naproxen
(Aleve)
Disease
Immune
System
Autoimmune
Joint
Diseases
Sarcoma
Neoplasms
Methotrexate
DHFR
(Gene)
inhibits
Arthritis
Warfarin
Embolism
(Blood Clot)
Nicotine
VKORC1
(Gene)CYP2C9
(Enzyme)
Chemical
Carboxylic
Acids
Heterocyclic
Aminopterin
Phenylpro-
pionates
Approved
Drugs
increased
degradation
inhibits
Inhibits
Inhibits
Inhibits
How to capture the context?
limit cells
growth
tum
or
suppressor
(3) Semantic Layer: Capturing conceptual relationships
among data instances and their types
© 2016 Mohammad Sadoghi (Purdue University)
Enriched Data Model: Semantic is essential to connect the dots
16
PTGS2
(Gene)
TP53
(Gene)
Acetaminophen
(Tylenol)
Rheumatoid
Arthritis
Osteosarcoma
(Bone Cancer)
Relief
Fever
Ibuprofen
(Advil)
Immune
System
Autoimmune
Joint
Diseases
Sarcoma
Neoplasms
Drug	Name Drug	Targets	
(Genes)
Symptomatic	
Treatment
Ibuprofen PTGS2 Rheumatoid	 Arthritis
Acetaminophen PTGS2 Relief Fever
Methotrexate DHFR Antineoplastic	
Anti-metabolite
Warfarin TP53	 Embolism
(Blood	 Clot)
Gene Interaction
PTGS2 TP53	(Gene)
DrugBank: Bioinformatics & Cheminformatics Resource
CTD: Comparative Toxicogenomics Database
Gene Function
TP53 Tumor	Suppressor
DHFR Limits	Cell Growth
Uniprot: Universal Protein Resource
Gene Disease
TP53	 Osteosarcoma
SemanticlayerRelationlayerInstancelayer
Methotrexate
DHFR
(Gene)
Arthritis
Warfarin
Embolism
(Blood Clot)
InformationKnowledgeData
Warfarin has narrow
therapeutic range
(fatal outcomes)
Dosage for Asians
population: 3.4 mg
Dosage for Whites
population: 5.1mg
Dosage for
African-Americans
population: 6.1 mg
© 2016 Mohammad Sadoghi (Purdue University)
Context-aware Query Model
17
Rank	Query	
Representation
Rank	Query	Refinement
Rank	Data	Sources	Discovery
Rank	Query	Composition
Rank	Query	Answers
Rank	Answer	Evidence
Rank	Answer	
Representation
Query	Refinement	Ranking
Data	Source	Discovery	Ranking
Query	Composition	Ranking
Query	Answer	Ranking
Evidence	Ranking
Query	
Representation	Ranking
Answer	Representation	
Ranking
“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”
Yes/No
© 2016 Mohammad Sadoghi (Purdue University)
Context-aware Query Model
18
Rank	Query	
Representation
Rank	Query	Refinement
Rank	Data	Sources	Discovery
Rank	Query	Composition
Rank	Query	Answers
Rank	Answer	Evidence
Rank	Answer	
Representation
Query	Refinement	Ranking
Data	Source	Discovery	Ranking
Query	Composition	Ranking
Query	Answer	Ranking
Evidence	Ranking
Query	
Representation	Ranking
Answer	Representation	
Ranking
“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”
Yes/No
“Is Warfarin sensitive to
ethnic background?”
© 2016 Mohammad Sadoghi (Purdue University)
Context-aware Query Model
19
Rank	Query	
Representation
Rank	Query	Refinement
Rank	Data	Sources	Discovery
Rank	Query	Composition
Rank	Query	Answers
Rank	Answer	Evidence
Rank	Answer	
Representation
Query	Refinement	Ranking
Data	Source	Discovery	Ranking
Query	Composition	Ranking
Query	Answer	Ranking
Evidence	Ranking
Query	
Representation	Ranking
Answer	Representation	
Ranking
“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”
Yes/No
“Is Warfarin sensitive to
ethnic background?”
“Does Warfarin have a narrow
therapeutic range?”
© 2016 Mohammad Sadoghi (Purdue University)
Context-aware Query Model
20
Rank	Query	
Representation
Rank	Query	Refinement
Rank	Data	Sources	Discovery
Rank	Query	Composition
Rank	Query	Answers
Rank	Answer	Evidence
Rank	Answer	
Representation
Query	Refinement	Ranking
Data	Source	Discovery	Ranking
Query	Composition	Ranking
Query	Answer	Ranking
Evidence	Ranking
Query	
Representation	Ranking
Answer	Representation	
Ranking
“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”
Yes/No
“Is Warfarin sensitive to
ethnic background?”
“Does Warfarin have a narrow
therapeutic range?”
“What are the disjoint classes of
population with respect to Warfarin?”
© 2016 Mohammad Sadoghi (Purdue University)
Context-aware Query Model
21
Rank	Query	
Representation
Rank	Query	Refinement
Rank	Data	Sources	Discovery
Rank	Query	Composition
Rank	Query	Answers
Rank	Answer	Evidence
Rank	Answer	
Representation
Query	Refinement	Ranking
Data	Source	Discovery	Ranking
Query	Composition	Ranking
Query	Answer	Ranking
Evidence	Ranking
Query	
Representation	Ranking
Answer	Representation	
Ranking
“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”
Yes/No
“Is Warfarin sensitive to
ethnic background?”
“Does Warfarin have a narrow
therapeutic range?”
“What are the disjoint classes of
population with respect to Warfarin?”
“What are the adverse reactions
of Warfarin?”
© 2016 Mohammad Sadoghi (Purdue University)
Context-aware Query Model
22
Rank	Query	
Representation
Rank	Query	Refinement
Rank	Data	Sources	Discovery
Rank	Query	Composition
Rank	Query	Answers
Rank	Answer	Evidence
Rank	Answer	
Representation
Query	Refinement	Ranking
Data	Source	Discovery	Ranking
Query	Composition	Ranking
Query	Answer	Ranking
Evidence	Ranking
Query	
Representation	Ranking
Answer	Representation	
Ranking
“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”
Yes/No
“Is Warfarin sensitive to
ethnic background?”
“Does Warfarin have a narrow
therapeutic range?”
“What are the disjoint classes of
population with respect to Warfarin?”
“What are the adverse reactions
of Warfarin?”
“What is an effective dosage of
Warfarin for preventing blood clot?”
© 2016 Mohammad Sadoghi (Purdue University)
Context-aware Query Model
23
“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”
“What are the disjoint
classes of population with
respect to Warfarin?”
“What is an effective
dosage of Warfarin for
preventing blood clot?”
“Does Warfarin have
a narrow therapeutic range?”
© 2016 Mohammad Sadoghi (Purdue University)
Context-aware Query Model
24
“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”
“What are the disjoint
classes of population with
respect to Warfarin?”
“What is an effective
dosage of Warfarin for
preventing blood clot?”
“Does Warfarin have
a narrow therapeutic range?”
Dosage for
African-Americans
population: 6.1 mg
Dosage for Whites
population: 5.1mg
Dosage for Asians
population: 3.4 mg
© 2016 Mohammad Sadoghi (Purdue University)
Context-aware Query Model
25
“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”
“What are the disjoint
classes of population with
respect to Warfarin?”
Querying different sources
return 6.1 mg, 5.1 mg, & 3.4 mg,
so is the data inconsistent?
(revisiting consistent answers formalism
& possible world semantics)
“What is an effective
dosage of Warfarin for
preventing blood clot?”
“Does Warfarin have
a narrow therapeutic range?”
Dosage for
African-Americans
population: 6.1 mg
Dosage for Whites
population: 5.1mg
Dosage for Asians
population: 3.4 mg
© 2016 Mohammad Sadoghi (Purdue University)
Context-aware Query Model
26
“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”
“What are the disjoint
classes of population with
respect to Warfarin?”
Querying different sources
return 6.1 mg, 5.1 mg, & 3.4 mg,
so is the data inconsistent?
(revisiting consistent answers formalism
& possible world semantics)
“What is an effective
dosage of Warfarin for
preventing blood clot?”
“Does Warfarin have
a narrow therapeutic range?”
Dosage for
African-Americans
population: 6.1 mg
Dosage for Whites
population: 5.1mg
Dosage for Asians
population: 3.4 mg
Given the known narrow therapeutic range,
so is 5.1 mg close enough to 5.0 mg?
(fuzzy answers formalism in
presence of enriched data)
© 2016 Mohammad Sadoghi (Purdue University)
Spark Architecture: Knowledge Oblivious
Applications
APIs/Services
(Access/Interfaces)
Processing
Engine
Data Model
(Immutable
Collection of
Objects)
Storage
Resource
Virtualization
27
Distributed File Systems (e.g., HDFS, S3, Ceph)
Distributed Memory (Tachyon)Compression (Succinct)
Apache Spark (General Data Processing on Distributed Memory)
Spark Data Model (Resilient Distributed Datasets — RDDs)
Resource Abstractions
(Apache Mesos)
Resource Management
(HadoopYarn)
Personalized Medicine
(Drug Discovery/Safety)
Spark
Streaming
SparkSQL
BlinkDB
GraphX SparkR MLlib
Computational Finance
Compliance
Informatics
© 2016 Mohammad Sadoghi (Purdue University)
Spark Architecture: Knowledge Oblivious
Applications
APIs/Services
(Access/Interfaces)
Processing
Engine
Data Model
(Immutable
Collection of
Objects)
Storage
Resource
Virtualization
28
Distributed File Systems (e.g., HDFS, S3, Ceph)
Distributed Memory (Tachyon)Compression (Succinct)
Apache Spark (General Data Processing on Distributed Memory)
Spark Data Model (Resilient Distributed Datasets — RDDs)
Resource Abstractions
(Apache Mesos)
Resource Management
(HadoopYarn)
Spark
Streaming
SparkSQL
BlinkDB
GraphX SparkR MLlib
Personalized Medicine
(Drug Discovery/Safety)
Computational Finance
Compliance
Informatics
© 2016 Mohammad Sadoghi (Purdue University)
ExpoDB Architecture: From Data to Knowledge
Applications
APIs/Services
(Access/Interfaces)
Processing
Engine
Data Model
(Enriching Raw
Data Towards
Knowledge)
Storage
Resource
Virtualization
29
Spark
Streaming
SparkSQL
BlinkDB
GraphX SparkR MLlib
Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON
Distributed File Systems (e.g., HDFS, S3, Ceph)
Distributed Memory (Tachyon)Compression (Succinct)
Resource Abstractions
(Apache Mesos)
Resource Management
(HadoopYarn)
Apache Spark (General Data Processing on Distributed Memory)
Personalized Medicine
(Drug Discovery/Safety)
Computational Finance
Compliance
Informatics
© 2016 Mohammad Sadoghi (Purdue University)
ExpoDB Architecture: From Data to Knowledge
Applications
APIs/Services
(Access/Interfaces)
Processing
Engine
Data Model
(Enriching Raw
Data Towards
Knowledge)
Storage
Resource
Virtualization
30
Spark
Streaming
SparkSQL
BlinkDB
GraphX SparkR MLlib
Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)
Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON
Distributed File Systems (e.g., HDFS, S3, Ceph)
Distributed Memory (Tachyon)Compression (Succinct)
Resource Abstractions
(Apache Mesos)
Resource Management
(HadoopYarn)
Apache Spark (General Data Processing on Distributed Memory)
Personalized Medicine
(Drug Discovery/Safety)
Computational Finance
Compliance
Informatics
© 2016 Mohammad Sadoghi (Purdue University)
ExpoDB Architecture: From Data to Knowledge
Applications
APIs/Services
(Access/Interfaces)
Processing
Engine
Data Model
(Enriching Raw
Data Towards
Knowledge)
Storage
Resource
Virtualization
31
Spark
Streaming
SparkSQL
BlinkDB
GraphX SparkR MLlib
Semantic Layer Ontology Rules Stochastic Models Tensor Embedding
Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)
Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON
Distributed File Systems (e.g., HDFS, S3, Ceph)
Distributed Memory (Tachyon)Compression (Succinct)
Resource Abstractions
(Apache Mesos)
Resource Management
(HadoopYarn)
Apache Spark (General Data Processing on Distributed Memory)
Personalized Medicine
(Drug Discovery/Safety)
Computational Finance
Compliance
Informatics
© 2016 Mohammad Sadoghi (Purdue University)
ExpoDB Architecture: From Data to Knowledge
Applications
APIs/Services
(Access/Interfaces)
Processing
Engine
Data Model
(Enriching Raw
Data Towards
Knowledge)
Storage
Resource
Virtualization
32
Spark
Streaming
SparkSQL
BlinkDB
GraphX SparkR MLlib
Semantic Layer
Spark Data Model (RDDs) Generic Data Model (Key-Value Store)
Ontology Rules Stochastic Models Tensor Embedding
Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)
Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON
Distributed File Systems (e.g., HDFS, S3, Ceph)
Distributed Memory (Tachyon)Compression (Succinct)
Resource Abstractions
(Apache Mesos)
Resource Management
(HadoopYarn)
Apache Spark (General Data Processing on Distributed Memory)
Personalized Medicine
(Drug Discovery/Safety)
Computational Finance
Compliance
Informatics
© 2016 Mohammad Sadoghi (Purdue University)
ExpoDB Architecture: From Data to Knowledge
Applications
APIs/Services
(Access/Interfaces)
Processing
Engine
Data Model
(Enriching Raw
Data Towards
Knowledge)
Storage
Resource
Virtualization
33
Spark
Streaming
SparkSQL
BlinkDB
GraphX SparkR MLlib
ReasoningRefinementCuration Fusion Discovery
Distributed File Systems (e.g., HDFS, S3, Ceph)
Distributed Memory (Tachyon)Compression (Succinct)
Resource Abstractions
(Apache Mesos)
Resource Management
(HadoopYarn)
Online Transactional Processing (OLTP) + Online Analytical Processing (OLAP)
Semantic Layer
Spark Data Model (RDDs) Generic Data Model (Key-Value Store)
Ontology Rules Stochastic Models Tensor Embedding
Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)
Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON
Personalized Medicine
(Drug Discovery/Safety)
Computational Finance
Compliance
Informatics
© 2016 Mohammad Sadoghi (Purdue University)
ExpoDB Architecture:Active Data Path
Applications
APIs/Services
(Access/Interfaces)
Processing
Engine
Data Model
(Enriching Raw
Data Towards
Knowledge)
Storage
Resource
Virtualization
34
Spark
Streaming
SparkSQL
BlinkDB
GraphX SparkR MLlib
ReasoningRefinementCuration Fusion
Semantic Layer
Spark Data Model (RDDs) Generic Data Model (Key-Value Store)
Ontology Rules Stochastic Models Tensor Embedding
Discovery
Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)
Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON
Distributed File Systems (e.g., HDFS, S3, Ceph)
Distributed Memory (Tachyon)Compression (Succinct)
Resource Abstractions
(Apache Mesos)
Resource Management
(HadoopYarn)
Virtualized Hardware Acceleration (GPU & FPGA)
Online Transactional Processing (OLTP) + Online Analytical Processing (OLAP)
Personalized Medicine
(Drug Discovery/Safety)
Computational Finance
Compliance
Informatics
© 2016 Mohammad Sadoghi (Purdue University)
Personalized Medicine
(Drug Discovery/Safety)
Computational Finance
The First Step!
Applications
APIs/Services
(Access/Interfaces)
Processing
Engine
Data Model
(Enriching Raw
Data Towards
Knowledge)
Storage
Resource
Virtualization
35
Spark
Streaming
SparkSQL
BlinkDB
GraphX SparkR MLlib
ReasoningRefinementCuration Fusion
Semantic Layer
Spark Data Model (RDDs) Generic Data Model (Key-Value Store)
Ontology Rules Stochastic Models Tensor Embedding
Discovery
Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)
Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON
Distributed File Systems (e.g., HDFS, S3, Ceph)
Distributed Memory (Tachyon)Compression (Succinct)
Resource Abstractions
(Apache Mesos)
Resource Management
(HadoopYarn)
Online Transactional Processing (OLTP) + Online Analytical Processing (OLAP)
L-Store
(Real-time OLTP+OLAP)
FQP
(Flexible Query Processor)
EmbedS
(Ontology)
Phenomenological Features
(Deep-Learning-as-Oracle)
PADRES
(Event Processing)
IBM DB2 BLU
(Column Store)
SPIDER
(Declarative Data Cleansing)
Vraph
(Vectorized Graph Processing)
Tiresias
(Predicting Adverse Drug Reaction)
fpga-ToPSS
(Algorithmic Trading)
Compliance
Informatics
Virtualized Hardware Acceleration (GPU & FPGA)
© 2016 Mohammad Sadoghi (Purdue University)
ThankYou
Q&A
Exploratory Systems Lab (ExpoLab)
website: https://msadoghi.github.io/
© 2016 Mohammad Sadoghi (Purdue University)
Data/Knowledge Exploration:
• Mohammad Sadoghi, Kavitha Srinivas, Oktie Hassanzadeh,Yuan-Chi Chang, Mustafa Canim,Achille Fokoue,Yishai A. Feldman: Self-Curating Databases. EDBT 2016
• Amit Chandel, Oktie Hassanzadeh, Nick Koudas, Mohammad Sadoghi, Divesh Srivastava: Benchmarking declarative approximate selection predicates. SIGMOD Conference 2007: 353-364
• Oktie Hassanzadeh, Mohammad Sadoghi, Renée J. Miller:Accuracy of Approximate String Joins Using Grams. QDB 2007
Drug Safety:
• Achille Fokoue, Mohammad Sadoghi, Oktie Hassanzadeh, Ping Zhang: Predicting Drug-Drug InteractionsThrough Large-Scale Similarity-Based Link Prediction. ESWC 2016
• Achille Fokoue, Oktie Hassanzadeh, Mohammad Sadoghi, Ping Zhang: Predicting Drug-Drug InteractionsThrough Similarity-Based Link Prediction OverWeb Data.WWW 2016
OLTP & OLAP:
• Mohammad Sadoghi, Souvik Bhattacherjee, Bishwaranjan Bhattacharjee, Mustafa Canim: L-Store:A Real-time OLTP and OLAP System. CoRR abs/1601.04084 (2016)
• Kaiwen Zhang, Mohammad Sadoghi, Hans-Arno Jacobsen: DL-Store:A Distributed Hybrid OLTP and OLAP Data Processing Engine. ICDCS 2016
• Mohammad Sadoghi, Kenneth A. Ross, Mustafa Canim, Bishwaranjan Bhattacharjee: Exploiting SSDs in operational multiversion databases.VLDB J. 25(5): 651-672 (2016)
• Mohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee, Fabian Nagel, Kenneth A. Ross: Reducing Database Locking ContentionThrough Multi-version Concurrency. PVLDB 7(13):
1331-1342 (2014)
• Prashanth Menon,Tilmann Rabl, Mohammad Sadoghi, Hans-Arno Jacobsen: CaSSanDra:An SSD boosted key-value store. ICDE 2014: 1162-1167
• Prashanth Menon,Tilmann Rabl, Mohammad Sadoghi, Hans-Arno Jacobsen: Optimizing key-value stores for hybrid storage architectures. CASCON 2014: 355-358
• Mohammad Sadoghi, Kenneth A. Ross, Mustafa Canim, Bishwaranjan Bhattacharjee: Making Updates Disk-I/O Friendly Using SSDs. PVLDB 6(11): 997-1008 (2013)
Hardware Acceleration:
• Rajesh R. Bordawekar, Mohammad Sadoghi:Accelerating database workloads by software-hardware-system co-design. ICDE 2016
• Mohammadreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen: SplitJoin:A Scalable, Low-latency Stream Join Architecture with Adjustable Ordering Precision. USENIX AnnualTechnical
Conference 2016
• Mohammadreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen:The FQPVision: Flexible Query Processing on a Reconfigurable Computing Fabric. SIGMOD Record 44(2): 5-10 (2015)
• Mohammadreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen: Configurable hardware-based streaming architecture using Online Programmable-Blocks. ICDE 2015
• Mohammedreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen: Flexible Query Processor on FPGAs. PVLDB 6(12): 1310-1313 (2013)
• Mohammad Sadoghi, Rija Javed, NaifTarafdar, Harsh Singh, Rohan Palaniappan, Hans-Arno Jacobsen: Multi-query Stream Processing on FPGAs. ICDE 2012: 1229-1232
• Mohammad Sadoghi, Harsh Singh, Hans-Arno Jacobsen:Towards highly parallel event processing through reconfigurable hardware. DaMoN 2011: 27-32
• Mohammad Sadoghi, Harsh Singh, Hans-Arno Jacobsen: fpga-ToPSS: line-speed event processing on fpgas. DEBS 2011: 373-374
• Mohammad Sadoghi, Hans-Arno Jacobsen, Martin Labrecque,Warren Shum, Harsh Singh: Efficient Event Processing through Reconfigurable Hardware for AlgorithmicTrading. PVLDB 3(2):
1525-1528 (2010)
References:

"ExpoDB: An Exploratory Data Science Platform"

  • 1.
    © 2016 MohammadSadoghi (Purdue University) ExpoDB:An Exploratory Data Science Platform (A New Frontier: From Data Processing to Knowledge Exploration) Mohammad Sadoghi Assistant Professor Department of Computer Science Purdue University IBM Cognitive Systems Institute Speaker Series September 29, 2016
  • 2.
    © 2016 MohammadSadoghi (Purdue University) Insight is Lost in Islands of Data 2 http://www.cpsresearch.eu/clinical-trials/ http://news.mit.edu/2015/mnookin-vaccination-public-health-0227 http://www.healthcarepackaging.com/trends-and-issues/clinical-trials http://stormercellularloo.gq/evolve-ii-clinical-trial.html https://www.geneticliteracyproject.org Data is spread across many islands of disconnected sources (a lack of holistic view)
  • 3.
    © 2016 MohammadSadoghi (Purdue University) Insight is Lost in Islands of Data 3 http://www.cpsresearch.eu/clinical-trials/ http://news.mit.edu/2015/mnookin-vaccination-public-health-0227 http://www.healthcarepackaging.com/trends-and-issues/clinical-trials http://stormercellularloo.gq/evolve-ii-clinical-trial.html https://www.geneticliteracyproject.org Sadly, adverse drug reactions (ADRs) is the 4th leading cause of deaths in United States, resulting in100,000 loss of life annually
  • 4.
    © 2016 MohammadSadoghi (Purdue University) Insight is Lost in Islands of Data 4 http://www.cpsresearch.eu/clinical-trials/ http://news.mit.edu/2015/mnookin-vaccination-public-health-0227 http://www.healthcarepackaging.com/trends-and-issues/clinical-trials http://stormercellularloo.gq/evolve-ii-clinical-trial.html https://www.geneticliteracyproject.org Adverse drug reaction costs over $136 billion dollars in US annually
  • 5.
    © 2016 MohammadSadoghi (Purdue University) Real-time Fusion and Exploration of Data
  • 6.
    © 2016 MohammadSadoghi (Purdue University) Real-time Fusion and Exploration of Enriched Data
  • 7.
    © 2016 MohammadSadoghi (Purdue University) Real-time Fusion and Exploration of Enriched Data at Web Scale
  • 8.
    © 2016 MohammadSadoghi (Purdue University) Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data 8 PTGS2 (Gene) inhibits TP53 (Gene) Rheumatoid Arthritis Osteosarcoma (Bone Cancer) Naproxen (Aleve) Disease Immune System Autoimmune Joint Diseases Sarcoma Neoplasms Methotrexate DHFR (Gene) inhibits Arthritis Warfarin Embolism (Blood Clot) Nicotine VKORC1 (Gene)CYP2C9 (Enzyme) Chemical Carboxylic Acids Heterocyclic Aminopterin Phenylpro- pionates Approved Drugs increased degradation inhibits Inhibits Inhibits Inhibits limit cells growth tum or suppressor Why capture the semantic/context? Semantic is essential to connect the dots.
  • 9.
    © 2016 MohammadSadoghi (Purdue University) Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data 9 PTGS2 (Gene) inhibits TP53 (Gene) Rheumatoid Arthritis Osteosarcoma (Bone Cancer) Naproxen (Aleve) Disease Immune System Autoimmune Joint Diseases Sarcoma Neoplasms Methotrexate DHFR (Gene) inhibits Arthritis Warfarin Embolism (Blood Clot) Nicotine VKORC1 (Gene)CYP2C9 (Enzyme) Chemical Carboxylic Acids Heterocyclic Aminopterin Phenylpro- pionates Approved Drugs increased degradation inhibits Inhibits Inhibits Inhibits limit cells growth tum or suppressor Why capture the semantic/context? Semantic is essential to connect the dots.
  • 10.
    © 2016 MohammadSadoghi (Purdue University) Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data 10 PTGS2 (Gene) inhibits TP53 (Gene) Rheumatoid Arthritis Osteosarcoma (Bone Cancer) Naproxen (Aleve) Disease Immune System Autoimmune Joint Diseases Sarcoma Neoplasms Methotrexate DHFR (Gene) inhibits limit cells growth Arthritis Warfarin Embolism (Blood Clot) Nicotine VKORC1 (Gene)CYP2C9 (Enzyme) Chemical Carboxylic Acids Heterocyclic Aminopterin Phenylpro- pionates Approved Drugs increased degradation inhibits Inhibits Inhibits Inhibits tum or suppressor Why capture the semantic/context? Semantic is essential to connect the dots.
  • 11.
    © 2016 MohammadSadoghi (Purdue University) Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data 11 PTGS2 (Gene) inhibits TP53 (Gene) Rheumatoid Arthritis Osteosarcoma (Bone Cancer) Naproxen (Aleve) Disease Immune System Autoimmune Joint Diseases Sarcoma Neoplasms Methotrexate DHFR (Gene) inhibits limit cells growth Arthritis Warfarin Embolism (Blood Clot) Nicotine VKORC1 (Gene)CYP2C9 (Enzyme) Chemical Carboxylic Acids Heterocyclic Aminopterin Phenylpro- pionates Approved Drugs increased degradation inhibits Inhibits Inhibits Inhibits tum or suppressor ? Why capture the semantic/context? Semantic is essential to connect the dots.
  • 12.
    © 2016 MohammadSadoghi (Purdue University) Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data 12 PTGS2 (Gene) inhibits TP53 (Gene) Rheumatoid Arthritis Osteosarcoma (Bone Cancer) tum or suppressor Naproxen (Aleve) Disease Immune System Autoimmune Joint Diseases Sarcoma Neoplasms Methotrexate DHFR (Gene) inhibits Arthritis Warfarin Embolism (Blood Clot) Nicotine VKORC1 (Gene)CYP2C9 (Enzyme) Chemical Carboxylic Acids Heterocyclic Aminopterin Phenylpro- pionates Approved Drugs increased degradation inhibits Inhibits Inhibits Inhibits limit cells growth ? ? ? Why capture the semantic/context? Semantic is essential to connect the dots.
  • 13.
    © 2016 MohammadSadoghi (Purdue University) Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data 13 PTGS2 (Gene) inhibits TP53 (Gene) Rheumatoid Arthritis Osteosarcoma (Bone Cancer) Naproxen (Aleve) Disease Immune System Autoimmune Joint Diseases Sarcoma Neoplasms Methotrexate DHFR (Gene) inhibits Arthritis Warfarin Embolism (Blood Clot) Nicotine VKORC1 (Gene)CYP2C9 (Enzyme) Chemical Carboxylic Acids Heterocyclic Aminopterin Phenylpro- pionates Approved Drugs increased degradation inhibits Inhibits Inhibits Inhibits (1) Instance Layer: Capturing raw data instances including both structured & semi-structured data How to capture the context? limit cells growth tum or suppressor
  • 14.
    © 2016 MohammadSadoghi (Purdue University) Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data 14 PTGS2 (Gene) inhibits TP53 (Gene) Rheumatoid Arthritis Osteosarcoma (Bone Cancer) Naproxen (Aleve) Disease Immune System Autoimmune Joint Diseases Sarcoma Neoplasms Methotrexate DHFR (Gene) inhibits Arthritis Warfarin Embolism (Blood Clot) Nicotine VKORC1 (Gene)CYP2C9 (Enzyme) Chemical Carboxylic Acids Heterocyclic Aminopterin Phenylpro- pionates Approved Drugs increased degradation inhibits Inhibits Inhibits Inhibits How to capture the context? limit cells growth tum or suppressor (2) Relation Layer: Capturing the interconnectedness of data instances across data sources
  • 15.
    © 2016 MohammadSadoghi (Purdue University) Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data 15 PTGS2 (Gene) inhibits TP53 (Gene) Rheumatoid Arthritis Osteosarcoma (Bone Cancer) Naproxen (Aleve) Disease Immune System Autoimmune Joint Diseases Sarcoma Neoplasms Methotrexate DHFR (Gene) inhibits Arthritis Warfarin Embolism (Blood Clot) Nicotine VKORC1 (Gene)CYP2C9 (Enzyme) Chemical Carboxylic Acids Heterocyclic Aminopterin Phenylpro- pionates Approved Drugs increased degradation inhibits Inhibits Inhibits Inhibits How to capture the context? limit cells growth tum or suppressor (3) Semantic Layer: Capturing conceptual relationships among data instances and their types
  • 16.
    © 2016 MohammadSadoghi (Purdue University) Enriched Data Model: Semantic is essential to connect the dots 16 PTGS2 (Gene) TP53 (Gene) Acetaminophen (Tylenol) Rheumatoid Arthritis Osteosarcoma (Bone Cancer) Relief Fever Ibuprofen (Advil) Immune System Autoimmune Joint Diseases Sarcoma Neoplasms Drug Name Drug Targets (Genes) Symptomatic Treatment Ibuprofen PTGS2 Rheumatoid Arthritis Acetaminophen PTGS2 Relief Fever Methotrexate DHFR Antineoplastic Anti-metabolite Warfarin TP53 Embolism (Blood Clot) Gene Interaction PTGS2 TP53 (Gene) DrugBank: Bioinformatics & Cheminformatics Resource CTD: Comparative Toxicogenomics Database Gene Function TP53 Tumor Suppressor DHFR Limits Cell Growth Uniprot: Universal Protein Resource Gene Disease TP53 Osteosarcoma SemanticlayerRelationlayerInstancelayer Methotrexate DHFR (Gene) Arthritis Warfarin Embolism (Blood Clot) InformationKnowledgeData Warfarin has narrow therapeutic range (fatal outcomes) Dosage for Asians population: 3.4 mg Dosage for Whites population: 5.1mg Dosage for African-Americans population: 6.1 mg
  • 17.
    © 2016 MohammadSadoghi (Purdue University) Context-aware Query Model 17 Rank Query Representation Rank Query Refinement Rank Data Sources Discovery Rank Query Composition Rank Query Answers Rank Answer Evidence Rank Answer Representation Query Refinement Ranking Data Source Discovery Ranking Query Composition Ranking Query Answer Ranking Evidence Ranking Query Representation Ranking Answer Representation Ranking “Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?” Yes/No
  • 18.
    © 2016 MohammadSadoghi (Purdue University) Context-aware Query Model 18 Rank Query Representation Rank Query Refinement Rank Data Sources Discovery Rank Query Composition Rank Query Answers Rank Answer Evidence Rank Answer Representation Query Refinement Ranking Data Source Discovery Ranking Query Composition Ranking Query Answer Ranking Evidence Ranking Query Representation Ranking Answer Representation Ranking “Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?” Yes/No “Is Warfarin sensitive to ethnic background?”
  • 19.
    © 2016 MohammadSadoghi (Purdue University) Context-aware Query Model 19 Rank Query Representation Rank Query Refinement Rank Data Sources Discovery Rank Query Composition Rank Query Answers Rank Answer Evidence Rank Answer Representation Query Refinement Ranking Data Source Discovery Ranking Query Composition Ranking Query Answer Ranking Evidence Ranking Query Representation Ranking Answer Representation Ranking “Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?” Yes/No “Is Warfarin sensitive to ethnic background?” “Does Warfarin have a narrow therapeutic range?”
  • 20.
    © 2016 MohammadSadoghi (Purdue University) Context-aware Query Model 20 Rank Query Representation Rank Query Refinement Rank Data Sources Discovery Rank Query Composition Rank Query Answers Rank Answer Evidence Rank Answer Representation Query Refinement Ranking Data Source Discovery Ranking Query Composition Ranking Query Answer Ranking Evidence Ranking Query Representation Ranking Answer Representation Ranking “Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?” Yes/No “Is Warfarin sensitive to ethnic background?” “Does Warfarin have a narrow therapeutic range?” “What are the disjoint classes of population with respect to Warfarin?”
  • 21.
    © 2016 MohammadSadoghi (Purdue University) Context-aware Query Model 21 Rank Query Representation Rank Query Refinement Rank Data Sources Discovery Rank Query Composition Rank Query Answers Rank Answer Evidence Rank Answer Representation Query Refinement Ranking Data Source Discovery Ranking Query Composition Ranking Query Answer Ranking Evidence Ranking Query Representation Ranking Answer Representation Ranking “Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?” Yes/No “Is Warfarin sensitive to ethnic background?” “Does Warfarin have a narrow therapeutic range?” “What are the disjoint classes of population with respect to Warfarin?” “What are the adverse reactions of Warfarin?”
  • 22.
    © 2016 MohammadSadoghi (Purdue University) Context-aware Query Model 22 Rank Query Representation Rank Query Refinement Rank Data Sources Discovery Rank Query Composition Rank Query Answers Rank Answer Evidence Rank Answer Representation Query Refinement Ranking Data Source Discovery Ranking Query Composition Ranking Query Answer Ranking Evidence Ranking Query Representation Ranking Answer Representation Ranking “Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?” Yes/No “Is Warfarin sensitive to ethnic background?” “Does Warfarin have a narrow therapeutic range?” “What are the disjoint classes of population with respect to Warfarin?” “What are the adverse reactions of Warfarin?” “What is an effective dosage of Warfarin for preventing blood clot?”
  • 23.
    © 2016 MohammadSadoghi (Purdue University) Context-aware Query Model 23 “Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?” “What are the disjoint classes of population with respect to Warfarin?” “What is an effective dosage of Warfarin for preventing blood clot?” “Does Warfarin have a narrow therapeutic range?”
  • 24.
    © 2016 MohammadSadoghi (Purdue University) Context-aware Query Model 24 “Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?” “What are the disjoint classes of population with respect to Warfarin?” “What is an effective dosage of Warfarin for preventing blood clot?” “Does Warfarin have a narrow therapeutic range?” Dosage for African-Americans population: 6.1 mg Dosage for Whites population: 5.1mg Dosage for Asians population: 3.4 mg
  • 25.
    © 2016 MohammadSadoghi (Purdue University) Context-aware Query Model 25 “Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?” “What are the disjoint classes of population with respect to Warfarin?” Querying different sources return 6.1 mg, 5.1 mg, & 3.4 mg, so is the data inconsistent? (revisiting consistent answers formalism & possible world semantics) “What is an effective dosage of Warfarin for preventing blood clot?” “Does Warfarin have a narrow therapeutic range?” Dosage for African-Americans population: 6.1 mg Dosage for Whites population: 5.1mg Dosage for Asians population: 3.4 mg
  • 26.
    © 2016 MohammadSadoghi (Purdue University) Context-aware Query Model 26 “Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?” “What are the disjoint classes of population with respect to Warfarin?” Querying different sources return 6.1 mg, 5.1 mg, & 3.4 mg, so is the data inconsistent? (revisiting consistent answers formalism & possible world semantics) “What is an effective dosage of Warfarin for preventing blood clot?” “Does Warfarin have a narrow therapeutic range?” Dosage for African-Americans population: 6.1 mg Dosage for Whites population: 5.1mg Dosage for Asians population: 3.4 mg Given the known narrow therapeutic range, so is 5.1 mg close enough to 5.0 mg? (fuzzy answers formalism in presence of enriched data)
  • 27.
    © 2016 MohammadSadoghi (Purdue University) Spark Architecture: Knowledge Oblivious Applications APIs/Services (Access/Interfaces) Processing Engine Data Model (Immutable Collection of Objects) Storage Resource Virtualization 27 Distributed File Systems (e.g., HDFS, S3, Ceph) Distributed Memory (Tachyon)Compression (Succinct) Apache Spark (General Data Processing on Distributed Memory) Spark Data Model (Resilient Distributed Datasets — RDDs) Resource Abstractions (Apache Mesos) Resource Management (HadoopYarn) Personalized Medicine (Drug Discovery/Safety) Spark Streaming SparkSQL BlinkDB GraphX SparkR MLlib Computational Finance Compliance Informatics
  • 28.
    © 2016 MohammadSadoghi (Purdue University) Spark Architecture: Knowledge Oblivious Applications APIs/Services (Access/Interfaces) Processing Engine Data Model (Immutable Collection of Objects) Storage Resource Virtualization 28 Distributed File Systems (e.g., HDFS, S3, Ceph) Distributed Memory (Tachyon)Compression (Succinct) Apache Spark (General Data Processing on Distributed Memory) Spark Data Model (Resilient Distributed Datasets — RDDs) Resource Abstractions (Apache Mesos) Resource Management (HadoopYarn) Spark Streaming SparkSQL BlinkDB GraphX SparkR MLlib Personalized Medicine (Drug Discovery/Safety) Computational Finance Compliance Informatics
  • 29.
    © 2016 MohammadSadoghi (Purdue University) ExpoDB Architecture: From Data to Knowledge Applications APIs/Services (Access/Interfaces) Processing Engine Data Model (Enriching Raw Data Towards Knowledge) Storage Resource Virtualization 29 Spark Streaming SparkSQL BlinkDB GraphX SparkR MLlib Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON Distributed File Systems (e.g., HDFS, S3, Ceph) Distributed Memory (Tachyon)Compression (Succinct) Resource Abstractions (Apache Mesos) Resource Management (HadoopYarn) Apache Spark (General Data Processing on Distributed Memory) Personalized Medicine (Drug Discovery/Safety) Computational Finance Compliance Informatics
  • 30.
    © 2016 MohammadSadoghi (Purdue University) ExpoDB Architecture: From Data to Knowledge Applications APIs/Services (Access/Interfaces) Processing Engine Data Model (Enriching Raw Data Towards Knowledge) Storage Resource Virtualization 30 Spark Streaming SparkSQL BlinkDB GraphX SparkR MLlib Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level) Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON Distributed File Systems (e.g., HDFS, S3, Ceph) Distributed Memory (Tachyon)Compression (Succinct) Resource Abstractions (Apache Mesos) Resource Management (HadoopYarn) Apache Spark (General Data Processing on Distributed Memory) Personalized Medicine (Drug Discovery/Safety) Computational Finance Compliance Informatics
  • 31.
    © 2016 MohammadSadoghi (Purdue University) ExpoDB Architecture: From Data to Knowledge Applications APIs/Services (Access/Interfaces) Processing Engine Data Model (Enriching Raw Data Towards Knowledge) Storage Resource Virtualization 31 Spark Streaming SparkSQL BlinkDB GraphX SparkR MLlib Semantic Layer Ontology Rules Stochastic Models Tensor Embedding Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level) Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON Distributed File Systems (e.g., HDFS, S3, Ceph) Distributed Memory (Tachyon)Compression (Succinct) Resource Abstractions (Apache Mesos) Resource Management (HadoopYarn) Apache Spark (General Data Processing on Distributed Memory) Personalized Medicine (Drug Discovery/Safety) Computational Finance Compliance Informatics
  • 32.
    © 2016 MohammadSadoghi (Purdue University) ExpoDB Architecture: From Data to Knowledge Applications APIs/Services (Access/Interfaces) Processing Engine Data Model (Enriching Raw Data Towards Knowledge) Storage Resource Virtualization 32 Spark Streaming SparkSQL BlinkDB GraphX SparkR MLlib Semantic Layer Spark Data Model (RDDs) Generic Data Model (Key-Value Store) Ontology Rules Stochastic Models Tensor Embedding Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level) Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON Distributed File Systems (e.g., HDFS, S3, Ceph) Distributed Memory (Tachyon)Compression (Succinct) Resource Abstractions (Apache Mesos) Resource Management (HadoopYarn) Apache Spark (General Data Processing on Distributed Memory) Personalized Medicine (Drug Discovery/Safety) Computational Finance Compliance Informatics
  • 33.
    © 2016 MohammadSadoghi (Purdue University) ExpoDB Architecture: From Data to Knowledge Applications APIs/Services (Access/Interfaces) Processing Engine Data Model (Enriching Raw Data Towards Knowledge) Storage Resource Virtualization 33 Spark Streaming SparkSQL BlinkDB GraphX SparkR MLlib ReasoningRefinementCuration Fusion Discovery Distributed File Systems (e.g., HDFS, S3, Ceph) Distributed Memory (Tachyon)Compression (Succinct) Resource Abstractions (Apache Mesos) Resource Management (HadoopYarn) Online Transactional Processing (OLTP) + Online Analytical Processing (OLAP) Semantic Layer Spark Data Model (RDDs) Generic Data Model (Key-Value Store) Ontology Rules Stochastic Models Tensor Embedding Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level) Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON Personalized Medicine (Drug Discovery/Safety) Computational Finance Compliance Informatics
  • 34.
    © 2016 MohammadSadoghi (Purdue University) ExpoDB Architecture:Active Data Path Applications APIs/Services (Access/Interfaces) Processing Engine Data Model (Enriching Raw Data Towards Knowledge) Storage Resource Virtualization 34 Spark Streaming SparkSQL BlinkDB GraphX SparkR MLlib ReasoningRefinementCuration Fusion Semantic Layer Spark Data Model (RDDs) Generic Data Model (Key-Value Store) Ontology Rules Stochastic Models Tensor Embedding Discovery Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level) Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON Distributed File Systems (e.g., HDFS, S3, Ceph) Distributed Memory (Tachyon)Compression (Succinct) Resource Abstractions (Apache Mesos) Resource Management (HadoopYarn) Virtualized Hardware Acceleration (GPU & FPGA) Online Transactional Processing (OLTP) + Online Analytical Processing (OLAP) Personalized Medicine (Drug Discovery/Safety) Computational Finance Compliance Informatics
  • 35.
    © 2016 MohammadSadoghi (Purdue University) Personalized Medicine (Drug Discovery/Safety) Computational Finance The First Step! Applications APIs/Services (Access/Interfaces) Processing Engine Data Model (Enriching Raw Data Towards Knowledge) Storage Resource Virtualization 35 Spark Streaming SparkSQL BlinkDB GraphX SparkR MLlib ReasoningRefinementCuration Fusion Semantic Layer Spark Data Model (RDDs) Generic Data Model (Key-Value Store) Ontology Rules Stochastic Models Tensor Embedding Discovery Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level) Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON Distributed File Systems (e.g., HDFS, S3, Ceph) Distributed Memory (Tachyon)Compression (Succinct) Resource Abstractions (Apache Mesos) Resource Management (HadoopYarn) Online Transactional Processing (OLTP) + Online Analytical Processing (OLAP) L-Store (Real-time OLTP+OLAP) FQP (Flexible Query Processor) EmbedS (Ontology) Phenomenological Features (Deep-Learning-as-Oracle) PADRES (Event Processing) IBM DB2 BLU (Column Store) SPIDER (Declarative Data Cleansing) Vraph (Vectorized Graph Processing) Tiresias (Predicting Adverse Drug Reaction) fpga-ToPSS (Algorithmic Trading) Compliance Informatics Virtualized Hardware Acceleration (GPU & FPGA)
  • 36.
    © 2016 MohammadSadoghi (Purdue University) ThankYou Q&A Exploratory Systems Lab (ExpoLab) website: https://msadoghi.github.io/
  • 37.
    © 2016 MohammadSadoghi (Purdue University) Data/Knowledge Exploration: • Mohammad Sadoghi, Kavitha Srinivas, Oktie Hassanzadeh,Yuan-Chi Chang, Mustafa Canim,Achille Fokoue,Yishai A. Feldman: Self-Curating Databases. EDBT 2016 • Amit Chandel, Oktie Hassanzadeh, Nick Koudas, Mohammad Sadoghi, Divesh Srivastava: Benchmarking declarative approximate selection predicates. SIGMOD Conference 2007: 353-364 • Oktie Hassanzadeh, Mohammad Sadoghi, Renée J. Miller:Accuracy of Approximate String Joins Using Grams. QDB 2007 Drug Safety: • Achille Fokoue, Mohammad Sadoghi, Oktie Hassanzadeh, Ping Zhang: Predicting Drug-Drug InteractionsThrough Large-Scale Similarity-Based Link Prediction. ESWC 2016 • Achille Fokoue, Oktie Hassanzadeh, Mohammad Sadoghi, Ping Zhang: Predicting Drug-Drug InteractionsThrough Similarity-Based Link Prediction OverWeb Data.WWW 2016 OLTP & OLAP: • Mohammad Sadoghi, Souvik Bhattacherjee, Bishwaranjan Bhattacharjee, Mustafa Canim: L-Store:A Real-time OLTP and OLAP System. CoRR abs/1601.04084 (2016) • Kaiwen Zhang, Mohammad Sadoghi, Hans-Arno Jacobsen: DL-Store:A Distributed Hybrid OLTP and OLAP Data Processing Engine. ICDCS 2016 • Mohammad Sadoghi, Kenneth A. Ross, Mustafa Canim, Bishwaranjan Bhattacharjee: Exploiting SSDs in operational multiversion databases.VLDB J. 25(5): 651-672 (2016) • Mohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee, Fabian Nagel, Kenneth A. Ross: Reducing Database Locking ContentionThrough Multi-version Concurrency. PVLDB 7(13): 1331-1342 (2014) • Prashanth Menon,Tilmann Rabl, Mohammad Sadoghi, Hans-Arno Jacobsen: CaSSanDra:An SSD boosted key-value store. ICDE 2014: 1162-1167 • Prashanth Menon,Tilmann Rabl, Mohammad Sadoghi, Hans-Arno Jacobsen: Optimizing key-value stores for hybrid storage architectures. CASCON 2014: 355-358 • Mohammad Sadoghi, Kenneth A. Ross, Mustafa Canim, Bishwaranjan Bhattacharjee: Making Updates Disk-I/O Friendly Using SSDs. PVLDB 6(11): 997-1008 (2013) Hardware Acceleration: • Rajesh R. Bordawekar, Mohammad Sadoghi:Accelerating database workloads by software-hardware-system co-design. ICDE 2016 • Mohammadreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen: SplitJoin:A Scalable, Low-latency Stream Join Architecture with Adjustable Ordering Precision. USENIX AnnualTechnical Conference 2016 • Mohammadreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen:The FQPVision: Flexible Query Processing on a Reconfigurable Computing Fabric. SIGMOD Record 44(2): 5-10 (2015) • Mohammadreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen: Configurable hardware-based streaming architecture using Online Programmable-Blocks. ICDE 2015 • Mohammedreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen: Flexible Query Processor on FPGAs. PVLDB 6(12): 1310-1313 (2013) • Mohammad Sadoghi, Rija Javed, NaifTarafdar, Harsh Singh, Rohan Palaniappan, Hans-Arno Jacobsen: Multi-query Stream Processing on FPGAs. ICDE 2012: 1229-1232 • Mohammad Sadoghi, Harsh Singh, Hans-Arno Jacobsen:Towards highly parallel event processing through reconfigurable hardware. DaMoN 2011: 27-32 • Mohammad Sadoghi, Harsh Singh, Hans-Arno Jacobsen: fpga-ToPSS: line-speed event processing on fpgas. DEBS 2011: 373-374 • Mohammad Sadoghi, Hans-Arno Jacobsen, Martin Labrecque,Warren Shum, Harsh Singh: Efficient Event Processing through Reconfigurable Hardware for AlgorithmicTrading. PVLDB 3(2): 1525-1528 (2010) References: