Fostering Serendipity through Big Linked Data

Fostering Serendipity through Big
Linked Data
Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal ,
Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille
Ngonga Ngomo
Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia

Agenda
• Motivation
• Datasets
• Architecture
• Evaluation
• Requirements
• Demo
• Conclusion and Future Work

Motivation
Fostering Serendipity through Big Data
Triplification, Continuous Integration,
and Visualization

Triplification: Linked TCGA
• TCGA is publicly accessible atlas of cancer
related data from National Cancer Institute
(NCI)
– 9000 patients
– 33 cancer types
– 147,645 raw data files
– 12.7 TB
• Only 46% of the total expected data with
new data being submitted every day
• Goal is to enable cancer researchers to
make and validate important discoveries
• Total Linked TCGA > 30 billion triples
(Largest Dataset of LOD)

Triplification:PubMed
• Collection of publications from the bio-medical
domain
• Large amount of metadata (MESH Terms)
• 23+ million publications
• 10,000 new publications/month

Big Data Continuous Integration
TopFed
Parser
Federator Optimizer
Integrator
Results
SPARQL Query Results
Sub-query
PubMed
Entrez Utilities
RDFizer
Auto
Loader
TCGA Data
Portal
SPARQL
endpoint
RDF
SPARQL
endpoint
RDF
SPARQL
endpoint
RDF
Index

Exon-Expression
Methylation
C-1 ∨ Category
Colour = blue
For each query triple t(s, p, o) ∈ T
Highly Scalable
b1 b2 p1 p2 p3 p4 p5 p6 g1 g2 g3 g4 g5 g6 g7 g8 g9
C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical}
M = {beta_value, position} F = {Expression-Exon}
(CNV, SNP, E-Gene,
miRNA,
E-Protein, Clinical)
D = {seg_mean, rpmmm, scaled_est, p_exp_val}
B = {DNA-Methylation}
C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧
!P-Join(p, M ∪ B ∪ E ∪ F) }}}
C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧
!P-Join(p, M ∪ B ∪ D ∪ C) }}}
C-3 = {{p ∈ {M∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p,M ∪ B) ∨ P-Join(p, M∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧
!P-Join(p, E ∪ F ∪ D ∪ C) }}}
IF tumour lookup is successful
forward to corresponding
leaf
Else
broadcast to every one
A = {chromosome, result, bcr_patient_barcode} G = {start, stop}
E = {RPKM}
Tumours
SPARQL
endpoints
C-2 ∨ Category
Colour = pink
C-3 ∨ Category
Colour = green
1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33

Evaluation:Number of Sub-Query Submission
60
50
40
30
20
10
FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission
• TopFed number of sub-queries submission is 1/3 to FedX
• Number of ASK requests
– FedX 480
– TopFed 10
0
1 2 3 4 5 6 7 8 9 10 Avg

Evaluation: Query Runtime
100000
10000
1000
100
10
1
1 2 3 4 5 6 7 8 9 10 Average
Query Execution Time (msec) in
log scale
FedX TopFed
• TopFed outperform FedX significantly on 90% of the queries
• On average, the query run time of TopFed is about 1/3 to that
of FedX
• TopFed‘s best run-time (query 2, query 3) is more than 75 times
smaller than that of FedX

Big Data Track Requirements
• Data Volume
– 7.36 billion triples from Linked TCGA
– 23 million publications from PubMed
• Data Variety
– The Linked TCGA data was extracted from raw text files of different
structures
– Processed the metadata associated with PubMed publications and
transform them into RDF
– Unstructured data (publication abstracts) is processed to extract
mentions of gene names and cancers
• Data Velocity
– TCGA data doubles /2 months
– PubMed publications 10k/month

PubMed Paper-wise Visualization

Genome-wise Patients Results Visualization

Everything is Public
• Demo: http://srvgal78.deri.ie/tcga-pubmed/
• TopFed: https://code.google.com/p/topfed/
• TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ
• Utilities: http://goo.gl/kNrFdI
• Linked TCGA : http://tcga.deri.ie/
saleem@informatik.uni-leipzig.de
AKSW, University of Leipzig, Germany

Fostering Serendipity through Big Linked Data

More Related Content

What's hot

Similar to Fostering Serendipity through Big Linked Data

More from Muhammad Saleem

Recently uploaded

Fostering Serendipity through Big Linked Data