Fostering Serendipity through Big 
Linked Data 
Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , 
Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille 
Ngonga Ngomo 
Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
Agenda 
• Motivation 
• Datasets 
• Architecture 
• Evaluation 
• Requirements 
• Demo 
• Conclusion and Future Work
Motivation 
Fostering Serendipity through Big Data 
Triplification, Continuous Integration, 
and Visualization
Triplification: Linked TCGA 
• TCGA is publicly accessible atlas of cancer 
related data from National Cancer Institute 
(NCI) 
– 9000 patients 
– 33 cancer types 
– 147,645 raw data files 
– 12.7 TB 
• Only 46% of the total expected data with 
new data being submitted every day 
• Goal is to enable cancer researchers to 
make and validate important discoveries 
• Total Linked TCGA > 30 billion triples 
(Largest Dataset of LOD)
Triplification:PubMed 
• Collection of publications from the bio-medical 
domain 
• Large amount of metadata (MESH Terms) 
• 23+ million publications 
• 10,000 new publications/month
Big Data Continuous Integration 
TopFed 
Parser 
Federator Optimizer 
Integrator 
Results 
SPARQL Query Results 
Sub-query 
PubMed 
Entrez Utilities 
RDFizer 
Auto 
Loader 
TCGA Data 
Portal 
SPARQL 
endpoint 
RDF 
SPARQL 
endpoint 
RDF 
SPARQL 
endpoint 
RDF 
Index
Exon-Expression 
Methylation 
C-1 ∨ Category 
Colour = blue 
For each query triple t(s, p, o) ∈ T 
Highly Scalable 
b1 b2 p1 p2 p3 p4 p5 p6 g1 g2 g3 g4 g5 g6 g7 g8 g9 
C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical} 
M = {beta_value, position} F = {Expression-Exon} 
(CNV, SNP, E-Gene, 
miRNA, 
E-Protein, Clinical) 
D = {seg_mean, rpmmm, scaled_est, p_exp_val} 
B = {DNA-Methylation} 
C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ 
!P-Join(p, M ∪ B ∪ E ∪ F) }}} 
C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ 
!P-Join(p, M ∪ B ∪ D ∪ C) }}} 
C-3 = {{p ∈ {M∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p,M ∪ B) ∨ P-Join(p, M∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ 
!P-Join(p, E ∪ F ∪ D ∪ C) }}} 
IF tumour lookup is successful 
forward to corresponding 
leaf 
Else 
broadcast to every one 
A = {chromosome, result, bcr_patient_barcode} G = {start, stop} 
E = {RPKM} 
Tumours 
SPARQL 
endpoints 
C-2 ∨ Category 
Colour = pink 
C-3 ∨ Category 
Colour = green 
1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
Evaluation:Number of Sub-Query Submission 
60 
50 
40 
30 
20 
10 
FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission 
• TopFed number of sub-queries submission is 1/3 to FedX 
• Number of ASK requests 
– FedX 480 
– TopFed 10 
0 
1 2 3 4 5 6 7 8 9 10 Avg
Evaluation: Query Runtime 
100000 
10000 
1000 
100 
10 
1 
1 2 3 4 5 6 7 8 9 10 Average 
Query Execution Time (msec) in 
log scale 
FedX TopFed 
• TopFed outperform FedX significantly on 90% of the queries 
• On average, the query run time of TopFed is about 1/3 to that 
of FedX 
• TopFed‘s best run-time (query 2, query 3) is more than 75 times 
smaller than that of FedX
Big Data Track Requirements 
• Data Volume 
– 7.36 billion triples from Linked TCGA 
– 23 million publications from PubMed 
• Data Variety 
– The Linked TCGA data was extracted from raw text files of different 
structures 
– Processed the metadata associated with PubMed publications and 
transform them into RDF 
– Unstructured data (publication abstracts) is processed to extract 
mentions of gene names and cancers 
• Data Velocity 
– TCGA data doubles /2 months 
– PubMed publications 10k/month
Big Data Visualization
Tumor-wise Visualization
PubMed Paper-wise Visualization
Genome-wise Patients Results Visualization
Everything is Public 
• Demo: http://srvgal78.deri.ie/tcga-pubmed/ 
• TopFed: https://code.google.com/p/topfed/ 
• TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ 
• Utilities: http://goo.gl/kNrFdI 
• Linked TCGA : http://tcga.deri.ie/ 
saleem@informatik.uni-leipzig.de 
AKSW, University of Leipzig, Germany

Fostering Serendipity through Big Linked Data

  • 1.
    Fostering Serendipity throughBig Linked Data Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille Ngonga Ngomo Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
  • 2.
    Agenda • Motivation • Datasets • Architecture • Evaluation • Requirements • Demo • Conclusion and Future Work
  • 3.
    Motivation Fostering Serendipitythrough Big Data Triplification, Continuous Integration, and Visualization
  • 4.
    Triplification: Linked TCGA • TCGA is publicly accessible atlas of cancer related data from National Cancer Institute (NCI) – 9000 patients – 33 cancer types – 147,645 raw data files – 12.7 TB • Only 46% of the total expected data with new data being submitted every day • Goal is to enable cancer researchers to make and validate important discoveries • Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)
  • 5.
    Triplification:PubMed • Collectionof publications from the bio-medical domain • Large amount of metadata (MESH Terms) • 23+ million publications • 10,000 new publications/month
  • 6.
    Big Data ContinuousIntegration TopFed Parser Federator Optimizer Integrator Results SPARQL Query Results Sub-query PubMed Entrez Utilities RDFizer Auto Loader TCGA Data Portal SPARQL endpoint RDF SPARQL endpoint RDF SPARQL endpoint RDF Index
  • 7.
    Exon-Expression Methylation C-1∨ Category Colour = blue For each query triple t(s, p, o) ∈ T Highly Scalable b1 b2 p1 p2 p3 p4 p5 p6 g1 g2 g3 g4 g5 g6 g7 g8 g9 C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical} M = {beta_value, position} F = {Expression-Exon} (CNV, SNP, E-Gene, miRNA, E-Protein, Clinical) D = {seg_mean, rpmmm, scaled_est, p_exp_val} B = {DNA-Methylation} C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}} C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}} C-3 = {{p ∈ {M∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p,M ∪ B) ∨ P-Join(p, M∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}} IF tumour lookup is successful forward to corresponding leaf Else broadcast to every one A = {chromosome, result, bcr_patient_barcode} G = {start, stop} E = {RPKM} Tumours SPARQL endpoints C-2 ∨ Category Colour = pink C-3 ∨ Category Colour = green 1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
  • 8.
    Evaluation:Number of Sub-QuerySubmission 60 50 40 30 20 10 FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission • TopFed number of sub-queries submission is 1/3 to FedX • Number of ASK requests – FedX 480 – TopFed 10 0 1 2 3 4 5 6 7 8 9 10 Avg
  • 9.
    Evaluation: Query Runtime 100000 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 Average Query Execution Time (msec) in log scale FedX TopFed • TopFed outperform FedX significantly on 90% of the queries • On average, the query run time of TopFed is about 1/3 to that of FedX • TopFed‘s best run-time (query 2, query 3) is more than 75 times smaller than that of FedX
  • 10.
    Big Data TrackRequirements • Data Volume – 7.36 billion triples from Linked TCGA – 23 million publications from PubMed • Data Variety – The Linked TCGA data was extracted from raw text files of different structures – Processed the metadata associated with PubMed publications and transform them into RDF – Unstructured data (publication abstracts) is processed to extract mentions of gene names and cancers • Data Velocity – TCGA data doubles /2 months – PubMed publications 10k/month
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Everything is Public • Demo: http://srvgal78.deri.ie/tcga-pubmed/ • TopFed: https://code.google.com/p/topfed/ • TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ • Utilities: http://goo.gl/kNrFdI • Linked TCGA : http://tcga.deri.ie/ saleem@informatik.uni-leipzig.de AKSW, University of Leipzig, Germany