Fostering Serendipity through Big 
Linked Data 
Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , 
Shanmukha Sampath , He...
Agenda 
• Motivation 
• Datasets 
• Architecture 
• Evaluation 
• Requirements 
• Demo 
• Conclusion and Future Work
Motivation 
Fostering Serendipity through Big Data 
Triplification, Continuous Integration, 
and Visualization
Triplification: Linked TCGA 
• TCGA is publicly accessible atlas of cancer 
related data from National Cancer Institute 
(...
Triplification:PubMed 
• Collection of publications from the bio-medical 
domain 
• Large amount of metadata (MESH Terms) ...
Big Data Continuous Integration 
TopFed 
Parser 
Federator Optimizer 
Integrator 
Results 
SPARQL Query Results 
Sub-query...
Exon-Expression 
Methylation 
C-1 ∨ Category 
Colour = blue 
For each query triple t(s, p, o) ∈ T 
Highly Scalable 
b1 b2 ...
Evaluation:Number of Sub-Query Submission 
60 
50 
40 
30 
20 
10 
FedX number of Sub-Query Submission TopFedE number of S...
Evaluation: Query Runtime 
100000 
10000 
1000 
100 
10 
1 
1 2 3 4 5 6 7 8 9 10 Average 
Query Execution Time (msec) in 
...
Big Data Track Requirements 
• Data Volume 
– 7.36 billion triples from Linked TCGA 
– 23 million publications from PubMed...
Big Data Visualization
Tumor-wise Visualization
PubMed Paper-wise Visualization
Genome-wise Patients Results Visualization
Everything is Public 
• Demo: http://srvgal78.deri.ie/tcga-pubmed/ 
• TopFed: https://code.google.com/p/topfed/ 
• TCGA Da...
Upcoming SlideShare
Loading in …5
×

Fostering Serendipity through Big Linked Data

1,265 views

Published on

Semantic Web Challenge - Big Data track winner at ISWC2013

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,265
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Fostering Serendipity through Big Linked Data

  1. 1. Fostering Serendipity through Big Linked Data Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille Ngonga Ngomo Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
  2. 2. Agenda • Motivation • Datasets • Architecture • Evaluation • Requirements • Demo • Conclusion and Future Work
  3. 3. Motivation Fostering Serendipity through Big Data Triplification, Continuous Integration, and Visualization
  4. 4. Triplification: Linked TCGA • TCGA is publicly accessible atlas of cancer related data from National Cancer Institute (NCI) – 9000 patients – 33 cancer types – 147,645 raw data files – 12.7 TB • Only 46% of the total expected data with new data being submitted every day • Goal is to enable cancer researchers to make and validate important discoveries • Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)
  5. 5. Triplification:PubMed • Collection of publications from the bio-medical domain • Large amount of metadata (MESH Terms) • 23+ million publications • 10,000 new publications/month
  6. 6. Big Data Continuous Integration TopFed Parser Federator Optimizer Integrator Results SPARQL Query Results Sub-query PubMed Entrez Utilities RDFizer Auto Loader TCGA Data Portal SPARQL endpoint RDF SPARQL endpoint RDF SPARQL endpoint RDF Index
  7. 7. Exon-Expression Methylation C-1 ∨ Category Colour = blue For each query triple t(s, p, o) ∈ T Highly Scalable b1 b2 p1 p2 p3 p4 p5 p6 g1 g2 g3 g4 g5 g6 g7 g8 g9 C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical} M = {beta_value, position} F = {Expression-Exon} (CNV, SNP, E-Gene, miRNA, E-Protein, Clinical) D = {seg_mean, rpmmm, scaled_est, p_exp_val} B = {DNA-Methylation} C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}} C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}} C-3 = {{p ∈ {M∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p,M ∪ B) ∨ P-Join(p, M∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}} IF tumour lookup is successful forward to corresponding leaf Else broadcast to every one A = {chromosome, result, bcr_patient_barcode} G = {start, stop} E = {RPKM} Tumours SPARQL endpoints C-2 ∨ Category Colour = pink C-3 ∨ Category Colour = green 1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
  8. 8. Evaluation:Number of Sub-Query Submission 60 50 40 30 20 10 FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission • TopFed number of sub-queries submission is 1/3 to FedX • Number of ASK requests – FedX 480 – TopFed 10 0 1 2 3 4 5 6 7 8 9 10 Avg
  9. 9. Evaluation: Query Runtime 100000 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 Average Query Execution Time (msec) in log scale FedX TopFed • TopFed outperform FedX significantly on 90% of the queries • On average, the query run time of TopFed is about 1/3 to that of FedX • TopFed‘s best run-time (query 2, query 3) is more than 75 times smaller than that of FedX
  10. 10. Big Data Track Requirements • Data Volume – 7.36 billion triples from Linked TCGA – 23 million publications from PubMed • Data Variety – The Linked TCGA data was extracted from raw text files of different structures – Processed the metadata associated with PubMed publications and transform them into RDF – Unstructured data (publication abstracts) is processed to extract mentions of gene names and cancers • Data Velocity – TCGA data doubles /2 months – PubMed publications 10k/month
  11. 11. Big Data Visualization
  12. 12. Tumor-wise Visualization
  13. 13. PubMed Paper-wise Visualization
  14. 14. Genome-wise Patients Results Visualization
  15. 15. Everything is Public • Demo: http://srvgal78.deri.ie/tcga-pubmed/ • TopFed: https://code.google.com/p/topfed/ • TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ • Utilities: http://goo.gl/kNrFdI • Linked TCGA : http://tcga.deri.ie/ saleem@informatik.uni-leipzig.de AKSW, University of Leipzig, Germany

×