Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Fostering Serendipity through Big 
Linked Data 
Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , 
Shanmukha Sampath , He...
Agenda 
• Motivation 
• Datasets 
• Architecture 
• Evaluation 
• Requirements 
• Demo 
• Conclusion and Future Work
Motivation 
Fostering Serendipity through Big Data 
Triplification, Continuous Integration, 
and Visualization
Triplification: Linked TCGA 
• TCGA is publicly accessible atlas of cancer 
related data from National Cancer Institute 
(...
Triplification:PubMed 
• Collection of publications from the bio-medical 
domain 
• Large amount of metadata (MESH Terms) ...
Big Data Continuous Integration 
TopFed 
Parser 
Federator Optimizer 
Integrator 
Results 
SPARQL Query Results 
Sub-query...
Exon-Expression 
Methylation 
C-1 ∨ Category 
Colour = blue 
For each query triple t(s, p, o) ∈ T 
Highly Scalable 
b1 b2 ...
Evaluation:Number of Sub-Query Submission 
60 
50 
40 
30 
20 
10 
FedX number of Sub-Query Submission TopFedE number of S...
Evaluation: Query Runtime 
100000 
10000 
1000 
100 
10 
1 
1 2 3 4 5 6 7 8 9 10 Average 
Query Execution Time (msec) in 
...
Big Data Track Requirements 
• Data Volume 
– 7.36 billion triples from Linked TCGA 
– 23 million publications from PubMed...
Big Data Visualization
Tumor-wise Visualization
PubMed Paper-wise Visualization
Genome-wise Patients Results Visualization
Everything is Public 
• Demo: http://srvgal78.deri.ie/tcga-pubmed/ 
• TopFed: https://code.google.com/p/topfed/ 
• TCGA Da...
Upcoming SlideShare
Loading in …5
×

Fostering Serendipity through Big Linked Data

1,322 views

Published on

Semantic Web Challenge - Big Data track winner at ISWC2013

Published in: Education
  • Be the first to comment

  • Be the first to like this

Fostering Serendipity through Big Linked Data

  1. 1. Fostering Serendipity through Big Linked Data Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille Ngonga Ngomo Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
  2. 2. Agenda • Motivation • Datasets • Architecture • Evaluation • Requirements • Demo • Conclusion and Future Work
  3. 3. Motivation Fostering Serendipity through Big Data Triplification, Continuous Integration, and Visualization
  4. 4. Triplification: Linked TCGA • TCGA is publicly accessible atlas of cancer related data from National Cancer Institute (NCI) – 9000 patients – 33 cancer types – 147,645 raw data files – 12.7 TB • Only 46% of the total expected data with new data being submitted every day • Goal is to enable cancer researchers to make and validate important discoveries • Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)
  5. 5. Triplification:PubMed • Collection of publications from the bio-medical domain • Large amount of metadata (MESH Terms) • 23+ million publications • 10,000 new publications/month
  6. 6. Big Data Continuous Integration TopFed Parser Federator Optimizer Integrator Results SPARQL Query Results Sub-query PubMed Entrez Utilities RDFizer Auto Loader TCGA Data Portal SPARQL endpoint RDF SPARQL endpoint RDF SPARQL endpoint RDF Index
  7. 7. Exon-Expression Methylation C-1 ∨ Category Colour = blue For each query triple t(s, p, o) ∈ T Highly Scalable b1 b2 p1 p2 p3 p4 p5 p6 g1 g2 g3 g4 g5 g6 g7 g8 g9 C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical} M = {beta_value, position} F = {Expression-Exon} (CNV, SNP, E-Gene, miRNA, E-Protein, Clinical) D = {seg_mean, rpmmm, scaled_est, p_exp_val} B = {DNA-Methylation} C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}} C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}} C-3 = {{p ∈ {M∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p,M ∪ B) ∨ P-Join(p, M∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}} IF tumour lookup is successful forward to corresponding leaf Else broadcast to every one A = {chromosome, result, bcr_patient_barcode} G = {start, stop} E = {RPKM} Tumours SPARQL endpoints C-2 ∨ Category Colour = pink C-3 ∨ Category Colour = green 1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
  8. 8. Evaluation:Number of Sub-Query Submission 60 50 40 30 20 10 FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission • TopFed number of sub-queries submission is 1/3 to FedX • Number of ASK requests – FedX 480 – TopFed 10 0 1 2 3 4 5 6 7 8 9 10 Avg
  9. 9. Evaluation: Query Runtime 100000 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 Average Query Execution Time (msec) in log scale FedX TopFed • TopFed outperform FedX significantly on 90% of the queries • On average, the query run time of TopFed is about 1/3 to that of FedX • TopFed‘s best run-time (query 2, query 3) is more than 75 times smaller than that of FedX
  10. 10. Big Data Track Requirements • Data Volume – 7.36 billion triples from Linked TCGA – 23 million publications from PubMed • Data Variety – The Linked TCGA data was extracted from raw text files of different structures – Processed the metadata associated with PubMed publications and transform them into RDF – Unstructured data (publication abstracts) is processed to extract mentions of gene names and cancers • Data Velocity – TCGA data doubles /2 months – PubMed publications 10k/month
  11. 11. Big Data Visualization
  12. 12. Tumor-wise Visualization
  13. 13. PubMed Paper-wise Visualization
  14. 14. Genome-wise Patients Results Visualization
  15. 15. Everything is Public • Demo: http://srvgal78.deri.ie/tcga-pubmed/ • TopFed: https://code.google.com/p/topfed/ • TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ • Utilities: http://goo.gl/kNrFdI • Linked TCGA : http://tcga.deri.ie/ saleem@informatik.uni-leipzig.de AKSW, University of Leipzig, Germany

×