Linked Cancer Genome Atlas
Database
Muhammad Saleem, Shanmukha
Sampath Padmanabhuni, Axel-Cyrille
Ngonga Ngomo, Jonas S. A...
Agenda
• Cancer Genome Atlas (TCGA) introduction
• Problem statement
• Linked TCGA a scalable solution
• Cancer treatment ...
TCGA Introduction
• A publicly accessible atlas of cancer related data
from National Cancer Institute (NCI)
– 9000 patient...
Problem Statement
• Data in the TCGA is organized as text archives
with no remote querying interface
– Download very large...
Linked TCGA a Scalable Solution:
RDFization
chromosome position beta_value
16 28890100 0.439271303584937
3 57743543 0.245147665381461
7 15725862 0.0440161061196347
2 ...
chromosome position beta_value
16 28890100 0.439271303584937
3 57743543 0.245147665381461
7 15725862 0.0440161061196347
2 ...
chromosome position beta_value
16 28890100 0.439271303584937
3 57743543 0.245147665381461
7 15725862 0.0440161061196347
2 ...
chromosome position beta_value
16 28890100 0.439271303584937
3 57743543 0.245147665381461
7 15725862 0.0440161061196347
2 ...
Linked TCGA Data Workflow
Linked TCGA Tumors Statistics
Tumor Type
Original
Size(GB)
Refined
Size (GB)
RDFized
Size (GB)
Triples
(Million)
Cervical ...
Linking to Linked Open Data
Source Target Class #Links
DNA27 HGNC Gene 23181
DNA27 Homologene Gene 27654
DNA27 HGNC Gene 1...
Cancer Treatment using Linked TCGA
Linked TCGA Use Cases
1. Targeted cancer treatment
– Whether a specific drug can be used to treat a tumour
using the genom...
Use case 1,2 SPARQL query
SELECT ?patient ?mean
WHERE
{
?uri tcga:tumour_type "BRCA".
?uri tcga:bcr_patient_barcode ?patie...
Use Case 1,2 Querying LOD DrugBank
SELECT ?drugname
WHERE
{
?patient rdf:type tcga:expression_gene_results.
?patient tcga:...
Use Case 3 Query
SELECT ?patient ?mean
WHERE
{
?uri tcga:tumour_type "BRCA".
?uri tcga:bcr_patient_barcode ?patient.
?pati...
Demo1
Demo2
Everything is Public
• TopFed: https://code.google.com/p/topfed/
• Linked TCGA : http://tcga.deri.ie/
saleem@informatik.un...
Thanks
Muhammad Saleem
saleem.muhammd@gmail.com
Upcoming SlideShare
Loading in …5
×

Linked Cancer Genome Atlas Database

1,014
-1

Published on

Linked Cancer Genome Atlas Database, Linked Data Cup Award Winner at I-Semnatics2013. http://tcga.deri.ie/

Published in: Technology, Health & Medicine
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,014
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Linked Cancer Genome Atlas Database

  1. 1. Linked Cancer Genome Atlas Database Muhammad Saleem, Shanmukha Sampath Padmanabhuni, Axel-Cyrille Ngonga Ngomo, Jonas S. Almeida, Stefan Decker, Helena F. Deus. Linked Data Cup, I-Semantics 2013, September 04 - 06 2013, Graz, Austria
  2. 2. Agenda • Cancer Genome Atlas (TCGA) introduction • Problem statement • Linked TCGA a scalable solution • Cancer treatment using Linked TCGA • Demo of the use cases • Conclusion
  3. 3. TCGA Introduction • A publicly accessible atlas of cancer related data from National Cancer Institute (NCI) – 9000 patients – 33 cancer types – 147,645 raw data files – total of 12.7 terabytes of data • Only a 46% of the total expected data with new data being submitted every day • Goal is to enable cancer researchers to make and validate important discoveries
  4. 4. Problem Statement • Data in the TCGA is organized as text archives with no remote querying interface – Download very large archives and waiting in queues – Parse the relevant text – Collect the critical co-variates necessary for analysis • Various types of experimental results are not connected biologically • TCGA data should be made publicly available for remote querying and virtual integration
  5. 5. Linked TCGA a Scalable Solution: RDFization
  6. 6. chromosome position beta_value 16 28890100 0.439271303584937 3 57743543 0.245147665381461 7 15725862 0.0440161061196347 2 177029073 0.741342927038953 11 93862594 0.0290713821114479 14 93813777 0.985555436681019 18 11980953 0.0109832005732912 14 89290921 0.0104525957219692 composite element REF gene_symbolchromosome position beta_value cg00000292 ATP2A1 16 288901000.439271303584937 cg00002426 SLMAP 3 577435430.245147665381461 cg00003994 MEOX2 7 157258620.0440161061196347 cg00005847 HOXD3 2 1770290730.741342927038953 cg00006414 ZNF425 7 148822837NA cg00007981 PANX1 11 938625940.0290713821114479 cg00008493 COX8C 14 938137770.985555436681019 cg00008713 IMPA2 18 119809530.0109832005732912 cg00009407 TTC8 14 892909210.0104525957219692 Text to RDF Conversion Data Refiner Refined Raw
  7. 7. chromosome position beta_value 16 28890100 0.439271303584937 3 57743543 0.245147665381461 7 15725862 0.0440161061196347 2 177029073 0.741342927038953 11 93862594 0.0290713821114479 14 93813777 0.985555436681019 18 11980953 0.0109832005732912 14 89290921 0.0104525957219692 composite element REF gene_symbolchromosome position beta_value cg00000292 ATP2A1 16 288901000.439271303584937 cg00002426 SLMAP 3 577435430.245147665381461 cg00003994 MEOX2 7 157258620.0440161061196347 cg00005847 HOXD3 2 1770290730.741342927038953 cg00006414 ZNF425 7 148822837NA cg00007981 PANX1 11 938625940.0290713821114479 cg00008493 COX8C 14 938137770.985555436681019 cg00008713 IMPA2 18 119809530.0109832005732912 cg00009407 TTC8 14 892909210.0104525957219692 Text to RDF Conversion Data Refiner Refined Raw
  8. 8. chromosome position beta_value 16 28890100 0.439271303584937 3 57743543 0.245147665381461 7 15725862 0.0440161061196347 2 177029073 0.741342927038953 11 93862594 0.0290713821114479 14 93813777 0.985555436681019 18 11980953 0.0109832005732912 14 89290921 0.0104525957219692 composite element REF gene_symbolchromosome position beta_value cg00000292 ATP2A1 16 288901000.439271303584937 cg00002426 SLMAP 3 577435430.245147665381461 cg00003994 MEOX2 7 157258620.0440161061196347 cg00005847 HOXD3 2 1770290730.741342927038953 cg00006414 ZNF425 7 148822837NA cg00007981 PANX1 11 938625940.0290713821114479 cg00008493 COX8C 14 938137770.985555436681019 cg00008713 IMPA2 18 119809530.0109832005732912 cg00009407 TTC8 14 892909210.0104525957219692 Text to RDF Conversion Data Refiner Refined Raw
  9. 9. chromosome position beta_value 16 28890100 0.439271303584937 3 57743543 0.245147665381461 7 15725862 0.0440161061196347 2 177029073 0.741342927038953 11 93862594 0.0290713821114479 14 93813777 0.985555436681019 18 11980953 0.0109832005732912 14 89290921 0.0104525957219692 composite element REF gene_symbolchromosome position beta_value cg00000292 ATP2A1 16 288901000.439271303584937 cg00002426 SLMAP 3 577435430.245147665381461 cg00003994 MEOX2 7 157258620.0440161061196347 cg00005847 HOXD3 2 1770290730.741342927038953 cg00006414 ZNF425 7 148822837NA cg00007981 PANX1 11 938625940.0290713821114479 cg00008493 COX8C 14 938137770.985555436681019 cg00008713 IMPA2 18 119809530.0109832005732912 cg00009407 TTC8 14 892909210.0104525957219692 @prefix b:<http://tcga.deri.ie/>. @prefix d:<http://tcga.deri.ie/schema/bcr_patient_barcode>. @prefix r:<http://tcga.deri.ie/schema/result>. @prefix c:<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>. @prefix w:<http://tcga.deri.ie/schema/dna_methylation_result>. @prefix m:<http://tcga.deri.ie/schema/chromosome>. @prefix v:<http://tcga.deri.ie/schema/position>. @prefix u:<http://tcga.deri.ie/schema/beta_value>. b:TCGA-A2-A0CX d: "TCGA-A2-A0CX". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d1 . b:TCGA-A2-A0CX-d1 c: w: ; m: "16"; v: "28890100"; u: "0.439271303584937". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d2 . b:TCGA-A2-A0CX-d2 c: w: ; m: "3"; v: "57743543"; u: "0.245147665381461". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d3 . b:TCGA-A2-A0CX-d3 c: w: ; m: "7"; v: "15725862"; u: "0.0440161061196347". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d4 . b:TCGA-A2-A0CX-d4 c: w: ; m: "2"; v: "177029073"; u: "0.741342927038953". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d5 . b:TCGA-A2-A0CX-d5 c: w: ; m: "11"; v: "93862594"; u: "0.0290713821114479". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d6 . b:TCGA-A2-A0CX-d6 c: w: ; m: "14"; v: "93813777"; u: "0.985555436681019". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d7 . b:TCGA-A2-A0CX-d7 c: w: ; m: "18"; v: "11980953"; u: "0.0109832005732912". b:TCGA-A2-A0CX r: b:TCGA-A2-A0CX-d8 . b:TCGA-A2-A0CX-d8 c: w: ; m: "14"; v: "89290921"; u: "0.0104525957219692". Text to RDF Conversion Data Refiner RDFizer Refined RDFizedRaw
  10. 10. Linked TCGA Data Workflow
  11. 11. Linked TCGA Tumors Statistics Tumor Type Original Size(GB) Refined Size (GB) RDFized Size (GB) Triples (Million) Cervical (CESC) 8.75 2.44 8.86 400.19 Rectal adenocarcinoma (READ) 8.07 2.25 9.04 413.31 Papillary Kidney (KIRP) 10.40 2.90 10.4 469.65 Bladder cancer (BLCA) 12.16 3.39 12.3 556.38 Acute Myeloid Leukemia (LAML) 14.85 4.14 15.1 684.05 Lower Grade Glioma (LGG) 17.08 4.76 17.1 778.82 Prostate adenocarcinoma (PRAD) 18.05 5.03 18.1 821.01 Lung squamous carcinoma (LUSC) 20.63 5.75 20.5 927.08 Cutaneous melanoma (SKCM) 23.22 6.47 23.2 1050.94 Head and neck squamous cell(HNSC) 27.6 7.69 27.5 1245.37 • A total of 7.36 Billion Triples for 10 small tumors • Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)
  12. 12. Linking to Linked Open Data Source Target Class #Links DNA27 HGNC Gene 23181 DNA27 Homologene Gene 27654 DNA27 HGNC Gene 15171 DNA450 Homologene Gene 489643 DNA450 OMIM Gene 212284 DNA27 HGNC Chromosome 108662 DNA27 OMIM Chromosome 16039535 Methylation HGNC Chromosome 97530 Methylation OMIM Chromosome 14407269 Gene Expression HGNC Chromosome 86052 Gene Expression OMIM Chromosome 12535829 • Links are generated using LIMES http://aksw.org/Projects/LIMES.html
  13. 13. Cancer Treatment using Linked TCGA
  14. 14. Linked TCGA Use Cases 1. Targeted cancer treatment – Whether a specific drug can be used to treat a tumour using the genomic data of patients with same tumor 2. Mechanism-based treatment – Whether a combination of drugs can be applied to treat a specific tumor using similar patients data 3. Survival outcome – Using mathematical model to predict future signs such as survival outcome for a new patient
  15. 15. Use case 1,2 SPARQL query SELECT ?patient ?mean WHERE { ?uri tcga:tumour_type "BRCA". ?uri tcga:bcr_patient_barcode ?patient. ?patient rdf:type tcga:expression_gene_results. ?patient tcga:gene_symbol "HER2","ER". ?patient tcga:scaled_estimate ?mean }
  16. 16. Use Case 1,2 Querying LOD DrugBank SELECT ?drugname WHERE { ?patient rdf:type tcga:expression_gene_results. ?patient tcga:gene_symbol ?targetname . ?patient tcga:scaled_estimate ?mean. FILTER (?mean > Threshold) ?drug drugbank:target ?target. ?drug drugbank:genericName ?drugname . ?target drugbank:synonym ?targetname . FILTER REGEX (?targetname, "HER2||estrogenreceptor||ERBB2", "i") }
  17. 17. Use Case 3 Query SELECT ?patient ?mean WHERE { ?uri tcga:tumour_type "BRCA". ?uri tcga:bcr_patient_barcode ?patient. ?patient rdf:type tcga:clinical. ?patient tcga:tumour_stage ?tumour_stage. ?patient tcga:age_at_initial_patalogical_diagnosis ?age. ?patient tcga:relevant_biomarker "BRCA1","CDKN2A", "CDH1". ?patient tcga:beta_value ?mean }
  18. 18. Demo1 Demo2
  19. 19. Everything is Public • TopFed: https://code.google.com/p/topfed/ • Linked TCGA : http://tcga.deri.ie/ saleem@informatik.uni-leipzig.de AKSW, University of Leipzig, Germany
  20. 20. Thanks Muhammad Saleem saleem.muhammd@gmail.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×