SlideShare a Scribd company logo
VCF and RDF
              Jeremy J Carroll
               April 3rd, 2013
Prepared for W3C Clinical Genomics Task Force
                  www.Syapse.com
                  jjc@Syapse.com



              Copyright 2013 Syapse Inc.
Contents
• Background
   – Business Background
   – Goal
   – VCF and RDF
• This specific work (exploration)
   – Stats
   – VCF example
   – Lots of example transforms: VCF and RDF
• Discussion: might the examples meet the
  business goals?
Background
• Syapse Discovery provides an end-to-end
  solution for […] laboratories deploying next-
  generation sequencing-based diagnostics, […]
  configurable semantic data structure enables
  users to bring omics data together with
  traditional medical information […]
• This work: initial exploration of VCF
  import, what useful information is in a VCF
  etc.
Success = Useful Advanced Query
        Over Genomic Information




Patient         Report           Tissue
Disease:        Gene:            Biopsy Site:
Breast Cancer   V-ERB-B2 AVIAN   Lobe of Liver
VCF, a Web Based Knowledge Format
• Variant Call Format
• Used for exchange from one system to
  another
• Used for exchange from one lab to another
• Used for exchange between universities and
  the wider community
RDF, a Web Based Knowledge Format
• Resource Description Framework
• Used for exchange from one system to
  another
• Used for exchange from one lab to another
• Used for exchange between universities and
  the wider community
VCF vs RDF
• VCF: efficient representation of huge
  quantities of data

• RDF: flexible representation with excellent
  interoperability and query
Materialized or Virtual?
• Materialized = Transform data:
  Extract Transform Load into RDF

• Virtual = Transform query:
  Runtime mapping

• Or some combination: transform VCF into
  some internal format, transform SPARQL
  queries into a query over internal format
This Work
• Initial mapping from VCF to RDF
• Materialized for simplicity

• Key questions:
  – Is this useful?
     • Would query answer interesting questions?
  – Scale?
     • Can we materialize? Should we go virtual?
Known limitations
• Lots not addressed ….
• Too many literals not enough URIs
• Slow

• Hoped to be fit for purpose, i.e. answering
  questions, not production code. (scale and
  utility, see previous slide)
Some Stats
• Working with Chromosome 7 from 1000G
    ftp://ftp-
    trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz



• 704,243 Variants, 1,092 samples
     – 689,034,445 ref calls, 79,998,911 variant calls
•   H/W: new mac book pro 2.3 GHz Intel Core i7 8 GB
•   Gunzip: 1m37s
•   PyVCF: 2h41m
•   VCF2TTL: 37h57m
•   serdi TTL 2 Ntriples: 13h57m
•   triples: 14,780,932,912 (1.2 trillion bytes)
VCF Overview
                                                       Metadata
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36                                                                      T-Box:
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
                                                                                                         Properties, Classe
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
                                                                                                         s, Domains, Rang
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">                                       es
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS           ID        REF        ALT      QUAL       FILTER INFO                                FORMAT        NA00001          NA00002
20         14370     rs6054257 G          A        29         PASS   NS=3;DP=14;AF=0.5;DB;H2             GT:GQ:DP:HQ   0|0:48:1:51,51   1|0:48:8:51,51
20         17330     .         T          A        3.0        q10    NS=3;DP=11;AF=0.017                 GT:GQ:DP:HQ   0|0:49:3:58,50   0|1:3:5:65,3
20         1110696 rs6040355 A            G,T      1e+03      PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ   1|2:21:6:23,27   2|1:2:0:18,2
20         1230237 .           T          .        47         PASS   NS=3;DP=13;AA=T                     GT:GQ:DP:HQ   0|0:54:7:56,60   0|0:48:4:51,51
20         1234567 microsat1 GTCT         G,GTACT .           PASS   NS=3;DP=9;AA=G                      GT:GQ:DP      ./.:35:4         0/2:17:2



                                                        Per
     Per Variant                                    Alternative                                                   Per
     Information                                       Allele                                                     Sample, Variant
                                                   Information                                                    Call
Key Classes
• Variant A position in the reference
  genome, where mutation may happen, and
  related information.
• Allele A possible sequence of bases at some
  variant, and related information.
• Sample A related set of Variant Calls
• Variant Call A selection of two Alleles (diploid)
  at a Variant (can be phased or unphased)
A Variant




#CHROM   POS     ID        REF   ALT   QUAL
20       1110696 rs6040355 A     G,T   1e+03
INFO about a Variant

                                Same node as on previous
                                slide, selecting different triples for
                                display.

                                DB property should probably be a
                                class (dbSNP membership); AF (allele
                                frequency) is not a property of the
                                variant but the allele.




#CHROM   POS     INFO
20       1110696 NS=2;DP=10;AF=0.333,0.667;AA=T;DB
INFO about Alleles


                                               Same node as on
                                               previous slide, selecting
                                               different triples for
                                               display.




#CHROM   POS     REF ALT  INFO
20       1110696 A G,TAF=0.333,0.667;AA=T;DB
Detail about a Variant Call


                                               Same variant node as on
                                               previous slide, HQ is
                                               mismodeled.




#CHROM    POS     REF ALT   FORMAT         NA00001
20        1110696 A G,T GT:GQ:DP:HQ   1|2:21:6:23,27
A Phased Diploid Genotype Call: rdf:Seq
                                         The | indicates
                                         phased, consistent
                                         ordering between VCF
                                         rows. rdf:Seq indicates
                                         order is significant.
                                         Heterozygous

                                         VCF concept of phase set
                                         not addressed.




#CHROM   POS     REF ALT  FORMAT         NA00001
20       1110696 A G,TGT:GQ:DP:HQ   1|2:21:6:23,27
Unphased Diploid Genotype Call: rdf:Bag
                                       The / indicates unphased
                                       genotype. rdf:Bag
                                       indicates order is not
                                       significant. rdf:Bag also
                                       used for homozygous
                                       genotypes. The 0/2
                                       indicates the use of the
                                       reference and the 2nd
                                       alternative.

                                       This is a different variant




#CHROM POS     REF ALT      FORMAT     NA00002
20     1234567 GTCT G,GTACT GT:GQ:DP   0/2:17:2
Genotype Likelihoods 1000G




#CHROM POS REF ALT FORMAT NA11829
7      16161 A G GT:DS:GL 0|1:1.300:-0.36,-0.43,-0.71
Notes: _1 and _2 hide one another, id_2397 is variant of id_23830
Discussion Points
• How much of this is actually useful?
  – Maybe just artifacts of the history of the call?
  – Maybe just joins that we could/should recompute on
    the fly
  – Maybe much of the data is fundamentally
    uninteresting

• Do we know the queries? If so can we represent
  in a more compact way and optimize for those
  queries?
Further observations
• Metadata is a mess, not actually reusable by
  machine, e.g. why file date but not sample date …
  provokes the desire to show rationale for file, but
  not the performance
• Extensibility of Tbox allows English text to
  introduce new syntax (e.g. enumerations, per alt
  allele and reference). This is not a good idea.
• The per genotype call info is only for unphased
  data, yet the call can be phased

More Related Content

Viewers also liked

(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
Amazon Web Services
 
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision MedicineElectronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
Kent State University
 
Precision Medicine World Conference 2017
Precision Medicine World Conference 2017Precision Medicine World Conference 2017
Precision Medicine World Conference 2017
University of California, San Francisco
 
Potential partner pitch
Potential partner pitchPotential partner pitch
Potential partner pitch
Jan Bendtsen
 
Partnership Proposal
Partnership ProposalPartnership Proposal
Partnership Proposal
Arvind Jha
 
Partnership presentation
Partnership presentationPartnership presentation
Partnership presentation
email2neev
 
Dwolla Startup Pitch Deck
Dwolla Startup Pitch DeckDwolla Startup Pitch Deck
Dwolla Startup Pitch Deck
Joseph Hsieh
 
Linkedin Series B Pitch Deck
Linkedin Series B Pitch DeckLinkedin Series B Pitch Deck
Linkedin Series B Pitch Deck
Joseph Hsieh
 
Mixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65MMixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65M
Suhail Doshi
 
The slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollarsThe slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollars
Buffer
 

Viewers also liked (10)

(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
 
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision MedicineElectronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
 
Precision Medicine World Conference 2017
Precision Medicine World Conference 2017Precision Medicine World Conference 2017
Precision Medicine World Conference 2017
 
Potential partner pitch
Potential partner pitchPotential partner pitch
Potential partner pitch
 
Partnership Proposal
Partnership ProposalPartnership Proposal
Partnership Proposal
 
Partnership presentation
Partnership presentationPartnership presentation
Partnership presentation
 
Dwolla Startup Pitch Deck
Dwolla Startup Pitch DeckDwolla Startup Pitch Deck
Dwolla Startup Pitch Deck
 
Linkedin Series B Pitch Deck
Linkedin Series B Pitch DeckLinkedin Series B Pitch Deck
Linkedin Series B Pitch Deck
 
Mixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65MMixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65M
 
The slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollarsThe slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollars
 

Similar to VCF and RDF

Compilers Are Databases
Compilers Are DatabasesCompilers Are Databases
Compilers Are Databases
Martin Odersky
 
The Little Register Allocator
The Little Register AllocatorThe Little Register Allocator
The Little Register Allocator
Ian Wang
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
Lukas Vlcek
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
Sucheta Tripathy
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
Scality
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
Li Shen
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
HAMNAHAMNA8
 
8086 instruction set
8086 instruction set8086 instruction set
8086 instruction set
Sazzad Hossain
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
Incremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIncremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation system
Istvan Rath
 
Oracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questionsOracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questions
Arunkumar Gurav
 
Compiler unit 5
Compiler  unit 5Compiler  unit 5
Compiler unit 5
BBDITM LUCKNOW
 
Deep Packet Inspection with Regular Expression Matching
Deep Packet Inspection with Regular Expression MatchingDeep Packet Inspection with Regular Expression Matching
Deep Packet Inspection with Regular Expression Matching
Editor IJCATR
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12
Tim Bunce
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
QIAGEN
 
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsXtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
Subhabrata Mukherjee
 
Sept2016 sv nist_intro
Sept2016 sv nist_introSept2016 sv nist_intro
Sept2016 sv nist_intro
GenomeInABottle
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
Vipin Varghese
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
Jeff Hammerbacher
 

Similar to VCF and RDF (20)

Compilers Are Databases
Compilers Are DatabasesCompilers Are Databases
Compilers Are Databases
 
The Little Register Allocator
The Little Register AllocatorThe Little Register Allocator
The Little Register Allocator
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
8086 instruction set
8086 instruction set8086 instruction set
8086 instruction set
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Incremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIncremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation system
 
Oracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questionsOracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questions
 
Compiler unit 5
Compiler  unit 5Compiler  unit 5
Compiler unit 5
 
Deep Packet Inspection with Regular Expression Matching
Deep Packet Inspection with Regular Expression MatchingDeep Packet Inspection with Regular Expression Matching
Deep Packet Inspection with Regular Expression Matching
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsXtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
 
Sept2016 sv nist_intro
Sept2016 sv nist_introSept2016 sv nist_intro
Sept2016 sv nist_intro
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 

Recently uploaded

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 

Recently uploaded (20)

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 

VCF and RDF

  • 1. VCF and RDF Jeremy J Carroll April 3rd, 2013 Prepared for W3C Clinical Genomics Task Force www.Syapse.com jjc@Syapse.com Copyright 2013 Syapse Inc.
  • 2. Contents • Background – Business Background – Goal – VCF and RDF • This specific work (exploration) – Stats – VCF example – Lots of example transforms: VCF and RDF • Discussion: might the examples meet the business goals?
  • 3. Background • Syapse Discovery provides an end-to-end solution for […] laboratories deploying next- generation sequencing-based diagnostics, […] configurable semantic data structure enables users to bring omics data together with traditional medical information […] • This work: initial exploration of VCF import, what useful information is in a VCF etc.
  • 4. Success = Useful Advanced Query Over Genomic Information Patient Report Tissue Disease: Gene: Biopsy Site: Breast Cancer V-ERB-B2 AVIAN Lobe of Liver
  • 5. VCF, a Web Based Knowledge Format • Variant Call Format • Used for exchange from one system to another • Used for exchange from one lab to another • Used for exchange between universities and the wider community
  • 6. RDF, a Web Based Knowledge Format • Resource Description Framework • Used for exchange from one system to another • Used for exchange from one lab to another • Used for exchange between universities and the wider community
  • 7. VCF vs RDF • VCF: efficient representation of huge quantities of data • RDF: flexible representation with excellent interoperability and query
  • 8. Materialized or Virtual? • Materialized = Transform data: Extract Transform Load into RDF • Virtual = Transform query: Runtime mapping • Or some combination: transform VCF into some internal format, transform SPARQL queries into a query over internal format
  • 9. This Work • Initial mapping from VCF to RDF • Materialized for simplicity • Key questions: – Is this useful? • Would query answer interesting questions? – Scale? • Can we materialize? Should we go virtual?
  • 10. Known limitations • Lots not addressed …. • Too many literals not enough URIs • Slow • Hoped to be fit for purpose, i.e. answering questions, not production code. (scale and utility, see previous slide)
  • 11. Some Stats • Working with Chromosome 7 from 1000G ftp://ftp- trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz • 704,243 Variants, 1,092 samples – 689,034,445 ref calls, 79,998,911 variant calls • H/W: new mac book pro 2.3 GHz Intel Core i7 8 GB • Gunzip: 1m37s • PyVCF: 2h41m • VCF2TTL: 37h57m • serdi TTL 2 Ntriples: 13h57m • triples: 14,780,932,912 (1.2 trillion bytes)
  • 12. VCF Overview Metadata ##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 T-Box: ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> Properties, Classe ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> s, Domains, Rang ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> es ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 20 17330 . T A 3.0 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 20 1110696 rs6040355 A G,T 1e+03 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 20 1234567 microsat1 GTCT G,GTACT . PASS NS=3;DP=9;AA=G GT:GQ:DP ./.:35:4 0/2:17:2 Per Per Variant Alternative Per Information Allele Sample, Variant Information Call
  • 13. Key Classes • Variant A position in the reference genome, where mutation may happen, and related information. • Allele A possible sequence of bases at some variant, and related information. • Sample A related set of Variant Calls • Variant Call A selection of two Alleles (diploid) at a Variant (can be phased or unphased)
  • 14. A Variant #CHROM POS ID REF ALT QUAL 20 1110696 rs6040355 A G,T 1e+03
  • 15. INFO about a Variant Same node as on previous slide, selecting different triples for display. DB property should probably be a class (dbSNP membership); AF (allele frequency) is not a property of the variant but the allele. #CHROM POS INFO 20 1110696 NS=2;DP=10;AF=0.333,0.667;AA=T;DB
  • 16. INFO about Alleles Same node as on previous slide, selecting different triples for display. #CHROM POS REF ALT INFO 20 1110696 A G,TAF=0.333,0.667;AA=T;DB
  • 17. Detail about a Variant Call Same variant node as on previous slide, HQ is mismodeled. #CHROM POS REF ALT FORMAT NA00001 20 1110696 A G,T GT:GQ:DP:HQ 1|2:21:6:23,27
  • 18. A Phased Diploid Genotype Call: rdf:Seq The | indicates phased, consistent ordering between VCF rows. rdf:Seq indicates order is significant. Heterozygous VCF concept of phase set not addressed. #CHROM POS REF ALT FORMAT NA00001 20 1110696 A G,TGT:GQ:DP:HQ 1|2:21:6:23,27
  • 19. Unphased Diploid Genotype Call: rdf:Bag The / indicates unphased genotype. rdf:Bag indicates order is not significant. rdf:Bag also used for homozygous genotypes. The 0/2 indicates the use of the reference and the 2nd alternative. This is a different variant #CHROM POS REF ALT FORMAT NA00002 20 1234567 GTCT G,GTACT GT:GQ:DP 0/2:17:2
  • 20. Genotype Likelihoods 1000G #CHROM POS REF ALT FORMAT NA11829 7 16161 A G GT:DS:GL 0|1:1.300:-0.36,-0.43,-0.71 Notes: _1 and _2 hide one another, id_2397 is variant of id_23830
  • 21. Discussion Points • How much of this is actually useful? – Maybe just artifacts of the history of the call? – Maybe just joins that we could/should recompute on the fly – Maybe much of the data is fundamentally uninteresting • Do we know the queries? If so can we represent in a more compact way and optimize for those queries?
  • 22. Further observations • Metadata is a mess, not actually reusable by machine, e.g. why file date but not sample date … provokes the desire to show rationale for file, but not the performance • Extensibility of Tbox allows English text to introduce new syntax (e.g. enumerations, per alt allele and reference). This is not a good idea. • The per genotype call info is only for unphased data, yet the call can be phased

Editor's Notes

  1. VCF2TTLreal 2278m26.935suser 2269m23.479ssys 8m2.282stime serdi -i turtle -o ntriples /tmp/pipe | time wc 14780932912 59123731783 1,211334,403999 136699.18 real 4809.46 user 628.36 sysreal 2278m19.206suser 807m12.920ssys 29m30.467s