SlideShare a Scribd company logo
1 of 22
VCF and RDF
              Jeremy J Carroll
               April 3rd, 2013
Prepared for W3C Clinical Genomics Task Force
                  www.Syapse.com
                  jjc@Syapse.com



              Copyright 2013 Syapse Inc.
Contents
• Background
   – Business Background
   – Goal
   – VCF and RDF
• This specific work (exploration)
   – Stats
   – VCF example
   – Lots of example transforms: VCF and RDF
• Discussion: might the examples meet the
  business goals?
Background
• Syapse Discovery provides an end-to-end
  solution for […] laboratories deploying next-
  generation sequencing-based diagnostics, […]
  configurable semantic data structure enables
  users to bring omics data together with
  traditional medical information […]
• This work: initial exploration of VCF
  import, what useful information is in a VCF
  etc.
Success = Useful Advanced Query
        Over Genomic Information




Patient         Report           Tissue
Disease:        Gene:            Biopsy Site:
Breast Cancer   V-ERB-B2 AVIAN   Lobe of Liver
VCF, a Web Based Knowledge Format
• Variant Call Format
• Used for exchange from one system to
  another
• Used for exchange from one lab to another
• Used for exchange between universities and
  the wider community
RDF, a Web Based Knowledge Format
• Resource Description Framework
• Used for exchange from one system to
  another
• Used for exchange from one lab to another
• Used for exchange between universities and
  the wider community
VCF vs RDF
• VCF: efficient representation of huge
  quantities of data

• RDF: flexible representation with excellent
  interoperability and query
Materialized or Virtual?
• Materialized = Transform data:
  Extract Transform Load into RDF

• Virtual = Transform query:
  Runtime mapping

• Or some combination: transform VCF into
  some internal format, transform SPARQL
  queries into a query over internal format
This Work
• Initial mapping from VCF to RDF
• Materialized for simplicity

• Key questions:
  – Is this useful?
     • Would query answer interesting questions?
  – Scale?
     • Can we materialize? Should we go virtual?
Known limitations
• Lots not addressed ….
• Too many literals not enough URIs
• Slow

• Hoped to be fit for purpose, i.e. answering
  questions, not production code. (scale and
  utility, see previous slide)
Some Stats
• Working with Chromosome 7 from 1000G
    ftp://ftp-
    trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz



• 704,243 Variants, 1,092 samples
     – 689,034,445 ref calls, 79,998,911 variant calls
•   H/W: new mac book pro 2.3 GHz Intel Core i7 8 GB
•   Gunzip: 1m37s
•   PyVCF: 2h41m
•   VCF2TTL: 37h57m
•   serdi TTL 2 Ntriples: 13h57m
•   triples: 14,780,932,912 (1.2 trillion bytes)
VCF Overview
                                                       Metadata
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36                                                                      T-Box:
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
                                                                                                         Properties, Classe
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
                                                                                                         s, Domains, Rang
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">                                       es
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS           ID        REF        ALT      QUAL       FILTER INFO                                FORMAT        NA00001          NA00002
20         14370     rs6054257 G          A        29         PASS   NS=3;DP=14;AF=0.5;DB;H2             GT:GQ:DP:HQ   0|0:48:1:51,51   1|0:48:8:51,51
20         17330     .         T          A        3.0        q10    NS=3;DP=11;AF=0.017                 GT:GQ:DP:HQ   0|0:49:3:58,50   0|1:3:5:65,3
20         1110696 rs6040355 A            G,T      1e+03      PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ   1|2:21:6:23,27   2|1:2:0:18,2
20         1230237 .           T          .        47         PASS   NS=3;DP=13;AA=T                     GT:GQ:DP:HQ   0|0:54:7:56,60   0|0:48:4:51,51
20         1234567 microsat1 GTCT         G,GTACT .           PASS   NS=3;DP=9;AA=G                      GT:GQ:DP      ./.:35:4         0/2:17:2



                                                        Per
     Per Variant                                    Alternative                                                   Per
     Information                                       Allele                                                     Sample, Variant
                                                   Information                                                    Call
Key Classes
• Variant A position in the reference
  genome, where mutation may happen, and
  related information.
• Allele A possible sequence of bases at some
  variant, and related information.
• Sample A related set of Variant Calls
• Variant Call A selection of two Alleles (diploid)
  at a Variant (can be phased or unphased)
A Variant




#CHROM   POS     ID        REF   ALT   QUAL
20       1110696 rs6040355 A     G,T   1e+03
INFO about a Variant

                                Same node as on previous
                                slide, selecting different triples for
                                display.

                                DB property should probably be a
                                class (dbSNP membership); AF (allele
                                frequency) is not a property of the
                                variant but the allele.




#CHROM   POS     INFO
20       1110696 NS=2;DP=10;AF=0.333,0.667;AA=T;DB
INFO about Alleles


                                               Same node as on
                                               previous slide, selecting
                                               different triples for
                                               display.




#CHROM   POS     REF ALT  INFO
20       1110696 A G,TAF=0.333,0.667;AA=T;DB
Detail about a Variant Call


                                               Same variant node as on
                                               previous slide, HQ is
                                               mismodeled.




#CHROM    POS     REF ALT   FORMAT         NA00001
20        1110696 A G,T GT:GQ:DP:HQ   1|2:21:6:23,27
A Phased Diploid Genotype Call: rdf:Seq
                                         The | indicates
                                         phased, consistent
                                         ordering between VCF
                                         rows. rdf:Seq indicates
                                         order is significant.
                                         Heterozygous

                                         VCF concept of phase set
                                         not addressed.




#CHROM   POS     REF ALT  FORMAT         NA00001
20       1110696 A G,TGT:GQ:DP:HQ   1|2:21:6:23,27
Unphased Diploid Genotype Call: rdf:Bag
                                       The / indicates unphased
                                       genotype. rdf:Bag
                                       indicates order is not
                                       significant. rdf:Bag also
                                       used for homozygous
                                       genotypes. The 0/2
                                       indicates the use of the
                                       reference and the 2nd
                                       alternative.

                                       This is a different variant




#CHROM POS     REF ALT      FORMAT     NA00002
20     1234567 GTCT G,GTACT GT:GQ:DP   0/2:17:2
Genotype Likelihoods 1000G




#CHROM POS REF ALT FORMAT NA11829
7      16161 A G GT:DS:GL 0|1:1.300:-0.36,-0.43,-0.71
Notes: _1 and _2 hide one another, id_2397 is variant of id_23830
Discussion Points
• How much of this is actually useful?
  – Maybe just artifacts of the history of the call?
  – Maybe just joins that we could/should recompute on
    the fly
  – Maybe much of the data is fundamentally
    uninteresting

• Do we know the queries? If so can we represent
  in a more compact way and optimize for those
  queries?
Further observations
• Metadata is a mess, not actually reusable by
  machine, e.g. why file date but not sample date …
  provokes the desire to show rationale for file, but
  not the performance
• Extensibility of Tbox allows English text to
  introduce new syntax (e.g. enumerations, per alt
  allele and reference). This is not a good idea.
• The per genotype call info is only for unphased
  data, yet the call can be phased

More Related Content

Viewers also liked

Electronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision MedicineElectronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
Kent State University
 
Potential partner pitch
Potential partner pitchPotential partner pitch
Potential partner pitch
Jan Bendtsen
 
Partnership presentation
Partnership presentationPartnership presentation
Partnership presentation
email2neev
 

Viewers also liked (10)

(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
 
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision MedicineElectronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
 
Precision Medicine World Conference 2017
Precision Medicine World Conference 2017Precision Medicine World Conference 2017
Precision Medicine World Conference 2017
 
Potential partner pitch
Potential partner pitchPotential partner pitch
Potential partner pitch
 
Partnership Proposal
Partnership ProposalPartnership Proposal
Partnership Proposal
 
Partnership presentation
Partnership presentationPartnership presentation
Partnership presentation
 
Dwolla Startup Pitch Deck
Dwolla Startup Pitch DeckDwolla Startup Pitch Deck
Dwolla Startup Pitch Deck
 
Linkedin Series B Pitch Deck
Linkedin Series B Pitch DeckLinkedin Series B Pitch Deck
Linkedin Series B Pitch Deck
 
Mixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65MMixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65M
 
The slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollarsThe slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollars
 

Similar to VCF and RDF

Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 

Similar to VCF and RDF (20)

Compilers Are Databases
Compilers Are DatabasesCompilers Are Databases
Compilers Are Databases
 
The Little Register Allocator
The Little Register AllocatorThe Little Register Allocator
The Little Register Allocator
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
8086 instruction set
8086 instruction set8086 instruction set
8086 instruction set
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Incremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIncremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation system
 
Oracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questionsOracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questions
 
Compiler unit 5
Compiler  unit 5Compiler  unit 5
Compiler unit 5
 
Deep Packet Inspection with Regular Expression Matching
Deep Packet Inspection with Regular Expression MatchingDeep Packet Inspection with Regular Expression Matching
Deep Packet Inspection with Regular Expression Matching
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsXtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
 
Sept2016 sv nist_intro
Sept2016 sv nist_introSept2016 sv nist_intro
Sept2016 sv nist_intro
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

VCF and RDF

  • 1. VCF and RDF Jeremy J Carroll April 3rd, 2013 Prepared for W3C Clinical Genomics Task Force www.Syapse.com jjc@Syapse.com Copyright 2013 Syapse Inc.
  • 2. Contents • Background – Business Background – Goal – VCF and RDF • This specific work (exploration) – Stats – VCF example – Lots of example transforms: VCF and RDF • Discussion: might the examples meet the business goals?
  • 3. Background • Syapse Discovery provides an end-to-end solution for […] laboratories deploying next- generation sequencing-based diagnostics, […] configurable semantic data structure enables users to bring omics data together with traditional medical information […] • This work: initial exploration of VCF import, what useful information is in a VCF etc.
  • 4. Success = Useful Advanced Query Over Genomic Information Patient Report Tissue Disease: Gene: Biopsy Site: Breast Cancer V-ERB-B2 AVIAN Lobe of Liver
  • 5. VCF, a Web Based Knowledge Format • Variant Call Format • Used for exchange from one system to another • Used for exchange from one lab to another • Used for exchange between universities and the wider community
  • 6. RDF, a Web Based Knowledge Format • Resource Description Framework • Used for exchange from one system to another • Used for exchange from one lab to another • Used for exchange between universities and the wider community
  • 7. VCF vs RDF • VCF: efficient representation of huge quantities of data • RDF: flexible representation with excellent interoperability and query
  • 8. Materialized or Virtual? • Materialized = Transform data: Extract Transform Load into RDF • Virtual = Transform query: Runtime mapping • Or some combination: transform VCF into some internal format, transform SPARQL queries into a query over internal format
  • 9. This Work • Initial mapping from VCF to RDF • Materialized for simplicity • Key questions: – Is this useful? • Would query answer interesting questions? – Scale? • Can we materialize? Should we go virtual?
  • 10. Known limitations • Lots not addressed …. • Too many literals not enough URIs • Slow • Hoped to be fit for purpose, i.e. answering questions, not production code. (scale and utility, see previous slide)
  • 11. Some Stats • Working with Chromosome 7 from 1000G ftp://ftp- trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz • 704,243 Variants, 1,092 samples – 689,034,445 ref calls, 79,998,911 variant calls • H/W: new mac book pro 2.3 GHz Intel Core i7 8 GB • Gunzip: 1m37s • PyVCF: 2h41m • VCF2TTL: 37h57m • serdi TTL 2 Ntriples: 13h57m • triples: 14,780,932,912 (1.2 trillion bytes)
  • 12. VCF Overview Metadata ##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 T-Box: ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> Properties, Classe ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> s, Domains, Rang ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> es ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 20 17330 . T A 3.0 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 20 1110696 rs6040355 A G,T 1e+03 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 20 1234567 microsat1 GTCT G,GTACT . PASS NS=3;DP=9;AA=G GT:GQ:DP ./.:35:4 0/2:17:2 Per Per Variant Alternative Per Information Allele Sample, Variant Information Call
  • 13. Key Classes • Variant A position in the reference genome, where mutation may happen, and related information. • Allele A possible sequence of bases at some variant, and related information. • Sample A related set of Variant Calls • Variant Call A selection of two Alleles (diploid) at a Variant (can be phased or unphased)
  • 14. A Variant #CHROM POS ID REF ALT QUAL 20 1110696 rs6040355 A G,T 1e+03
  • 15. INFO about a Variant Same node as on previous slide, selecting different triples for display. DB property should probably be a class (dbSNP membership); AF (allele frequency) is not a property of the variant but the allele. #CHROM POS INFO 20 1110696 NS=2;DP=10;AF=0.333,0.667;AA=T;DB
  • 16. INFO about Alleles Same node as on previous slide, selecting different triples for display. #CHROM POS REF ALT INFO 20 1110696 A G,TAF=0.333,0.667;AA=T;DB
  • 17. Detail about a Variant Call Same variant node as on previous slide, HQ is mismodeled. #CHROM POS REF ALT FORMAT NA00001 20 1110696 A G,T GT:GQ:DP:HQ 1|2:21:6:23,27
  • 18. A Phased Diploid Genotype Call: rdf:Seq The | indicates phased, consistent ordering between VCF rows. rdf:Seq indicates order is significant. Heterozygous VCF concept of phase set not addressed. #CHROM POS REF ALT FORMAT NA00001 20 1110696 A G,TGT:GQ:DP:HQ 1|2:21:6:23,27
  • 19. Unphased Diploid Genotype Call: rdf:Bag The / indicates unphased genotype. rdf:Bag indicates order is not significant. rdf:Bag also used for homozygous genotypes. The 0/2 indicates the use of the reference and the 2nd alternative. This is a different variant #CHROM POS REF ALT FORMAT NA00002 20 1234567 GTCT G,GTACT GT:GQ:DP 0/2:17:2
  • 20. Genotype Likelihoods 1000G #CHROM POS REF ALT FORMAT NA11829 7 16161 A G GT:DS:GL 0|1:1.300:-0.36,-0.43,-0.71 Notes: _1 and _2 hide one another, id_2397 is variant of id_23830
  • 21. Discussion Points • How much of this is actually useful? – Maybe just artifacts of the history of the call? – Maybe just joins that we could/should recompute on the fly – Maybe much of the data is fundamentally uninteresting • Do we know the queries? If so can we represent in a more compact way and optimize for those queries?
  • 22. Further observations • Metadata is a mess, not actually reusable by machine, e.g. why file date but not sample date … provokes the desire to show rationale for file, but not the performance • Extensibility of Tbox allows English text to introduce new syntax (e.g. enumerations, per alt allele and reference). This is not a good idea. • The per genotype call info is only for unphased data, yet the call can be phased

Editor's Notes

  1. VCF2TTLreal 2278m26.935suser 2269m23.479ssys 8m2.282stime serdi -i turtle -o ntriples /tmp/pipe | time wc 14780932912 59123731783 1,211334,403999 136699.18 real 4809.46 user 628.36 sysreal 2278m19.206suser 807m12.920ssys 29m30.467s