SlideShare a Scribd company logo
An Open-Source Format for Personal Genome Representation
Enabling Fast Queries and Analyses of Human Genomes
Compact Genome Format
Sally Guthrie, Research Scientist, Curoverse
(sguthrie@curoverse.com)
ACKNOWLEDGEMENTS
Alexander Wait Zaranek, Chief Scientist, Curoverse
Abram Connelly, Research Scientist, Curoverse
CURRENT USES OF GENOMIC DATA
Patient Care
• Analyze one genome for rare and pathogenic variants
Population Analysis
• Examine a population for rare variants
• Separate a population into subgroups
• Case/Control Studies and GWA Studies
• Can require merging multiple data sets
• Can require using supervised and unsupervised
machine learning
VARIANT CALL FORMAT (VCF) SNAPSHOT
Advantages
• Very flexible
• Easily annotated with canonical or in-house annotation
pipelines
• Can be small (with compression)
Disadvantages
• Difficult to merge VCFs between studies
• Can be slow to query and run machine learning
algorithms on (requires pre-processing)
WHAT IS COMPACT GENOME FORMAT (CGF)?
Compact Genome Format is a compressed genomic
sequence
Allows analysis to be run on the compressed data
Represents a sequence using a series of vectors
• Each position in the vector is termed a “tile”
• The value of the vector points to a sequence in a “Tile
Library,” a pan-genome
GENERATING THE REFERENCE TILE LIBRARY
Human Reference Genome
(with tag sets highlighted)
Tag Set: …
1. Choose a tag set of unique 24-base long sequences
2. Map tag set to a reference genome
GENERATING THE REFERENCE TILE LIBRARY
1. Save sequence between each tag pair to the tile library
2. Give these sequences a value (0)
Tile Position Id

00.0000

00.0001

00.0002

…
…
EXTENDING THE TILE LIBRARY
…010020…
…011031…
Tile Library
Tile Position Id

00.002b

00.002c

00.002d

00.002e

00.002f

00.0030

……
EXTENDING THE TILE LIBRARY
…00201*…
…1*11**…
Tile Library
Tile Position Id

00.002b

00.002c

00.002d

00.002e

00.002f

00.0030

……
RATE OF GROWTH OF THE TILE LIBRARY
CGF AND TILE LIBRARY FACILITATE
Requires: beginning locus and end locus
Returns: the sequences between the two loci
for all people in the population
Queries on Sequences
TIME USED FOR QUERIES ON SEQUENCES
TIME PER BASE FOR QUERIES ON SEQUENCES
CGF FACILITATES SEVERAL IMPORTANT ANALYSIS TYPES
Unsupervised Machine Learning
Supervised Machine Learning (Case/Control)
GWAS
Encompass all variation, not just SNP variation
COMPACT GENOME FORMAT FINAL THOUGHTS
• Allows annotations
Tile Library can be annotated by canonical and in-
house annotation pipelines, thus automatically
applying annotations to all CGF files
• Small
• Standardized
• Fast to query
• Designed for machine learning
Thank you!
Any Questions?
Preliminary implementation: lightning-dev3.curoverse.com/brca
Source code: https://github.com/curoverse/lightning
Software license: GNU AGPLv3
GENERATING THE REFERENCE TILE LIBRARY WITH MULTIPLE
TAG SETS
Tag Set ……Tag Set
RATE OF GROWTH OF THE TILE LIBRARY (NO CALLS CREATE
VARIANTS)

More Related Content

Viewers also liked

Near-Duplicate Video Detection Using Temporal Patterns of Semantic Concepts
Near-Duplicate Video Detection Using Temporal Patterns of Semantic ConceptsNear-Duplicate Video Detection Using Temporal Patterns of Semantic Concepts
Near-Duplicate Video Detection Using Temporal Patterns of Semantic Concepts
Wesley De Neve
 
Lightning Talk 2015-10-15
Lightning Talk 2015-10-15Lightning Talk 2015-10-15
Lightning Talk 2015-10-15
Arvados
 
Introduction to 3rd sequencing
Introduction to 3rd sequencing Introduction to 3rd sequencing
Introduction to 3rd sequencing Eric Lee
 
Algorithm of NGS Data
Algorithm of NGS DataAlgorithm of NGS Data
Algorithm of NGS DataEric Lee
 
Content-Driven Apps with React
Content-Driven Apps with ReactContent-Driven Apps with React
Content-Driven Apps with React
Netcetera
 
ACMG 2017 The Data Behind the Results - Bioinformatics for Clinicians
ACMG 2017 The Data Behind the Results - Bioinformatics for CliniciansACMG 2017 The Data Behind the Results - Bioinformatics for Clinicians
ACMG 2017 The Data Behind the Results - Bioinformatics for Clinicians
Erica Ramos
 
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & ExcitingNetcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera
 
SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen
Netcetera
 
COSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4jCOSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4j
Eric Lee
 
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Wesley De Neve
 
Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Netcetera
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
Extract Data Conference
 
SkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoTSkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoT
Netcetera
 
Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...
Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...
Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...
Data Driven Innovation
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
c.titus.brown
 
Die Herausforderungen in der Payment-Industrie
Die Herausforderungen in der Payment-IndustrieDie Herausforderungen in der Payment-Industrie
Die Herausforderungen in der Payment-Industrie
Netcetera
 
Managers - The Missing Manual
Managers - The Missing ManualManagers - The Missing Manual
Managers - The Missing Manual
Netcetera
 

Viewers also liked (17)

Near-Duplicate Video Detection Using Temporal Patterns of Semantic Concepts
Near-Duplicate Video Detection Using Temporal Patterns of Semantic ConceptsNear-Duplicate Video Detection Using Temporal Patterns of Semantic Concepts
Near-Duplicate Video Detection Using Temporal Patterns of Semantic Concepts
 
Lightning Talk 2015-10-15
Lightning Talk 2015-10-15Lightning Talk 2015-10-15
Lightning Talk 2015-10-15
 
Introduction to 3rd sequencing
Introduction to 3rd sequencing Introduction to 3rd sequencing
Introduction to 3rd sequencing
 
Algorithm of NGS Data
Algorithm of NGS DataAlgorithm of NGS Data
Algorithm of NGS Data
 
Content-Driven Apps with React
Content-Driven Apps with ReactContent-Driven Apps with React
Content-Driven Apps with React
 
ACMG 2017 The Data Behind the Results - Bioinformatics for Clinicians
ACMG 2017 The Data Behind the Results - Bioinformatics for CliniciansACMG 2017 The Data Behind the Results - Bioinformatics for Clinicians
ACMG 2017 The Data Behind the Results - Bioinformatics for Clinicians
 
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & ExcitingNetcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
 
SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen
 
COSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4jCOSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4j
 
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
 
Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
 
SkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoTSkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoT
 
Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...
Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...
Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Die Herausforderungen in der Payment-Industrie
Die Herausforderungen in der Payment-IndustrieDie Herausforderungen in der Payment-Industrie
Die Herausforderungen in der Payment-Industrie
 
Managers - The Missing Manual
Managers - The Missing ManualManagers - The Missing Manual
Managers - The Missing Manual
 

Similar to Compact Genome Format

BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
hansjansen9999
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
GenomeInABottle
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
VHIR Vall d’Hebron Institut de Recerca
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveVarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
Golden Helix
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveVarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
Golden Helix
 
2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge
Prof. Wim Van Criekinge
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
Barry Smith
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
GenomeInABottle
 
Pathogen phylogenetics using BEAST
Pathogen phylogenetics using BEASTPathogen phylogenetics using BEAST
Pathogen phylogenetics using BEAST
Bioinformatics and Computational Biosciences Branch
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
Yun Lung Li
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Jonathan Eisen
 
CCBC tutorial beiko
CCBC tutorial beikoCCBC tutorial beiko
CCBC tutorial beiko
beiko
 
Sept2016 sv nist_intro
Sept2016 sv nist_introSept2016 sv nist_intro
Sept2016 sv nist_intro
GenomeInABottle
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
Bioinformatics and Computational Biosciences Branch
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
Sebastian Schmeier
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
Vidya Kalaivani Rajkumar
 
Toolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSToolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGS
Mirko Rossi
 
Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013
Prof. Wim Van Criekinge
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GenomeInABottle
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
Aksw Group
 

Similar to Compact Genome Format (20)

BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveVarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveVarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
 
2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Pathogen phylogenetics using BEAST
Pathogen phylogenetics using BEASTPathogen phylogenetics using BEAST
Pathogen phylogenetics using BEAST
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
 
CCBC tutorial beiko
CCBC tutorial beikoCCBC tutorial beiko
CCBC tutorial beiko
 
Sept2016 sv nist_intro
Sept2016 sv nist_introSept2016 sv nist_intro
Sept2016 sv nist_intro
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Toolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSToolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGS
 
Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
 

Recently uploaded

一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 

Recently uploaded (20)

一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 

Compact Genome Format

  • 1. An Open-Source Format for Personal Genome Representation Enabling Fast Queries and Analyses of Human Genomes Compact Genome Format Sally Guthrie, Research Scientist, Curoverse (sguthrie@curoverse.com)
  • 2. ACKNOWLEDGEMENTS Alexander Wait Zaranek, Chief Scientist, Curoverse Abram Connelly, Research Scientist, Curoverse
  • 3. CURRENT USES OF GENOMIC DATA Patient Care • Analyze one genome for rare and pathogenic variants Population Analysis • Examine a population for rare variants • Separate a population into subgroups • Case/Control Studies and GWA Studies • Can require merging multiple data sets • Can require using supervised and unsupervised machine learning
  • 4. VARIANT CALL FORMAT (VCF) SNAPSHOT Advantages • Very flexible • Easily annotated with canonical or in-house annotation pipelines • Can be small (with compression) Disadvantages • Difficult to merge VCFs between studies • Can be slow to query and run machine learning algorithms on (requires pre-processing)
  • 5. WHAT IS COMPACT GENOME FORMAT (CGF)? Compact Genome Format is a compressed genomic sequence Allows analysis to be run on the compressed data Represents a sequence using a series of vectors • Each position in the vector is termed a “tile” • The value of the vector points to a sequence in a “Tile Library,” a pan-genome
  • 6. GENERATING THE REFERENCE TILE LIBRARY Human Reference Genome (with tag sets highlighted) Tag Set: … 1. Choose a tag set of unique 24-base long sequences 2. Map tag set to a reference genome
  • 7. GENERATING THE REFERENCE TILE LIBRARY 1. Save sequence between each tag pair to the tile library 2. Give these sequences a value (0) Tile Position Id 00.0000 00.0001 00.0002 … …
  • 8. EXTENDING THE TILE LIBRARY …010020… …011031… Tile Library Tile Position Id 00.002b 00.002c 00.002d 00.002e 00.002f 00.0030 ……
  • 9. EXTENDING THE TILE LIBRARY …00201*… …1*11**… Tile Library Tile Position Id 00.002b 00.002c 00.002d 00.002e 00.002f 00.0030 ……
  • 10. RATE OF GROWTH OF THE TILE LIBRARY
  • 11. CGF AND TILE LIBRARY FACILITATE Requires: beginning locus and end locus Returns: the sequences between the two loci for all people in the population Queries on Sequences
  • 12. TIME USED FOR QUERIES ON SEQUENCES
  • 13. TIME PER BASE FOR QUERIES ON SEQUENCES
  • 14. CGF FACILITATES SEVERAL IMPORTANT ANALYSIS TYPES Unsupervised Machine Learning Supervised Machine Learning (Case/Control) GWAS Encompass all variation, not just SNP variation
  • 15. COMPACT GENOME FORMAT FINAL THOUGHTS • Allows annotations Tile Library can be annotated by canonical and in- house annotation pipelines, thus automatically applying annotations to all CGF files • Small • Standardized • Fast to query • Designed for machine learning
  • 16. Thank you! Any Questions? Preliminary implementation: lightning-dev3.curoverse.com/brca Source code: https://github.com/curoverse/lightning Software license: GNU AGPLv3
  • 17. GENERATING THE REFERENCE TILE LIBRARY WITH MULTIPLE TAG SETS Tag Set ……Tag Set
  • 18. RATE OF GROWTH OF THE TILE LIBRARY (NO CALLS CREATE VARIANTS)