SlideShare a Scribd company logo
1 of 18
Download to read offline
Analyzing Genomic Data
with PyEnsembl and
Varcode
Alex Rubinsteyn
SciPy - July 9th, 2015
HammerLab @ Mount Sinai
http://www.hammerlab.org/research/
Not biologists!
Most lab members have a background in
Computer Science and programming:
• Distributed data management
• Machine learning/statistics
• Programming languages
• Data visualization
Lab focus
Software development: building high-quality
open source software for biomedicine
Research: use that software to change how a
physician treats a patient
Lab research projects
• Can we predict response to immunotherapy from
melanoma & lung cancer mutations? (collaboration with Alex Snyder
& Matt Hellman)
• Does chemotherapy increase heterogeneity of
mutations across cells in ovarian cancer? (collaborations with
John Martignetti and Alex Snyder)
• Can we reprogram the immune system to attack
cancer cells making mutated proteins? (collaboration with Nina
Bhardwaj’s lab)
– Personalized cancer vaccine pipeline written in Python
– Phase I clinical trial (~20 patients) starting soon (!!)
tl;dr of high throughput DNA
sequencing for cancer
Peripheral
Blood
Tumor
Normal
DNA
Tumor
DNA
Fragments
Fragments
Sequenced
“Reads”
Sequenced
“Reads”
Cancer (“somatic”) Mutations
How do we find mutations?
Normal
DNA
Reads
Tumor
DNA
Reads
Reference Genome
Sequence
Variant file format:VCF
Source: wiki.nci.nih.gov
Entry for variant
“chr20:1235678 GTC>G”
Cancer-specific variant
file format: MAF
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_position
End_position Strand Variant_Classification Variant_Type
Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS
dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode
Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1
Tumor_Validation_Allele2 Match_Norm_Validation_Allele1
Match_Norm_Validation_Allele2 Verification_StatusValidation_Status
Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score
BAM_file Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID
AGL 178 genome.wustl.edu 37 1 100349684 100349684 +
Missense_Mutation SNP G G A TCGA-13-1405-01A-01W-0494-09
TCGA-13-1405-10A-01W-0495-09 G G G A G G Unknown Valid Somatic
4 WXS 454_PCR_WGA 1 dbGAP Illumina GAIIx
c0d1de72-4cce-4d74-93f0-29c462dc1426 89f04056-0478-4305-b1ce-486ae469b4dd
SASS6 163786 genome.wustl.edu 37 1 100573197 100573197 +
Missense_Mutation SNP G G A TCGA-04-1542-01A-01W-0553-09
TCGA-04-1542-10A-01W-0553-09 G G G A G G Unknown Valid Somatic
4 WXS 454_PCR_WGA 1 dbGAP Illumina GAIIx 317a63af-
e862-43df-8ef5-7c555b2cb678 b94052a8-c3d2-4e47-81e2-62242bc0841a
Working With Genomic
Mutation Data in Python
PyEnsembl
• Python interface to the Ensembl genome annotations
• Where is each gene/transcript/exon?
• What’s the “biotype” of each transcript? (e.g. long
noncoding RNA, protein coding)
Varcode
• ReadVCF (& MAF) files
• Filter variants (e.g. only protein coding genes)
• Compare collections of variants (e.g. do these two
patients’ cancers share mutations?)
• Predict effect of mutations on protein sequence
Getting Started
$ pip install pyensembl varcode
$ pyensembl install --release 75
Download reference genome annotation data:
…and wait for ~10 minutes for download and indexing of
several GB of reference sequence and annotation data.
Ensembl releases specific to the version of the human genome
variants are aligned against:
• GRCh37/NCBI37/hg19 = Ensembl release 75
• GRCh38 = Ensembl release 80 (newer releases coming)
PyEnsembl Example: Looking
up info about the TP53 gene
PyEnsembl Example:What’s
the sequence of TP53-001?
Varcode Example: Loading a
MAF file
• Each mutation represented by aVariant object
• Collection of Variant objects is called a …VariantCollection!
Varcode: Which gene is most
mutated in ovarian cancer?
Varcode: What are the protein
effects of mutations inTP53?
Many effect classes corresponding to kinds of changes in protein
sequence (or describing which non-coding region a mutation
affects), examples:
• Substitution (change on amino acid into another)
• Frameshift (change in codon translation frame)
• Intronic (mutation gets spliced out of transcript)
Varcode: What’s the sequence
of a mutant protein?
Don’t just want to know what kind of effect a mutation has,
but specifically what will the sequence of the new protein be.
Thanks!
PyEnsembl
www.github.com/hammerlab/pyensembl
Varcode
www.github.com/hammerlab/varcode

More Related Content

What's hot

GMI proficiency testing- Progress report 2016
GMI proficiency testing- Progress report 2016GMI proficiency testing- Progress report 2016
GMI proficiency testing- Progress report 2016ExternalEvents
 
Bioinformatics as a tool for understanding carcinogenesis
Bioinformatics as a tool for understanding carcinogenesisBioinformatics as a tool for understanding carcinogenesis
Bioinformatics as a tool for understanding carcinogenesisDespoina Kalfakakou
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation SequencingShelomi Karoon
 
Albert pujol reingeneering the human biology
Albert pujol   reingeneering the human biologyAlbert pujol   reingeneering the human biology
Albert pujol reingeneering the human biologyAlbert Pujol Torras
 
Immune 3a qtl_website
Immune 3a qtl_websiteImmune 3a qtl_website
Immune 3a qtl_websitexuelianma
 
Immune 3a qtl_website
Immune 3a qtl_websiteImmune 3a qtl_website
Immune 3a qtl_websitexuelianma
 
Immune 3a qtl_website
Immune 3a qtl_websiteImmune 3a qtl_website
Immune 3a qtl_websitexuelianma
 
The Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingThe Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingEmiliano De Cristofaro
 
01 Slide_Oscar
01 Slide_Oscar01 Slide_Oscar
01 Slide_OscarOscar Chan
 
Coding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS eraCoding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS eraLex Nederbragt
 
Illumina-General-Overview-Q1-17
Illumina-General-Overview-Q1-17Illumina-General-Overview-Q1-17
Illumina-General-Overview-Q1-17Matthew Holguin
 
Newcastle iGEM Presentation 2008
Newcastle iGEM Presentation 2008Newcastle iGEM Presentation 2008
Newcastle iGEM Presentation 2008Morgan Taschuk
 
dkNET Webinar: Addgene, The Nonprofit Plasmid Repository 04/24/2020
dkNET Webinar: Addgene, The Nonprofit Plasmid Repository 04/24/2020dkNET Webinar: Addgene, The Nonprofit Plasmid Repository 04/24/2020
dkNET Webinar: Addgene, The Nonprofit Plasmid Repository 04/24/2020dkNET
 
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic SequencingDr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic SequencingJohn Blue
 
Genomic Epidemiology: How High Throughput Sequencing changed our view on bac...
Genomic Epidemiology:  How High Throughput Sequencing changed our view on bac...Genomic Epidemiology:  How High Throughput Sequencing changed our view on bac...
Genomic Epidemiology: How High Throughput Sequencing changed our view on bac...João André Carriço
 
Basic knowledge of_viral_metagenome_vanshika-varshney
Basic knowledge of_viral_metagenome_vanshika-varshneyBasic knowledge of_viral_metagenome_vanshika-varshney
Basic knowledge of_viral_metagenome_vanshika-varshneyVanshikaVarshney5
 

What's hot (20)

GMI proficiency testing- Progress report 2016
GMI proficiency testing- Progress report 2016GMI proficiency testing- Progress report 2016
GMI proficiency testing- Progress report 2016
 
Bioinformatics as a tool for understanding carcinogenesis
Bioinformatics as a tool for understanding carcinogenesisBioinformatics as a tool for understanding carcinogenesis
Bioinformatics as a tool for understanding carcinogenesis
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Albert pujol reingeneering the human biology
Albert pujol   reingeneering the human biologyAlbert pujol   reingeneering the human biology
Albert pujol reingeneering the human biology
 
Immune 3a qtl_website
Immune 3a qtl_websiteImmune 3a qtl_website
Immune 3a qtl_website
 
Immune 3a qtl_website
Immune 3a qtl_websiteImmune 3a qtl_website
Immune 3a qtl_website
 
Immune 3a qtl_website
Immune 3a qtl_websiteImmune 3a qtl_website
Immune 3a qtl_website
 
The Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingThe Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome Sequencing
 
Viral ER Presentation (1)
Viral ER Presentation (1)Viral ER Presentation (1)
Viral ER Presentation (1)
 
01 Slide_Oscar
01 Slide_Oscar01 Slide_Oscar
01 Slide_Oscar
 
Coding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS eraCoding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS era
 
Rossen eccmid2015v1.5
Rossen eccmid2015v1.5Rossen eccmid2015v1.5
Rossen eccmid2015v1.5
 
Illumina-General-Overview-Q1-17
Illumina-General-Overview-Q1-17Illumina-General-Overview-Q1-17
Illumina-General-Overview-Q1-17
 
Newcastle iGEM Presentation 2008
Newcastle iGEM Presentation 2008Newcastle iGEM Presentation 2008
Newcastle iGEM Presentation 2008
 
262797391212125
262797391212125262797391212125
262797391212125
 
dkNET Webinar: Addgene, The Nonprofit Plasmid Repository 04/24/2020
dkNET Webinar: Addgene, The Nonprofit Plasmid Repository 04/24/2020dkNET Webinar: Addgene, The Nonprofit Plasmid Repository 04/24/2020
dkNET Webinar: Addgene, The Nonprofit Plasmid Repository 04/24/2020
 
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic SequencingDr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
Dr. Ben Hause - Pathogen Discovery Using Metagenomic Sequencing
 
Genomic Epidemiology: How High Throughput Sequencing changed our view on bac...
Genomic Epidemiology:  How High Throughput Sequencing changed our view on bac...Genomic Epidemiology:  How High Throughput Sequencing changed our view on bac...
Genomic Epidemiology: How High Throughput Sequencing changed our view on bac...
 
Genome comparision
Genome comparisionGenome comparision
Genome comparision
 
Basic knowledge of_viral_metagenome_vanshika-varshney
Basic knowledge of_viral_metagenome_vanshika-varshneyBasic knowledge of_viral_metagenome_vanshika-varshney
Basic knowledge of_viral_metagenome_vanshika-varshney
 

Similar to Analyzing Genomic Data with PyEnsembl and Varcode

NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...VHIR Vall d’Hebron Institut de Recerca
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Intel IT Center
 
Next generation sequencing & microarray-- Genotypic Technology
Next generation sequencing & microarray-- Genotypic TechnologyNext generation sequencing & microarray-- Genotypic Technology
Next generation sequencing & microarray-- Genotypic TechnologyGenotypic Technology
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
2012 10-24 - ngs webinar
2012 10-24 - ngs webinar2012 10-24 - ngs webinar
2012 10-24 - ngs webinarElsa von Licy
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
2013 02-14 - ngs webinar - sellappan
2013 02-14 - ngs webinar - sellappan2013 02-14 - ngs webinar - sellappan
2013 02-14 - ngs webinar - sellappanElsa von Licy
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGLong Pei
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Alejandro Borges
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
Targeted RNAseq for Gene Expression Using Unique Molecular Indexes (UMIs): In...
Targeted RNAseq for Gene Expression Using Unique Molecular Indexes (UMIs): In...Targeted RNAseq for Gene Expression Using Unique Molecular Indexes (UMIs): In...
Targeted RNAseq for Gene Expression Using Unique Molecular Indexes (UMIs): In...QIAGEN
 
Internship Report
Internship ReportInternship Report
Internship ReportNeha Gupta
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 

Similar to Analyzing Genomic Data with PyEnsembl and Varcode (20)

NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
 
Next generation sequencing & microarray-- Genotypic Technology
Next generation sequencing & microarray-- Genotypic TechnologyNext generation sequencing & microarray-- Genotypic Technology
Next generation sequencing & microarray-- Genotypic Technology
 
Rna seq
Rna seq Rna seq
Rna seq
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Ngs webinar 2013
Ngs webinar 2013Ngs webinar 2013
Ngs webinar 2013
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
2012 10-24 - ngs webinar
2012 10-24 - ngs webinar2012 10-24 - ngs webinar
2012 10-24 - ngs webinar
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
2013 02-14 - ngs webinar - sellappan
2013 02-14 - ngs webinar - sellappan2013 02-14 - ngs webinar - sellappan
2013 02-14 - ngs webinar - sellappan
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene Día 19 - Noel Chen - Introducción a Novogene
Día 19 - Noel Chen - Introducción a Novogene
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Targeted RNAseq for Gene Expression Using Unique Molecular Indexes (UMIs): In...
Targeted RNAseq for Gene Expression Using Unique Molecular Indexes (UMIs): In...Targeted RNAseq for Gene Expression Using Unique Molecular Indexes (UMIs): In...
Targeted RNAseq for Gene Expression Using Unique Molecular Indexes (UMIs): In...
 
Internship Report
Internship ReportInternship Report
Internship Report
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 

Recently uploaded

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 

Recently uploaded (20)

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 

Analyzing Genomic Data with PyEnsembl and Varcode

  • 1. Analyzing Genomic Data with PyEnsembl and Varcode Alex Rubinsteyn SciPy - July 9th, 2015
  • 2. HammerLab @ Mount Sinai http://www.hammerlab.org/research/
  • 3. Not biologists! Most lab members have a background in Computer Science and programming: • Distributed data management • Machine learning/statistics • Programming languages • Data visualization
  • 4. Lab focus Software development: building high-quality open source software for biomedicine Research: use that software to change how a physician treats a patient
  • 5. Lab research projects • Can we predict response to immunotherapy from melanoma & lung cancer mutations? (collaboration with Alex Snyder & Matt Hellman) • Does chemotherapy increase heterogeneity of mutations across cells in ovarian cancer? (collaborations with John Martignetti and Alex Snyder) • Can we reprogram the immune system to attack cancer cells making mutated proteins? (collaboration with Nina Bhardwaj’s lab) – Personalized cancer vaccine pipeline written in Python – Phase I clinical trial (~20 patients) starting soon (!!)
  • 6. tl;dr of high throughput DNA sequencing for cancer Peripheral Blood Tumor Normal DNA Tumor DNA Fragments Fragments Sequenced “Reads” Sequenced “Reads” Cancer (“somatic”) Mutations
  • 7. How do we find mutations? Normal DNA Reads Tumor DNA Reads Reference Genome Sequence
  • 8. Variant file format:VCF Source: wiki.nci.nih.gov Entry for variant “chr20:1235678 GTC>G”
  • 9. Cancer-specific variant file format: MAF Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_position End_position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_StatusValidation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_file Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID AGL 178 genome.wustl.edu 37 1 100349684 100349684 + Missense_Mutation SNP G G A TCGA-13-1405-01A-01W-0494-09 TCGA-13-1405-10A-01W-0495-09 G G G A G G Unknown Valid Somatic 4 WXS 454_PCR_WGA 1 dbGAP Illumina GAIIx c0d1de72-4cce-4d74-93f0-29c462dc1426 89f04056-0478-4305-b1ce-486ae469b4dd SASS6 163786 genome.wustl.edu 37 1 100573197 100573197 + Missense_Mutation SNP G G A TCGA-04-1542-01A-01W-0553-09 TCGA-04-1542-10A-01W-0553-09 G G G A G G Unknown Valid Somatic 4 WXS 454_PCR_WGA 1 dbGAP Illumina GAIIx 317a63af- e862-43df-8ef5-7c555b2cb678 b94052a8-c3d2-4e47-81e2-62242bc0841a
  • 10. Working With Genomic Mutation Data in Python PyEnsembl • Python interface to the Ensembl genome annotations • Where is each gene/transcript/exon? • What’s the “biotype” of each transcript? (e.g. long noncoding RNA, protein coding) Varcode • ReadVCF (& MAF) files • Filter variants (e.g. only protein coding genes) • Compare collections of variants (e.g. do these two patients’ cancers share mutations?) • Predict effect of mutations on protein sequence
  • 11. Getting Started $ pip install pyensembl varcode $ pyensembl install --release 75 Download reference genome annotation data: …and wait for ~10 minutes for download and indexing of several GB of reference sequence and annotation data. Ensembl releases specific to the version of the human genome variants are aligned against: • GRCh37/NCBI37/hg19 = Ensembl release 75 • GRCh38 = Ensembl release 80 (newer releases coming)
  • 12. PyEnsembl Example: Looking up info about the TP53 gene
  • 14. Varcode Example: Loading a MAF file • Each mutation represented by aVariant object • Collection of Variant objects is called a …VariantCollection!
  • 15. Varcode: Which gene is most mutated in ovarian cancer?
  • 16. Varcode: What are the protein effects of mutations inTP53? Many effect classes corresponding to kinds of changes in protein sequence (or describing which non-coding region a mutation affects), examples: • Substitution (change on amino acid into another) • Frameshift (change in codon translation frame) • Intronic (mutation gets spliced out of transcript)
  • 17. Varcode: What’s the sequence of a mutant protein? Don’t just want to know what kind of effect a mutation has, but specifically what will the sequence of the new protein be.