SlideShare a Scribd company logo
1 of 43
Download to read offline
© Desktop Genetics Ltd. 2016 A n Illumina-backed company
PERSONALIZING GENOME
SURGERY (WITH PYTHON)
RILEY DOYLE, CEO AND TECHNICAL LEAD
2
AI-POWERED GENOME EDITING
TECH TO USE CRISPR FOR PERSONALIZED GENOMIC SURGERY
@DESKTOPGENETICS| @DOYLE_RILEY
3
BIGGEST BIOTECH BREAKTHROUGH OF THE CENTURY
GLOBAL COVERAGE ACROSS SCIENCE AND TECH MEDIA
GENE EDITING SAVES GIRL DYING FROM
LEUKAEMIA IN WORLD FIRST
5 November 2015
HIV GENES HAVE BEEN CUT OUT OF LIVE
ANIMALS USING CRISPR
15 May 2016
CHINA USED CRISPR TO FIGHT CANCER
IN A REAL, LIVE HUMAN
18 November 2016
CRISPR: GENE EDITING IS JUST THE
BEGINNING
07 March 2016
@DESKTOPGENETICS| @DOYLE_RILEY
4
GENE THERAPY TACKLES DISEASES
CRISPR IS USED TO TREAT PATIENTS AND DISCOVER CURES
DTG
DTG
DTG
@DESKTOPGENETICS| @DOYLE_RILEY
5
CRISPR IS GETTING BIGGER EVERY DAY
GLOBAL REACH OF CRIPR LABS & SCIENTISTS
GENOME EDITING PLASMIDS DISTRIBUTED
BY ADDGENE FROM 2005 TO 2014
CRISPR TALE ZFN SYNBIO
@DESKTOPGENETICS| @DOYLE_RILEY
6
GENOME EDITING PROCESS
AI REQUIRED TO AUTOMATE DECISION MAKING THROUGHOUT THE PROCESS
@DESKTOPGENETICS| @DOYLE_RILEY
7
AGENDA
1. Brief intro to CRISPR
2. Applying machine learning to DNA
3. Our CRISPR design process
4. The path forward
@DESKTOPGENETICS| @DOYLE_RILEY
© Desktop Genetics Ltd. 2016 A n Illumina-backed company
1. BRIEF INTRO TO CRISPR
9
CRISPR AT A GLANCE
MOLECULAR INTERACTIONS AND MODELS
@DESKTOPGENETICS| @DOYLE_RILEY
Email Pydata@deskgen.com
for video and 3D molecule
10
CRISPR OVERVIEW
PROGRAMMABLE TWO COMPONENT SYSTEM
CAS9
NUCLEASE
RNA
COMPONENT
@DESKTOPGENETICS| @DOYLE_RILEY
11
CRISPR OVERVIEW
PROGRAMMABLE TWO COMPONENT SYSTEM
CAS9
NUCLEASE
RNA
COMPONENT
VARIABLE
20 BP GUIDE RNA
(sgRNA)
CONSTANT
REGION
(tracrRNA)
@DESKTOPGENETICS| @DOYLE_RILEY
12
CRISPR OVERVIEW
PROGRAMMABLE TWO COMPONENT SYSTEM
CAS9
NUCLEASE
RNA
COMPONENT
ACTIVE
RNA-GUIDED CAS9
COMPLEX
@DESKTOPGENETICS| @DOYLE_RILEY
13
CRISPR OVERVIEW
CUT + REPAIR = GENOME EDITING
NGG PAM SEQUENCE
NUCLEASE DOMAINS
GENOME
SEQUENCE
sgRNA-DNA
BASE PAIRING
@DESKTOPGENETICS| @DOYLE_RILEY
14
WHY EDIT GENOMES?
RESEARCH AND DEVELOPMENT → CLINICAL CURES
- Degenerative blindness
- Custom cancer models
- Humanization of heart valves
- Swine fever resistance
- HIV eradicated in vitro
- Immuno-oncology
- Clinical trials cured cancer
- Clinical trials cured HIV
@DESKTOPGENETICS| @DOYLE_RILEY
© Desktop Genetics Ltd. 2016 A n Illumina-backed company
2. APPLYING MACHINE
LEARNING TO DNA
16
CRISPR HAS SEVERAL COMPUTATIONAL PROBLEMS
WHAT ARE WE ACTUALLY TRYING TO PREDICT ANYWAY?
ActivitySpecificity
Patient
Outcome
Biological
Importance
Instrument
Signal
@DESKTOPGENETICS| @DOYLE_RILEY
RECURRING CRISPR PROBLEMS
USER ANALYTICS REVEALED COMMON PROBLEMS
HUMAN MACHINE
Guide
selection
Get tired of choosing many
guides for each gene
Considers all guides, choses
consistently
Scoring
function(s)
Undue weight given to some
scoring functions
Weights of features carefully
controlled
Genotype
data
Considers only reference
genome
Considers actual genome
sequence
Overall
objective
Few “winning” guides
Balanced, orthogonal training
set
@DESKTOPGENETICS| @DOYLE_RILEY 17
SELECTION OF BIOCHEMISTRY BASED FEATURES
SEVERAL MACRO & CONTEXTUAL FEATURES IDENTIFIED FROM BIOCHEMISTRY LITERATURE
DESIGN RULE TYPE RANGE CONSIDERS RESULT
NAG PAM (Control) Negative {0,1} (PAM) Sequence ✔
GC% Negative [0,1] Sequence ✔
Homopolymer (N4) Negative {0,1} Sequence ✔
SNP Collision Negative {0,1} Location ✔
UUU Triplet Negative {0,1} Sequence ✔
Non-constitutive Transcript Negative {0,1} Location ✔
1st
third CDS Positive {0,1} Location ✖
Functional domain Positive {0,1} Location ?
Truncated guide Positive {0,1} Sequence ✖
Microhomology Positive [0,1] Sequence ✖
Specificity (Hsu, 2013) Negative [0,1] Sequence ?
@DESKTOPGENETICS| @DOYLE_RILEY 18
19
GUIDE RNA SEQUENCE FEATURES
SEQUENCES EMBEDDED INTO VECTOR SPACE USING ONE-HOT ENCODING OF K-MER@POSITION
Number of non-overlapping, position-dependent
sequence features is:
● We used k [1,3] for ~4700 features total
● Resulting embedding is very sparse.
● Too many dimensions + insufficient data = over fitting
where k = feature size (nt) and n is length of sequence
4 States: A → [1000], C → [0100], G → [0010], T → [0001]
at each position in n; repeat for all k-mers.
@DESKTOPGENETICS| @DOYLE_RILEY
20
REAL GENOMES HAVE MUTATIONS
INDIVIDUAL GENOME VARIANTS CAN GENERATE NOISE
@DESKTOPGENETICS| @DOYLE_RILEY
21
GENOME SEQUENCING IS DATA INTENSIVE
OUR SYSTEM NEEDS TO HANDLE LARGE VOLUMES OF DATA
500 GB + 1 GB + 2 GB + 2 GB +
@DESKTOPGENETICS| @DOYLE_RILEY
22
DESKGEN INFRASTRUCTURE
HANDLING GENOME DATA AT SCALE
SaltStack Control Layer orchestrates instance groups in both development and production environments.
Github
Sequencer
Remote
Stores
Salt
Master
Vendors
Browser
BioInfo
Worker
BioInfo
Worker
BioInfo
Workers Cloud
Storage
BioInfo
Worker
BioInfo
Worker
Production
Hosts
PRODUCT TEAMTECH R&D TEAM
@DESKTOPGENETICS| @DOYLE_RILEY
Google Cloud Platform
23
DESKGEN HOST LEVEL ARCHITECTURE
GENOME CONTEXT MADE AVAILABLE ACROSS STEPS OF ML PIPELINE
ML PIPELINE either imports Python code directly or uses CLI commands.
dgregistry
(Tornado)
dgcli
(Click)
genome_fs
(C ext)
Omics
Tools
(Click)
Postgresql
(Alembic)
manifest
(Python2)
salt-minion
(Salt)
GCStorage
(gcloud sdk)
Specialized
Services
(C ext)
Browser
(Vue.js)
Vendors
(Requests)
Align to
Genome
Compute
Features
Compute
Performance
Values
Train Model
Report and
Bank Model
MACHINE LEARNING ENV (Jupyter Notebooks + PANDAS + SciKit Learn)
IN-SILICO OF TARGET GENOME (Common Instance Image)
API
BioInfo
Library
(C ext)
@DESKTOPGENETICS| @DOYLE_RILEY
MEASURING GUIDE PERFORMANCE
EVOLUTION SAYS GUIDES ACTIVE AGAINST ESSENTIAL GENES SHOULD KILL CELLS
24
PLASMID
POOL
Transfection
INITIAL
TIMEPOINT
CRISPR KO &
Depletion
FINAL
TIMEPOINT
Day 0 NGS Day 23 NGS
sgRNA
Count
sgRNA
Count
@DESKTOPGENETICS| @DOYLE_RILEY
GUIDE SCORING
NON-ESSENTIAL GENE TARGETS RESULT IN UNDETECTABLE GUIDES
● Remove non-essential genes from analysis as sgRNA activity cannot be detected.
@DESKTOPGENETICS| @DOYLE_RILEY 25
VARIANCE OF THE SAME GUIDE
AN ACTIVE GUIDE
In active guides, there is little variance
between biological replicates, and
different experiments.
@DESKTOPGENETICS| @DOYLE_RILEY 26
VARIANCE OF THE SAME GUIDE
AN INACTIVE GUIDE
In inactive guides - there is large
variance between biological replicates,
and different experiments
@DESKTOPGENETICS| @DOYLE_RILEY 27
GUIDE SCORING
REMOVING NON-ESSENTIAL GENES INCREASES ROBUSTNESS OF GUIDE ACTIVITY DETECTION
28
Wang
(1878)
Strain H
(291)
Strain A
(396)
166 125 235 161
1518
Full Essential‘Essential’ Genes
Sabatini data: Wang et al. Science. 2015 Nov 27;350(6264):1096-101
log2fc
Doench 2016 Score (Full)
log2fc
Wang et al. (2015): Conducted CRISPR screen in the near-haploid human KBM7 chronic
myelogenous leukemia (CML) cell line and confirmed essentiality using gene-trap.
@DESKTOPGENETICS| @DOYLE_RILEY
DATA ANALYSIS PIPELINE
1. Normalization
1.1. Normalized so that read count across columns was consistent per experiment
2. Selection
2.1. Removed rows where there was a read count < 30
2.2. Removed rows where gene was 'NA' or null
2.3. Removed guides targeting non-coding regions
2.4. Selected guides targeting essential genes using MAGeCK
2.4.1. Human: 6509 guides (5.61% of dataset)
2.4.2. Mouse: 8006 guides (5.58% of dataset)
3. Scoring derived from first-order kinetic rate law
POST-PROCESSING AND NORMALIZATION CRITICAL TO MODEL
@DESKTOPGENETICS| @DOYLE_RILEY 29
© Desktop Genetics Ltd. 2016 A n Illumina-backed company
3. OUR CRISPR DESIGN
PROCESS
LINEAR MODEL PERFORMED SURPRISINGLY WELL
BOTH PEARSON AND SPEARMAN METRICS IMPROVED
Comparison of performance between DTG and Doench 2016 models
● Executing this algorithm found DTG’s model is an 84% improvement over state of the art
(Doench 2016)
● Generalized Linear Model performed as well as ConvNet and RandomForest
@DESKTOPGENETICS| @DOYLE_RILEY 31
MODEL DOES NOT GENERALIZE ACROSS SPECIES
Comparison of performance between DTG and Doench models
MOUSE PERFORMANCE ALSO IMPROVED BUT IS NOT AS GOOD AS HUMAN MODEL
● Executing this algorithm found DTG’s model is an 100% improvement over Doench 2016
● No literature list of essential genes available for Mouse
● Still unclear why performance is different
@DESKTOPGENETICS| @DOYLE_RILEY 32
MODEL COEFFICIENTS CONFIRM POSITION-DEPENDENT SEQUENCE EFFECT
● We examined the coefficients of the ridge regression model
● We determined the importance of single bases varies a lot of the range of the flank
PRIOR WORK EXTENDED INTO NEW TRAINING DATA
@DESKTOPGENETICS| @DOYLE_RILEY 33
MARGINAL BENEFIT OF ADDITIONAL DATA
HUMAN AND MOUSE MODELS BOTH IMPROVE AS FURTHER WET LAB DATA ADDED
● Relationship between model performance and data used = more data will help build a better model
SpearmanCorrelation
SpearmanCorrelation
@DESKTOPGENETICS| @DOYLE_RILEY 34
© Desktop Genetics Ltd. 2016 A n Illumina-backed company
4. THE PATH FORWARD
36
CONCLUSIONS
1. De-noising and normalization of the training data and feature engineering resulted in a
linear model which outperformed more complex types.
2. Linear model currently predicts guide performance up to current variance seen
experimentally.
3. Model generalized across cell lines but not across species. We are currently unsure
why.
4. Prior knowledge about essential genes and target genome significantly improved the
model (ie. human genome better curated than mouse).
5. Model performance increased linearly with more training data, but less rapidly for
mouse.
@DESKTOPGENETICS| @DOYLE_RILEY
SIGNIFICANTLY MORE ACCURATE GUIDE ACTIVITY PREDICTIONS WERE POSSIBLE
37
LESSONS LEARNED
1. Task queues (Celery), microservices, containers (Docker, Kubernetes), and
Postgresql significantly increased dev-ops burden, dependencies, code
maintenance requirements, and learning curve without increasing productivity.
Pure python code nearly always ended up getting used more.
2. Scikit Learn Model serialization (cPickle) is not portable as ABI breaks between
minor and patch versions. Significant source of errors in production. Acute need for
better way to serialize more complex models.
3. Docker Containers did not provide a “silver bullet” replacement for Python packaging,
dependency management, or model portability. Instead they introduced significant
learning curve as most bioinformatics tools expect direct access to a shared
filesystem.
4. Data Science and BioInformatics team strongly preferred working with Conda
environment vs. PyEnv + VirutualEnv.
5. Google Cloud Storage critical to working with large genomic data sets.
@DESKTOPGENETICS| @DOYLE_RILEY
ETL PIPELINE, FEATURES, AND DATA PROCESSING WERE CRITICAL TO SUCCESS
38
TAKING CRISPR AI TO THE CLINIC
EXTENDING APPROACH TO IMPROVE GENOME EDITING SAFETY AND EFFICACY
@DESKTOPGENETICS| @DOYLE_RILEY
RECOGNITION
TECH, BIOTECH AND EVERYTHING IN BETWEEN
39@DESKTOPGENETICS| @DOYLE_RILEY
GETTING INVOLVED WITH CRISPR
OPTIMISE AND IMPROVE
1. Dataset available on GitHub – try it yourself
https://github.com/DeskGen/guide-cluster
2. Larger dataset with API coming March 2017
https://github.com/DeskGen/dgcli
3. Hiring full time at Desktop Genetics
https://www.deskgen.com/landing/company#about-careers
40@DESKTOPGENETICS| @DOYLE_RILEY
JOBS AT DESKTOP GENETICS HQ
JOIN US IN SHOREDITCH - TELL YOUR FRIENDS!
41@DESKTOPGENETICS| @DOYLE_RILEY
GET EVERYTHING YOU JUST HEARD AND MORE
SLIDES, FUTURE MEETUPS, CRISPR RESOURCES, JOB OPPORTUNITIES
PyData@deskgen.com
42@DESKTOPGENETICS| @DOYLE_RILEY
Send an empty email to
© Desktop Genetics Ltd. 2017
pydata@deskgen.com

More Related Content

What's hot

High-performance web services for gene and variant annotations
High-performance web services for gene and variant annotationsHigh-performance web services for gene and variant annotations
High-performance web services for gene and variant annotationsChunlei Wu
 
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...ExternalEvents
 
2nd CRISPR Congress Boston, 23-25 February 2016
2nd CRISPR Congress Boston, 23-25 February 2016 2nd CRISPR Congress Boston, 23-25 February 2016
2nd CRISPR Congress Boston, 23-25 February 2016 Diane McKenna
 
Integration of heterogeneous data
Integration of heterogeneous dataIntegration of heterogeneous data
Integration of heterogeneous dataLars Juhl Jensen
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbioBenjamin Good
 
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...Sean Ekins
 
CBGW Brian Van Doormaal
CBGW Brian Van DoormaalCBGW Brian Van Doormaal
CBGW Brian Van DoormaalGenome Alberta
 
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12Sage Base
 
Optimization of sterile male ratio of oriental fruit fly, bactrocera dorsalis...
Optimization of sterile male ratio of oriental fruit fly, bactrocera dorsalis...Optimization of sterile male ratio of oriental fruit fly, bactrocera dorsalis...
Optimization of sterile male ratio of oriental fruit fly, bactrocera dorsalis...Md. Julfiker Rahman
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Andrew Su
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...c.titus.brown
 
HIV RT Inhibitions in vitro and or or in vivo: materials and methods
HIV RT Inhibitions in vitro and or or in vivo: materials and methods HIV RT Inhibitions in vitro and or or in vivo: materials and methods
HIV RT Inhibitions in vitro and or or in vivo: materials and methods Titas Mallick
 
Jan2015 using the pilot genome rm for clinical validation steve lincoln
Jan2015 using the pilot genome rm for clinical validation steve lincolnJan2015 using the pilot genome rm for clinical validation steve lincoln
Jan2015 using the pilot genome rm for clinical validation steve lincolnGenomeInABottle
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
Pptpodcast2
Pptpodcast2Pptpodcast2
Pptpodcast2cwihhs
 
Humn330 amezcua genetic engineering
Humn330 amezcua genetic engineeringHumn330 amezcua genetic engineering
Humn330 amezcua genetic engineeringAlejandro Amezcua
 
Poster MEDINFO 2010
Poster MEDINFO 2010Poster MEDINFO 2010
Poster MEDINFO 2010Timothy Cook
 
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...ExternalEvents
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidataBenjamin Good
 

What's hot (20)

High-performance web services for gene and variant annotations
High-performance web services for gene and variant annotationsHigh-performance web services for gene and variant annotations
High-performance web services for gene and variant annotations
 
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...
 
2nd CRISPR Congress Boston, 23-25 February 2016
2nd CRISPR Congress Boston, 23-25 February 2016 2nd CRISPR Congress Boston, 23-25 February 2016
2nd CRISPR Congress Boston, 23-25 February 2016
 
Integration of heterogeneous data
Integration of heterogeneous dataIntegration of heterogeneous data
Integration of heterogeneous data
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio
 
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
 
2013 alumni-webinar
2013 alumni-webinar2013 alumni-webinar
2013 alumni-webinar
 
CBGW Brian Van Doormaal
CBGW Brian Van DoormaalCBGW Brian Van Doormaal
CBGW Brian Van Doormaal
 
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12
Stephen Friend Koo Foundation / Sun Yat-Sen Cancer Center 2012-03-12
 
Optimization of sterile male ratio of oriental fruit fly, bactrocera dorsalis...
Optimization of sterile male ratio of oriental fruit fly, bactrocera dorsalis...Optimization of sterile male ratio of oriental fruit fly, bactrocera dorsalis...
Optimization of sterile male ratio of oriental fruit fly, bactrocera dorsalis...
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
HIV RT Inhibitions in vitro and or or in vivo: materials and methods
HIV RT Inhibitions in vitro and or or in vivo: materials and methods HIV RT Inhibitions in vitro and or or in vivo: materials and methods
HIV RT Inhibitions in vitro and or or in vivo: materials and methods
 
Jan2015 using the pilot genome rm for clinical validation steve lincoln
Jan2015 using the pilot genome rm for clinical validation steve lincolnJan2015 using the pilot genome rm for clinical validation steve lincoln
Jan2015 using the pilot genome rm for clinical validation steve lincoln
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
Pptpodcast2
Pptpodcast2Pptpodcast2
Pptpodcast2
 
Humn330 amezcua genetic engineering
Humn330 amezcua genetic engineeringHumn330 amezcua genetic engineering
Humn330 amezcua genetic engineering
 
Poster MEDINFO 2010
Poster MEDINFO 2010Poster MEDINFO 2010
Poster MEDINFO 2010
 
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata
 

Similar to PyData London January 2017

VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...Application of Whole Genome Sequencing in the infectious disease’ in vitro di...
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...ExternalEvents
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science researchDenis C. Bauer
 
Enabling Patient-Driven Medicine Using Graph Database
Enabling Patient-Driven Medicine Using Graph DatabaseEnabling Patient-Driven Medicine Using Graph Database
Enabling Patient-Driven Medicine Using Graph DatabaseNeo4j
 
Illumina-General-Overview-Q1-17
Illumina-General-Overview-Q1-17Illumina-General-Overview-Q1-17
Illumina-General-Overview-Q1-17Matthew Holguin
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
 
Whole Genome Trait Association in SVS
Whole Genome Trait Association in SVSWhole Genome Trait Association in SVS
Whole Genome Trait Association in SVSGolden Helix
 
Tilling @ sid
Tilling @ sidTilling @ sid
Tilling @ sidsidjena70
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSIntegrated DNA Technologies
 
Grand round whsiao_may2015
Grand round whsiao_may2015Grand round whsiao_may2015
Grand round whsiao_may2015IRIDA_community
 
How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality?  - William HsiaoHow Can We Make Genomic Epidemiology a Widespread Reality?  - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality? - William HsiaoWilliam Hsiao
 
YASHIKA JAISWAL 1 (3) (1).pptx
YASHIKA JAISWAL 1 (3) (1).pptxYASHIKA JAISWAL 1 (3) (1).pptx
YASHIKA JAISWAL 1 (3) (1).pptxYashikaJaiswal10
 
Cloud Accelerated Genomics
Cloud Accelerated GenomicsCloud Accelerated Genomics
Cloud Accelerated GenomicsIdan Tohami
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsSean Ekins
 
Meaningful (meta)data at scale: removing barriers to precision medicine research
Meaningful (meta)data at scale: removing barriers to precision medicine researchMeaningful (meta)data at scale: removing barriers to precision medicine research
Meaningful (meta)data at scale: removing barriers to precision medicine researchNolan Nichols
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsLarry Smarr
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECAProject
 
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsRamil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsGigaScience, BGI Hong Kong
 

Similar to PyData London January 2017 (20)

VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...Application of Whole Genome Sequencing in the infectious disease’ in vitro di...
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Enabling Patient-Driven Medicine Using Graph Database
Enabling Patient-Driven Medicine Using Graph DatabaseEnabling Patient-Driven Medicine Using Graph Database
Enabling Patient-Driven Medicine Using Graph Database
 
Illumina-General-Overview-Q1-17
Illumina-General-Overview-Q1-17Illumina-General-Overview-Q1-17
Illumina-General-Overview-Q1-17
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 
Whole Genome Trait Association in SVS
Whole Genome Trait Association in SVSWhole Genome Trait Association in SVS
Whole Genome Trait Association in SVS
 
Tilling @ sid
Tilling @ sidTilling @ sid
Tilling @ sid
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGS
 
Grand round whsiao_may2015
Grand round whsiao_may2015Grand round whsiao_may2015
Grand round whsiao_may2015
 
How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality?  - William HsiaoHow Can We Make Genomic Epidemiology a Widespread Reality?  - William Hsiao
How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao
 
YASHIKA JAISWAL 1 (3) (1).pptx
YASHIKA JAISWAL 1 (3) (1).pptxYASHIKA JAISWAL 1 (3) (1).pptx
YASHIKA JAISWAL 1 (3) (1).pptx
 
Cloud Accelerated Genomics
Cloud Accelerated GenomicsCloud Accelerated Genomics
Cloud Accelerated Genomics
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning models
 
Meaningful (meta)data at scale: removing barriers to precision medicine research
Meaningful (meta)data at scale: removing barriers to precision medicine researchMeaningful (meta)data at scale: removing barriers to precision medicine research
Meaningful (meta)data at scale: removing barriers to precision medicine research
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare Diagnostics
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
 
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsRamil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
 

Recently uploaded

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2John Carlo Rollon
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxPABOLU TEJASREE
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 

Recently uploaded (20)

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 

PyData London January 2017

  • 1. © Desktop Genetics Ltd. 2016 A n Illumina-backed company PERSONALIZING GENOME SURGERY (WITH PYTHON) RILEY DOYLE, CEO AND TECHNICAL LEAD
  • 2. 2 AI-POWERED GENOME EDITING TECH TO USE CRISPR FOR PERSONALIZED GENOMIC SURGERY @DESKTOPGENETICS| @DOYLE_RILEY
  • 3. 3 BIGGEST BIOTECH BREAKTHROUGH OF THE CENTURY GLOBAL COVERAGE ACROSS SCIENCE AND TECH MEDIA GENE EDITING SAVES GIRL DYING FROM LEUKAEMIA IN WORLD FIRST 5 November 2015 HIV GENES HAVE BEEN CUT OUT OF LIVE ANIMALS USING CRISPR 15 May 2016 CHINA USED CRISPR TO FIGHT CANCER IN A REAL, LIVE HUMAN 18 November 2016 CRISPR: GENE EDITING IS JUST THE BEGINNING 07 March 2016 @DESKTOPGENETICS| @DOYLE_RILEY
  • 4. 4 GENE THERAPY TACKLES DISEASES CRISPR IS USED TO TREAT PATIENTS AND DISCOVER CURES DTG DTG DTG @DESKTOPGENETICS| @DOYLE_RILEY
  • 5. 5 CRISPR IS GETTING BIGGER EVERY DAY GLOBAL REACH OF CRIPR LABS & SCIENTISTS GENOME EDITING PLASMIDS DISTRIBUTED BY ADDGENE FROM 2005 TO 2014 CRISPR TALE ZFN SYNBIO @DESKTOPGENETICS| @DOYLE_RILEY
  • 6. 6 GENOME EDITING PROCESS AI REQUIRED TO AUTOMATE DECISION MAKING THROUGHOUT THE PROCESS @DESKTOPGENETICS| @DOYLE_RILEY
  • 7. 7 AGENDA 1. Brief intro to CRISPR 2. Applying machine learning to DNA 3. Our CRISPR design process 4. The path forward @DESKTOPGENETICS| @DOYLE_RILEY
  • 8. © Desktop Genetics Ltd. 2016 A n Illumina-backed company 1. BRIEF INTRO TO CRISPR
  • 9. 9 CRISPR AT A GLANCE MOLECULAR INTERACTIONS AND MODELS @DESKTOPGENETICS| @DOYLE_RILEY Email Pydata@deskgen.com for video and 3D molecule
  • 10. 10 CRISPR OVERVIEW PROGRAMMABLE TWO COMPONENT SYSTEM CAS9 NUCLEASE RNA COMPONENT @DESKTOPGENETICS| @DOYLE_RILEY
  • 11. 11 CRISPR OVERVIEW PROGRAMMABLE TWO COMPONENT SYSTEM CAS9 NUCLEASE RNA COMPONENT VARIABLE 20 BP GUIDE RNA (sgRNA) CONSTANT REGION (tracrRNA) @DESKTOPGENETICS| @DOYLE_RILEY
  • 12. 12 CRISPR OVERVIEW PROGRAMMABLE TWO COMPONENT SYSTEM CAS9 NUCLEASE RNA COMPONENT ACTIVE RNA-GUIDED CAS9 COMPLEX @DESKTOPGENETICS| @DOYLE_RILEY
  • 13. 13 CRISPR OVERVIEW CUT + REPAIR = GENOME EDITING NGG PAM SEQUENCE NUCLEASE DOMAINS GENOME SEQUENCE sgRNA-DNA BASE PAIRING @DESKTOPGENETICS| @DOYLE_RILEY
  • 14. 14 WHY EDIT GENOMES? RESEARCH AND DEVELOPMENT → CLINICAL CURES - Degenerative blindness - Custom cancer models - Humanization of heart valves - Swine fever resistance - HIV eradicated in vitro - Immuno-oncology - Clinical trials cured cancer - Clinical trials cured HIV @DESKTOPGENETICS| @DOYLE_RILEY
  • 15. © Desktop Genetics Ltd. 2016 A n Illumina-backed company 2. APPLYING MACHINE LEARNING TO DNA
  • 16. 16 CRISPR HAS SEVERAL COMPUTATIONAL PROBLEMS WHAT ARE WE ACTUALLY TRYING TO PREDICT ANYWAY? ActivitySpecificity Patient Outcome Biological Importance Instrument Signal @DESKTOPGENETICS| @DOYLE_RILEY
  • 17. RECURRING CRISPR PROBLEMS USER ANALYTICS REVEALED COMMON PROBLEMS HUMAN MACHINE Guide selection Get tired of choosing many guides for each gene Considers all guides, choses consistently Scoring function(s) Undue weight given to some scoring functions Weights of features carefully controlled Genotype data Considers only reference genome Considers actual genome sequence Overall objective Few “winning” guides Balanced, orthogonal training set @DESKTOPGENETICS| @DOYLE_RILEY 17
  • 18. SELECTION OF BIOCHEMISTRY BASED FEATURES SEVERAL MACRO & CONTEXTUAL FEATURES IDENTIFIED FROM BIOCHEMISTRY LITERATURE DESIGN RULE TYPE RANGE CONSIDERS RESULT NAG PAM (Control) Negative {0,1} (PAM) Sequence ✔ GC% Negative [0,1] Sequence ✔ Homopolymer (N4) Negative {0,1} Sequence ✔ SNP Collision Negative {0,1} Location ✔ UUU Triplet Negative {0,1} Sequence ✔ Non-constitutive Transcript Negative {0,1} Location ✔ 1st third CDS Positive {0,1} Location ✖ Functional domain Positive {0,1} Location ? Truncated guide Positive {0,1} Sequence ✖ Microhomology Positive [0,1] Sequence ✖ Specificity (Hsu, 2013) Negative [0,1] Sequence ? @DESKTOPGENETICS| @DOYLE_RILEY 18
  • 19. 19 GUIDE RNA SEQUENCE FEATURES SEQUENCES EMBEDDED INTO VECTOR SPACE USING ONE-HOT ENCODING OF K-MER@POSITION Number of non-overlapping, position-dependent sequence features is: ● We used k [1,3] for ~4700 features total ● Resulting embedding is very sparse. ● Too many dimensions + insufficient data = over fitting where k = feature size (nt) and n is length of sequence 4 States: A → [1000], C → [0100], G → [0010], T → [0001] at each position in n; repeat for all k-mers. @DESKTOPGENETICS| @DOYLE_RILEY
  • 20. 20 REAL GENOMES HAVE MUTATIONS INDIVIDUAL GENOME VARIANTS CAN GENERATE NOISE @DESKTOPGENETICS| @DOYLE_RILEY
  • 21. 21 GENOME SEQUENCING IS DATA INTENSIVE OUR SYSTEM NEEDS TO HANDLE LARGE VOLUMES OF DATA 500 GB + 1 GB + 2 GB + 2 GB + @DESKTOPGENETICS| @DOYLE_RILEY
  • 22. 22 DESKGEN INFRASTRUCTURE HANDLING GENOME DATA AT SCALE SaltStack Control Layer orchestrates instance groups in both development and production environments. Github Sequencer Remote Stores Salt Master Vendors Browser BioInfo Worker BioInfo Worker BioInfo Workers Cloud Storage BioInfo Worker BioInfo Worker Production Hosts PRODUCT TEAMTECH R&D TEAM @DESKTOPGENETICS| @DOYLE_RILEY Google Cloud Platform
  • 23. 23 DESKGEN HOST LEVEL ARCHITECTURE GENOME CONTEXT MADE AVAILABLE ACROSS STEPS OF ML PIPELINE ML PIPELINE either imports Python code directly or uses CLI commands. dgregistry (Tornado) dgcli (Click) genome_fs (C ext) Omics Tools (Click) Postgresql (Alembic) manifest (Python2) salt-minion (Salt) GCStorage (gcloud sdk) Specialized Services (C ext) Browser (Vue.js) Vendors (Requests) Align to Genome Compute Features Compute Performance Values Train Model Report and Bank Model MACHINE LEARNING ENV (Jupyter Notebooks + PANDAS + SciKit Learn) IN-SILICO OF TARGET GENOME (Common Instance Image) API BioInfo Library (C ext) @DESKTOPGENETICS| @DOYLE_RILEY
  • 24. MEASURING GUIDE PERFORMANCE EVOLUTION SAYS GUIDES ACTIVE AGAINST ESSENTIAL GENES SHOULD KILL CELLS 24 PLASMID POOL Transfection INITIAL TIMEPOINT CRISPR KO & Depletion FINAL TIMEPOINT Day 0 NGS Day 23 NGS sgRNA Count sgRNA Count @DESKTOPGENETICS| @DOYLE_RILEY
  • 25. GUIDE SCORING NON-ESSENTIAL GENE TARGETS RESULT IN UNDETECTABLE GUIDES ● Remove non-essential genes from analysis as sgRNA activity cannot be detected. @DESKTOPGENETICS| @DOYLE_RILEY 25
  • 26. VARIANCE OF THE SAME GUIDE AN ACTIVE GUIDE In active guides, there is little variance between biological replicates, and different experiments. @DESKTOPGENETICS| @DOYLE_RILEY 26
  • 27. VARIANCE OF THE SAME GUIDE AN INACTIVE GUIDE In inactive guides - there is large variance between biological replicates, and different experiments @DESKTOPGENETICS| @DOYLE_RILEY 27
  • 28. GUIDE SCORING REMOVING NON-ESSENTIAL GENES INCREASES ROBUSTNESS OF GUIDE ACTIVITY DETECTION 28 Wang (1878) Strain H (291) Strain A (396) 166 125 235 161 1518 Full Essential‘Essential’ Genes Sabatini data: Wang et al. Science. 2015 Nov 27;350(6264):1096-101 log2fc Doench 2016 Score (Full) log2fc Wang et al. (2015): Conducted CRISPR screen in the near-haploid human KBM7 chronic myelogenous leukemia (CML) cell line and confirmed essentiality using gene-trap. @DESKTOPGENETICS| @DOYLE_RILEY
  • 29. DATA ANALYSIS PIPELINE 1. Normalization 1.1. Normalized so that read count across columns was consistent per experiment 2. Selection 2.1. Removed rows where there was a read count < 30 2.2. Removed rows where gene was 'NA' or null 2.3. Removed guides targeting non-coding regions 2.4. Selected guides targeting essential genes using MAGeCK 2.4.1. Human: 6509 guides (5.61% of dataset) 2.4.2. Mouse: 8006 guides (5.58% of dataset) 3. Scoring derived from first-order kinetic rate law POST-PROCESSING AND NORMALIZATION CRITICAL TO MODEL @DESKTOPGENETICS| @DOYLE_RILEY 29
  • 30. © Desktop Genetics Ltd. 2016 A n Illumina-backed company 3. OUR CRISPR DESIGN PROCESS
  • 31. LINEAR MODEL PERFORMED SURPRISINGLY WELL BOTH PEARSON AND SPEARMAN METRICS IMPROVED Comparison of performance between DTG and Doench 2016 models ● Executing this algorithm found DTG’s model is an 84% improvement over state of the art (Doench 2016) ● Generalized Linear Model performed as well as ConvNet and RandomForest @DESKTOPGENETICS| @DOYLE_RILEY 31
  • 32. MODEL DOES NOT GENERALIZE ACROSS SPECIES Comparison of performance between DTG and Doench models MOUSE PERFORMANCE ALSO IMPROVED BUT IS NOT AS GOOD AS HUMAN MODEL ● Executing this algorithm found DTG’s model is an 100% improvement over Doench 2016 ● No literature list of essential genes available for Mouse ● Still unclear why performance is different @DESKTOPGENETICS| @DOYLE_RILEY 32
  • 33. MODEL COEFFICIENTS CONFIRM POSITION-DEPENDENT SEQUENCE EFFECT ● We examined the coefficients of the ridge regression model ● We determined the importance of single bases varies a lot of the range of the flank PRIOR WORK EXTENDED INTO NEW TRAINING DATA @DESKTOPGENETICS| @DOYLE_RILEY 33
  • 34. MARGINAL BENEFIT OF ADDITIONAL DATA HUMAN AND MOUSE MODELS BOTH IMPROVE AS FURTHER WET LAB DATA ADDED ● Relationship between model performance and data used = more data will help build a better model SpearmanCorrelation SpearmanCorrelation @DESKTOPGENETICS| @DOYLE_RILEY 34
  • 35. © Desktop Genetics Ltd. 2016 A n Illumina-backed company 4. THE PATH FORWARD
  • 36. 36 CONCLUSIONS 1. De-noising and normalization of the training data and feature engineering resulted in a linear model which outperformed more complex types. 2. Linear model currently predicts guide performance up to current variance seen experimentally. 3. Model generalized across cell lines but not across species. We are currently unsure why. 4. Prior knowledge about essential genes and target genome significantly improved the model (ie. human genome better curated than mouse). 5. Model performance increased linearly with more training data, but less rapidly for mouse. @DESKTOPGENETICS| @DOYLE_RILEY SIGNIFICANTLY MORE ACCURATE GUIDE ACTIVITY PREDICTIONS WERE POSSIBLE
  • 37. 37 LESSONS LEARNED 1. Task queues (Celery), microservices, containers (Docker, Kubernetes), and Postgresql significantly increased dev-ops burden, dependencies, code maintenance requirements, and learning curve without increasing productivity. Pure python code nearly always ended up getting used more. 2. Scikit Learn Model serialization (cPickle) is not portable as ABI breaks between minor and patch versions. Significant source of errors in production. Acute need for better way to serialize more complex models. 3. Docker Containers did not provide a “silver bullet” replacement for Python packaging, dependency management, or model portability. Instead they introduced significant learning curve as most bioinformatics tools expect direct access to a shared filesystem. 4. Data Science and BioInformatics team strongly preferred working with Conda environment vs. PyEnv + VirutualEnv. 5. Google Cloud Storage critical to working with large genomic data sets. @DESKTOPGENETICS| @DOYLE_RILEY ETL PIPELINE, FEATURES, AND DATA PROCESSING WERE CRITICAL TO SUCCESS
  • 38. 38 TAKING CRISPR AI TO THE CLINIC EXTENDING APPROACH TO IMPROVE GENOME EDITING SAFETY AND EFFICACY @DESKTOPGENETICS| @DOYLE_RILEY
  • 39. RECOGNITION TECH, BIOTECH AND EVERYTHING IN BETWEEN 39@DESKTOPGENETICS| @DOYLE_RILEY
  • 40. GETTING INVOLVED WITH CRISPR OPTIMISE AND IMPROVE 1. Dataset available on GitHub – try it yourself https://github.com/DeskGen/guide-cluster 2. Larger dataset with API coming March 2017 https://github.com/DeskGen/dgcli 3. Hiring full time at Desktop Genetics https://www.deskgen.com/landing/company#about-careers 40@DESKTOPGENETICS| @DOYLE_RILEY
  • 41. JOBS AT DESKTOP GENETICS HQ JOIN US IN SHOREDITCH - TELL YOUR FRIENDS! 41@DESKTOPGENETICS| @DOYLE_RILEY
  • 42. GET EVERYTHING YOU JUST HEARD AND MORE SLIDES, FUTURE MEETUPS, CRISPR RESOURCES, JOB OPPORTUNITIES PyData@deskgen.com 42@DESKTOPGENETICS| @DOYLE_RILEY Send an empty email to
  • 43. © Desktop Genetics Ltd. 2017 pydata@deskgen.com