Desktop Genetics is developing an AI-powered system for personalized genome editing using CRISPR. They applied machine learning to improve the prediction of guide RNA activity and developed a linear model that outperforms existing methods. Their model performance increases with additional training data from wet lab experiments. Desktop Genetics aims to take their CRISPR AI approach to the clinic to improve genome editing safety and efficacy.
Next-generation sequencing (NGS) has revolutionized the way we analyze diseases and commercial outfits such as Illumina, Helicos, QIAGEN and Pacific Biosciences have made significant contributions. In addition, the launch of direct-to-consumer genetic testing solutions has dramatically changed the way consumers access genomics data. Until a few years ago, the cost of sequencing was a major bottleneck. Recent developments have reduced the cost from thousands of dollars to a couple of cents per megabase. When did these changes start? What were the changes in the commercial sector in the last 15 years? This infographic is a timeline of the NGS commercial marketplace.
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery CollaborationsSean Ekins
A talk given at SERMACS 7th Nov 2015 in Memphis, describes CDD Vault, CDD Vision and CDD Models. In addition it also describes how the software is used in large and smaller scale collaborations for drug discovery.
Next-generation sequencing (NGS) has revolutionized the way we analyze diseases and commercial outfits such as Illumina, Helicos, QIAGEN and Pacific Biosciences have made significant contributions. In addition, the launch of direct-to-consumer genetic testing solutions has dramatically changed the way consumers access genomics data. Until a few years ago, the cost of sequencing was a major bottleneck. Recent developments have reduced the cost from thousands of dollars to a couple of cents per megabase. When did these changes start? What were the changes in the commercial sector in the last 15 years? This infographic is a timeline of the NGS commercial marketplace.
CDD: Vault, CDD: Vision and CDD: Models for Drug Discovery CollaborationsSean Ekins
A talk given at SERMACS 7th Nov 2015 in Memphis, describes CDD Vault, CDD Vision and CDD Models. In addition it also describes how the software is used in large and smaller scale collaborations for drug discovery.
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...ExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infection Control in an Institutional Setting. Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management and GMI-9, 23-25 May 2016, Rome, Italy.
2nd CRISPR Congress Boston, 23-25 February 2016 Diane McKenna
The 2nd Annual CRISPR Congress will enhance the basic research, drug discovery and therapeutic applications of CRISPR technology by overcoming key specificity, efficiency and delivery challenges needed to improve the precise editing and repair of the genome.
Update on the gene wiki project, introduction to knowledge.bio semantic search application, introduction to biobranch.org collaborative decision tree creator
This is a presentation from the Canadian Bovine Genomics Workshop held in Calgary, Alberta on Sept.14, 2009.
The workshop was the first step in developing a national bovine genomics strategy for Canada.
Poster presented at the 13th International Congress on Medical Informatics - MEDINFO - in 2010.
See: http://www.mlhim.org http://gplus.to/MLHIM and http://gplus.to/MLHIMComm for more information about semantic interoperability in healthcare.
#mlhim #semantic_interoperability #health_informatics
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...ExternalEvents
http://tiny.cc/faowgsworkshop
Applications of genome sequencing technology on food safety management- United Kingdom. Presentation from the FAO expert workshop on practical applications of Whole Genome Sequencing (WGS) for food safety management - 7-8 December 2015, Rome, Italy.
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...ExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Applications of WGS in industry. Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management -23-25 May 2016, Rome, Italy.
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infectio...ExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infection Control in an Institutional Setting. Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management and GMI-9, 23-25 May 2016, Rome, Italy.
2nd CRISPR Congress Boston, 23-25 February 2016 Diane McKenna
The 2nd Annual CRISPR Congress will enhance the basic research, drug discovery and therapeutic applications of CRISPR technology by overcoming key specificity, efficiency and delivery challenges needed to improve the precise editing and repair of the genome.
Update on the gene wiki project, introduction to knowledge.bio semantic search application, introduction to biobranch.org collaborative decision tree creator
This is a presentation from the Canadian Bovine Genomics Workshop held in Calgary, Alberta on Sept.14, 2009.
The workshop was the first step in developing a national bovine genomics strategy for Canada.
Poster presented at the 13th International Congress on Medical Informatics - MEDINFO - in 2010.
See: http://www.mlhim.org http://gplus.to/MLHIM and http://gplus.to/MLHIMComm for more information about semantic interoperability in healthcare.
#mlhim #semantic_interoperability #health_informatics
Applications of Whole Genome Sequencing (WGS) to Food Safety – Perspective fr...ExternalEvents
http://tiny.cc/faowgsworkshop
Applications of genome sequencing technology on food safety management- United Kingdom. Presentation from the FAO expert workshop on practical applications of Whole Genome Sequencing (WGS) for food safety management - 7-8 December 2015, Rome, Italy.
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...ExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Applications of WGS in industry. Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management -23-25 May 2016, Rome, Italy.
How novel compute technology transforms life science researchDenis C. Bauer
Unprecedented data volumes and pressure on turnaround time driven by commercial applications require bioinformatics solutions to evolve to meed these new demands. New compute paradigms and cloud-based IT solutions enable this transition. Here I present two solution capable of meeting these demands for genomic variant analysis, VariantSpark, as well as genome engineering applications, GT-Scan2.
VariantSpark classifies 3000 individuals with 80 Million genomic variants each in under 30 minutes. This Hadoop/Spark solution for machine learning application on genomic data is hence capable to scale up to population size cohorts.
GT-Scan2, identifies CRISPR target sites by minimizing off-target effects and maximizing on-target efficiency. This optimization is powered by AWS Lambda functions, which offer an “always-on” web service that can instantaneously recruit enough compute resources keep runtime stable even for queries with several thousand of potential target sites.
Golden Helix’s SNP & Variation Suite (SVS) has been used by researchers around the world to do trait analysis and association testing on large cohorts of samples in both humans and other species. As Next-Generation Sequencing of whole genomes becomes more affordable, large cohorts of Whole Genome Sequencing (WGS) samples are available to search for additional trait association signals that were not found in array-based testing. In fact, recent papers have shown that WGS analysis using advanced GREML (Genomic Relatedness Restricted Maximum Likelihood) techniques is able to outperform micro-array based GWAS methods in the analysis of complex traits and proportion of the trait heritability explained.
Our latest update release of SVS has expanded the exiting maximum likelihood and GRM methods to support these new techniques. We have also enhanced various other association testing and prediction methodologies. This webcast showcases:
- Newly supported analysis workflow for whole genome variants using LD binning and enhanced GBLUP analysis
- Enhanced gender correction using REML
- Additional capabilities for genomic prediction and phenotype prediction
We are continually improving our products based on our customer’s feedback. We hope you enjoy this recording highlighting the exciting new features and select enhancements we have made.
Target enrichment enables researchers to focus their next generation sequencing (NGS) efforts on regions of interest, allowing them to obtain more sequencing data relevant to their study. In-solution target capture is a method of enrichment using oligonucleotide probes directed to specific regions within a genome. Target capture can be used to enrich multiple samples simultaneously, reducing the cost per sample, while using individually synthesized probes allows researchers to construct gene panels that can be optimized over time.
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
Meaningful (meta)data at scale: removing barriers to precision medicine researchNolan Nichols
Randomized controlled trials (RCTs) are the gold standard for evaluating therapeutics in patient populations. The data collected during RCTs include a wealth of clinical measures, biomarkers, and tissue samples – the analysis of which can lead to the approval of new medicines that improve the lives of patients. The secondary use of these data can also fuel the discovery of novel targets and biomarkers that support precision medicine, but a lack of metadata standards creates substantial barriers to reuse.
For this talk, I will discuss the challenges that arise when aggregating diverse types of data from a large number of RCTs and present a case study in how to apply (meta)data standards for the scalable curation and integration of these data into an analysis ready form.
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECAProject
Genetic analysis of molecular traits such as gene expression, splicing and chromatin accessibility requires a number of complex analysis steps that can easily take weeks or months for a analyst to implement from scratch. In the CINECA project, we have developed a number of modular Nextflow workflows that standardise and automate these steps. In this webinar, we will give an overview of the CINECA workflows for genotype imputation, gene expression and splicing quantification, data normalisation and association testing, and demonstrate how these workflows can be used in a federated setting without transferring identifiable personal data between partners.
The CINECA webinar series aims to discuss ways to address common challenges and share best practices in the field of cohort data analysis, as well as distribute CINECA project results. All CINECA webinars include an audience Q&A session during which attendees can ask questions and make suggestions. Please note that all webinars are recorded and available for posterior viewing.
This webinar took place on 10th November 2020 and is part of the CINECA webinar series.
For previous and upcoming CINECA webinars see:
https://www.cineca-project.eu/webinars
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
3. 3
BIGGEST BIOTECH BREAKTHROUGH OF THE CENTURY
GLOBAL COVERAGE ACROSS SCIENCE AND TECH MEDIA
GENE EDITING SAVES GIRL DYING FROM
LEUKAEMIA IN WORLD FIRST
5 November 2015
HIV GENES HAVE BEEN CUT OUT OF LIVE
ANIMALS USING CRISPR
15 May 2016
CHINA USED CRISPR TO FIGHT CANCER
IN A REAL, LIVE HUMAN
18 November 2016
CRISPR: GENE EDITING IS JUST THE
BEGINNING
07 March 2016
@DESKTOPGENETICS| @DOYLE_RILEY
4. 4
GENE THERAPY TACKLES DISEASES
CRISPR IS USED TO TREAT PATIENTS AND DISCOVER CURES
DTG
DTG
DTG
@DESKTOPGENETICS| @DOYLE_RILEY
5. 5
CRISPR IS GETTING BIGGER EVERY DAY
GLOBAL REACH OF CRIPR LABS & SCIENTISTS
GENOME EDITING PLASMIDS DISTRIBUTED
BY ADDGENE FROM 2005 TO 2014
CRISPR TALE ZFN SYNBIO
@DESKTOPGENETICS| @DOYLE_RILEY
6. 6
GENOME EDITING PROCESS
AI REQUIRED TO AUTOMATE DECISION MAKING THROUGHOUT THE PROCESS
@DESKTOPGENETICS| @DOYLE_RILEY
7. 7
AGENDA
1. Brief intro to CRISPR
2. Applying machine learning to DNA
3. Our CRISPR design process
4. The path forward
@DESKTOPGENETICS| @DOYLE_RILEY
16. 16
CRISPR HAS SEVERAL COMPUTATIONAL PROBLEMS
WHAT ARE WE ACTUALLY TRYING TO PREDICT ANYWAY?
ActivitySpecificity
Patient
Outcome
Biological
Importance
Instrument
Signal
@DESKTOPGENETICS| @DOYLE_RILEY
17. RECURRING CRISPR PROBLEMS
USER ANALYTICS REVEALED COMMON PROBLEMS
HUMAN MACHINE
Guide
selection
Get tired of choosing many
guides for each gene
Considers all guides, choses
consistently
Scoring
function(s)
Undue weight given to some
scoring functions
Weights of features carefully
controlled
Genotype
data
Considers only reference
genome
Considers actual genome
sequence
Overall
objective
Few “winning” guides
Balanced, orthogonal training
set
@DESKTOPGENETICS| @DOYLE_RILEY 17
18. SELECTION OF BIOCHEMISTRY BASED FEATURES
SEVERAL MACRO & CONTEXTUAL FEATURES IDENTIFIED FROM BIOCHEMISTRY LITERATURE
DESIGN RULE TYPE RANGE CONSIDERS RESULT
NAG PAM (Control) Negative {0,1} (PAM) Sequence ✔
GC% Negative [0,1] Sequence ✔
Homopolymer (N4) Negative {0,1} Sequence ✔
SNP Collision Negative {0,1} Location ✔
UUU Triplet Negative {0,1} Sequence ✔
Non-constitutive Transcript Negative {0,1} Location ✔
1st
third CDS Positive {0,1} Location ✖
Functional domain Positive {0,1} Location ?
Truncated guide Positive {0,1} Sequence ✖
Microhomology Positive [0,1] Sequence ✖
Specificity (Hsu, 2013) Negative [0,1] Sequence ?
@DESKTOPGENETICS| @DOYLE_RILEY 18
19. 19
GUIDE RNA SEQUENCE FEATURES
SEQUENCES EMBEDDED INTO VECTOR SPACE USING ONE-HOT ENCODING OF K-MER@POSITION
Number of non-overlapping, position-dependent
sequence features is:
● We used k [1,3] for ~4700 features total
● Resulting embedding is very sparse.
● Too many dimensions + insufficient data = over fitting
where k = feature size (nt) and n is length of sequence
4 States: A → [1000], C → [0100], G → [0010], T → [0001]
at each position in n; repeat for all k-mers.
@DESKTOPGENETICS| @DOYLE_RILEY
20. 20
REAL GENOMES HAVE MUTATIONS
INDIVIDUAL GENOME VARIANTS CAN GENERATE NOISE
@DESKTOPGENETICS| @DOYLE_RILEY
21. 21
GENOME SEQUENCING IS DATA INTENSIVE
OUR SYSTEM NEEDS TO HANDLE LARGE VOLUMES OF DATA
500 GB + 1 GB + 2 GB + 2 GB +
@DESKTOPGENETICS| @DOYLE_RILEY
22. 22
DESKGEN INFRASTRUCTURE
HANDLING GENOME DATA AT SCALE
SaltStack Control Layer orchestrates instance groups in both development and production environments.
Github
Sequencer
Remote
Stores
Salt
Master
Vendors
Browser
BioInfo
Worker
BioInfo
Worker
BioInfo
Workers Cloud
Storage
BioInfo
Worker
BioInfo
Worker
Production
Hosts
PRODUCT TEAMTECH R&D TEAM
@DESKTOPGENETICS| @DOYLE_RILEY
Google Cloud Platform
23. 23
DESKGEN HOST LEVEL ARCHITECTURE
GENOME CONTEXT MADE AVAILABLE ACROSS STEPS OF ML PIPELINE
ML PIPELINE either imports Python code directly or uses CLI commands.
dgregistry
(Tornado)
dgcli
(Click)
genome_fs
(C ext)
Omics
Tools
(Click)
Postgresql
(Alembic)
manifest
(Python2)
salt-minion
(Salt)
GCStorage
(gcloud sdk)
Specialized
Services
(C ext)
Browser
(Vue.js)
Vendors
(Requests)
Align to
Genome
Compute
Features
Compute
Performance
Values
Train Model
Report and
Bank Model
MACHINE LEARNING ENV (Jupyter Notebooks + PANDAS + SciKit Learn)
IN-SILICO OF TARGET GENOME (Common Instance Image)
API
BioInfo
Library
(C ext)
@DESKTOPGENETICS| @DOYLE_RILEY
24. MEASURING GUIDE PERFORMANCE
EVOLUTION SAYS GUIDES ACTIVE AGAINST ESSENTIAL GENES SHOULD KILL CELLS
24
PLASMID
POOL
Transfection
INITIAL
TIMEPOINT
CRISPR KO &
Depletion
FINAL
TIMEPOINT
Day 0 NGS Day 23 NGS
sgRNA
Count
sgRNA
Count
@DESKTOPGENETICS| @DOYLE_RILEY
25. GUIDE SCORING
NON-ESSENTIAL GENE TARGETS RESULT IN UNDETECTABLE GUIDES
● Remove non-essential genes from analysis as sgRNA activity cannot be detected.
@DESKTOPGENETICS| @DOYLE_RILEY 25
26. VARIANCE OF THE SAME GUIDE
AN ACTIVE GUIDE
In active guides, there is little variance
between biological replicates, and
different experiments.
@DESKTOPGENETICS| @DOYLE_RILEY 26
27. VARIANCE OF THE SAME GUIDE
AN INACTIVE GUIDE
In inactive guides - there is large
variance between biological replicates,
and different experiments
@DESKTOPGENETICS| @DOYLE_RILEY 27
28. GUIDE SCORING
REMOVING NON-ESSENTIAL GENES INCREASES ROBUSTNESS OF GUIDE ACTIVITY DETECTION
28
Wang
(1878)
Strain H
(291)
Strain A
(396)
166 125 235 161
1518
Full Essential‘Essential’ Genes
Sabatini data: Wang et al. Science. 2015 Nov 27;350(6264):1096-101
log2fc
Doench 2016 Score (Full)
log2fc
Wang et al. (2015): Conducted CRISPR screen in the near-haploid human KBM7 chronic
myelogenous leukemia (CML) cell line and confirmed essentiality using gene-trap.
@DESKTOPGENETICS| @DOYLE_RILEY
29. DATA ANALYSIS PIPELINE
1. Normalization
1.1. Normalized so that read count across columns was consistent per experiment
2. Selection
2.1. Removed rows where there was a read count < 30
2.2. Removed rows where gene was 'NA' or null
2.3. Removed guides targeting non-coding regions
2.4. Selected guides targeting essential genes using MAGeCK
2.4.1. Human: 6509 guides (5.61% of dataset)
2.4.2. Mouse: 8006 guides (5.58% of dataset)
3. Scoring derived from first-order kinetic rate law
POST-PROCESSING AND NORMALIZATION CRITICAL TO MODEL
@DESKTOPGENETICS| @DOYLE_RILEY 29
31. LINEAR MODEL PERFORMED SURPRISINGLY WELL
BOTH PEARSON AND SPEARMAN METRICS IMPROVED
Comparison of performance between DTG and Doench 2016 models
● Executing this algorithm found DTG’s model is an 84% improvement over state of the art
(Doench 2016)
● Generalized Linear Model performed as well as ConvNet and RandomForest
@DESKTOPGENETICS| @DOYLE_RILEY 31
32. MODEL DOES NOT GENERALIZE ACROSS SPECIES
Comparison of performance between DTG and Doench models
MOUSE PERFORMANCE ALSO IMPROVED BUT IS NOT AS GOOD AS HUMAN MODEL
● Executing this algorithm found DTG’s model is an 100% improvement over Doench 2016
● No literature list of essential genes available for Mouse
● Still unclear why performance is different
@DESKTOPGENETICS| @DOYLE_RILEY 32
33. MODEL COEFFICIENTS CONFIRM POSITION-DEPENDENT SEQUENCE EFFECT
● We examined the coefficients of the ridge regression model
● We determined the importance of single bases varies a lot of the range of the flank
PRIOR WORK EXTENDED INTO NEW TRAINING DATA
@DESKTOPGENETICS| @DOYLE_RILEY 33
34. MARGINAL BENEFIT OF ADDITIONAL DATA
HUMAN AND MOUSE MODELS BOTH IMPROVE AS FURTHER WET LAB DATA ADDED
● Relationship between model performance and data used = more data will help build a better model
SpearmanCorrelation
SpearmanCorrelation
@DESKTOPGENETICS| @DOYLE_RILEY 34
36. 36
CONCLUSIONS
1. De-noising and normalization of the training data and feature engineering resulted in a
linear model which outperformed more complex types.
2. Linear model currently predicts guide performance up to current variance seen
experimentally.
3. Model generalized across cell lines but not across species. We are currently unsure
why.
4. Prior knowledge about essential genes and target genome significantly improved the
model (ie. human genome better curated than mouse).
5. Model performance increased linearly with more training data, but less rapidly for
mouse.
@DESKTOPGENETICS| @DOYLE_RILEY
SIGNIFICANTLY MORE ACCURATE GUIDE ACTIVITY PREDICTIONS WERE POSSIBLE
37. 37
LESSONS LEARNED
1. Task queues (Celery), microservices, containers (Docker, Kubernetes), and
Postgresql significantly increased dev-ops burden, dependencies, code
maintenance requirements, and learning curve without increasing productivity.
Pure python code nearly always ended up getting used more.
2. Scikit Learn Model serialization (cPickle) is not portable as ABI breaks between
minor and patch versions. Significant source of errors in production. Acute need for
better way to serialize more complex models.
3. Docker Containers did not provide a “silver bullet” replacement for Python packaging,
dependency management, or model portability. Instead they introduced significant
learning curve as most bioinformatics tools expect direct access to a shared
filesystem.
4. Data Science and BioInformatics team strongly preferred working with Conda
environment vs. PyEnv + VirutualEnv.
5. Google Cloud Storage critical to working with large genomic data sets.
@DESKTOPGENETICS| @DOYLE_RILEY
ETL PIPELINE, FEATURES, AND DATA PROCESSING WERE CRITICAL TO SUCCESS
38. 38
TAKING CRISPR AI TO THE CLINIC
EXTENDING APPROACH TO IMPROVE GENOME EDITING SAFETY AND EFFICACY
@DESKTOPGENETICS| @DOYLE_RILEY
40. GETTING INVOLVED WITH CRISPR
OPTIMISE AND IMPROVE
1. Dataset available on GitHub – try it yourself
https://github.com/DeskGen/guide-cluster
2. Larger dataset with API coming March 2017
https://github.com/DeskGen/dgcli
3. Hiring full time at Desktop Genetics
https://www.deskgen.com/landing/company#about-careers
40@DESKTOPGENETICS| @DOYLE_RILEY
41. JOBS AT DESKTOP GENETICS HQ
JOIN US IN SHOREDITCH - TELL YOUR FRIENDS!
41@DESKTOPGENETICS| @DOYLE_RILEY
42. GET EVERYTHING YOU JUST HEARD AND MORE
SLIDES, FUTURE MEETUPS, CRISPR RESOURCES, JOB OPPORTUNITIES
PyData@deskgen.com
42@DESKTOPGENETICS| @DOYLE_RILEY
Send an empty email to