Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
#Code2Cure: Engineering Genomics
: @mirkiani
A field guide for software engineers on their journey to the world of genomic...
 https://www.youtube.com/watch?v=G1ZLyGW8rKY
2
www.bina.com
Why Genomics?
$3,000,000,000
13 years
 http://en.wikipedia.org/wiki/Human_Genome_Project
Past Present
$1000
...
www.bina.com
Why Genomics?
Some things we could do with genomics:
• Carrier Screening
• Prenatal Screening
• Newborn Scree...
But I have no genomics background!
It’s ok. 
5
www.bina.com
My personal story…
6
Now
Then
www.bina.com
What is cell, what is DNA?
 http://en.wikipedia.org/wiki/Cell_%28biology%29
 http://en.wikipedia.org/wiki/D...
www.bina.com
Crash Course on Genomics
The field of studying the structure of genomes.
 http://en.wikipedia.org/wiki/Genom...
www.bina.com
How do we figure out what’s in DNA?
Like everything else, we turn the analog signal to digital, and then
anal...
www.bina.com
RAW Data to Variants (Secondary Analysis)
Step 1. Alignment
 http://en.wikipedia.org/wiki/DNA_sequencing
 h...
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
Step 1. Short-Read Sequence Alignment
 http://en.wikipedia...
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
• Burrows-Wheeler Aligner (BWA)
• Uses Burrows-Wheeler tran...
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
Alignment
FASTQ SAM
Convert to Binary
BZIP (samtools)
BAM F...
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
BAM File
BAM File Index
 http://www.broadinstitute.org/igv...
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
15
… and here are your variants (VCF file)! 
 http://samt...
www.bina.com
What do we do with variant calls then?
Zooming in on the Central Dogma of Molecular Biology:
• There is redun...
www.bina.com
What do we do with variant calls then?
Annotation & Interpretation
• Functional Annotation  Figure out if th...
www.bina.com
CASE STUDY:
18
www.bina.com
Statistics
Data AnalyticsBioinformatics
Genomics
Big Data Technologies
Compute and Data Science
19
Bringing t...
www.bina.com
Case Study: Bina GMS
20
Sequencing 2º Analysis 3º Analysis Interpretation
Meaningful Results
& Clinical Relev...
www.bina.com
Bina RAVE Architecture (1)
21
Secure REST Interface
Portal Server(s)
Portal Backend
(Distributed)
• Workflow ...
www.bina.com
Bina RAVE Architecture (2)
Workflows (DNA, RNA ..)
Tools (BWA, GATK, SVs)
Services
(Logging, Storage, Caching...
www.bina.com
Bina AAiM Architecture
Annotation and Indexing Engine
Input
VCF
UI/CMD
Clinical
Annotations
Genomic
Context
P...
www.bina.com
What next?
 http://www.genomicsengland.co.uk
 http://www.personalgenomes.org
• Apply this process to differ...
www.bina.com
Challenges in Genomics
• Accuracy
• Gold standard? What tool is best, there are so many!
• NIST, Dream Challe...
www.bina.com
Why should software engineers move to genomics?
Because genomics needs you, and you need genomics.
Work on so...
www.bina.com
Open projects/resources to checkout/contribute to
Projects/Conferences
• Galaxy -- http://galaxyproject.org
•...
Thank you.
And I hope you consider moving to genomics! 
 http://info.bina.com/code2cure-community
: @mirkiani
Amirhossei...
Upcoming SlideShare
Loading in …5
×

#Code2Cure: A field guide for software engineers on their journey to the world of genomics.

10,339 views

Published on

Recording on YouTube: https://www.youtube.com/watch?v=G419mmAL9qw

We are at the beginning of a pivotal chapter in the history of medicine.

It took more than a decade and billions of dollars to assemble the first human genome. Today, this tedious task be done in a only a few hours and with a cost as low as $1000.

This historical advancement in the genomics world has created a significant challenge and opportunity for storing, analyzing and understanding genomics information. Fortunately, the software industry, while working on massive ad networks, video games, and social applications, has invented tools, approaches and solutions that can directly be applied to the genomics world and enable the future of medicine.

This talk serves as a preliminary field guide for general software engineers, with no experience in genomics on what it takes to transition from the internet world to the genomics world. An industry ripe for innovation and great potential for applying big data, algorithms, system design and user interface design best practices from the software world.

Lots of us are tired of working on yet another ad network, social game, or mobile game. We want to work on things that change the world and affect human life in a positive way. And what would be better than curing human diseases?

At the end of this talk you will know what skills you can bring to the genomics world from the software world, pointers to the best resources for a software engineer to explore genomics and top open-source genomics tools/libraries used within the genomics industry.

Amirhossein Kiani
Sr. Lead Software Engineer
Bina Technologies Inc.
www.bina.com

Published in: Data & Analytics
  • D0WNL0AD FULL ▶ ▶ ▶ ▶ http://1lite.top/6Ie02 ◀ ◀ ◀ ◀
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

#Code2Cure: A field guide for software engineers on their journey to the world of genomics.

  1. 1. #Code2Cure: Engineering Genomics : @mirkiani A field guide for software engineers on their journey to the world of genomics. Amirhossein Kiani Sr. Lead Software Engineer : amir@bina.com Image courtesy of http://circos.ca DISCLAIMER: The views expressed in this talk are mine alone and not those of my employer. Bina products are for use Research Use Only. Not for use in diagnostic procedures. Also, I’m a Computer Scientist by training and trying to help those with similar background to learn about the field of genomics. Therefore there has been a high degree of simplification done in explaining the scientific concepts in this talk.
  2. 2.  https://www.youtube.com/watch?v=G1ZLyGW8rKY 2
  3. 3. www.bina.com Why Genomics? $3,000,000,000 13 years  http://en.wikipedia.org/wiki/Human_Genome_Project Past Present $1000 24 hours Future 3
  4. 4. www.bina.com Why Genomics? Some things we could do with genomics: • Carrier Screening • Prenatal Screening • Newborn Screening • Inherited Disease • Infectious Disease • Cancer Diagnostics • Microbiome • Personalized Medicine 4
  5. 5. But I have no genomics background! It’s ok.  5
  6. 6. www.bina.com My personal story… 6 Now Then
  7. 7. www.bina.com What is cell, what is DNA?  http://en.wikipedia.org/wiki/Cell_%28biology%29  http://en.wikipedia.org/wiki/DNA 7 Image courtesy of Pinterest Image courtesy of Tumblr
  8. 8. www.bina.com Crash Course on Genomics The field of studying the structure of genomes.  http://en.wikipedia.org/wiki/Genomics  http://en.wikipedia.org/wiki/RNA  http://en.wikipedia.org/wiki/Protein DNA RNA Protein You! 8
  9. 9. www.bina.com How do we figure out what’s in DNA? Like everything else, we turn the analog signal to digital, and then analyze it.  http://en.wikipedia.org/wiki/DNA_sequencing  http://en.wikipedia.org/wiki/FASTQ_format Illumina, Ion Torrent, Genia, … Primary Analysis FASTQ Format 9 Image courtesy of PersonalGenomes.org
  10. 10. www.bina.com RAW Data to Variants (Secondary Analysis) Step 1. Alignment  http://en.wikipedia.org/wiki/DNA_sequencing  http://en.wikipedia.org/wiki/FASTQ_format 10 Image courtesy of Wall Woodworks Image courtesy of Wallpaper Up
  11. 11. www.bina.com From “Raw” DNA to “Variants” (Secondary Analysis) Step 1. Short-Read Sequence Alignment  http://en.wikipedia.org/wiki/Reference_genome  http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism  http://en.wikipedia.org/wiki/Indel  http://en.wikipedia.org/wiki/Structural_variation AACACACCCAAGGGGGAAACTTTGGTCCACCCAAGGGGGAAACCCAAGGGGGAAACTTTG Reference Genome (~3B bases) ACTTTGGTCCACCCAAGG AAGGGGGACACCCAAGGACACCC__GGGGGAAACT GGACACCCAAGGGGGAA ACCCAAGGGGGACACCC ACCC__GGGGGAAACTTTG AACACACCC__GGGGGAA Coverage Deletion Single Nucleotide Polymorphism 11
  12. 12. www.bina.com From “Raw” DNA to “Variants” (Secondary Analysis) • Burrows-Wheeler Aligner (BWA) • Uses Burrows-Wheeler transform (also used in bzip) • Uses Smith-Waterman algorithm • Written in C++ • Uses ~4GB memory for human genome  http://bio-bwa.sourceforge.net  http://bioinformatics.oxfordjournals.org/content/25/14/1754.full.pdf+html $ bwa mem ref.fa read1.fq read2.fq > aln-pe.sam Example 12
  13. 13. www.bina.com From “Raw” DNA to “Variants” (Secondary Analysis) Alignment FASTQ SAM Convert to Binary BZIP (samtools) BAM File BAM File Index  http://samtools.github.io/hts-specs/SAMv1.pdf  http://samtools.github.io 13
  14. 14. www.bina.com From “Raw” DNA to “Variants” (Secondary Analysis) BAM File BAM File Index  http://www.broadinstitute.org/igv  https://github.com/ekg/freebayes  http://arxiv.org/abs/1207.3907  https://www.broadinstitute.org/gatk Visualize Variant Calling $ freebayes -f ref.fa aln.bam >var.vcf Example Interactive Genome Browser (IGV) 14
  15. 15. www.bina.com From “Raw” DNA to “Variants” (Secondary Analysis) 15 … and here are your variants (VCF file)!   http://samtools.github.io/hts-specs/VCFv4.2.pdf
  16. 16. www.bina.com What do we do with variant calls then? Zooming in on the Central Dogma of Molecular Biology: • There is redundancy in protein codes. • But a mutation could change the protein coding. 16 Image courtesy of Wikipedia
  17. 17. www.bina.com What do we do with variant calls then? Annotation & Interpretation • Functional Annotation  Figure out if the mutation is dangerous (Use SNPEff) • Synonymous • Non-Synonymous • Frame-shift • … • Put in the context of existing findings • dbSNP • ClinVar • COSMIC • ESP • 1000 Genomes • …  http://snpeff.sourceforge.net  http://www.ncbi.nlm.nih.gov/SNP 17
  18. 18. www.bina.com CASE STUDY: 18
  19. 19. www.bina.com Statistics Data AnalyticsBioinformatics Genomics Big Data Technologies Compute and Data Science 19 Bringing three disciplines together
  20. 20. www.bina.com Case Study: Bina GMS 20 Sequencing 2º Analysis 3º Analysis Interpretation Meaningful Results & Clinical Relevance 20+ DBs including over 140+ annotations: HGMD // PGMD // Clinvar COSMIC // dbNSFP // TRANSFAC 1000 Genome and more. Tools & Workflows for: WGS // WES // RNAseq Somatic Mutations Multi sample Gene Panels Bina Products are for Research Use Only
  21. 21. www.bina.com Bina RAVE Architecture (1) 21 Secure REST Interface Portal Server(s) Portal Backend (Distributed) • Workflow Definition • Templates • QC/Monitoring • System Management/Updates Task Dependency Graphs Distributed Workflow Orchestration Secure Push Interface WorkflowGeneration Interactive UI // Command Line SDK Executor Dynamic Scheduling Local Storage ExecutionEngine Executor Nodes / VMs Network Storage – Input/Output Data Static Scheduling Workflows Tools Commands
  22. 22. www.bina.com Bina RAVE Architecture (2) Workflows (DNA, RNA ..) Tools (BWA, GATK, SVs) Services (Logging, Storage, Caching, Streaming) Commands (Samtools, GATK, URL,..) Genome-aware – Workflow Generation Distributed Coordination Task Graph JSON Request (UI/CMD/SDK) Nodes / VMs Executor Dynamic scheduling Graph Triggers Updates Genome aware – Distributed Execution Framework Syncing all Nodes Dependency Graph Task Status Network storage – Input/output data Local storage • Dependency Aware Execution • Locality Aware Execution (Caching) • Streaming Through “Engines” • In-Memory Computation Output (VCF,SV) Input (BAM, FASTQ) Static Scheduling
  23. 23. www.bina.com Bina AAiM Architecture Annotation and Indexing Engine Input VCF UI/CMD Clinical Annotations Genomic Context Prediction Func. Impact Population Frequency Distributed Execution Framework Annotation (Join static DBs) Indexing & Functional Filters MapReduce Jobs Analytics Engine NoSQL Data Store Indices Metadata Store Tumor/Norma l Pedigree Queries, Filters, Variant Sets, Reports Bina Secondary Cohort StudyProband
  24. 24. www.bina.com What next?  http://www.genomicsengland.co.uk  http://www.personalgenomes.org • Apply this process to different domains and applications • Come up with ways of ranking variants • Keep learning from data • Sequence everyone! • Genomics England 100,000 Genome Project • Personal Genomes Project • Decrease cost • Increase accuracy • Make the technology faster and more usable! Map of sequencers around the globe: http://omicsmaps.com 24
  25. 25. www.bina.com Challenges in Genomics • Accuracy • Gold standard? What tool is best, there are so many! • NIST, Dream Challenge • Need to speak the same language… interoperability • Global Alliance • API, format, meta data, … • Regulations • HIPPA, CLIA: security, accuracy, anonymity and encryption • Scalability • Storage • Need terabytes • Each genome could be up to 1T • Computation • We still pretty much have no idea what most of DNA is doing… • Can’t run on single machine. Need to scale to many nodes • Need to leverage cloud technologies • Provenance and auditability • Importance of usability • Different personas • Errors are very expensive (life and death) • Better visualization → faster discovery → faster cure 25
  26. 26. www.bina.com Why should software engineers move to genomics? Because genomics needs you, and you need genomics. Work on something that matters! (#Code2Cure) Things that SWEs do very well: • Automation • Elegant solutions for complex problems • Enabling non-savvy users by making the technology robust and accessible • Scale • Optimization • Building production-grade platforms • Tested • Robust • Secure THESE ARE ALL NEEDED IN GENOMICS YESTERDAY! 26 Image courtesy of http://silvsoul.blogspot.com
  27. 27. www.bina.com Open projects/resources to checkout/contribute to Projects/Conferences • Galaxy -- http://galaxyproject.org • Arvados -- https://arvados.org • Open Bio Conference -- http://www.open-bio.org • BioViz -- http://www.biovis.net • BioPython -- http://biopython.org • Global Alliance for Genomics Health -- http://ga4gh.org • Rosalind Project -- http://rosalind.info Blogs/Websites • http://bcb.io • http://nextgenseek.com/ • http://ngs-expert.com/ • http://seqanswers.com/ • http://core-genomics.blogspot.com • http://www.genomesunzipped.org • http://genomeweb.com 27
  28. 28. Thank you. And I hope you consider moving to genomics!   http://info.bina.com/code2cure-community : @mirkiani Amirhossein Kiani Sr. Lead Software Engineer : amir@bina.com

×