SlideShare a Scribd company logo
1 of 42
Closing the Gap in Time: From Raw Data to Real Science Science as a Service (ScaaS) Justin H. Johnson Director of Bioinformatics EdgeBio Gaithersburg, MD
Edge Bio & Me Sequencing and Bioinformatics Shop I am a Bioinformatician IT Professional Scientist The lines are blurred…
NGS Exponential Growth  Nature Biotechnology Volume 26  Number10  October2008
Sequencing is Free? Machines and Vendors Lab Staffing, Integrations and LIMS Project Design Data Management and QC Bioinformatics and Data Analysis Data Computes and Storage Data Sharing
Machines and Vendors GnuBio
Ultra High Throughput + Lower Cost = Broader Applications
Bioinformatics Tools 	* BFAST - Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA.* Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows-Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X.* BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source.* ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.* Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.* GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX.* GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.* gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix.* MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source* MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX* MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source.* MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, CorinaAntonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.* Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.* PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux.* RMAP - Assembles 20 - 64 bpIllumina reads to a FASTA reference genome. By Andrew D. Smith and ZhenyuXuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.* SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.* SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem'scolourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX.* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here.* SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX.* SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by ZeminNing, Anthony Cox and James Mullikin. C++ for Linux/Alpha.* SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH.* SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM)* SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.* Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX.* Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data. Courtesy of SeqAnsers.com
Experimental Design Considerations ,[object Object]
Choice of Library Construction
Depth of coverage
Re$ources
Number of Replicates
Number of Samples and Control
Etc…,[object Object]
Its Building… 1 16 96 384 100,000 350,000,000 Tomorrow?
…and distributing Distributing the Problem Exchanging Data Refreshing Data
How do we avoid the Perfect Storm? http://www.flickr.com/photos/nationalmaritimemuseum/3115298137/
Life Vests? Science as a Service Saas Iaas Paas Haas Daas
Edge Bio
You can’t do this alone, neither can we. Elaine Mardis, Ph.D. Co-Director, Genome Sequencing Center Washington University School of Medicine Sam Levy, Ph.D. Director of Genomic Sciences Professor of Translational Genomics & Human Genomic Medicine Scripps Translational Science Institute Scripps Health Scripps Research Institute Michael Zody, Ph.D. Chief Technologist Broad Institute of MIT Ken Dewar, Ph.D. Assistant Professor McGill University and Quebec Genome Steven Salzberg, Ph.D. Director, Center for Bioinformatics and Computational Biology University of Maryland Gabor Marth, Ph.D. Professor of Bioinformatics Boston College Elliott Margulies, Ph.D. Investigator Genome Informatics Section National Human Genome Research Institute National Institutes of Health Scientific Advisory Board
EdgeBio
Edge Bio
Bioinformatics Cloud Computing (Iaas, PaaS) Amazon, Google, Others NGS Software and Algorithms Commercial (CLC) and Open Source Frameworks (cloud)Biolinux, Hadoop and Chef Data Sharing and Standards (GSC/M5)
Cloud
Cloud Not a talk on cloud computing… Either you already know what it is  – OR –  Other can do it more justice with more time. The 4 dollar genome.
An Academic Problem Have 100K+ free computes avail (Teragrid) Can’t get to them easily Have unlimited pool of computes (Amazon) Can’t figure out how to pay for them If you have figured those out… Traditional Issues Pipes too small Security
Edge Science Gateway (ESG) Easily leverage any source at the back end Cloud TeraGrid Internal HPC Cluster Internal cloud (Eucalyptus) Commercial partners provide thin clients layer CLC Bio
Edge Science Gateway (ESG) Plug and Play Tools Chef and (cloud)Bio-Linux Real people, not services underneath it all. IFX and IT consulting
Edge Exome Analysis Pipeline Base Call, Quality Filter Map Genome Exon junctions Variant Analysis SNP INDEL Variant Annotation Refseq (etc) dbSNP/100Genomes SIFT, PolyPhen OMIM Repeat Database Visualization Functional and Structural Analysis
Sometimes money isn’t enough Traditional Infrastructure or New Era Clouds Still costs, whether up front or on back end Always want to minimize cost Frameworks and Algorithms R&D NGS Software and Algorithms cloud(Bio-Linux) Hadoop Chef
NGS Software and Algorithms Best of Breed though pluggable architecture Make them faster… Map Reduce Blast & GPU Enable Blast SIMD Accelerated HMMR, ClustalW, SW (CLC Bio) …or run them less. Clustering algorithms Efficient data refreshes Sharing of results
Frameworks Cloud(BioLinux) Easily deploy others software though virtualization Chef (an Intro) Open Source configuration management for your entire infrastructure MPI and Hadoop Speed up and I/O issue resolution with MPI CloudBurst – mapReduce based NGS mapping
Edge BioServ Services   Project  goals and timelines Number of samples Number of reads/tags per sample Experiment and Project Design Sample Preparation  Library Preparation Sample Capture   Library Construction Sample QC and Quantification Adaptor ligation   Project Workflow Amplification & Sequence Fragment amplification Next Generation Sequencing run  Align sequence to reference genome or RNA database Secondary and Tertiary Analysis Data Analysis
Real World Examples 1500+ Sample Epigenetic Study Challenges ,[object Object]
QC
DeliverySolution ,[object Object]
Geospiza
CLC Bio
Customs Algorithms
HPC and Storage
Onsite 200 TB NAS
S3 for Backup (Soon for Delivery),[object Object]

More Related Content

What's hot

Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseqDenis C. Bauer
 
Finding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/HadoopFinding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/HadoopMahmoud Parsian
 
Bioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformaticsBioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformaticsProf. Wim Van Criekinge
 
UCSC MS bioinformatics report 2010
UCSC MS bioinformatics report 2010UCSC MS bioinformatics report 2010
UCSC MS bioinformatics report 2010Elinor Velasquez
 
Integrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS dataIntegrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS dataJoão André Carriço
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopMorgan Langille
 
16S rRNA Analysis using Mothur Pipeline
16S rRNA Analysis using Mothur Pipeline16S rRNA Analysis using Mothur Pipeline
16S rRNA Analysis using Mothur PipelineEman Abdelrazik
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectivePalaniappan SP
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Barbera van Schaik
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityMonica Munoz-Torres
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
 
DNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differencesDNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differencesBarbera van Schaik
 

What's hot (20)

Biological database
Biological databaseBiological database
Biological database
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
Finding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/HadoopFinding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/Hadoop
 
Eccmid meet the expert 2015
Eccmid meet the expert 2015Eccmid meet the expert 2015
Eccmid meet the expert 2015
 
Bioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformaticsBioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformatics
 
UCSC MS bioinformatics report 2010
UCSC MS bioinformatics report 2010UCSC MS bioinformatics report 2010
UCSC MS bioinformatics report 2010
 
Integrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS dataIntegrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS data
 
TrEMBL
TrEMBLTrEMBL
TrEMBL
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
A
AA
A
 
16S rRNA Analysis using Mothur Pipeline
16S rRNA Analysis using Mothur Pipeline16S rRNA Analysis using Mothur Pipeline
16S rRNA Analysis using Mothur Pipeline
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data Perspective
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
Article
ArticleArticle
Article
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
DNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differencesDNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differences
 

Viewers also liked

Raw data in, Insights out - CKANcon 2015
Raw data in, Insights out - CKANcon 2015Raw data in, Insights out - CKANcon 2015
Raw data in, Insights out - CKANcon 2015Joel Natividad
 
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
From Raw Data to Deployed Product. Fast & Agile with CRISP-DMFrom Raw Data to Deployed Product. Fast & Agile with CRISP-DM
From Raw Data to Deployed Product. Fast & Agile with CRISP-DMMichał Łopuszyński
 
Common ways to avoid frequent gmp errors
Common ways to avoid frequent gmp errorsCommon ways to avoid frequent gmp errors
Common ways to avoid frequent gmp errorsKiran Kota
 
Quality risk management
Quality risk managementQuality risk management
Quality risk managementPRASAD V J V
 
Gmp qa and doccumentation by kailash vilegave
Gmp qa and doccumentation by kailash vilegaveGmp qa and doccumentation by kailash vilegave
Gmp qa and doccumentation by kailash vilegaveKailash Vilegave
 
Out of specification shravan
Out of specification shravanOut of specification shravan
Out of specification shravanshravan dubey
 
Good Manufacturing Practices For Quality Control
Good Manufacturing Practices For Quality ControlGood Manufacturing Practices For Quality Control
Good Manufacturing Practices For Quality ControlDr Rajendra Patel
 
Good manufacturing practice (GMP)
Good manufacturing practice (GMP)Good manufacturing practice (GMP)
Good manufacturing practice (GMP)Sagar Savale
 
Good Documentation Pactise dr. amsavel
Good Documentation Pactise  dr. amsavelGood Documentation Pactise  dr. amsavel
Good Documentation Pactise dr. amsavelAmsavel Vel
 
GMP & Quality Assurance Training by Fakultas Farmasi Universitas Andalas
GMP & Quality Assurance Training by Fakultas Farmasi Universitas AndalasGMP & Quality Assurance Training by Fakultas Farmasi Universitas Andalas
GMP & Quality Assurance Training by Fakultas Farmasi Universitas AndalasAtlantic Training, LLC.
 
Good Manufacturing Practice (“GMP”) Compliance: GMPs EXPLAINED by SIDLEY AUST...
Good Manufacturing Practice (“GMP”) Compliance: GMPs EXPLAINED by SIDLEY AUST...Good Manufacturing Practice (“GMP”) Compliance: GMPs EXPLAINED by SIDLEY AUST...
Good Manufacturing Practice (“GMP”) Compliance: GMPs EXPLAINED by SIDLEY AUST...Atlantic Training, LLC.
 
Good Documentation Practice
Good Documentation PracticeGood Documentation Practice
Good Documentation Practicecr7clark
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2BITS
 
Importance of documentation for gmp compliance
Importance of documentation for gmp complianceImportance of documentation for gmp compliance
Importance of documentation for gmp complianceJRamniwas
 

Viewers also liked (20)

Raw data in, Insights out - CKANcon 2015
Raw data in, Insights out - CKANcon 2015Raw data in, Insights out - CKANcon 2015
Raw data in, Insights out - CKANcon 2015
 
Labscope intro
Labscope introLabscope intro
Labscope intro
 
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
From Raw Data to Deployed Product. Fast & Agile with CRISP-DMFrom Raw Data to Deployed Product. Fast & Agile with CRISP-DM
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
 
Delta GMP Data Integrity Sept2016
Delta GMP Data Integrity Sept2016Delta GMP Data Integrity Sept2016
Delta GMP Data Integrity Sept2016
 
Common ways to avoid frequent gmp errors
Common ways to avoid frequent gmp errorsCommon ways to avoid frequent gmp errors
Common ways to avoid frequent gmp errors
 
Quality risk management
Quality risk managementQuality risk management
Quality risk management
 
Gmp qa and doccumentation by kailash vilegave
Gmp qa and doccumentation by kailash vilegaveGmp qa and doccumentation by kailash vilegave
Gmp qa and doccumentation by kailash vilegave
 
Out of specification shravan
Out of specification shravanOut of specification shravan
Out of specification shravan
 
Good Manufacturing Practices For Quality Control
Good Manufacturing Practices For Quality ControlGood Manufacturing Practices For Quality Control
Good Manufacturing Practices For Quality Control
 
Good manufacturing practice (GMP)
Good manufacturing practice (GMP)Good manufacturing practice (GMP)
Good manufacturing practice (GMP)
 
Good Documentation Pactise dr. amsavel
Good Documentation Pactise  dr. amsavelGood Documentation Pactise  dr. amsavel
Good Documentation Pactise dr. amsavel
 
GDP:an overview
GDP:an overviewGDP:an overview
GDP:an overview
 
GMP & Quality Assurance Training by Fakultas Farmasi Universitas Andalas
GMP & Quality Assurance Training by Fakultas Farmasi Universitas AndalasGMP & Quality Assurance Training by Fakultas Farmasi Universitas Andalas
GMP & Quality Assurance Training by Fakultas Farmasi Universitas Andalas
 
GMP Training
GMP TrainingGMP Training
GMP Training
 
Good Manufacturing Practice (“GMP”) Compliance: GMPs EXPLAINED by SIDLEY AUST...
Good Manufacturing Practice (“GMP”) Compliance: GMPs EXPLAINED by SIDLEY AUST...Good Manufacturing Practice (“GMP”) Compliance: GMPs EXPLAINED by SIDLEY AUST...
Good Manufacturing Practice (“GMP”) Compliance: GMPs EXPLAINED by SIDLEY AUST...
 
Good Documentation Practice
Good Documentation PracticeGood Documentation Practice
Good Documentation Practice
 
Gmp ayurveda
Gmp ayurvedaGmp ayurveda
Gmp ayurveda
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
Importance of documentation for gmp compliance
Importance of documentation for gmp complianceImportance of documentation for gmp compliance
Importance of documentation for gmp compliance
 
Back To Basic Gmp
Back To Basic GmpBack To Basic Gmp
Back To Basic Gmp
 

Similar to Closing the Gap in Time: From Raw Data to Real Science

Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
A Survey on Bioinformatics Tools
A Survey on Bioinformatics ToolsA Survey on Bioinformatics Tools
A Survey on Bioinformatics Toolsidescitation
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data miningSangeeta Das
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC
 
Scientific Data Management
Scientific Data ManagementScientific Data Management
Scientific Data ManagementAlberto Labarga
 
Internship Report
Internship ReportInternship Report
Internship ReportNeha Gupta
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers Golden Helix Inc
 
LIMS FOR MAIZE MAPPING PROJECT
LIMS FOR MAIZE MAPPING PROJECTLIMS FOR MAIZE MAPPING PROJECT
LIMS FOR MAIZE MAPPING PROJECTG2 APPS SA DE CV
 
Network and pathway visualization tools
Network and pathway visualization toolsNetwork and pathway visualization tools
Network and pathway visualization toolsD HARI PRASATH
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsGolden Helix Inc
 
OVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Solutions
 

Similar to Closing the Gap in Time: From Raw Data to Real Science (20)

Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Molecular Biology Software Links
Molecular Biology Software LinksMolecular Biology Software Links
Molecular Biology Software Links
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
A Survey on Bioinformatics Tools
A Survey on Bioinformatics ToolsA Survey on Bioinformatics Tools
A Survey on Bioinformatics Tools
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
Scientific Data Management
Scientific Data ManagementScientific Data Management
Scientific Data Management
 
Internship Report
Internship ReportInternship Report
Internship Report
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
LIMS for maize mapping project
LIMS for maize mapping projectLIMS for maize mapping project
LIMS for maize mapping project
 
LIMS FOR MAIZE MAPPING PROJECT
LIMS FOR MAIZE MAPPING PROJECTLIMS FOR MAIZE MAPPING PROJECT
LIMS FOR MAIZE MAPPING PROJECT
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Network and pathway visualization tools
Network and pathway visualization toolsNetwork and pathway visualization tools
Network and pathway visualization tools
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
 
Harvester I
Harvester IHarvester I
Harvester I
 
OVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Bioinformatic Solutions
OVium Bioinformatic Solutions
 

Recently uploaded

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Closing the Gap in Time: From Raw Data to Real Science

  • 1. Closing the Gap in Time: From Raw Data to Real Science Science as a Service (ScaaS) Justin H. Johnson Director of Bioinformatics EdgeBio Gaithersburg, MD
  • 2. Edge Bio & Me Sequencing and Bioinformatics Shop I am a Bioinformatician IT Professional Scientist The lines are blurred…
  • 3. NGS Exponential Growth Nature Biotechnology Volume 26 Number10 October2008
  • 4. Sequencing is Free? Machines and Vendors Lab Staffing, Integrations and LIMS Project Design Data Management and QC Bioinformatics and Data Analysis Data Computes and Storage Data Sharing
  • 6. Ultra High Throughput + Lower Cost = Broader Applications
  • 7. Bioinformatics Tools * BFAST - Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA.* Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows-Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X.* BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source.* ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.* Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.* GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX.* GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.* gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix.* MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source* MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX* MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source.* MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, CorinaAntonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.* Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.* PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux.* RMAP - Assembles 20 - 64 bpIllumina reads to a FASTA reference genome. By Andrew D. Smith and ZhenyuXuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.* SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.* SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem'scolourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX.* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here.* SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX.* SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by ZeminNing, Anthony Cox and James Mullikin. C++ for Linux/Alpha.* SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH.* SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM)* SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.* Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX.* Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data. Courtesy of SeqAnsers.com
  • 8.
  • 9. Choice of Library Construction
  • 13. Number of Samples and Control
  • 14.
  • 15. Its Building… 1 16 96 384 100,000 350,000,000 Tomorrow?
  • 16. …and distributing Distributing the Problem Exchanging Data Refreshing Data
  • 17. How do we avoid the Perfect Storm? http://www.flickr.com/photos/nationalmaritimemuseum/3115298137/
  • 18. Life Vests? Science as a Service Saas Iaas Paas Haas Daas
  • 20. You can’t do this alone, neither can we. Elaine Mardis, Ph.D. Co-Director, Genome Sequencing Center Washington University School of Medicine Sam Levy, Ph.D. Director of Genomic Sciences Professor of Translational Genomics & Human Genomic Medicine Scripps Translational Science Institute Scripps Health Scripps Research Institute Michael Zody, Ph.D. Chief Technologist Broad Institute of MIT Ken Dewar, Ph.D. Assistant Professor McGill University and Quebec Genome Steven Salzberg, Ph.D. Director, Center for Bioinformatics and Computational Biology University of Maryland Gabor Marth, Ph.D. Professor of Bioinformatics Boston College Elliott Margulies, Ph.D. Investigator Genome Informatics Section National Human Genome Research Institute National Institutes of Health Scientific Advisory Board
  • 23. Bioinformatics Cloud Computing (Iaas, PaaS) Amazon, Google, Others NGS Software and Algorithms Commercial (CLC) and Open Source Frameworks (cloud)Biolinux, Hadoop and Chef Data Sharing and Standards (GSC/M5)
  • 24. Cloud
  • 25. Cloud Not a talk on cloud computing… Either you already know what it is – OR – Other can do it more justice with more time. The 4 dollar genome.
  • 26. An Academic Problem Have 100K+ free computes avail (Teragrid) Can’t get to them easily Have unlimited pool of computes (Amazon) Can’t figure out how to pay for them If you have figured those out… Traditional Issues Pipes too small Security
  • 27. Edge Science Gateway (ESG) Easily leverage any source at the back end Cloud TeraGrid Internal HPC Cluster Internal cloud (Eucalyptus) Commercial partners provide thin clients layer CLC Bio
  • 28. Edge Science Gateway (ESG) Plug and Play Tools Chef and (cloud)Bio-Linux Real people, not services underneath it all. IFX and IT consulting
  • 29. Edge Exome Analysis Pipeline Base Call, Quality Filter Map Genome Exon junctions Variant Analysis SNP INDEL Variant Annotation Refseq (etc) dbSNP/100Genomes SIFT, PolyPhen OMIM Repeat Database Visualization Functional and Structural Analysis
  • 30. Sometimes money isn’t enough Traditional Infrastructure or New Era Clouds Still costs, whether up front or on back end Always want to minimize cost Frameworks and Algorithms R&D NGS Software and Algorithms cloud(Bio-Linux) Hadoop Chef
  • 31. NGS Software and Algorithms Best of Breed though pluggable architecture Make them faster… Map Reduce Blast & GPU Enable Blast SIMD Accelerated HMMR, ClustalW, SW (CLC Bio) …or run them less. Clustering algorithms Efficient data refreshes Sharing of results
  • 32. Frameworks Cloud(BioLinux) Easily deploy others software though virtualization Chef (an Intro) Open Source configuration management for your entire infrastructure MPI and Hadoop Speed up and I/O issue resolution with MPI CloudBurst – mapReduce based NGS mapping
  • 33. Edge BioServ Services Project goals and timelines Number of samples Number of reads/tags per sample Experiment and Project Design Sample Preparation Library Preparation Sample Capture Library Construction Sample QC and Quantification Adaptor ligation Project Workflow Amplification & Sequence Fragment amplification Next Generation Sequencing run Align sequence to reference genome or RNA database Secondary and Tertiary Analysis Data Analysis
  • 34.
  • 35. QC
  • 36.
  • 42.
  • 43. Edge Exome Analysis Pipeline Base Call, Quality Filter Map Genome Exon junctions Variant Analysis SNP INDEL Variant Annotation Refseq (etc) dbSNP/100Genomes SIFT, PolyPhen OMIM Repeat Database Visualization Functional and Structural Analysis
  • 44.
  • 45. 3% of exons have 0-5x coverage
  • 46. 5% of exons have 0-10x coverage
  • 47.
  • 48. 10% of exons have 0-5x coverage
  • 49. 17% of exons have 0-10x coverage
  • 50.
  • 57. Questions? Thank You Twitter: @BioInfo & @EdgeBio Email: jjohnson@edgebio.com
  • 58.

Editor's Notes

  1. Won’t bore you with the entire history of Edge Bio. Its been around 20 …I am sure there are several of you in this room that found yourself doing something you didn’t expect because NGS has provided new challenges to typical thinking and infrastructure.
  2. I am sure you all have seen some sort of variation of this. You, like us, see the value in exploring the next gen landscape, but with such an infection – something has to give.Figure 1 The number of publications with keywords for nucleic acid detection and sequencing technologies. PubMed (http://www. ncbi.nlm.nih.gov/sites/entrez) was searched in two-year increments for key words and the number of hits plotted over time. For 2007–2008, results from January 1–March 31, 2008 were multiplied by four and added to those for 2007. Key words used were those listed in the legend except for new sequencing technologies (‘next-generation sequencing’ or ‘high-throughput sequencing’), ChIP (‘chromatin immunoprecipitation’ or ‘ChIP-Chip’ or ‘ChIPPCR’ or ‘ChIP-Seq’), qPCR (TaqMan or qPCR or ‘real-time PCR’) and SNP analysis (SNPs or ‘single-nucleotide polymorphisms’ and not nitroprusside (nitroprusside is excluded because sodium nitroprusside is sometimes abbreviated as ‘SNP’ but is generally unrelated to genetics)).
  3. From the base of the inflection (or tipping point) in 06 - innovation and technology have been in a constant state of flux. Costs per base have plummeted.Still remains as significant capital expenditure in many areas
  4. Lets hone in on a couple of those areas.- These are the players in the sequencing field as it is, and as it is emerging. - Where do you start – each are provide data at lower costs,- Each have their strengths and weaknesses, - How do you make sure you choose the right platform for your application.
  5. Now, you have a platform – with the high throughout comes an immense range of applications – how do you become proficient?
  6. Now you have an application, how do you design your project? And what many here are talking about today – how do you effectively analyze the data?To illustrate the complexity – this is a partial list of read mapping software from SeqAnswersSo testing, comparing, and honing in on an app is important. Then each one can be run in a multitude of ways. How many errors, colorspace, base, space, etc, Daunting,
  7. Then there is the execution and operations of the projects, - We all wish we had a bag of money to do the science “in the right way”compromises are made to suit funders, timelines, quality standards, etc.How do we balance those limited resources – lets touch on that later.
  8. I tend to start with the inherent challenges of the next gen landscape to lay the ground work for possible solutions.
  9. I was lucky to grow up in genomics. 16 was high throughput. Infrastructureas we know it was rendered obsoletesheer scalechemistries (flows and colorspace data) error models.– significant issue lie ahead for next next gen in these areas.
  10. Distributing the Problem- lower barrier to entry than before- it has somewhat democratized sequencing,
  11. - Service based models becoming common in IT- They allow abstractions, so you don’t have to care about the hardware, or the software, or the platform that is built to accomplish the goals of the research organization or company.
  12. Edge abstracts it to the most basic of levels. Our aim is to help researchers with all these considerations, to come out with optimal results. DNA/RNA in  answers out.
  13. Though leaders in fields of genomics and bioinformatics.
  14. Lets focus a bit more. Bioinformatics is a piece of the puzzle.
  15. More recently, it just so happens to be a larger piece.
  16. With limited time and so many cool things to talk about – I am going to touch on several. I would love to continue any one of these in a hallway or over coffee. My group’s goal is to leverage many technologies – commercial, academic, and internal to help our clients. Some of the areas we are focusing on are on this slide.
  17. SO the first, and probability hottest topic out there is cloud.
  18. ScalablePay as you GoRedundant data storageUbiquitous accessLower upfront and capital costs
  19. Access resources in non-traditional ways.Paradigm shift needs to happen with funding agencies,think in traditional ways of funding capital infrastructure. We need to continue to educate.
  20. - Oneneeds to be nimble in their ability to leverage computes from many pools. - With the ESG, we have built a flexible and scalable infrastructure
  21. ESG is built around This plug and play architecture- continue to explore tools - Chef and other virtualization mechanism that ease the burden of infrastructure maintenance.
  22. Note how each is a node which we can hotswap tools as needed.
  23. This isn’t Panaceathough. There are are costs, whether front or back loaded. SO how do we minimize them? Spend time making things better, not just building pipelinesI’ll touch a bit more on each of these in the upcoming slides.
  24. - Best of Breed. - Changing so rapidly. SOwhether we, or someone else figures out a way to do something better we can take advantage.- With software, make it better or run it less
  25. How many times have you tried to install software that was sent to you or you downloaded from SourceForge that wouldn’t build or run as planned?We quickly learned with a community based AMI is very serial in its adoption and addition. Chef allows for developers to construct infrastructure much like SW
  26. So, lets wind this down by looking at some examples. I guess if you take one term away – I would like it to be Science as a Service. This is a concept, an abstraction for a method of building infrastructure and transparency. So – does it work?
  27. Note how each is a node which we can plug in tools as needed.
  28. Balance between cost and coverage – this provides a talking point for a slide coming up in a few.
  29. What is the best way to do this with what you have?If money not an object, get high coverage and 100% accuracyIn reality, What is the best way to do it with what you have?PlaftormBfx
  30. What is the best way to do this with what you have?If money not an object, get high coverage and 100% accuracyIn reality, What is the best way to do it with what you have?PlaftormBfx