SlideShare a Scribd company logo
Jeff Melching
Monsanto: Lead Big Data Engineer
Twitter: @melchbox
8/7/2013
Intro to the genomic space @Monsanto
Strategy for Legacy Analysis
Example Use Cases
htseq-count: Counting reads in features
Genotype Scoring
Crossbow
“We succeed when farmers succeed.”
-Hugh Grant, Monsanto CEO
Monsanto Company is a leading global provider of
technology-based tools and agricultural products
that improve farm productivity and food quality.
We work to deliver agricultural products and
solutions to:
• Meet the world’s growing food needs
• Conserve natural resources
• Protect the environment
Monsanto Company Confidential
Genomics : a discipline in genetics that applies
recombinant DNA, DNA sequencing
methods, and bioinformatics to
sequence, assemble, and analyze the function
and structure of genomes (the complete set of
DNA within a single cell of an organism)
http://en.wikipedia.org/wiki/Genomics
New gene discovery, gene expression
Evolutionary population genetics
Insect Control
Disease resistance targets
Marker discovery and variation analysis
Genotyping and fingerprinting
New vegetable reference genomes
Marker discovery and variation analysis
Disease resistance
Viral and fungal resistance
Targets for topical RNAi
Seed Treatments
Yield & Stress
Agricultural Traits
Molecular Breeding
Vegetable Quality &
Disease
Plant Health
Chemistry
30+ years of increasing computational
power, open source tools and knowledge
Two distinct workloads
Production workflows
Discovery analytics
Computational pipelines
Grid Computing
High Performance
Storage
File Processing
Hadoop
HDFS
Block processing
perl
python
C/C++
R
bash
MapReduce
Java
Pig
Hive
The work is done, why port it to java?
Can I get it done quickly?
Where’s the value?
Math is hard
Genomic algorithms are harder
Coding it is harder still
Coding it correctly…
11
http://gapingvoid.com/2008/06/13/now-what/
Minimizing change in order to leverage the
existing pipelines, tools and knowledge in their
natural state, requires a common platform that
is language neutral and easily consumable
stdin & stdout
Creates map and reduce tasks
Controls map and reduce defined executables
Feeds data to stdin of the process, collects output
from stdout
Equivalent to using pipes
Input Mapper Reducer
Map Exe
Reduce
Exe
stdoutstdin
Output
stdoutstdin
Algorithm of existing executables
parallelizable?
Can existing code operate on or be easily
modified to support stdin & stdout?
If not, can you wrap it?
Identify decision
points to split
code into
MapReduce style
http://www.recessframework.org/blog/category/PHP
Minimize Change
Test in local mode first
$ cat inputFile | mapper.sh | reducer.sh > outputFile
http://wiki.apache.org/hadoop/HadoopStreaming
“Given a file with aligned sequencing reads and
a list of genomic features, a common task is to
count how many reads map to each feature.”
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
gene
read
read
read
2
hadoop jar $HADOOP_STREAMING
-input my_experiment.sam -output count
--mapper 'mapper -q -s no - features.gtf'
--reducer ‘reducer.py'
-file dist/mapper -file features.gtf -file reducer.py
htseq-count –q –s no my_experiment.sam
features.gtf
#… do crazy parsing using python libs and stuff…
try:
read_seq = iter( HTSeq.SAM_Reader( sys.stdin ) )
first_read = read_seq.next()
read_seq = itertools.chain( [ first_read ], read_seq )
pe_mode = first_read.paired_end
for r in read_seq:
# do more algorithm and validation stuff…
fs = set()
for iv in iv_seq:
if iv.chrom not in features.chrom_vectors:
raise UnknownChrom
for iv2, fs2 in features[ iv ].steps():
fs = fs.union( fs2 )
if fs is None or len( fs ) == 0:
empty += 1
elif len( fs ) > 1:
ambiguous += 1
else:
counts[ list(fs)[0] ] += 1
for fn in sorted( counts.keys() ):
print "%st%d" % ( fn, counts[fn] )
#!/usr/bin/env python
import sys
current_fn = None
current_count = 0
fn = None
for line in sys.stdin:
fn, count = line.split('t',1)
count = int(count)
if current_fn == fn:
current_count += count
else:
if current_fn:
print '%st%s' % (current_fn, current_count)
current_count = count
current_fn = fn
if current_fn == fn:
print '%st%s' % (current_fn, current_count)
Split python script into mapper and reducer
No Change to command line args
Reused all dependent libraries
Run in MR mode or local mode
$ cat my_experiment.sam | mapper -q -s no - features.gtf | sort |
python reducer.py
Analysis determines variants and quality scores
Legacy code written in Java and R part of a larger
pipeline
Embedded on app server and responds to JMS
Parallelizable
Wrapped input to read from streamed files
Map operates on a plate of data at one time
Minimal change only in how the data passed
Reused existing Java and R code by simply writing
a transformation class
2 days of work during “Innovation Days” to modify
for Hadoop by following this strategy
Value: 75k plates in 4 minutes on 14 node cluster
Crossbow is a scalable software pipeline for whole
genome resequencing analysis. It combines
Bowtie, an ultrafast and memory efficient short
read aligner, and SoapSNP, and an accurate
genotyper. These tools are combined in an
automatic, parallel pipeline…
http://bowtie-bio.sourceforge.net/crossbow/index.shtml
“Hadoop also supports a 'streaming' mode of
operation whereby the map and reduce
functions are delegated to command-line scripts
or compiled programs written in any language.
…
This allows Crossbow to reuse existing software
for aligning reads and calling SNPs while
automatically gaining the scaling benefits of
Hadoop. “
http://genomebiology.com/2009/10/11/R134
http://www.cbcb.umd.edu/~mschatz/Posters/Crossbow_WABI_Sept2009.pdf
Crossbow-specific output format was
implemented that encodes an alignment as a
tuple where the tuple's key identifies a
reference partition and the value describes the
alignment.
http://genomebiology.com/2009/10/11/R134
http://bowtie-bio.sourceforge.net/crossbow/
A new input format (option --12) was added, allowing
Bowtie to recognize the one-read-per-line format
produced by the Crossbow preprocessor.
http://genomebiology.com/2009/10/11/R134
The version of SOAPsnp used in Crossbow was
modified to accept alignment records output by
modified Bowtie ... None of the modifications made to
SOAPsnp fundamentally affect how consensus bases or
SNPs are called
http://genomebiology.com/2009/10/11/R134
Adoption of Hadoop for discovery
Rapid development times
Executables as stand-alone packages
Assess fit
It’s still MapReduce
Minimize change
Test in local mode
http://i.qkme.me/3pgc1j.jpg
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013

More Related Content

Similar to Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013

3rd presentation
3rd presentation3rd presentation
3rd presentation
Olabode Ajayi
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
Ajay Ohri
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Adam Bradley
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2
Samrat Jha
 
Functional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptxFunctional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptx
UmerjibranRaza
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Update
bosc
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
Jan Aerts
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Alex Zeltov
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
Ramya P
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
GenomeInABottle
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
Mr.Sameer Kumar Das
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences
EMC
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
Jinseob Kim
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
Justin Johnson
 
Adelaide Rhodes Resume March 2023
Adelaide Rhodes Resume March 2023Adelaide Rhodes Resume March 2023
Adelaide Rhodes Resume March 2023
Stacy Taylor
 
Paper - Muhammad Gulraj
Paper - Muhammad GulrajPaper - Muhammad Gulraj
Paper - Muhammad Gulraj
Muhammad GulRaj
 
Applying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsApplying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation Experiments
Frank Bergmann
 

Similar to Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013 (20)

3rd presentation
3rd presentation3rd presentation
3rd presentation
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2
 
Functional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptxFunctional ANNOTATION OF GENOME.pptx
Functional ANNOTATION OF GENOME.pptx
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Update
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
 
Adelaide Rhodes Resume March 2023
Adelaide Rhodes Resume March 2023Adelaide Rhodes Resume March 2023
Adelaide Rhodes Resume March 2023
 
Paper - Muhammad Gulraj
Paper - Muhammad GulrajPaper - Muhammad Gulraj
Paper - Muhammad Gulraj
 
Applying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsApplying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation Experiments
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 

Recently uploaded

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 

Recently uploaded (20)

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 

Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013

  • 1. Jeff Melching Monsanto: Lead Big Data Engineer Twitter: @melchbox 8/7/2013
  • 2. Intro to the genomic space @Monsanto Strategy for Legacy Analysis Example Use Cases htseq-count: Counting reads in features Genotype Scoring Crossbow
  • 3. “We succeed when farmers succeed.” -Hugh Grant, Monsanto CEO Monsanto Company is a leading global provider of technology-based tools and agricultural products that improve farm productivity and food quality. We work to deliver agricultural products and solutions to: • Meet the world’s growing food needs • Conserve natural resources • Protect the environment Monsanto Company Confidential
  • 4. Genomics : a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genomes (the complete set of DNA within a single cell of an organism) http://en.wikipedia.org/wiki/Genomics
  • 5. New gene discovery, gene expression Evolutionary population genetics Insect Control Disease resistance targets Marker discovery and variation analysis Genotyping and fingerprinting New vegetable reference genomes Marker discovery and variation analysis Disease resistance Viral and fungal resistance Targets for topical RNAi Seed Treatments Yield & Stress Agricultural Traits Molecular Breeding Vegetable Quality & Disease Plant Health Chemistry
  • 6. 30+ years of increasing computational power, open source tools and knowledge Two distinct workloads Production workflows Discovery analytics Computational pipelines
  • 7.
  • 8. Grid Computing High Performance Storage File Processing Hadoop HDFS Block processing perl python C/C++ R bash MapReduce Java Pig Hive
  • 9. The work is done, why port it to java? Can I get it done quickly? Where’s the value?
  • 10. Math is hard Genomic algorithms are harder Coding it is harder still Coding it correctly…
  • 12. Minimizing change in order to leverage the existing pipelines, tools and knowledge in their natural state, requires a common platform that is language neutral and easily consumable stdin & stdout
  • 13. Creates map and reduce tasks Controls map and reduce defined executables Feeds data to stdin of the process, collects output from stdout Equivalent to using pipes Input Mapper Reducer Map Exe Reduce Exe stdoutstdin Output stdoutstdin
  • 14. Algorithm of existing executables parallelizable?
  • 15. Can existing code operate on or be easily modified to support stdin & stdout? If not, can you wrap it?
  • 16. Identify decision points to split code into MapReduce style http://www.recessframework.org/blog/category/PHP
  • 18. Test in local mode first $ cat inputFile | mapper.sh | reducer.sh > outputFile http://wiki.apache.org/hadoop/HadoopStreaming
  • 19.
  • 20. “Given a file with aligned sequencing reads and a list of genomic features, a common task is to count how many reads map to each feature.” http://www-huber.embl.de/users/anders/HTSeq/doc/count.html gene read read read 2
  • 21. hadoop jar $HADOOP_STREAMING -input my_experiment.sam -output count --mapper 'mapper -q -s no - features.gtf' --reducer ‘reducer.py' -file dist/mapper -file features.gtf -file reducer.py htseq-count –q –s no my_experiment.sam features.gtf
  • 22. #… do crazy parsing using python libs and stuff… try: read_seq = iter( HTSeq.SAM_Reader( sys.stdin ) ) first_read = read_seq.next() read_seq = itertools.chain( [ first_read ], read_seq ) pe_mode = first_read.paired_end for r in read_seq: # do more algorithm and validation stuff… fs = set() for iv in iv_seq: if iv.chrom not in features.chrom_vectors: raise UnknownChrom for iv2, fs2 in features[ iv ].steps(): fs = fs.union( fs2 ) if fs is None or len( fs ) == 0: empty += 1 elif len( fs ) > 1: ambiguous += 1 else: counts[ list(fs)[0] ] += 1 for fn in sorted( counts.keys() ): print "%st%d" % ( fn, counts[fn] )
  • 23. #!/usr/bin/env python import sys current_fn = None current_count = 0 fn = None for line in sys.stdin: fn, count = line.split('t',1) count = int(count) if current_fn == fn: current_count += count else: if current_fn: print '%st%s' % (current_fn, current_count) current_count = count current_fn = fn if current_fn == fn: print '%st%s' % (current_fn, current_count)
  • 24. Split python script into mapper and reducer No Change to command line args Reused all dependent libraries Run in MR mode or local mode $ cat my_experiment.sam | mapper -q -s no - features.gtf | sort | python reducer.py
  • 25.
  • 26. Analysis determines variants and quality scores Legacy code written in Java and R part of a larger pipeline Embedded on app server and responds to JMS
  • 27. Parallelizable Wrapped input to read from streamed files Map operates on a plate of data at one time Minimal change only in how the data passed
  • 28. Reused existing Java and R code by simply writing a transformation class 2 days of work during “Innovation Days” to modify for Hadoop by following this strategy Value: 75k plates in 4 minutes on 14 node cluster
  • 29.
  • 30. Crossbow is a scalable software pipeline for whole genome resequencing analysis. It combines Bowtie, an ultrafast and memory efficient short read aligner, and SoapSNP, and an accurate genotyper. These tools are combined in an automatic, parallel pipeline… http://bowtie-bio.sourceforge.net/crossbow/index.shtml
  • 31.
  • 32. “Hadoop also supports a 'streaming' mode of operation whereby the map and reduce functions are delegated to command-line scripts or compiled programs written in any language. … This allows Crossbow to reuse existing software for aligning reads and calling SNPs while automatically gaining the scaling benefits of Hadoop. “ http://genomebiology.com/2009/10/11/R134 http://www.cbcb.umd.edu/~mschatz/Posters/Crossbow_WABI_Sept2009.pdf
  • 33. Crossbow-specific output format was implemented that encodes an alignment as a tuple where the tuple's key identifies a reference partition and the value describes the alignment. http://genomebiology.com/2009/10/11/R134 http://bowtie-bio.sourceforge.net/crossbow/
  • 34. A new input format (option --12) was added, allowing Bowtie to recognize the one-read-per-line format produced by the Crossbow preprocessor. http://genomebiology.com/2009/10/11/R134
  • 35. The version of SOAPsnp used in Crossbow was modified to accept alignment records output by modified Bowtie ... None of the modifications made to SOAPsnp fundamentally affect how consensus bases or SNPs are called http://genomebiology.com/2009/10/11/R134
  • 36. Adoption of Hadoop for discovery Rapid development times Executables as stand-alone packages
  • 37. Assess fit It’s still MapReduce Minimize change Test in local mode http://i.qkme.me/3pgc1j.jpg

Editor's Notes

  1. HTSeq: Opensource with modificationsGenotype Scoring: custom wrappersCrossbow: opensource modifying input to underlying programs.
  2. Started in 70’s and 80’s, took of in 90’s as growth in computational power and techniques grew.A toolbox full of tools which may or may not play well togetherwritten by those with roots in academia and computational biologyanalytics software and algorithms have been built up over the past 30 years by contributions from both the public and private domain and written in a number of programming languages. When these software packages are brought in house and combined with the skills and preferences of internal bioinformatics researchers, what you get is a myriad of different technologies linked together in an analytics pipelineMany different types of file formats all widely accepted and used for different stages of analysis
  3. An open source project in the wild that adopted the essence of strategy
  4. Lets make a shift to hadoop platform… are these statements true?
  5. Assume Hadoop is a good fit for what we are trying to do.Can I trust myself or others to write, understand and excel at all these languages?
  6. Is the algorithm of existing executables parallelizable?Can your existing code operate on or be easily modified to support stdin?If not, can you wrap it?Is there a decision point in the code that makes sense to split into mapper and reducer?Do you need multiple jobs?Test in “local mode” firstKeep as much of it the same as possible - don’t want to screw it up.
  7. Is the algorithm of existing executables parallelizable?Can your existing code operate on or be easily modified to support stdin?If not, can you wrap it?Is there a decision point in the code that makes sense to split into mapper and reducer?Do you need multiple jobs?Test in “local mode” firstKeep as much of it the same as possible - don’t want to screw it up.
  8. Remember you don’t want to screw it up.
  9. An open source project in the wild that adopted the essence of strategy
  10. A feature is here an interval (i.e., a range of positions) on a chromosome or a union of such intervals. Used to identify gene expression levelsPython basedPart of a larger “pipeline” that includes additional R based analytics
  11. An open source project in the wild that adopted the essence of strategy
  12. Reaching bottleneck of what could be processed. Needed a more parallel process.
  13. 75k platesRead data from stdin rather than from db. Pass newly constructed object to existing algorithm.Old architecture would have taken nearly an hour (51 minutes) with same number of machines
  14. An open source project in the wild that adopted the essence of strategy
  15. An open source project in the wild that adopted the essence of strategy
  16. Speed improvements were also made to SOAPsnp, including an improvement for the case where the input alignments cover only a small interval of a chromosome, as is the case when Crossbow invokes SOAPsnp on a single partition. These features allow many Bowtie processes, each acting as an independent mapper, to run in parallel on a multi-core computer while sharing a single in-memory image of the reference index. This maximizes alignment throughput when cluster computers contain many CPUs but limited memory.
  17. Speed improvements were also made to SOAPsnp, including an improvement for the case where the input alignments cover only a small interval of a chromosome, as is the case when Crossbow invokes SOAPsnp on a single partition. These features allow many Bowtie processes, each acting as an independent mapper, to run in parallel on a multi-core computer while sharing a single in-memory image of the reference index. This maximizes alignment throughput when cluster computers contain many CPUs but limited memory.
  18. NFS becomes unruly for shared environments python – pyinstallerDon’t need to manage any kinds of dependencies or versions on the data nodesDon’t even need python installed As long as OS is the same, let’s roll java – duh
  19. NFS becomes unruly for shared environments python – pyinstallerDon’t need to manage any kinds of dependencies or versions on the data nodesDon’t even need python installed As long as OS is the same, let’s roll java – duh