Monsanto Lead Big Data Engineer Discusses Genomics Strategy and Hadoop Adoption

Jeff Melching
Monsanto: Lead Big Data Engineer
Twitter: @melchbox
8/7/2013

Intro to the genomic space @Monsanto
Strategy for Legacy Analysis
Example Use Cases
htseq-count: Counting reads in features
Genotype Scoring
Crossbow

“We succeed when farmers succeed.”
-Hugh Grant, Monsanto CEO
Monsanto Company is a leading global provider of
technology-based tools and agricultural products
that improve farm productivity and food quality.
We work to deliver agricultural products and
solutions to:
• Meet the world’s growing food needs
• Conserve natural resources
• Protect the environment
Monsanto Company Confidential

Genomics : a discipline in genetics that applies
recombinant DNA, DNA sequencing
methods, and bioinformatics to
sequence, assemble, and analyze the function
and structure of genomes (the complete set of
DNA within a single cell of an organism)
http://en.wikipedia.org/wiki/Genomics

New gene discovery, gene expression
Evolutionary population genetics
Insect Control
Disease resistance targets
Marker discovery and variation analysis
Genotyping and fingerprinting
New vegetable reference genomes
Marker discovery and variation analysis
Disease resistance
Viral and fungal resistance
Targets for topical RNAi
Seed Treatments
Yield & Stress
Agricultural Traits
Molecular Breeding
Vegetable Quality &
Disease
Plant Health
Chemistry

30+ years of increasing computational
power, open source tools and knowledge
Two distinct workloads
Production workflows
Discovery analytics
Computational pipelines

Grid Computing
High Performance
Storage
File Processing
Hadoop
HDFS
Block processing
perl
python
C/C++
R
bash
MapReduce
Java
Pig
Hive

The work is done, why port it to java?
Can I get it done quickly?
Where’s the value?

Math is hard
Genomic algorithms are harder
Coding it is harder still
Coding it correctly…

11
http://gapingvoid.com/2008/06/13/now-what/

Minimizing change in order to leverage the
existing pipelines, tools and knowledge in their
natural state, requires a common platform that
is language neutral and easily consumable
stdin & stdout

Creates map and reduce tasks
Controls map and reduce defined executables
Feeds data to stdin of the process, collects output
from stdout
Equivalent to using pipes
Input Mapper Reducer
Map Exe
Reduce
Exe
stdoutstdin
Output
stdoutstdin

Algorithm of existing executables
parallelizable?

Can existing code operate on or be easily
modified to support stdin & stdout?
If not, can you wrap it?

Identify decision
points to split
code into
MapReduce style
http://www.recessframework.org/blog/category/PHP

Test in local mode first
$ cat inputFile | mapper.sh | reducer.sh > outputFile
http://wiki.apache.org/hadoop/HadoopStreaming

“Given a file with aligned sequencing reads and
a list of genomic features, a common task is to
count how many reads map to each feature.”
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
gene
read
read
read
2

hadoop jar $HADOOP_STREAMING
-input my_experiment.sam -output count
--mapper 'mapper -q -s no - features.gtf'
--reducer ‘reducer.py'
-file dist/mapper -file features.gtf -file reducer.py
htseq-count –q –s no my_experiment.sam
features.gtf

#… do crazy parsing using python libs and stuff…
try:
read_seq = iter( HTSeq.SAM_Reader( sys.stdin ) )
first_read = read_seq.next()
read_seq = itertools.chain( [ first_read ], read_seq )
pe_mode = first_read.paired_end
for r in read_seq:
# do more algorithm and validation stuff…
fs = set()
for iv in iv_seq:
if iv.chrom not in features.chrom_vectors:
raise UnknownChrom
for iv2, fs2 in features[ iv ].steps():
fs = fs.union( fs2 )
if fs is None or len( fs ) == 0:
empty += 1
elif len( fs ) > 1:
ambiguous += 1
else:
counts[ list(fs)[0] ] += 1
for fn in sorted( counts.keys() ):
print "%st%d" % ( fn, counts[fn] )

#!/usr/bin/env python
import sys
current_fn = None
current_count = 0
fn = None
for line in sys.stdin:
fn, count = line.split('t',1)
count = int(count)
if current_fn == fn:
current_count += count
else:
if current_fn:
print '%st%s' % (current_fn, current_count)
current_count = count
current_fn = fn
if current_fn == fn:
print '%st%s' % (current_fn, current_count)

Split python script into mapper and reducer
No Change to command line args
Reused all dependent libraries
Run in MR mode or local mode
$ cat my_experiment.sam | mapper -q -s no - features.gtf | sort |
python reducer.py

Analysis determines variants and quality scores
Legacy code written in Java and R part of a larger
pipeline
Embedded on app server and responds to JMS

Parallelizable
Wrapped input to read from streamed files
Map operates on a plate of data at one time
Minimal change only in how the data passed

Reused existing Java and R code by simply writing
a transformation class
2 days of work during “Innovation Days” to modify
for Hadoop by following this strategy
Value: 75k plates in 4 minutes on 14 node cluster

Crossbow is a scalable software pipeline for whole
genome resequencing analysis. It combines
Bowtie, an ultrafast and memory efficient short
read aligner, and SoapSNP, and an accurate
genotyper. These tools are combined in an
automatic, parallel pipeline…
http://bowtie-bio.sourceforge.net/crossbow/index.shtml

“Hadoop also supports a 'streaming' mode of
operation whereby the map and reduce
functions are delegated to command-line scripts
or compiled programs written in any language.
…
This allows Crossbow to reuse existing software
for aligning reads and calling SNPs while
automatically gaining the scaling benefits of
Hadoop. “
http://genomebiology.com/2009/10/11/R134
http://www.cbcb.umd.edu/~mschatz/Posters/Crossbow_WABI_Sept2009.pdf

Crossbow-specific output format was
implemented that encodes an alignment as a
tuple where the tuple's key identifies a
reference partition and the value describes the
alignment.
http://bowtie-bio.sourceforge.net/crossbow/

A new input format (option --12) was added, allowing
Bowtie to recognize the one-read-per-line format
produced by the Crossbow preprocessor.

The version of SOAPsnp used in Crossbow was
modified to accept alignment records output by
modified Bowtie ... None of the modifications made to
SOAPsnp fundamentally affect how consensus bases or
SNPs are called

Adoption of Hadoop for discovery
Rapid development times
Executables as stand-alone packages

Assess fit
It’s still MapReduce
Minimize change
Test in local mode
http://i.qkme.me/3pgc1j.jpg

Monsanto Lead Big Data Engineer Discusses Genomics Strategy and Hadoop Adoption

Monsanto Lead Big Data Engineer Discusses Genomics Strategy and Hadoop Adoption

Recommended

Recommended

More Related Content

Similar to Monsanto Lead Big Data Engineer Discusses Genomics Strategy and Hadoop Adoption

Similar to Monsanto Lead Big Data Engineer Discusses Genomics Strategy and Hadoop Adoption (20)

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

Monsanto Lead Big Data Engineer Discusses Genomics Strategy and Hadoop Adoption

Editor's Notes