Jeff Melching
Monsanto: Lead Big Data Engineer
Twitter: @melchbox
8/7/2013
Intro to the genomic space @Monsanto
Strategy for Legacy Analysis
Example Use Cases
htseq-count: Counting reads in feature...
“We succeed when farmers succeed.”
-Hugh Grant, Monsanto CEO
Monsanto Company is a leading global provider of
technology-b...
Genomics : a discipline in genetics that applies
recombinant DNA, DNA sequencing
methods, and bioinformatics to
sequence, ...
New gene discovery, gene expression
Evolutionary population genetics
Insect Control
Disease resistance targets
Marker disc...
30+ years of increasing computational
power, open source tools and knowledge
Two distinct workloads
Production workflows
D...
Grid Computing
High Performance
Storage
File Processing
Hadoop
HDFS
Block processing
perl
python
C/C++
R
bash
MapReduce
Ja...
The work is done, why port it to java?
Can I get it done quickly?
Where’s the value?
Math is hard
Genomic algorithms are harder
Coding it is harder still
Coding it correctly…
11
http://gapingvoid.com/2008/06/13/now-what/
Minimizing change in order to leverage the
existing pipelines, tools and knowledge in their
natural state, requires a comm...
Creates map and reduce tasks
Controls map and reduce defined executables
Feeds data to stdin of the process, collects outp...
Algorithm of existing executables
parallelizable?
Can existing code operate on or be easily
modified to support stdin & stdout?
If not, can you wrap it?
Identify decision
points to split
code into
MapReduce style
http://www.recessframework.org/blog/category/PHP
Minimize Change
Test in local mode first
$ cat inputFile | mapper.sh | reducer.sh > outputFile
http://wiki.apache.org/hadoop/HadoopStreami...
“Given a file with aligned sequencing reads and
a list of genomic features, a common task is to
count how many reads map t...
hadoop jar $HADOOP_STREAMING
-input my_experiment.sam -output count
--mapper 'mapper -q -s no - features.gtf'
--reducer ‘r...
#… do crazy parsing using python libs and stuff…
try:
read_seq = iter( HTSeq.SAM_Reader( sys.stdin ) )
first_read = read_s...
#!/usr/bin/env python
import sys
current_fn = None
current_count = 0
fn = None
for line in sys.stdin:
fn, count = line.spl...
Split python script into mapper and reducer
No Change to command line args
Reused all dependent libraries
Run in MR mode o...
Analysis determines variants and quality scores
Legacy code written in Java and R part of a larger
pipeline
Embedded on ap...
Parallelizable
Wrapped input to read from streamed files
Map operates on a plate of data at one time
Minimal change only i...
Reused existing Java and R code by simply writing
a transformation class
2 days of work during “Innovation Days” to modify...
Crossbow is a scalable software pipeline for whole
genome resequencing analysis. It combines
Bowtie, an ultrafast and memo...
“Hadoop also supports a 'streaming' mode of
operation whereby the map and reduce
functions are delegated to command-line s...
Crossbow-specific output format was
implemented that encodes an alignment as a
tuple where the tuple's key identifies a
re...
A new input format (option --12) was added, allowing
Bowtie to recognize the one-read-per-line format
produced by the Cros...
The version of SOAPsnp used in Crossbow was
modified to accept alignment records output by
modified Bowtie ... None of the...
Adoption of Hadoop for discovery
Rapid development times
Executables as stand-alone packages
Assess fit
It’s still MapReduce
Minimize change
Test in local mode
http://i.qkme.me/3pgc1j.jpg
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013
Upcoming SlideShare
Loading in...5
×

Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013

1,106

Published on

At the StampedeCon 2013 Big Data conference in St. Louis, Jeff Melching, Big Data Engineer and Architect at Monsanto, discussed Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study. The bioinformatics domain and in particular computational genomics has always had the problem of computing analytics against very large data sets. Traditionally, these analytics have leveraged grid and compute farm technologies. Additionally, the analytics software and algorithms have been built up over the past 30 years by contributions from both the public and private domain and written in a number of programming languages. When these software packages are brought in house and combined with the skills and preferences of internal bioinformatics researchers, what you get is a myriad of different technologies linked together in an analytics pipeline. The rise of technologies like MapReduce in hadoop have made the execution of such pipelines much more efficient, but what about all those analytic pipelines I have built up over the years that aren’t written in MapReduce? Do I have to rewrite them? Do I have to know java? This talk will explain how hadoop streaming can help you reuse instead of rewriting. It will also touch on techniques for packaging and deploying hadoop applications without having to centrally manage software versions on the cluster.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,106
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • HTSeq: Opensource with modificationsGenotype Scoring: custom wrappersCrossbow: opensource modifying input to underlying programs.
  • Started in 70’s and 80’s, took of in 90’s as growth in computational power and techniques grew.A toolbox full of tools which may or may not play well togetherwritten by those with roots in academia and computational biologyanalytics software and algorithms have been built up over the past 30 years by contributions from both the public and private domain and written in a number of programming languages. When these software packages are brought in house and combined with the skills and preferences of internal bioinformatics researchers, what you get is a myriad of different technologies linked together in an analytics pipelineMany different types of file formats all widely accepted and used for different stages of analysis
  • An open source project in the wild that adopted the essence of strategy
  • Lets make a shift to hadoop platform… are these statements true?
  • Assume Hadoop is a good fit for what we are trying to do.Can I trust myself or others to write, understand and excel at all these languages?
  • Is the algorithm of existing executables parallelizable?Can your existing code operate on or be easily modified to support stdin?If not, can you wrap it?Is there a decision point in the code that makes sense to split into mapper and reducer?Do you need multiple jobs?Test in “local mode” firstKeep as much of it the same as possible - don’t want to screw it up.
  • Is the algorithm of existing executables parallelizable?Can your existing code operate on or be easily modified to support stdin?If not, can you wrap it?Is there a decision point in the code that makes sense to split into mapper and reducer?Do you need multiple jobs?Test in “local mode” firstKeep as much of it the same as possible - don’t want to screw it up.
  • Remember you don’t want to screw it up.
  • An open source project in the wild that adopted the essence of strategy
  • A feature is here an interval (i.e., a range of positions) on a chromosome or a union of such intervals. Used to identify gene expression levelsPython basedPart of a larger “pipeline” that includes additional R based analytics
  • An open source project in the wild that adopted the essence of strategy
  • Reaching bottleneck of what could be processed. Needed a more parallel process.
  • 75k platesRead data from stdin rather than from db. Pass newly constructed object to existing algorithm.Old architecture would have taken nearly an hour (51 minutes) with same number of machines
  • An open source project in the wild that adopted the essence of strategy
  • An open source project in the wild that adopted the essence of strategy
  • Speed improvements were also made to SOAPsnp, including an improvement for the case where the input alignments cover only a small interval of a chromosome, as is the case when Crossbow invokes SOAPsnp on a single partition. These features allow many Bowtie processes, each acting as an independent mapper, to run in parallel on a multi-core computer while sharing a single in-memory image of the reference index. This maximizes alignment throughput when cluster computers contain many CPUs but limited memory.
  • Speed improvements were also made to SOAPsnp, including an improvement for the case where the input alignments cover only a small interval of a chromosome, as is the case when Crossbow invokes SOAPsnp on a single partition. These features allow many Bowtie processes, each acting as an independent mapper, to run in parallel on a multi-core computer while sharing a single in-memory image of the reference index. This maximizes alignment throughput when cluster computers contain many CPUs but limited memory.
  • NFS becomes unruly for shared environments python – pyinstallerDon’t need to manage any kinds of dependencies or versions on the data nodesDon’t even need python installed As long as OS is the same, let’s roll java – duh
  • NFS becomes unruly for shared environments python – pyinstallerDon’t need to manage any kinds of dependencies or versions on the data nodesDon’t even need python installed As long as OS is the same, let’s roll java – duh
  • Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Case Study - StampedeCon 2013

    1. 1. Jeff Melching Monsanto: Lead Big Data Engineer Twitter: @melchbox 8/7/2013
    2. 2. Intro to the genomic space @Monsanto Strategy for Legacy Analysis Example Use Cases htseq-count: Counting reads in features Genotype Scoring Crossbow
    3. 3. “We succeed when farmers succeed.” -Hugh Grant, Monsanto CEO Monsanto Company is a leading global provider of technology-based tools and agricultural products that improve farm productivity and food quality. We work to deliver agricultural products and solutions to: • Meet the world’s growing food needs • Conserve natural resources • Protect the environment Monsanto Company Confidential
    4. 4. Genomics : a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genomes (the complete set of DNA within a single cell of an organism) http://en.wikipedia.org/wiki/Genomics
    5. 5. New gene discovery, gene expression Evolutionary population genetics Insect Control Disease resistance targets Marker discovery and variation analysis Genotyping and fingerprinting New vegetable reference genomes Marker discovery and variation analysis Disease resistance Viral and fungal resistance Targets for topical RNAi Seed Treatments Yield & Stress Agricultural Traits Molecular Breeding Vegetable Quality & Disease Plant Health Chemistry
    6. 6. 30+ years of increasing computational power, open source tools and knowledge Two distinct workloads Production workflows Discovery analytics Computational pipelines
    7. 7. Grid Computing High Performance Storage File Processing Hadoop HDFS Block processing perl python C/C++ R bash MapReduce Java Pig Hive
    8. 8. The work is done, why port it to java? Can I get it done quickly? Where’s the value?
    9. 9. Math is hard Genomic algorithms are harder Coding it is harder still Coding it correctly…
    10. 10. 11 http://gapingvoid.com/2008/06/13/now-what/
    11. 11. Minimizing change in order to leverage the existing pipelines, tools and knowledge in their natural state, requires a common platform that is language neutral and easily consumable stdin & stdout
    12. 12. Creates map and reduce tasks Controls map and reduce defined executables Feeds data to stdin of the process, collects output from stdout Equivalent to using pipes Input Mapper Reducer Map Exe Reduce Exe stdoutstdin Output stdoutstdin
    13. 13. Algorithm of existing executables parallelizable?
    14. 14. Can existing code operate on or be easily modified to support stdin & stdout? If not, can you wrap it?
    15. 15. Identify decision points to split code into MapReduce style http://www.recessframework.org/blog/category/PHP
    16. 16. Minimize Change
    17. 17. Test in local mode first $ cat inputFile | mapper.sh | reducer.sh > outputFile http://wiki.apache.org/hadoop/HadoopStreaming
    18. 18. “Given a file with aligned sequencing reads and a list of genomic features, a common task is to count how many reads map to each feature.” http://www-huber.embl.de/users/anders/HTSeq/doc/count.html gene read read read 2
    19. 19. hadoop jar $HADOOP_STREAMING -input my_experiment.sam -output count --mapper 'mapper -q -s no - features.gtf' --reducer ‘reducer.py' -file dist/mapper -file features.gtf -file reducer.py htseq-count –q –s no my_experiment.sam features.gtf
    20. 20. #… do crazy parsing using python libs and stuff… try: read_seq = iter( HTSeq.SAM_Reader( sys.stdin ) ) first_read = read_seq.next() read_seq = itertools.chain( [ first_read ], read_seq ) pe_mode = first_read.paired_end for r in read_seq: # do more algorithm and validation stuff… fs = set() for iv in iv_seq: if iv.chrom not in features.chrom_vectors: raise UnknownChrom for iv2, fs2 in features[ iv ].steps(): fs = fs.union( fs2 ) if fs is None or len( fs ) == 0: empty += 1 elif len( fs ) > 1: ambiguous += 1 else: counts[ list(fs)[0] ] += 1 for fn in sorted( counts.keys() ): print "%st%d" % ( fn, counts[fn] )
    21. 21. #!/usr/bin/env python import sys current_fn = None current_count = 0 fn = None for line in sys.stdin: fn, count = line.split('t',1) count = int(count) if current_fn == fn: current_count += count else: if current_fn: print '%st%s' % (current_fn, current_count) current_count = count current_fn = fn if current_fn == fn: print '%st%s' % (current_fn, current_count)
    22. 22. Split python script into mapper and reducer No Change to command line args Reused all dependent libraries Run in MR mode or local mode $ cat my_experiment.sam | mapper -q -s no - features.gtf | sort | python reducer.py
    23. 23. Analysis determines variants and quality scores Legacy code written in Java and R part of a larger pipeline Embedded on app server and responds to JMS
    24. 24. Parallelizable Wrapped input to read from streamed files Map operates on a plate of data at one time Minimal change only in how the data passed
    25. 25. Reused existing Java and R code by simply writing a transformation class 2 days of work during “Innovation Days” to modify for Hadoop by following this strategy Value: 75k plates in 4 minutes on 14 node cluster
    26. 26. Crossbow is a scalable software pipeline for whole genome resequencing analysis. It combines Bowtie, an ultrafast and memory efficient short read aligner, and SoapSNP, and an accurate genotyper. These tools are combined in an automatic, parallel pipeline… http://bowtie-bio.sourceforge.net/crossbow/index.shtml
    27. 27. “Hadoop also supports a 'streaming' mode of operation whereby the map and reduce functions are delegated to command-line scripts or compiled programs written in any language. … This allows Crossbow to reuse existing software for aligning reads and calling SNPs while automatically gaining the scaling benefits of Hadoop. “ http://genomebiology.com/2009/10/11/R134 http://www.cbcb.umd.edu/~mschatz/Posters/Crossbow_WABI_Sept2009.pdf
    28. 28. Crossbow-specific output format was implemented that encodes an alignment as a tuple where the tuple's key identifies a reference partition and the value describes the alignment. http://genomebiology.com/2009/10/11/R134 http://bowtie-bio.sourceforge.net/crossbow/
    29. 29. A new input format (option --12) was added, allowing Bowtie to recognize the one-read-per-line format produced by the Crossbow preprocessor. http://genomebiology.com/2009/10/11/R134
    30. 30. The version of SOAPsnp used in Crossbow was modified to accept alignment records output by modified Bowtie ... None of the modifications made to SOAPsnp fundamentally affect how consensus bases or SNPs are called http://genomebiology.com/2009/10/11/R134
    31. 31. Adoption of Hadoop for discovery Rapid development times Executables as stand-alone packages
    32. 32. Assess fit It’s still MapReduce Minimize change Test in local mode http://i.qkme.me/3pgc1j.jpg

    ×