Example Biotech Use Cases for Hadoop Adam Muise Systems Engineer Cloudera THUG – Toronto Hadoop User Group September 18 2012
This evening…• I will discuss a Hadoop DNA sequencing use case actually implemented at a large biotech firm• I will take questions from the Biologists that are new to Hadoop and review the architecture• We will walk through the sequencing workflow• We will encourage an informal discussion of other ongoing biotech use cases that we can apply Hadoop to• We will have an open discussion about other common Bioinformatic toolsets and their compatibility with Hadoop• You will tell me if we should continue to have a biotech- themed meetup on a quarterly basis
Use Case: Drug Development using NGS• Implemented at a very large biotech firm, we will call “N”• NGS = Next Generation Sequencing and all of the scientific process that accompanies it• Workflow was previously on traditional HPC
Challenges at “N”• NGS produces a great deal of DNA data to sequence• Prior to using Hadoop, the traditional process was a tangle of manually distributed R and PERL mixed with a traditional HPC cluster (MPI based, 2200 processing cores)• The workflow we will focus on took 6 weeks for a full data set• E.g. 300 R files submitted at the same time hung the system, processing this benchmark took several hours as the parallelism had to be reduced
Enter the Dragon• The workflow was analyzed by Cloudera and implemented in Hadoop• The resulting workflow went from 6 weeks to 2 days or less on a 23 node cluster (3 masters, 20 workers)• The primary gain was the massive parallelism that Hadoop provides without requiring message- passing as well as the data locality built into MapReduce/HDFS.• 300 R files processed in 66 seconds
Symbolic workflowIngest Data Pre-processing Steps Alignment Post-processing•Typically •Remove bad reads •Align reads to a reference Steps FASTA, FASTQ, QSEQ, or •Count the bases (and genome •Meaningful scientific Tab-delimited formats other stats) •Lots of replicated reads analysis starts here •Remove Adapter and to process •SNP Barcodes (added by •Variance analysis sequencers) •Assess quality of sample
Resulting Workflow: Pre-processing Jobs• Ingest into Hadoop Cluster (HDFS put)• Remove Adaptors (markers added by sequencers) if present• Segregate data based on barcodes for small RNA (again, artifact of sequencing and useful for sorting)• Plug-in for any other tools that can be used with Hadoop streaming (eg std in and out to a PERL script or C program)• Sequences converted to BAM (Binary Sequence Format)
Resulting Workflow: Alignment• Used TopHat 2.0.4: – http://tophat.cbcb.umd.edu/ – Implemented in Hadoop as a MapReduce job – Input/output file format is BAM – TopHat uses the Bowtie aligner• This step maps the sequences to an established reference genome and separate out exons from the introns• Splice junctions are identified via the reads that did not get identified as an exon• Bulk of the work
Resulting Workflow: Post-processing• Various post-processing tools can be implemented based on specific workflow variation• Frequently CuffLinks was used: – http://cufflinks.cbcb.umd.edu/index.html• These tools were implemented in Hadoop primarily with Streaming, this allows Hadoop to be a very flexible generalized parallel execution engine without imposing the complexity of traditional grid computing
Another Use Case: “M” Challenges• Immense amount of critical genomic data. "Vanilla-brand RDBMS" unable to capture/process all raw logs.• Governance: Use "Vanilla-brand RDBMS" for compliance today. Data captured on tape today. Not easily accessed and many hidden costs with data stored on tape. Super slow process of accessing/analysis of data with "Vanilla-brand RDBMS"• Genome analysis and integration is highly manual• Existing platforms don’t scale to expected volumes – 100’s to 1000’s of reference genomes – 10,000 prokaryotic genomes/year by 2014 – 100,000 resequenced lines by 2015• Next-generation sequence analysis workflows are manual today – not sustainable for forecasted growth• By 2015, the "M" sequencing lab expected to sequence more than two quadrillion nucleotides / year
“M” ChallengesBusiness Challenges before Cloudera Enterprise• Ingesting data for genotyping and DNA sequencing hitting scaling limitations: – new sequencing devices are generating more data – need to increase frequency and number of products being – sequenced – need to bring openly available sequence datasets in-house• DNA data has multiple dimensions, processing it efficiently in relational database with java based applications are: – not meeting performance needs, not scaling to meet SLA, not cost effective – infrastructure complexity, difficulty scaling horizontally – store and access sparse unstructured dataset
“M” Existing Infrastructure• Systems which support "M"s R&D pipeline are constantly changing, evolving and becoming increasingly complex as the science and its corresponding processes evolve. The initial project phase is the adoption of complex computational analytics that are run on growing data sets. Additionally, operational improvements are found by streamlining and automating these analytical computations which previously ran offline by a select few users.• RDBM’s: "Vanilla-brand RDBMS" (*SGE) for data capture, governance and compliance protocols• Performance bottlenecks and costly investment because the size of the data, in which the analysis can be run against, is limited to the size of the hardware that the database engine is running on. Costly additional software/hw licenses required• HPC: Illumina for sequencing, genotyping and gene expression "M" is not able to perform complex analytics on large datasets in parallel within a short period of time. Scientists run manual workloads to clean the data prior to being able to access the data.• Java-based applications run on a *Sun Grid Engine – managed compute farm. Java applications are the easiest and simplest method of performing computational analysis. However, Java based analyses have several limiting factors. First, to perform any analysis, time must be spent to retrieve data from a repository which in most cases is a RDBMS. Second, the size of the data that the analysis can be run against at any one time is going to be limited to the max heap size of the Java Virtual Machine and, in some respects, the size of memory on the machine. Finally the amount of parallelization that can be performed at any one time is limited to the size of the hardware of the machine running the application.
“M” ImplementationUse Case• Phase 1: ‘Sequence Search and Retrieval’"M" uses CDH to capture and store Apache logs from internal (notpublic facing) R&D workflow applications. Today they mine 300TB ofdata on their CDH cluster searching for trends and performancemetrics for coordination of scientific workflow including raw geneticsequencing. "M" has developed complex (proprietary) computationalalgorithms running on CDH. Phase one includes slow migration of dataoff stored tape, to CDH for full governance and compliance.• Phase 2: Genetic Data - One RepositoryThe R&D Biotech (seed breeding) group owns one cluster inproduction. They are building a scalable data architecture for nextgeneration storage and analysis of genetic data - at scale - with realtime access. Analysis results will include specific gene detection.
“M” ImplementationCloudera Enterprise ImplementationCloudera Manager 3.5 with CDH3u3 with HBaseSequence data dumped to NFS for cleansing and filtering (phase 2: move toHDFS/MR)Ingest cleansed data into HDFS for cost effective storage and processingMR on ingested data for insert into HBaseReal-time full/range scan on HBase depending on user queries for analysis(compare/contrast, patterns)For fingerprint data: sqoop in/out of "Vanilla-brand RDBMS" for furtherprocessing(SimMatrix)ResultsStore & process large amounts of data, in near real time, cost effectivelyAnalyse (compare/contrast, patterns) against larger number of products veryefficientlyRun more queries against much larger data set more frequently. This was a majorchallenge pre Cloudera.