CloudBurst• CloudBurst : Highly Sensitive Short Read Mapping with MapReduce• New parallel read-mapping algorithm optimized for mapping NGS data to the human genome and other reference genomes• SNP discovery, genotyping, and personal genomics
CloudBurst• It is modeled after the short read mapping program RMAP• Reports either all alignments or the unambiguous best alignment for each read with any number of mismatches or differences• This level of sensitivity could be prohibitively time consuming, but CloudBurst uses the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes.
CloudBurst• Running time – scales linearly with the number of reads mapped – with near linear speedup as the number of processors increases.• CloudBurst reduces the running time from hours to mere minutes for typical jobs involving mapping of millions of short reads to the human genome.
Algorithm Overview• CloudBurst uses seed-and-extend algorithms to map reads to a reference genome.• Seed – k differences : the alignment must have a region of length s=r/k+1 called a seed that exactly matches the reference.• Extend – CloudBurst attempts to extend the alignment into an end-to-end alignment with at most k mismatches or differences
Algorithm Overview• CloudBurst uses the Hadoop implementation of MapReduce to catalog and extend the seeds• Map phase emits – all length-s k-mers from the reference sequences – all non-overlapping length-s kmers from the reads• Shuffle phase – read and reference kmers are brought together• Reduce phase – the seeds are extended into end-to-end alignments
DemoGetting Started.docx 참고
Related Tools• Bowtie: Ultrafast short read alignment• SoapSNP: Accurate SNP/consensus calling• Tophat: RNA-Seq splice junction mapper• Cufflinks: Isoform assembly, quantitation• Hadoop: Open Source MapReduce• CloudBurst: Sensitive MapReduce alignment• Crossbow: Read Mapping and SNP calling in the clouds• Jnomics: Cloud-Scale Sequence Analysis• Contrail: Cloud-based de novo assembly• Myrna: Cloud-Scale differential expression of RNAseq
Figure 1: A MapReduce approach for detecting genetic variants from high-throughput genome sequencing. 출처 : http://www.nature.com/nbt/journal/v30/n3/fig_tab/nbt.2134_F1.html