Successfully reported this slideshow.
Your SlideShare is downloading. ×

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Scaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
Loading in …3
×

Check these out next

1 of 35 Ad

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Download to read offline

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark (20)

Advertisement

More from DataWorks Summit (20)

Advertisement

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

  1. 1. Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark Zhong Wang, Ph.D. Group Lead, Genome Analysis 05/23/2019
  2. 2. 1999-2007 2008-now: JGI as the DOE sequencing center dedicated to plants and microbes. DOE JGI: A brief history
  3. 3. Our Mission 3 DOE JGI, Serving as a genomic user facility in support of the DOE missions: • Walnut Creek 1999-2019 • Berkeley, CA • 250 employees • $70M annual budget bioenergy, carbon cycling, & biogeochemistry
  4. 4. Our sequencer lineups Miseq NextSeq 500 Hiseq 2500 PacBio RSII Oxford Nanopore Short-read technologies Long-read technologies Novaseq 6000 PacBio Sequel MinION Promethion 200Tb sequencing data in FY18 Illumina
  5. 5. Genomics big data is not typical big data Unstructured Volume, variety veracity increases during analytics
  6. 6. Metagenome is the genome of a microbial community 10s "intimate kiss" = 80 million bacteria Metagenomics questions: Who are there? What they do? How they interact?
  7. 7. Microbial communities are “dark matters” Number of Species Cow ~6000 Human ~1000 Soil, >100000 >90% of the species haven’t been seen before
  8. 8. Metagenome sequencing and assembly Harvest microbes Extract DNA Shear, & Sequencing Assembly Short Reads Reconstructed genomes Microbial Community Metagenome DNA
  9. 9. The metagenome assembly problem Library of Books Shredded Library “reconstructed” Library Genome ~= Book Metagenome ~= Library Sequencing ~= sampling the pieces and read them
  10. 10. Scale is an enemy 1 10 100 1,000 10,000 100,000 1,000,000 Typical Human Cow Ocean Soil Gigabases (Gb)
  11. 11. Complexity is another… Remove contaminants, sequencing errors Overlap graph de bruijn graph Contigs or clusters Repetitive elements Homologous genes Horizontal transferred genes
  12. 12. The ideal solution and the failed ones  Easy to develop  Robust  Scale to big data  Efficient BigMem • Easy to develop • Expensive • Not scale MPI • Fast • Hard to develop • Not robust Hadoop • Easy to develop • Scale • Slow
  13. 13. Addressing big data: Apache Spark • New scalable programming paradigm • Compatible with Hadoop-supported storage systems • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell  Scale to big data  Efficient  Easy to develop  Robust
  14. 14. Goal: Metagenome read clustering Read clustering can reduce metagenome problem to single-genome problem • Parallel Processing • Individualized optimization Reads Read clusters
  15. 15. Algorithm 2 3 1 Node: Read Edge: number of kmers two reads share Kmer to reads is what word to sentence Read graph containing all reads Graph Partitioning: LPA Kmer-mapping reads Graph Construction and Edge Reduction Label Propagation Algorithm
  16. 16. Clustering performance on long reads Read length = 500-20,000
  17. 17. Short reads? Not so much Read length = 150
  18. 18. Can long reads come in rescue?
  19. 19. Hybrid clustering
  20. 20. A tradeoff between cost and performance 0 50 100 150 200 250 0% 20% 40% 60% 80% 100% mean cluster size (K) #reads (M) #clusters Percent of long reads used
  21. 21. Short-read only: there is still a way out
  22. 22. More samples, better results: one vs 50
  23. 23. More data, better results: clustering success is dependent on coverage
  24. 24. Can we scale to big data?
  25. 25. Hardware and software environments Customized EMR Bridge nodes 20 20 8 cores 8 (160) 8 (160) 28 (224) memory 64 (1280) 61 (1220) 128 (1024) Hadoop 2.7.3 2.7.3 2.7.2 Spark 2.1.1 2.2.0 2.1.0
  26. 26. A quick reminder… 2 3 1 Node: Read Edge: number of kmers two reads share Kmer to reads is what word to sentence Read graph containing all reads Graph Partitioning: LPA Kmer-mapping reads (KMR) Graph Construction and Edge Reduction (Edges) Label Propagation Algorithm (LPA)
  27. 27. Scale to bigger data volume on a 20-node cluster 0 200 400 600 800 20 40 60 80 100 ExecutionTime(mins) Data Size (GB) KMR Edges LPA Total
  28. 28. Increasing nodes on a 50G-dataset 0 100 200 300 400 500 25 50 75 100 ExecutionTime(mins) Number of nodes 50G KMR Edges LPA Total
  29. 29. Fine tune parallelism 0 50 100 150 200 250 300 350 1 2 3 4 5 6 7 8 ExecutionTIme(mins) Spark default parallelism (log10) 50G 20G
  30. 30. Dataset complexity vs performance 146.33 44.5 0 20 40 60 80 100 120 140 160 Human Iso-Seq Alzheimer(PacBio) Cow Rumen(Illumina) ExecutionTime(mins) KMR Edges LPA
  31. 31. Platform comparison: Clouds and HPC Customized EMR Bridge nodes 20 20 8 cores 8 (160) 8 (160) 28 (224) memory 64 (1280) 61 (1220) 128 (1024) Time (min) 106 105 126
  32. 32. Now we have a big hammer…
  33. 33. Clustering for identifying genome contaminants Russula 70Mb Bradyrhizobium 7.2Mb Collimonas: 5.3Mb
  34. 34. Targeting big metagenome projects Dr. Morgan-Kiss @ Miami University Dr. Slonczewski @Kenyon University Two lakes, 1.2Tbp
  35. 35. Acknowledgements Spark Team Lizhen Shi @FSU Xiandong Meng Kexue Li, LiliWang and Li Deng @Shanghai U Kurt Labutti Elizabeth Tseng @PacBio Lisa Gerhardt , Evan Racah @ NERSC Yong Qin, Gary Jung, Greg Kurtzer, Bernard Li, @ HPC Philip Blood, Bryon Gill @PSC

×