Successfully reported this slideshow.
Your SlideShare is downloading. ×

Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case Studies and Beyond: Spark Summit East talk by Lucy Lu and Eric Kaczmarek

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 21 Ad

Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case Studies and Beyond: Spark Summit East talk by Lucy Lu and Eric Kaczmarek

Download to read offline

Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.

Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case Studies and Beyond: Spark Summit East talk by Lucy Lu and Eric Kaczmarek (20)

More from Spark Summit (20)

Advertisement

Recently uploaded (20)

Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case Studies and Beyond: Spark Summit East talk by Lucy Lu and Eric Kaczmarek

  1. 1. ACCELERATING SPARK GENOME SEQUENCING IN CLOUD – A DATA DRIVEN APPROACH, CASE STUDIES AND BEYOND Yingqi (Lucy) Lu Mulugeta Mammo Eric Kaczmarek Intel Corporation
  2. 2. Legal Disclaimer 2 • Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. • No computer system can be absolutely secure. • Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Intel, the Intel logo, Xeon, Xeon phi, Lake Crest, etc. are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2017 Intel Corporation
  3. 3. Spark Deployment Is Moving to Cloud Cloud On- premises 3
  4. 4. Spark Deployment Is Moving to Cloud Cloud On- premise + Quick deployment + Elasticity + Manageability/Maintenance 4
  5. 5. Spark Deployment Is Moving to Cloud Cloud On- premise - Don’t expect similar performance - Limited perf counters available - Need to re-profile and retune your application 5
  6. 6. Cloud vs. On-Premises 6 “Do I need 10 instances with 2 cores per instance and network attached storage or a single instance with 20 cores and attached storage”
  7. 7. Cloud vs. On-Premises 7 “Do I need 10 instances with 2 cores per instance and network attached storage or a single instance with 20 cores and attached storage” It depends. The performance of your application in a Cloud environment will be directly affected by your resource partitioning.
  8. 8. Compute vs. IO 8 Setup #1 36 cores 9 storage disks Setup #2 12 cores 9 storage disks Setup #3 15 cores 9 storage disks A Spark Application CPU cycles spent waiting on IO computation wasted CPU fully utilized IO under utilized Storage wasted CPU fully utilized IO fully utilized Best ROI Run on Pay attention to IO vs. Core ratio
  9. 9. 9 Starting from on-premises baseline, profiling Spark Application and Java Virtual Machine – Hot functions – Locking contentions – Java garbage collection Partition Resources in the Cloud
  10. 10. 10 Partition Resources in the Cloud Starting from on-premises baseline, profiling Spark Application and Java Virtual Machine – Hot functions – Locking contentions – Java Garbage collection *System – Processor – Network and Storage – Memory * Be conscious on available tools and counters, not everything would actually work
  11. 11. Case Study – Genome Analysis Toolkit Structured programming framework designed to enable rapid development of efficient and robust analysis tools for next- generation DNA sequencers – Industry standard for analyzing/sequencing human genome data – Developed by the Broad Institute of MIT and Harvard 11
  12. 12. Profile Application and Java VM Java Flight Recorder − Ships with Oracle JDK − Thread lock contention − Hot functions − Garbage collection 12 Hot function Lock contention Garbage collection
  13. 13. Lock Contention Example 13 • Spark application using SynchronizedMap resulting in heavy lock contention (50+% of time spent waiting on lock) • Replacing SynchornizedMap with ConcurentHashMap improved performance by 3.5x
  14. 14. Uncover a Scala Scalability Issue 14 • The problem resides in Scala APIs is caused by highly concurrent Instanceof calls from Java VM • The problem gets exacerbated with increasing # of threads inside Java VM
  15. 15. Scala API Fix 15 • Use polymorphism instead of instanceof! • 1.6x performance improvement in the critical stage and 1.3x across the entire workload. • Code changes released in Scala 2.12.0 • https://issues.scala-lang.org/browse/SI-9823
  16. 16. Beyond Scala and Spark 16 • Scalability issue with Instanceof impacts other Java applications – Apache Cassandra: https://issues.apache.org/jira/browse/CASSANDRA- 12787 – Similar fix results in 61% better throughput and 15% reduction in 99 percentile latency reduction
  17. 17. • Hottest GC function is PSPromotionManager::copy_to_survivor_space • Tuning following parameters improves 10% performance -XX:SurvivorRatio -XX:InitialTenuringThreshold -XX:MaxTenuringThreshold Garbage Collection Example 17 Eden Old Generation Survivor Space #1 Survivor Space #2 Object
  18. 18. Profile System 18 • Baseline shows up to 40% CPU cycles spent waiting on IO • With same total number of cores, changing Core vs. Storage ratio from 32 vs.1 to 4 vs.1 provides 1.4x performance improvements 1.0 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 6VM with 32vCPU/VM 48VM with 4vCPU/VM Throughput 1 storage disk/VM
  19. 19. Summary • Spark deployment is moving from on-premises to cloud • Cloud environment provides elastic deployment, but at the same time brings the challenges of repartitioning resources • Profiling applications and understand their behavior lead to good performance improvement 19
  20. 20. Acknowledging Agata Gruza, Intel Corporation Olasoji Denloye, Intel Corporation 20
  21. 21. Thank You. Yingqi (Lucy) Lu: Yingqi.Lu@intel.com Mulugeta Mammo: Mulugeta.Mammo@intel.com Eric Kaczmarek: Eric.Kaczmarek@intel.com

×