Successfully reported this slideshow.
Your SlideShare is downloading. ×

JVM and OS Tuning for accelerating Spark application

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

JVM and OS Tuning for accelerating Spark application

  1. 1. © 2015 IBM Corporation JVM, OSレベルのチューニングによる Spark アプリケーションの最適化 Feb. 8, 2016 Tatsuhiro Chiba (chiba@jp.ibm.com) IBM Research - Tokyo
  2. 2. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Who am I ?  Tatsuhiro Chiba (千葉 立寛)  Staff Researcher at IBM Research – Tokyo  Research Interests – Parallel Distributed System and Middleware – Parallel Distributed Programming Language – High Performance Computing  Twitter: @tatsuhiro  Today’s contents appear in, – 付録D in “Sparkによる実践データ解析” - O’reilly Japan – “Workload Characterization and Optimization of TPC-H Queries on Apache Spark”, IBM Research Reports. 2
  3. 3. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Summary – after applying JVM and OS tuning 3 Machine Spec : CPU: POWER8 3.3GHz(2Sockets x 12cores), Memory: 1TB, Disk: 1TB OS: Ubuntu 14.10(Kernel: 3.16.0-31-generic) Optimized JVM Option : -Xmx24g –Xms24g –Xmn12g -Xgcthreads12 -Xtrace:none –Xnoloa –XlockReservation –Xgcthreads6 –Xnocompactgc –Xdisableexplicitgc -XX:-RuntimeInstrumentation –Xlp Executor JVMs : 4 OS Settings : NUMA aware affinity=enabled, large page=enabled Spark Version : 1.4.1 JVM Version : java version “1.8.0” (IBM J9 VM, build pxl6480sr2-20151023_01(SR2)) -50.0% -45.0% -40.0% -35.0% -30.0% -25.0% -20.0% -15.0% -10.0% -5.0% 0.0% 0 50 100 150 200 250 300 350 400 450 Q1 Q3 Q5 Q9 kmeans ExecutionTIme(sec.) original optimized speedup (%)
  4. 4. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Benchmark 1 – Kmeans // input data is cached val data = sc.textFile(“file:///tmp/kmeans-data”, 2) val parsedData = data.map(s => Vectors.dense( s.split(' ').map(_.toDouble))).persist() // run Kmeans with varying # of clusters val bestK = (100,1) for (k <- 2 to 11) { val clusters = new KMeans() .setK(k).setMaxIterations(5) .setRuns(1).setInitializationMode("random") .setEpsilon(1e-30).run(parsedData) // evaluate val error = clusters.computeCost(parsedData) if (bestK._1 > error) { bestK = (errors,k) } } Kmeans  Kmeans application – Varied clustering number ‘K’ for the same dataset – The first Kmeans job takes much time due to data loading into memory  Synthetic data generator program – Used BigDataBench published at http://prof.ict.ac.cn/ – Generated 6GB dataset which includes over 65M data points
  5. 5. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Benchmark 2 - TPC-H  TPC-H Benchmark on Spark SQL – TPC-H is often used for SQL on Hadoop system – Spark SQL can run Hive QL directly through hiveserver2 (thrift server) and beeline (JDBC client) – We modified TPC-H Queries published at https://github.com/rxin/TPC-H-Hive  Table data generator – Used DBGEN program and generated 100GB dataset (scale factor = 100) – Loaded data into Hive tables with Parquet format and Snappy compression 5 select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= '1998-09-01' group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; TPC-H Q1 (Hive)
  6. 6. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Machine & Software Spec and Spark Settings 6 Processor # Core SMT Memory OS POWER8 3.30 GHz * 2 24 cores (2 sockets * 12 cores) 8 (total 192 hardware threads) 1TB Ubuntu 14.10 (kernel 3.16.0-31) Xeon E5-2699 v3 2.30 GHz 36 cores (2 sockets x 18 cores) 2 (total 72 hardware threads) 755GB Ubuntu 15.04 (kernel 3.19.0-26) software version Spark 1.4.1, 1.5.2, 1.6.0 Hadoop (HDFS) 2.6.0 Java 1.8.0 (IBM J9 VM SR2) Scala 2.10.4  Default Spark Settings – # of Executor JVMs: 1 – # of worker threads: 48 – Total Heap size: 192GB (nursery = 48g, tenure = 144g)
  7. 7. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo JVM Tuning – Heap Space Sizing  Garbage Collection tuning points – GC algorithms – GC threads – Heap sizing  Heap sizing is simplest way to reduce GC overhead – Bigger young space helps to achieve over 30% improvement  But, small old space may cause many global GC – Cached RDD stays in Java heap 7 0 50 100 150 200 250 300 350 400 450 Xmn48g Xmn96g Xmn144g Xmn48g Xmn96g Xmn144g Kmeans TPC-H Q9 ExecutionTime(sec.) Young Space (-Xmn) Execution Time (sec) GC ratio (%) Minor GC Avg. pause time Minor GC Major GC 48g (default) 400 s 20 % 2.1 s 39 1 96g 306 s 18 % 3.4 s 22 1 144g 300 s 14 % 3.6 s 14 0
  8. 8. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo JVM Tuning – Other Options  JVM options tuning point – Monitor threads tuning – GC tuning – Java thread tuning – JIT tuning , etc.  Result – Proper JVM options helps to improve application performance over 20% 8 -25.0% -20.0% -15.0% -10.0% -5.0% 0.0% 0 20 40 60 80 100 120 option 0 option 1 option 2 option 3 option 4 ExecutionTIme(sec.) Q1 Q5 speedup Q1 (%) speedup Q5 (%) # JVM Options Option 0 (baseline) -Xmn96g –Xdump:heap:none –Xdump:system:none -XX:+RuntimeInstrumentation -agentpath:/path/to/libjvmti_oprofile.so -verbose:gc –Xverbosegclog:/tmp/gc.log -Xjit:verbose={compileStart,compileEnd},vlog=/tmp/jit.log Option 1 (Monitor) Option 0 + “-Xtrace:none” Option 2 (GC) Option 1 + “-Xgcthreads48 –Xnoloa –Xnocompactgc –Xdisableexplicitgc” Option 3 (Thread) Option 2 + “-XlockReservation” Option 4 (JIT) Option 3 + “-XX:-RuntimeInstrumentation”
  9. 9. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo JVM Tuning – JVM Counts  Experiment – Kept # worker threads and total heap – Changed # Executor JVMs – 1JVM : 48 worker threads & 192GB heap – 2JVMs : 24 worker threads & 96GB heap – 4JVMs : 12 worker threads & 48GB heap  Result – Using a single big Executor JVM is not always best – By dividing into smaller JVMs, • Helps to reduce GC overhead • Helps to reduce resource contention  Kmeans case – Performance gap comes from the first Kmeans job, especially from data loading – After loading RDD in memory, computation performance is similar 9 -16% -14% -12% -10% -8% -6% -4% -2% 0% 2% 4% 6% 0 50 100 150 200 250 300 Q1 Q3 Q5 Q9 Kmeans improvement ExecutionTime(sec.) 1JVM 2JVM 4JVM 2JVM (%) 4JVM (%) 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 ExecutionTime(sec.) Kmeans Clustering Job Iterations (K = 2, 3, .. 11) 1JVM 2JVM 4JVM
  10. 10. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo OS Tuning – NUMA aware process affinity  Setting NUMA aware process affinity to each Executor JVM helps to speed-up – By reducing scheduling overhead – By reducing cache miss and stall cycles  Result – Achieved 3 – 14% improvement in all benchmarks without any bad effects 10 NUMA1NUMA0 NUMA2 NUMA3 JVM 0 12threads JVM 1 12threads JVM 2 12threads JVM 3 12threads Socket 0 Socket 1 Processors DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM numactl -c [0-7],[8-15],[16-23],[24-31],[32-39],[40-47] Spark Executor JVMs -16.0% -14.0% -12.0% -10.0% -8.0% -6.0% -4.0% -2.0% 0.0% 0 50 100 150 200 250 Q1 Q5 Q9 Kmeans ExecutionTIme(sec.) NUMA off NUMA on speedup (%)
  11. 11. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo OS Tuning – Large Page  How to use large page – reserve large page on Linux by changing kernel parameter – Append “-Xlp” to Executor JVM option  Result – Achieved 3 – 5 % improvement 11 0 20 40 60 80 100 120 140 160 180 200 PageSize=64K PageSize=16M PageSize=64K PageSize=16M NUMA off NUMA on ExecutionTime(sec.) Kmeans
  12. 12. © 2016 IBM CorporationHadoop / Spark Conference Japan 2016 Performance Innovation Laboratory, IBM Research - Tokyo Comparison of Default and Optimized w/ 1.4.1, 1.5.2, and 1.6.0  Newer version basically achieved good performance  JVM & OS tuning are still helpful to improve Spark performance  Tungsten & other new features (e.g. Unified Memory Management) can reduce GC overhead drastically 12 0 20 40 60 80 100 120 140 160 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 Q1 Q3 Q5 ExecutionTime(sec.) default optimized 0 50 100 150 200 250 300 350 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 Q9 Q19 Q21 ExecutionTime(sec.) default optimized 711 632

×