Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Dive into Spark 1.5 and Beyond

2,111 views

Published on

by Josh Rosen

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Deep Dive into Spark 1.5 and Beyond

  1. 1. Deep Dive into Spark 1.5 and Beyond Josh Rosen (@jshrsn) November 19, 2015
  2. 2. About Me • Apache Spark Committer and PMC member • Former UCB grad student • Software Engineer @ Databricks • Work on Spark Core, Spark SQL, and PySpark 2 Me at AMP Camp 1:
  3. 3. About Databricks Offers a hosted service: •  Spark on EC2 •  Notebooks •  Plot visualizations •  Cluster management •  Scheduled jobs •  14-day free trial: •  https://databricks.com •  We’re hiring! 3 Founded by creators of Spark and remains largest contributor
  4. 4. In this talk • A brief introduction to Spark SQL • Overview of Project Tungsten, a major perf. optimization effort in Spark 1.5 and beyond • Brief tour of other major 1.5 features • A peek at Spark 1.6 4
  5. 5. Graduated from Alpha in 1.3 • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) SQL  About 5 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 # of Contributors 5
  6. 6. 6 SELECT  COUNT(*)   FROM  hiveTable   WHERE  hive_udf(data)     • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments Improved multi-version support in 1.4 SQL  About
  7. 7. 7 • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC SQL  About
  8. 8. • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, Java, and R 8 SQL  About
  9. 9. The not-so-secret truth... 9 is about more than SQL.   SQL  
  10. 10. DataFrame noun – [dey-tuh-freym] 10 1.  A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).  
  11. 11. Write Less Code: Compute an Average Using RDDs   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [int(x[1]),  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()         Using DataFrames   sqlCtx.table("people")          .groupBy("name")          .agg("name",  avg("age"))          .map(lambda  …)          .collect()       Full API Docs •  Python •  Scala •  Java •  R 11 Using SQL   SELECT  name,  avg(age)   FROM  people   GROUP  BY  name  
  12. 12. Write Less Code: Data Source API Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 12 { JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/ ORCplain text*
  13. 13. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")     13
  14. 14. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")     read and write   functions create new builders for doing I/O 14
  15. 15. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: •  Format •  Partitioning •  Handling of existing data df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")     15
  16. 16. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…)   finish the I/O specification df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")     16
  17. 17. Not Just Less Code, Faster Too! 17 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  18. 18. Plan Optimization & Execution 18 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  19. 19. Tungsten: Preparing Spark for Next 5 Years 19 Substantially speed up execution by optimizing CPU efficiency, via: • Manual memory management and data structure layout • Runtime code generation • Optimized query operators
  20. 20. Key areas of optimization 20 Data   representa*ons   Broadcas*ng   and  Shuffling   Code  Genera*on   Joins   Sor*ng   Aggrega*on   Inspired  by  tradi*onal  database  systems  
  21. 21. Optimized data representations • Java objects have two downsides: •  Space overheads •  Garbage collection overheads • Tungsten sidesteps these problems by performing its own manual memory management. 21
  22. 22. Java object-based row representation 3 fields of type (int, string, string) with value (123, “data”, “bricks”) 22 GenericMutableRow   Array   String(“data”)   String(“bricks”)   5+ objects; high space overhead; expensive hashCode() BoxedInteger(123)  
  23. 23. Tungsten’s UnsafeRow format •  Bit set for tracking null values •  Every column appears in the fixed-length values region: –  Small values are inlined –  For variable-length values (strings), we store a relative offset into the variable-length data section •  Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation 23 null  bit  set  (1  bit/field)     values  (8  bytes  /  field)       variable  length     Offset to var. length data
  24. 24. 6 “bricks” Example of an UnsafeRow (123, “data”, “bricks”) 24 0x0 123 32L 48L 4 “data” Null tracking bitmap Variable-length dataFixed-length data (fixed space per column) Inline data Offsets to variable- length data Field lengths UTF8 string data
  25. 25. 25 java.util.HashMap array •  Huge object overheads •  Poor memory locality •  Size estimation is hard …   key  ptr   value  ptr   next   key   value  
  26. 26. Memory  page   hc   26 Tungsten’s BytesToBytesMap array •  Low space overheads •  Good memory locality, especially for scans ptr   …   key   value   key   value   key   value   key   value   key   value   key   value  
  27. 27. Code generation •  Generic evaluation of expression logic is very expensive on the JVM –  Virtual function calls –  Branches based on expression type –  Object creation due to primitive boxing –  Memory consumption by boxed primitive objects •  Generating custom bytecode can eliminate these overheads 27 9.33 9.36 36.65 Hand written Code gen Interpreted Projection Evaluating “SELECT a + a + a” (query time in seconds)
  28. 28. •  100+ native functions with optimized codegen implementations –  String manipulation – concat,   format_string,  lower,  lpad   –  Data/Time – current_timestamp,   date_format,  date_add,  …   –  Math – sqrt,  randn,  …   –  Other – monotonicallyIncreasingId,   sparkPartitionId,  …   28 Rich Function Library   from  pyspark.sql.functions  import  *   yesterday  =  date_sub(current_date(),  1)   df2  =  df.filter(df.created_at  >  yesterday)           import  org.apache.spark.sql.functions._   val  yesterday  =  date_sub(current_date(),  1)   val  df2  =  df.filter(df("created_at")  >  yesterday)   Added  in   Spark  1.5  
  29. 29. Optimized Aggregation (SPARK-8160) •  Major refactoring of internal aggregate function interfaces, plus a new experimental UDAF interface. •  New aggregation operator based on our efficient Tungsten hashmap: –  Automatically falls back to an external sort-based aggregate implementation when data is too large to fit in memory. 29
  30. 30. Optimized Aggregation 30 Incoming  record   In  memory   hashmap   sort in-memory data and spill to disk … Sorted  iterator   Merge sorted iterators and aggregate runs of keys probe buffer   update in- place Special  case:  if  there  is  no  memory  for   the  hashmap  then  we  directly  use  sort-­‐ based  aggrega*on  
  31. 31. Optimized Sorting (SPARK-7082) •  AlphaSort-style prefix sort: –  Store prefixes of sort keys inside the sort pointer array –  During sort, compare prefixes to short-circuit and avoid full record comparisons •  Use this to build external sort which supports datasets larger than memory •  Compression & serialization optimizations •  Code-generated comparators 31 pointer   record   Key  prefix   pointer   record   Naïve layout Cache friendly layout
  32. 32. Optimized Sorting (SPARK-7082) 32 Incoming  record   ptr   Sort array key   …   In-­‐memory   buffers   sort in-memory data and spill to disk … Sorted  iterator   Merge sorted files into single sorted iterator
  33. 33. Performance results: aggregation query 33 0   40   80   120   160   1   2   4   8   16   32   64   128  256   Spark  1.5   Spark  1.4   0   300   600   900   1200   1   2   4   8   16   32   64   128  256   Spark  1.5   Spark  1.4   Performance is enhanced even for small queries and scalability is greatly improved for large ones Scale factor (# of groups) vs. runtime Smaller dataset Larger dataset
  34. 34. Optimized Joins •  Prefer (external) sort-merge join over hash join in shuffle joins (for left/right outer and inner joins) (SPARK-7165) –  Join data size is now bounded by disk rather than memory •  Support for broadcast outer join (SPARK-4485) 34
  35. 35. Performance results: sort-merge join 35 Scale factor vs. runtime 0   150   300   450   600   1   2   4   8   16   32   64   Spark  1.4   Spark  1.5  
  36. 36. Optimized Broadcasting and Shuffling •  Tungsten binary row format already stores data in a serialized form, so no need to re-serialize when caching, broadcasting, or shuffling. •  Significant reduction in number of objects provides large GC benefits for broadcasted tables. 36
  37. 37. New in 1.5: Exposing Execution Concepts •  Metrics reported back for nodes of physical execution tree [SPARK-8856] •  Full visualization of DataFrame execution tree (e.g. queries with broadcast joins) [SPARK-8862] 37
  38. 38. 38 Coming soon: Spark 1.6 Some key improvements and new features: •  Adaptive query execution in Spark [SPARK-9850] •  Unified memory management (by consolidating cache and execution memory) [SPARK-10000] •  Improved session management in Spark SQL and DataFrames [SPARK-10810] •  Type-safe API on top of Catalyst/DataFrame [SPARK-9999]
  39. 39. •  Type-safe: operate on domain objects with compiled lambda functions •  Fast: Code-generated encoders for fast serialization •  Interoperable: Easily convert DataFrame ßà Dataset without boiler plate 39 Coming soon: Datasets val  df  =  ctx.read.json("people.json")     //  Convert  data  to  domain  objects.   case  class  Person(name:  String,  age:  Int)   val  ds:  Dataset[Person]  =  df.as[Person]   ds.filter(_.age  >  30)     //  Compute  histogram  of  age  by  name.   val  hist  =  ds.groupBy(_.name).mapGroups  {      case  (name,  people:  Iter[Person])  =>          val  buckets  =  new  Array[Int](10)                      people.map(_.age).foreach  {  a  =>                              buckets(a  /  10)  +=  1                  }                            (name,  buckets)   } Preview  in   Spark  1.6  
  40. 40. Recap • Tungsten optimizations are enabled by default. • Tungsten offers large performance and scalability improvements for SQL and DataFrames. • Further improvements are planned in Spark 1.6 and beyond. 40 Try  Spark  1.5  in  Databricks:  hps://databricks.com/registra*on    
  41. 41. Thank you!

×