Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU 2015: Reynold Xin Keynote

6,609 views

Published on

A Look Ahead at Spark's Development

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Spark Summit EU 2015: Reynold Xin Keynote

  1. 1. A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th,2015
  2. 2. SQL Streaming MLlib Spark Core (RDD) GraphX Spark stack diagram
  3. 3. Frontend (user facing APIs) Backend (execution) Spark stack diagram (a different take)
  4. 4. Frontend (RDD, DataFrame, ML pipelines, …) Backend (scheduler, shuffle, operators, …) Spark stack diagram (a different take)
  5. 5. Last 12 months of Spark evolution Frontend DataFrames Data sources R Machine learning pipelines … Backend Project Tungsten Sort-based shuffle Netty-based network …
  6. 6. Last 12 months of Spark evolution Frontend DataFrames Data sources R Machine learning pipelines … Backend Project Tungsten Sort-based shuffle Netty-based network …
  7. 7. DataFrame: A Frontend Perspective
  8. 8. Spark DataFrame >  head(filter(df,  df$waiting <  50))    #  an  example  in  R ##    eruptions  waiting ##1          1.750            47 ##2          1.750            47 ##3          1.867            48 Scalabledata frame for Java, Python, R, Scala Similar APIs as single-nodetools (Pandas, dplyr), i.e. easy to learn
  9. 9. Spark RDD Execution Java/Scala frontend JVM backend Python frontend Python backend opaque closures (user-defined functions)
  10. 10. Spark DataFrame Execution DataFrame frontend Logical Plan Physical execution Catalyst optimizer Intermediate representationfor computation
  11. 11. Spark DataFrame Execution Python DF Logical Plan Physical execution Catalyst optimizer Java/Scala DF R DF Intermediate representationfor computation Simple wrappers to create logical plan
  12. 12. Benefit of Logical Plan: Simpler Frontend Python : ~2000 line of code (built over a weekend) R : ~1000 line of code i.e. much easier to add newlanguagebindings (Julia, Clojure, …)
  13. 13. Performance 0 2 4 6 8 10 Java/Scala Python Runtime for an example aggregationworkload RDD
  14. 14. Benefit of Logical Plan: Performance Parity Across Languages 0 2 4 6 8 10 Java/Scala Python Java/Scala Python R SQL Runtime for an example aggregationworkload (secs) DataFrame RDD
  15. 15. Tungsten: A Backend Perspective
  16. 16. Hardware Trends Storage Network CPU
  17. 17. Hardware Trends 2010 Storage 50+MB/s (HDD) Network 1Gbps CPU ~3GHz
  18. 18. Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz
  19. 19. Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) 10X Network 1Gbps 10Gbps 10X CPU ~3GHz ~3GHz L
  20. 20. Project Tungsten Substantially speed up execution by optimizing CPU efficiency, via: (1) Runtime code generation (2) Exploiting cachelocality (3) Off-heap memory management
  21. 21. From DataFrame to Tungsten Python DF Logical Plan Java/Scala DF R DF Tungsten Execution Initial phasein Spark 1.5 More work coming in 2016
  22. 22. 3 Things to Look Forward To
  23. 23. Dataset API in Spark 1.6 Typed interface over DataFrames / Tungsten case  class Person(name:   String,  age:  Int) val dataframe =  read.json(“people.json”) val ds:  Dataset[Person]   =  dataframe.as[Person] ds.filter(p   =>  p.name.startsWith(“M”)) .groupBy(“name”) .avg(“age”)
  24. 24. Dataset “Encoder”to specify type information so Spark can translate it into DataFrame and generateoptimized memory layouts CheckoutSPARK-9999 Dataset[T] DataFrame encoder
  25. 25. Streaming DataFrames Easier-to-use APIs (batch, streaming, and interactive) And optimizations: - Tungstenbackends - native support for out-of-order data - data sourcesand sinks val stream =  read.kafka("...") stream.window(5 mins,  10 secs) .agg(sum("sales")) .write.jdbc("mysql://...")
  26. 26. 3D XPoint - DRAM latency - SSD capacity - Byte addressible
  27. 27. Python Java/Scala RSQL … DataFrame Logical Plan LLVMJVM SIMD 3D XPoint Unified API, One Engine, Automatically Optimized Tungsten backend language frontend …
  28. 28. Tungsten Execution PythonSQL R Streaming DataFrame (& Dataset) Advanced Analytics
  29. 29. Office Hours Today @ Databricks booth Topic Area 10:30–11:30 Spark general(Reynold) 13:00–14:00 R and datascience (Hossein) 13:30–14:30 machine learning(Joseph) 14:00–15:00 Spark, YARN, etc (Andrew)

×