Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databricks)

3,998 views

Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databricks)

  1. 1. From DataFrames to Tungsten: A Peek into Spark’s Future Reynold Xin @rxin Spark Summit, San Francisco June 16th, 2015
  2. 2. DataFrame noun Making Spark accessible to everyone (data scientists, engineers, statisticians, …)
  3. 3. Tungsten noun Making Spark faster & prepare for the next five years.
  4. 4. How do DataFrames and Tungsten relate to each other?
  5. 5. Google Trends for “dataframe” Single-node tabular data structure, with API for relational algebra (filter, join, …) math and stats input/output (CSV, JSON, …) ad infinitum
  6. 6. Data frame: lingua franca for “small data”   head(flights)   #>  Source:  local  data  frame  [6  x  16]   #>     #>        year  month  day  dep_time  dep_delay  arr_time  arr_delay  carrier  tailnum   #>  1    2013          1      1            517                  2            830                11            UA    N14228   #>  2    2013          1      1            533                  4            850                20            UA    N24211   #>  3    2013          1      1            542                  2            923                33            AA    N619AA   #>  4    2013          1      1            544                -­‐1          1004              -­‐18            B6    N804JB   #>  ..    ...      ...  ...            ...              ...            ...              ...          ...          ...    
  7. 7. Spark DataFrame >  head(filter(df,  df$waiting  <  50))    #  an  example  in  R   ##    eruptions  waiting   ##1          1.750            47   ##2          1.750            47   ##3          1.867            48     Distributed data frame for Java, Python, R, Scala Similar APIs as single-node tools (Pandas, dplyr), i.e. easy to learn
  8. 8. data size KB MB GB TB PB Existing Single-node Data Frames Spark DataFrame
  9. 9. It is not Spark vs Python/R, but Spark and Python/R.
  10. 10. Spark and Python/R Spark DF scalability multi-core multi-machines Python/R DF Viz Machine Learning Stats wealth of libraries
  11. 11. Spark RDD Execution Java/Scala API JVM Execution Python API Python Execution opaque closures (user-defined functions)
  12. 12. Spark DataFrame Execution DataFrame Logical Plan Physical Execution Catalyst optimizer Intermediate representation for computation
  13. 13. Spark DataFrame Execution Python DF Logical Plan Physical Execution Catalyst optimizer Java/Scala DF R DF Intermediate representation for computation Simple wrappers to create logical plan
  14. 14. Benefit of Logical Plan: Simpler Frontend Python : ~2000 line of code (built over a weekend) R : ~1000 line of code i.e. much easier to add new language bindings (Julia, Clojure, …)
  15. 15. Performance 0 2 4 6 8 10 Java/Scala Python Runtime for an example aggregation workload RDD
  16. 16. Benefit of Logical Plan: Performance Parity Across Languages 0 2 4 6 8 10 Java/Scala Python Java/Scala Python R SQL Runtime for an example aggregation workload (secs) DataFrame RDD
  17. 17. What about Tungsten?
  18. 18. Hardware Trends Storage Network CPU
  19. 19. Hardware Trends 2010 Storage 50+MB/s (HDD) Network 1Gbps CPU ~3GHz
  20. 20. Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz
  21. 21. Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) 10X Network 1Gbps 10Gbps 10X CPU ~3GHz ~3GHz L
  22. 22. Tungsten: Preparing Spark for Next 5 Years Substantially speed up execution by optimizing CPU efficiency, via: (1)  Runtime code generation (2)  Exploiting cache locality (3)  Off-heap memory management
  23. 23. From DataFrame to Tungsten Python DF Logical Plan Java/Scala DF R DF Tungsten Execution 5PM Deep Dive into Project Tungsten Developer Track by Josh Rosen
  24. 24. Initial Performance Results 0 200 400 600 800 1000 1200 1x 2x 4x 8x Runtime(seconds) Data set size (relative) Tungsten-off Tungsten-on
  25. 25. Python Java/Scala RSQL … DataFrame Logical Plan LLVMJVM GPU NVRAM Unified API, One Engine, Automatically Optimized Tungsten backend language frontend …
  26. 26. Tungsten Execution PythonSQL R Streaming DataFrame Advanced Analytics
  27. 27. Spark Office Hours Today Databricks booth A1 Topic Area 1:00-1:45 Core, YARN, Ops 1:45-2:30 Core/SQL/Data Science 3:00-3:40 Streaming 3:40-4:15 Core, Python, R 4:30-5:15 Machine Learning 5:15-6:00 Matei Zaharia

×