Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin

Keynote from Reynold Xin at Spark + AI Summit Europe

  • Login to see the comments

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin

  1. 1. Spark + AI Summit @rxin
  2. 2. Lester
  3. 3. Prize$1m anonymized movie rating dataset best recommendation algorithm wins
  4. 4. MateiLester
  5. 5. The first unified analytics engine in 600 lines of code… Big Data Machine Learning
  6. 6. tied for best score 20 mins late
  7. 7. Apache Spark 1.0 (2014) Fast and general engine for distributed data processing
  8. 8. DataFrame + Tungsten (2015) Tungsten Execution PythonSQL R Streaming DataFrame Advanced Analytics
  9. 9. 2015 “Apache Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune, Sep 2015
  10. 10. Structured Streaming (2016) The simplest way to perform analytics is not having to reason about streaming Single API! Spark 1.3 Static DataFrames Spark 2.0 Infinite DataFrames
  11. 11. Continuous Processing (2017) sub-milliseconds streaming A new execution mode that follows fully pipelined execution. ● Streaming execution without microbatches ● Supports async checkpointing ~1ms latency ● no changes required for user code Proposal available at https://issues.apache.org/jira/browse/SPARK-20928
  12. 12. 100,000,000,000,000 streaming records processed on Databricks in 2018
  13. 13. Explosion of ML Frameworks …
  14. 14. Prep data Train models Embracing ML ecosystem as 1st-class citizens …
  15. 15. Two Challenges in Supporting ML Frameworks in Spark Data exchange: need to push data in high throughput between Spark and ML frameworks Execution model: fundamental incompatibility between Spark (embarrassingly parallel) vs ML frameworks (gang scheduled) 1 2
  16. 16. Introducing Project Hydrogen
  17. 17. Data Exchange Execution Model
  18. 18. User-Defined Functions (UDFs) Allows executing arbitrary code, often used for integration with ML frameworks Example: prediction on data using TensorFlow
  19. 19. Row-at-a-time Data Exchange Spark UDF (x + 1) 1 john 4.1 2 mike 3.5 3 sally 6.4 1 john 4.1 2 2 john 4.1
  20. 20. Spark Row-at-a-time Data Exchange UDF (x + 1) 1 john 4.1 2 mike 3.5 3 sally 6.4 2 3 mike 3.5 3 mike 3.5
  21. 21. Row-at-a-time Data Exchange Profile UDF lambda x: x + 1 Source: Li Jin, Pandas UDF: Scalable Analysis with Python and PySpark 8 Mb/s 92% in data exchange
  22. 22. Row-at-a-time Data Exchange 8 Mb/s 92% in data exchange Profile UDF lambda x: x + 1 Source: Li Jin, Pandas UDF: Scalable Analysis with Python and PySpark 92% CPU Cycles Wasted!!!
  23. 23. Vectorized Data Exchange Spark UDF (x + 1) 1 john 4.1 2 mike 3.5 3 sally 6.4 2 3 4 2 john 4.1 3 mike 3.5 4 sally 6.4 1 2 3 john 4.1 mike 3.5 sally 6.4
  24. 24. 0.9 1.1 7.2 Performance - 3 to 240X faster Plus One CDF Subtract Mean 3.15 242 117 Runtime (seconds - shorter is better) Row-at-a-time Vectorized
  25. 25. Data Exchange Execution Model
  26. 26. Execution Models Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable Distributed ML Frameworks Complete coordination among tasks Optimized for communication Task 1 Task 2 Task 3 3
  27. 27. What if a task crashes? Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable Distributed ML Frameworks Complete coordination among tasks Optimized for communication Task 1 Task 2 Task 3
  28. 28. Incompatible Execution Models Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable If a task crashes, rerun that one Distributed ML Frameworks Complete coordination among tasks Optimized for communication If a task crashes, must rerun all tasks Task 1 Task 2 Task 3
  29. 29. Unifying Execution Models with Barrier Stage 1 data prep embarrassingly parallel Stage 2 distributed ML training gang scheduled Stage 3 data sink embarrassingly parallel tasks “all or nothing” to reconcile fundamental incompatibility between Spark and distributed ML frameworks
  30. 30. 10 to 100X Faster Data Exchange Unify Spark + ML Execution Model Project Hydrogen
  31. 31. Timeline Spark 2.3 (Spring 2018): Basic vectorized UDFs (SPARK-21190) Spark 2.4 (Fall 2018): Barrier scheduling (SPARK-24374), and more vectorized UDFs support (SPARK-22216) Spark 3.0 (2019): GA and standard format for data exchange (SPARK-24579)
  32. 32. Session Talk Highlights Experience Of Optimizing Spark SQL When Migrating from MPP Database Tim Hunter, Databricks Xiangrui Meng, Databricks Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Apache Spark on K8S and HDFS Security Ilan Filonenko, Bloomberg Yucai Yu, eBay Yuming Wang, eBay

×