Successfully reported this slideshow.
Your SlideShare is downloading. ×

What's New in Apache Spark 2.3 & Why Should You Care

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 40 Ad

What's New in Apache Spark 2.3 & Why Should You Care

Download to read offline

The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.

This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:

* Continuous Processing in Structured Streaming.

* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.

* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.

The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.

This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:

* Continuous Processing in Structured Streaming.

* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.

* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to What's New in Apache Spark 2.3 & Why Should You Care (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

What's New in Apache Spark 2.3 & Why Should You Care

  1. 1. What's New in the Apache Spark™ 2.3 Jules S. Damji BASM Meetup , Bloomberg @2twitme May 15, 2018
  2. 2. Spark Community& DeveloperAdvocate@ Databricks Program Chair Spark + AI Summit DeveloperAdvocate@ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest https://www.linkedin.com/in/dmatrix @2twitme
  3. 3. Databricks’ Unified Analytics Platform DATABRICKS RUNTIME COLLABORATIVE WORKSPACE Delta SQL Streaming Powered by Data Engineers Data Scientists CLOUD NATIVE SERVICE Unifies DataEngineers and Data Scientists Unifies Dataand AI Technologies Eliminates infrastructure complexity
  4. 4. Major Features on Apache Spark 2.3 4 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming Over 1400issues resolved!
  5. 5. This Talk 5 Continuous Processing Spark on Kubernetes PySpark Performance Streaming ML + ImageReader ApacheSpark 2.3 Stream-stream Join
  6. 6. This Talk 6 Continuous Processing Stream- Stream Join
  7. 7. building robust stream processing apps is hard
  8. 8. Complexities in stream processing COMPLEX DATA Diversedata formats (json, avro, binary, …) Data can bedirty, late, out-of-order COMPLEX SYSTEMS Diversestoragesystems (Kafka, S3, Kinesis, RDBMS, …) System failures COMPLEX WORKLOADS Combiningstreamingwith interactive queries Joining two Streams Machine learning with Streams
  9. 9. Structured Streaming stream processingon Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
  10. 10. Structured Streaming – Continuous Processing Mode 10
  11. 11. Structured Streaming Processing Modes 11
  12. 12. 12 Continuous Processing Structured Streaming – Continuous Mode
  13. 13. Structured Streaming Introducedin Spark 2.0 and Production Ready Spark 2.2 Among Databricks customers: - 10X more usagethan DStream - Processed100+ trillion recordsin production
  14. 14. Continuous Processing Execution Mode Continuous Processing (since 2.3 release) [SPARK-20928] • A new streaming execution mode (experimental) • Low (~1 ms) end-to-end latency • At-least-once guarantees Micro-batch Processing (since 2.0 release) • Lower end-to-end latencies of ~100ms • Exactly-once fault-tolerance guarantees 14
  15. 15. 15 Continuous Processing The only change you need! Rating/Benchmarking
  16. 16. Continuous Processing Supported Operations: • Map-like Dataset operations • projections • selections • All SQL functions • Except current_timestamp(), current_date() and aggregation functions 16 Supported Sources: • Kafka source • Rate source Supported Sinks: • Kafka sink • Memory sink • Console sink Continuous Processing https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  17. 17. Stream-Stream Joins 17 Stream-Stream Join
  18. 18. Inner Join + Time constraints + Watermarks time constraints Time constraints - Impressions can be 2 hours late - Clicks can be 3 hours late - A click can occur within 1 hour after the corresponding impression val impressionsWithWatermark = impressions .withWatermark("impressionTime", "2 hours") val clicksWithWatermark = clicks .withWatermark("clickTime", "3 hours") impressionsWithWatermark.join( clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= impressionTime AND clickTime <= impressionTime + interval 1 hour """ )) Join Range Join Stream-stream Join
  19. 19. Stream-stream Joins Inner 19 Supported, optionally specify watermark on both sides + time constraints for state cleanup Conditionally supported, must specify watermark on right + time constraints for correct results, optionally specify watermark on left for all state cleanup Conditionally supported, must specify watermark on left + time constraints for correct results, optionally specify watermark on right for all state cleanup Not supported. Left Right Full Stream-stream Join
  20. 20. This Talk 20 Continuous Processing Spark on Kubernetes PySpark Performance Streaming ML + ImageReader Databricks Delta
  21. 21. Streaming Machine Learning Model transformation/prediction on batch and streaming data with unified API. After fitting a model or Pipeline, you can deploy it in a streaming job. val streamOutput = transformer.transform(streamDF) 21
  22. 22. Demo: Structured Streaming + MLlib https://dbricks.co/ss_mllib_py (ss_mllib_py)
  23. 23. Image Support in Spark Spark Image data source SPARK-21866 : • Defined a standard API in Spark for loading and reading images • Deep learning frameworks can rely on this val df = ImageSchema.readImages("/data/images") 23
  24. 24. Image Support in Spark 24
  25. 25. This Talk 25 Continuous Processing Spark on Kubernetes PySpark Performance Streaming ML + ImageReader Databricks Delta
  26. 26. PySpark Introducedin Spark 0.7 (~2013); became first class citizen in the DataFrame API in Spark 1.3 (~2015) Much slower than Scala/Java with user-defined functions (UDF), due to serialization & Python interpreter Note: Most PyData tooling, e.g., Pandas, NumPy, are written in C++
  27. 27. PySpark Performance Fast data serialization and execution using vectorized formats [SPARK-22216] [SPARK-21187] • Conversion from/to Pandas df.toPandas() createDataFrame(pandas_df) • Pandas/Vectorized UDFs: UDF using Pandas to process data • Scalar Pandas UDFs • Grouped Map Pandas UDFs 27
  28. 28. Pandas UDF • Used with functions such as select and withColumn. • The Python function should take pandas.Seriesas inputs and return a pandas.Seriesof the same length. 28 Scalar PandasUDFs:
  29. 29. Pandas UDF 29 Grouped Map PandasUDFs: • Split-apply-combine • A Python function that defines the computation for each group. • The output schema
  30. 30. PySpark Performance 30 Blog "Introducing Pandas UDFs for PySpark(Two Sigma) http://dbricks.co/2rMwmW0 Apache Arrow Columnar Format forData Exchange https://arrow.spark.org
  31. 31. Demo Try to import this note book at home … : https://dbricks.co/pandas_udf
  32. 32. This Talk 32 Continuous Processing Spark on Kubernetes PySpark Performance Streaming ML + ImageReader
  33. 33. Native Spark App in K8S • New Spark scheduler backend • Driver runs in a Kubernetes pod created by the submission client and creates pods that run the executors in response to requests from the Spark scheduler. [K8S-34377] [SPARK-18278] • Make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization, and Logging. 34
  34. 34. Spark on Kubernetes Supported: • Supports Kubernetes1.6 and up • Supports cluster mode only • Staticresource allocation only • Supports Java and Scala applications • Can use container-local and remote dependencies that are downloadable 35 In roadmap (2.4): • Client mode • Dynamic resource allocation + external shuffle service • Python and R support • Submission client local dependencies + Resource staging server (RSS) • Non-secured and KerberizedHDFS access (injection of Hadoop configuration)
  35. 35. General Client Mode https://databricks.com/blog/2018/03/06/apache-spark-2-3-with-native-kubernetes-support.html
  36. 36. 37 Continuous Processing PySpark Performance Streaming ML + ImageReader Databricks Delta Try these on Databricks Runtime 4.0 today! Communityedition: https://community.cloud.databricks.com/
  37. 37. Save $300 : JulesPicks https://tinyurl.com/saissf18
  38. 38. Thank You J @2twitme jules@databricks.com

×