Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 29

Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal

4

Share

Download to read offline

Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features:

Kubernetes Scheduler Backend
PySpark Performance and Enhancements
Continuous Structured Streaming Processing
DataSource v2 APIs
Spark History Server Performance Enhancements

Related Books

Free with a 30 day trial from Scribd

See all

Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal

  1. 1. Sameer Agarwal Spark Summit | San Francisco | June 6th 2018 What’s New in Apache Spark 2.3 #DevSAIS16
  2. 2. About Me 2 • Spark Committer and 2.3 Release Manager • Software Engineer at Facebook (Big Compute) • Previously at Databricks and UC Berkeley • Research on BlinkDB (Approximate Queries in Spark)
  3. 3. Spark 2.3 Release by the numbers • Released on 28th February 2018 • Development Span: July ‘17 – Feb ‘18 • 284 Contributors • 1406 JIRAs – SQL/Streaming (52%) – Spark Core (12%) – PySpark (9%) – ML (8%) 3
  4. 4. Overview 4 Continuous Processing Spark on Kubernetes PySpark Performance ML Streaming + Image Reader
  5. 5. Major Features in Spark 2.3 5 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming https://spark.apache.org/releases/spark-release-2-3 -0.html
  6. 6. Overview 6 Continuous Processing Spark on Kubernetes PySpark Performance ML Streaming + Image Reader
  7. 7. Structured Streaming 7 Users: Treat a stream as an infinite table, no need to reason about micro-batches Developers: Decoupled the high-level API with the execution engine
  8. 8. Structured Streaming 8
  9. 9. Micro Batch Execution 9
  10. 10. Micro Batch Execution 10 Latency > 100ms Exactly-once Semantics
  11. 11. Continuous Processing Continuous Processing (SPARK-20928) 11 An experimental execution mode
  12. 12. Continuous Processing (SPARK-20928) 12
  13. 13. Continuous Processing (SPARK-20928) 13 Latency ~1ms At-least once Semantics
  14. 14. Continuous Processing (SPARK-20928) 14
  15. 15. Continuous Processing (SPARK-20928) Supported Operations • Map-like Dataset Operations – Projections – Selections • All SQL functions – Except current_timestamp(), current_date() and aggregation functions 15 Supported Sources • Kafka Source • Rate Source Supported Sinks • Kafka Sink • Memory Sink • Console Sink Blog: https://tinyurl.com/spark-cp
  16. 16. Overview 16 Continuous Processing Spark on Kubernetes PySpark Performance ML Streaming + Image Reader
  17. 17. ML on Streaming • Model transformation/prediction on batch and streaming data with unified API • After fitting a model or Pipeline, you can deploy it in a streaming job val streamOutput = transformer.transform(streamDF) 17
  18. 18. Image Support in Spark (SPARK-21866) • A standard API in Spark for reading images into DataFrames • Utilities for loading images from common formats • Deep learning frameworks can rely on this val df = ImageSchema.readImages("/data/images") 18
  19. 19. Overview 19 Continuous Processing Spark on Kubernetes PySpark Performance ML Streaming + Image Reader
  20. 20. PySpark • Introduced in Spark 0.7 (~2013); became first class citizen in the Dataframe API in Spark 1.3 (~2015) • Much slower than Scala/Java with UDFs due to serialization and Python interpreter • Note: Most PyData tooling (e.g., Pandas, numpy etc.) are written in C/C++ 20
  21. 21. PySpark Performance 21 Pandas UDFs perform much better than row-at-a-time UDFs across the board, ranging from 3x to over 100x.
  22. 22. 22 Scalar UDFs • Used with functions such as select and withColumn • The python function should take pandas.Series as input and return a pandas.Series of same length Pandas/Vectorized UDFs
  23. 23. 23 Pandas/Vectorized UDFs Grouped Map UDFs • Split-apply-Combine • A python function that defines the computation for each group • Input/Outputs are both pandas.DataFrame Blog: https://tinyurl.com/pyspark-udf
  24. 24. Overview 24 Continuous Processing Spark on Kubernetes PySpark Performance ML Streaming + Image Reader
  25. 25. Spark on Kubernetes (SPARK-18278) 25 Spark Core Spark SQL + DataFrames Structured Streaming MLlib GraphX Standalone YARN Mesos
  26. 26. Spark on Kubernetes (SPARK-18278) • Driver runs in a Kubernetes pod created by the submission client and creates pods that runs the executors in response to requests from Spark Scheduler • Make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging 26
  27. 27. Spark on Kubernetes (SPARK-18278) 27 Apache Spark 2.3 • Supports K8S 1.6+ • Cluster Mode • Static Resource Allocation • Java/Scala Applications • Container-local and remote- dependencies that are downloadable Roadmap (Apache Spark 2.4+) • Client Mode • Dynamic Resource Allocation + External Shuffle Service • Python/R Applications • Client-local dependencies + Resource Staging Server (RSS) Blog: https://tinyurl.com/spark-k8s
  28. 28. Recap 28 Continuous Processing Spark on Kubernetes PySpark Performance ML Streaming + Image Reader
  29. 29. Sameer Agarwal Spark Summit | San Francisco | June 6th 2018 Questions? #DevSAIS16

×