Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Deep Dive into the New Features of
Apache Spark 3.1
Data + AI Summit 2021
Xiao Li gatorsmile
Wenchen Fan cloud-fan
5000+
Across the globe
CUSTOMERS
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
The ...
• The Spark team at
• Apache Spark Committer and PMC members
About Us
Xiao Li (Github: gatorsmile)
Wenchen Fan (Github: cl...
Execution
ANSI Compliance
Python
Performance More
Streaming
Node
Decommissioning
Create Table
Syntax
Sub-expression
Elimin...
Overflow
Checking
Create Table
Syntax
Explicit Cast
CHAR/VARCHAR
ANSI SQL Compliance
Fail Earlier for Invalid Data
In Spark 3.0 we added the overflow check, in 3.1 more checks are added
Apache Spark 3.1
Forbid Confusing CAST
Apache Spark 3.1
Forbid Confusing CAST
Apache Spark 3.1
ANSI Mode GA in Spark 3.2
• ANSI implicit CAST
• More runtime failures for invalid input
• No-failure alternatives: TRY_CA...
Unified CREATE TABLE SQL Syntax
Apache Spark 3.1
CREATE TABLE t (col INT) USING parquet OPTIONS …
CREATE TABLE t (col INT)...
CHAR/VARCHAR Support
Apache Spark 3.1
Table insertion fails at runtime if the input string length exceeds the CHAR/VARCHAR...
CHAR/VARCHAR Support
Apache Spark 3.1
CHAR type values will be padded to
the length.
Mostly for ANSI compatibility and
rec...
More ANSI Features
Apache Spark 3.1
• Unify SQL temp view and permanent view behaviors (SPARK-33138)
• Re-parse and analyz...
More ANSI Features Coming in Spark 3.2!
• ANSI year-month and day-time INTERVAL date types
• Comparable and persist-able
•...
Node Decommissioning
Node Decommissioning
Apache Spark 3.1
Gracefully handle scheduled executor shutdown
• Auto-scaling: Spark decides to shut ...
Node Decommissioning
Apache Spark 3.1
Migrate RDD cache and shuffle blocks from executors going to be shut down to
other l...
Node Decommissioning
Apache Spark 3.1
Migrate RDD cache and shuffle blocks from executors going to be shut down to
other l...
Node Decommissioning
Apache Spark 3.1
Migrate RDD cache and shuffle blocks from executors going to be shut down to
other l...
Summary
Apache Spark 3.1
• Migrate data blocks to other nodes before shutdown, to avoid recomputing
later.
• Stop scheduli...
Shuffle
Hash Join
Partition
Pruning
Predicate
Pushdown
Compile Latency
Reduction
SQL Performance
Shuffle Hash Join Improvement
Spark prefers Sort Merge Join over Shuffle Hash Join to avoid OOM.
Apache Spark 3.1
Build Si...
Shuffle Hash Join Improvement
Apache Spark 3.1
Makes Shuffle Hash Join on-par with Sort Merge Join and Broadcast Hash Join...
Partition Pruning Improvement
Partition pruning is critical for file scan performance
Apache Spark 3.1
Catalog
1. send par...
Partition Pruning Improvement
Partition pruning is critical for file scan performance
Apache Spark 3.1
Catalog
1. send par...
Partition Pruning Improvement
Apache Spark 3.1
Pushdown more partition predicates
• Support Contains, StartsWith and EndsW...
Predicate Pushdown Improvement
Apache Spark 3.1
Join condition mixed with columns from both sides
• Can push: FROM t1 JOIN...
Predicate Pushdown Improvement
Apache Spark 3.1
Conjunctive Normal Form (CNF)
(a1 AND a2) OR (b1 AND b2) ->
(a1 OR b1) AND...
Reduce Query Compiling Latency (3.2)
Apache Spark 3.1
Optimize for short queries
A major improvement of the catalyst frame...
Structured Streaming
Structured Streaming
> 120 trillion
records/day processed on Databricks
with Structured Streaming
Stream-stream
Join
Histo...
Streaming Table APIs
The APIs to read/write continuous data streams as unbounded tables
Apache Spark 3.1
input = spark.rea...
Streaming Table APIs
The APIs to read/write continuous data streams as unbounded tables
Apache Spark 3.1
spark.readStream
...
Stream-stream Join
Apache Spark 3.1
New in 3.1
New in 3.1
Monitoring Structured Streaming
Structured streaming support in Spark History Server is added in 3.1 release.
Apache Spark...
State Store for Structured Streaming
State Store backed operations. Examples:
• Stateful aggregations
• Drop duplicates
• ...
State Store for Structured Streaming
The metrics of the state store are added to Live UI and Spark History Server
Apache S...
State Store
HDFS-backed State Store (the default in Spark 3.1):
• Each executor has hashmap(s) containing versioned data
•...
RocksDB State Store
In Development
RocksDB state store (default in Databricks)
• RocksDB can serve data from the disk with...
Project Zen: Python Usability
PySpark
2020
Python
47%
SQL
41%
Scala and
others
12%
Python Type
Support
Dependency
Management
Visualization
(3.2)
pandas ...
Add the type hints [PEP 484] to PySpark!
§ Simple to use
▪ Autocompletion in IDE/Notebook. For example, in Databricks note...
Add the type hints [PEP 484] to PySpark!
§ Simple to use
▪ Autocompletion in IDE/Notebook
§ Richer API documentation
▪ Aut...
Static Error Detection
Apache Spark 3.1
Python Dependency Management
Supported environments
• Conda
• Venv (Virtualenv)
• PEX
Apache Spark 3.1
worker
worker
worke...
Announced April 24, 2019
Pure Python library
Familiar if coming from pandas
§ Aims at providing the pandas API on top
of S...
Koalas Growth (from 0 to 84 Countries)
> 2 million
imports per month
on Databricks
~ 3 million
PyPI downloads per month
pandas APIs
In Development
pandas APIs
In Development
import pandas as pd
df = pd.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot.line(......
pandas APIs
In Development
import pyspark.pandas as ps
df = ps.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot....
pandas APIs
In Development
import pyspark.pandas as ps
df = ps.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot....
Visualization and Plotting
In Development
import pyspark.pandas as ps
df = ps.DataFrame({
'a': [1, 2, 2.5, 3, 3.5, 4, 5],
...
Visualization and Plotting
In Development
Error
Messages
Date-Time
Utility Func.
Ignore Hints
EXPLAIN
Usability Enhancements
New Utility Functions for Unix Time
timestamp_seconds: Creates timestamp from the
number of seconds (can be fractional) si...
Name Version Description Example
unix_seconds
unix_micros
unix_millis
3.1 Returns the number of
seconds/microseconds/milli...
New Utility Functions for Time Zone
current_timezone: Returns the current session local timezone.
> SELECT current_timezon...
EXPLAIN FORMMATTED
§ Spark UI uses EXPLAIN FORMMATTED by default
§ New Format for Adaptive Query Execution
Apache Spark 3....
Ignore Hints
Apache Spark 3.1
• Is it FASTER ?
• Caused OOM?
Ignore Hints
Apache Spark 3.1
SET spark.sql.optimizer.disableHints=true
Standardize Error Messages
§ Group exception messages in dedicated files for easy maintenance and
auditing. [SPARK-33539]
...
New Doc for
PySpark
Installation
Option for PyPi
Search Function
in Spark Doc
JAVA 17
(3.2)
Documentation and Environments
New Doc for PySpark
Apache Spark 3.1
https://spark.apache.org/docs/latest/api/python/index.html
Search Function in Spark Doc
Apache Spark 3.1
Apache Spark 3.1
Installation Option for PyPI
Apache Spark 3.1
Default Hadoop version is changed to Hadoop 3.2.
Deprecations and Removals
§ Drop Python 2.7, 3.4 and 3.5 (SPARK-32138)
§ Drop R < 3.5 support (SPARK-32073)
§ Remove hive-...
Execution
ANSI Compliance
Python
Performance More
Streaming
Node
Decommissioning
Create Table
Syntax
Sub-expression
Elimin...
ANSI SQL Compliance
Python
Performance More
Streaming
Decorrelation
Framework
Timestamp
w/o Time Zone
Adaptive
Optimizatio...
Thank you for your
contributions!
Xiao Li (lixiao @ databricks)
Wenchen Fan (wenchen @ databricks)
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Deep Dive into the New Features of Apache Spark 3.1

Download to read offline

Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.

The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.

Deep Dive into the New Features of Apache Spark 3.1

  1. 1. Deep Dive into the New Features of Apache Spark 3.1 Data + AI Summit 2021 Xiao Li gatorsmile Wenchen Fan cloud-fan
  2. 2. 5000+ Across the globe CUSTOMERS Lakehouse One simple platform to unify all of your data, analytics, and AI workloads The Data and AI Company ORIGINAL CREATORS
  3. 3. • The Spark team at • Apache Spark Committer and PMC members About Us Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
  4. 4. Execution ANSI Compliance Python Performance More Streaming Node Decommissioning Create Table Syntax Sub-expression Elimination Spark on K8S GA Runtime Error Partition Pruning Explicit Cast Char/Varchar Predicate Pushdown Shuffle Hash Join Shuffle Removal Nested Field Pruning Catalog APIs for JDBC Ignore Hints Stream- stream Join New Doc for PySpark History Server Support for SS Streaming Table APIs Python Type Support Dependency Management Installation Option for PyPi Search Function in Spark Doc State Schema Validation Stage-level Scheduling Apache Spark 3.1
  5. 5. Overflow Checking Create Table Syntax Explicit Cast CHAR/VARCHAR ANSI SQL Compliance
  6. 6. Fail Earlier for Invalid Data In Spark 3.0 we added the overflow check, in 3.1 more checks are added Apache Spark 3.1
  7. 7. Forbid Confusing CAST Apache Spark 3.1
  8. 8. Forbid Confusing CAST Apache Spark 3.1
  9. 9. ANSI Mode GA in Spark 3.2 • ANSI implicit CAST • More runtime failures for invalid input • No-failure alternatives: TRY_CAST, TRY_DIVIDE, TRY_ADD, … • ... In Development
  10. 10. Unified CREATE TABLE SQL Syntax Apache Spark 3.1 CREATE TABLE t (col INT) USING parquet OPTIONS … CREATE TABLE t (col INT) STORED AS parquet SERDEPROPERTIES … // L This creates a Hive text table, super slow! CREATE TABLE t (col INT) // After Spark 3.1 … SET spark.sql.legacy.createHiveTableByDefault=false // J Now this creates native parquet table CREATE TABLE t (col INT)
  11. 11. CHAR/VARCHAR Support Apache Spark 3.1 Table insertion fails at runtime if the input string length exceeds the CHAR/VARCHAR length
  12. 12. CHAR/VARCHAR Support Apache Spark 3.1 CHAR type values will be padded to the length. Mostly for ANSI compatibility and recommend to use VARCHAR.
  13. 13. More ANSI Features Apache Spark 3.1 • Unify SQL temp view and permanent view behaviors (SPARK-33138) • Re-parse and analyze the view SQL string when reading the view • Support column list in INSERT statement (SPARK-32976) • INSERT INTO t(col2, col1) VALUES … • Support ANSI nested bracketed comments (SPARK-28880) • ...
  14. 14. More ANSI Features Coming in Spark 3.2! • ANSI year-month and day-time INTERVAL date types • Comparable and persist-able • ANSI TIMESTAMP WITHOUT TIMEZONE type • Simplify the timestamp handling • A new decorrelation framework for correlated subquery • Support outer reference in more places • LATERAL JOIN • FROM t1 JOIN LATERAL (SELECT t1.col + t2.col FROM t2) • SQL error code • More searchable, cross language, JDCB compatible In Development
  15. 15. Node Decommissioning
  16. 16. Node Decommissioning Apache Spark 3.1 Gracefully handle scheduled executor shutdown • Auto-scaling: Spark decides to shut down one or more idle executors. • EC2 spot instances: executor gets notified when it’s going to be killed soon. • GCE preemptable instances: same as above • YARN & Kubernetes: kill containers with notification for higher priority tasks.
  17. 17. Node Decommissioning Apache Spark 3.1 Migrate RDD cache and shuffle blocks from executors going to be shut down to other live executors. executor 1 executor 2 executor 3 driver Shutdown Trigger 1. send a signal to notify the shutdown
  18. 18. Node Decommissioning Apache Spark 3.1 Migrate RDD cache and shuffle blocks from executors going to be shut down to other live executors. executor 1 executor 2 executor 3 driver Shutdown Trigger 1. send a signal to notify the shutdown 2. notify driver about this 2. migrate data 2. migrate data
  19. 19. Node Decommissioning Apache Spark 3.1 Migrate RDD cache and shuffle blocks from executors going to be shut down to other live executors. executor 1 executor 2 executor 3 driver Shutdown Trigger 1. send a signal to notify the shutdown 2. notify driver about this 2. migrate data 2. migrate data 3. stop scheduling tasks to executor 1
  20. 20. Summary Apache Spark 3.1 • Migrate data blocks to other nodes before shutdown, to avoid recomputing later. • Stop scheduling tasks on the decommissioning node as they likely can’t complete and waste resources. • Launch speculative tasks for tasks running on the decommissioning node that likely can’t complete.
  21. 21. Shuffle Hash Join Partition Pruning Predicate Pushdown Compile Latency Reduction SQL Performance
  22. 22. Shuffle Hash Join Improvement Spark prefers Sort Merge Join over Shuffle Hash Join to avoid OOM. Apache Spark 3.1 Build Side Probe Side Partition 0 Partition 1 Partition 2 … Partition 0 Partition 1 Partition 2 … partition by join keys partition by join keys Hash Table row1 row2 row3 row4 … look up and join
  23. 23. Shuffle Hash Join Improvement Apache Spark 3.1 Makes Shuffle Hash Join on-par with Sort Merge Join and Broadcast Hash Join • Add code-gen for shuffled hash join (SPARK-32421) • Support full outer join in shuffled hash join (SPARK-32399) • Add handling for unique key in non-codegen hash join (SPARK-32420) • Preserve shuffled hash join build side partitioning (SPARK-32330) • ...
  24. 24. Partition Pruning Improvement Partition pruning is critical for file scan performance Apache Spark 3.1 Catalog 1. send partition predicates pushed filters files list 2. get pruned file list file scan operator task1 task2 task3 … 3. launch Spark tasks to read files
  25. 25. Partition Pruning Improvement Partition pruning is critical for file scan performance Apache Spark 3.1 Catalog 1. send partition predicates pushed filters files list 2. get pruned file list file scan operator task1 task2 task3 … 3. launch Spark tasks to read files
  26. 26. Partition Pruning Improvement Apache Spark 3.1 Pushdown more partition predicates • Support Contains, StartsWith and EndsWith in partition pruning (SPARK-33458) • Support date type in partition pruning (SPARK-33477) • Support not-equals in partition pruning (SPARK-33582) • Support NOT IN in partition pruning (SPARK-34538) • ...
  27. 27. Predicate Pushdown Improvement Apache Spark 3.1 Join condition mixed with columns from both sides • Can push: FROM t1 JOIN t2 ON t1.key = 1 AND t2.key = 2 • Cannot push: FROM t1 JOIN t2 ON t1.key = 1 OR t2.key = 2 • Cannot push: FROM t1 JOIN t2 ON (t1.key = 1 AND t2.key = 2) OR t1.key = 3 Predicates mixed with data columns and partition columns have similar issues.
  28. 28. Predicate Pushdown Improvement Apache Spark 3.1 Conjunctive Normal Form (CNF) (a1 AND a2) OR (b1 AND b2) -> (a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2) Push down more predicates, less disk IO.
  29. 29. Reduce Query Compiling Latency (3.2) Apache Spark 3.1 Optimize for short queries A major improvement of the catalyst framework Stay tuned!
  30. 30. Structured Streaming
  31. 31. Structured Streaming > 120 trillion records/day processed on Databricks with Structured Streaming Stream-stream Join History Server Support for SS Streaming Table APIs RocksDB State Store (3.2) 2.5 x (the past year) 12 x (since 2 years ago) Growth on Databricks with Structured Streaming
  32. 32. Streaming Table APIs The APIs to read/write continuous data streams as unbounded tables Apache Spark 3.1 input = spark.readStream .format("rate") .option("rowsPerSecond", 10) .load() input.writeStream .option("checkpointLocation", "path/to/checkpoint/dir1") .format("delta") .toTable("myStreamTable")
  33. 33. Streaming Table APIs The APIs to read/write continuous data streams as unbounded tables Apache Spark 3.1 spark.readStream .table("myStreamTable") .select("value") .writeStream .option("checkpointLocation", "path/to/checkpoint/dir2") .format("delta") .toTable("newStreamTable") // Show the current snapshot spark.read .table("myStreamTable") .show(false)
  34. 34. Stream-stream Join Apache Spark 3.1 New in 3.1 New in 3.1
  35. 35. Monitoring Structured Streaming Structured streaming support in Spark History Server is added in 3.1 release. Apache Spark 3.1
  36. 36. State Store for Structured Streaming State Store backed operations. Examples: • Stateful aggregations • Drop duplicates • Stream-stream joins Maintaining immediate state data across batches. • High availability • Fault tolerance
  37. 37. State Store for Structured Streaming The metrics of the state store are added to Live UI and Spark History Server Apache Spark 3.1
  38. 38. State Store HDFS-backed State Store (the default in Spark 3.1): • Each executor has hashmap(s) containing versioned data • Updates committed as delta files in HDFS • Delta files periodically collapsed into snapshots to improve recovery Drawbacks: • The quantity of states that can be maintained is limited by the heap size of the executors. • State expiration by watermark and/or timeouts require full scans over all the data. Spark cluster Driver Executor Executor hash map hash map delta files in HDFS k1 v1 k2 v2 k1 v1 k2 v2
  39. 39. RocksDB State Store In Development RocksDB state store (default in Databricks) • RocksDB can serve data from the disk with a configurable amount of non-JVM memory. • Sorting keys using the appropriate column should avoid full scans to find the to-be-dropped keys.
  40. 40. Project Zen: Python Usability
  41. 41. PySpark 2020 Python 47% SQL 41% Scala and others 12% Python Type Support Dependency Management Visualization (3.2) pandas APIs (3.2)
  42. 42. Add the type hints [PEP 484] to PySpark! § Simple to use ▪ Autocompletion in IDE/Notebook. For example, in Databricks notebook, ▪ Display Python docstring hints by pressing Shift+Tab ▪ Display a list of valid completions by pressing Tab Apache Spark 3.1
  43. 43. Add the type hints [PEP 484] to PySpark! § Simple to use ▪ Autocompletion in IDE/Notebook § Richer API documentation ▪ Automatic inclusion of input and output types § Better code quality ▪ Static analysis in IDE, error detection, type mismatch, etc. ▪ Running mypy on CI systems. Apache Spark 3.1
  44. 44. Static Error Detection Apache Spark 3.1
  45. 45. Python Dependency Management Supported environments • Conda • Venv (Virtualenv) • PEX Apache Spark 3.1 worker worker worker driver tar tar tar Blog post: How to Manage Python Dependencies in PySpark http://tinyurl.com/pysparkdep
  46. 46. Announced April 24, 2019 Pure Python library Familiar if coming from pandas § Aims at providing the pandas API on top of Spark § Unifies the two ecosystems with a familiar API § Seamless transition between small and large data Koalas
  47. 47. Koalas Growth (from 0 to 84 Countries) > 2 million imports per month on Databricks ~ 3 million PyPI downloads per month
  48. 48. pandas APIs In Development
  49. 49. pandas APIs In Development import pandas as pd df = pd.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...) import databricks.koalas as ks df = ks.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...)
  50. 50. pandas APIs In Development import pyspark.pandas as ps df = ps.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...) import pandas as pd df = pd.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...) import databricks.koalas as ks df = ks.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...)
  51. 51. pandas APIs In Development import pyspark.pandas as ps df = ps.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...) import pandas as pd df = pd.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...) import databricks.koalas as ks df = ks.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...)
  52. 52. Visualization and Plotting In Development import pyspark.pandas as ps df = ps.DataFrame({ 'a': [1, 2, 2.5, 3, 3.5, 4, 5], 'b': [1, 2, 3, 4, 5, 6, 7], 'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]}) df.plot.hist()
  53. 53. Visualization and Plotting In Development
  54. 54. Error Messages Date-Time Utility Func. Ignore Hints EXPLAIN Usability Enhancements
  55. 55. New Utility Functions for Unix Time timestamp_seconds: Creates timestamp from the number of seconds (can be fractional) since UTC epoch. > SELECT timestamp_seconds(1230219000); 2008-12-25 07:30:00 unix_seconds: Returns the number of seconds since 1970-01-01 00:00:00 UTC (UTC epoch). > SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z')); 1 Apache Spark 3.1 Also, timestamp_millis, timestamp_micros, unix_millis and unix_micros date_from_unix_date: Create date from the number of days since 1970-01-01. > SELECT date_from_unix_date(1); 1970-01-02 unix_date: Returns the number of days since 1970-01-01. > SELECT unix_date(DATE('1970-01-02')); 1
  56. 56. Name Version Description Example unix_seconds unix_micros unix_millis 3.1 Returns the number of seconds/microseconds/milliseconds since 1970-01-01 00:00:00 UTC. -- input: Timestamp => output: Long > SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z')); 1 to_unix_timestamp unix_timestamp (not recommended, if the result is not for the current timestamp) 1.6 1.5 Returns the number of seconds since 1970-01-01 00:00:00 UTC. -- input: String => output: Long > SELECT to_unix_timestamp('2016-04-08', 'yyyy-MM-dd'); 1460098800 > SELECT unix_timestamp(); -- return the current timestamp 1460041200 > SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd'); 1460041200 timestamp_seconds 3.1 Creates timestamp from the number of seconds (can be fractional) since 1970- 01-01 00:00:00 UTC. -- input: Long => output: Timestamp > SELECT timestamp_seconds(1230219000); 2008-12-25 07:30:00 from_unixtime 1.5 Returns a timestamp string in the specified format. The timestamp is converted from the input number of seconds elapsed since 1970-01-01 00:00:00 UTC. -- input: Long => output: Sting > SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss'); 1969-12-31 16:00:00
  57. 57. New Utility Functions for Time Zone current_timezone: Returns the current session local timezone. > SELECT current_timezone(); Asia/Shanghai SET TIME ZONE: Sets the local time zone of the current session. -- Set time zone to the system default. SET TIME ZONE LOCAL; -- Set time zone to the region-based zone ID. SET TIME ZONE 'America/Los_Angeles'; -- Set time zone to the Zone offset. SET TIME ZONE '+08:00'; Apache Spark 3.1 Upcoming new data type: Timestamp Without Time Zone
  58. 58. EXPLAIN FORMMATTED § Spark UI uses EXPLAIN FORMMATTED by default § New Format for Adaptive Query Execution Apache Spark 3.1 The final plan after adaptive optimization The plan after cost-based optimization
  59. 59. Ignore Hints Apache Spark 3.1 • Is it FASTER ? • Caused OOM?
  60. 60. Ignore Hints Apache Spark 3.1 SET spark.sql.optimizer.disableHints=true
  61. 61. Standardize Error Messages § Group exception messages in dedicated files for easy maintenance and auditing. [SPARK-33539] § Improve error messages quality ▪ Establish an error messages guideline for developers https://spark.apache.org/error-message-guidelines.html § Add language-agnostic, locale-agnostic identifiers/SQLSTATE SPK-10000 MISSING_COLUMN_ERROR: cannot resolve 'bad_column' given input columns: [a, b, c, d, e]; (SQLSTATE 42704) In Development
  62. 62. New Doc for PySpark Installation Option for PyPi Search Function in Spark Doc JAVA 17 (3.2) Documentation and Environments
  63. 63. New Doc for PySpark Apache Spark 3.1 https://spark.apache.org/docs/latest/api/python/index.html
  64. 64. Search Function in Spark Doc Apache Spark 3.1
  65. 65. Apache Spark 3.1
  66. 66. Installation Option for PyPI Apache Spark 3.1 Default Hadoop version is changed to Hadoop 3.2.
  67. 67. Deprecations and Removals § Drop Python 2.7, 3.4 and 3.5 (SPARK-32138) § Drop R < 3.5 support (SPARK-32073) § Remove hive-1.2 distribution (SPARK-32981) In the upcoming Spark 3.2 § Deprecate support of Mesos (SPARK-35050) Apache Spark 3.1
  68. 68. Execution ANSI Compliance Python Performance More Streaming Node Decommissioning Create Table Syntax Sub-expression Elimination Spark on K8S GA Runtime Error Partition Pruning Explicit Cast Char/Varchar Predicate Pushdown Shuffle Hash Join Shuffle Removal Nested Field Pruning Catalog APIs for JDBC Ignore Hints Stream- stream Join New Doc for PySpark History Server Support for SS Streaming Table APIs Python Type Support Dependency Management Installation Option for PyPi Search Function in Spark Doc State Schema Validation Stage-level Scheduling Apache Spark 3.1
  69. 69. ANSI SQL Compliance Python Performance More Streaming Decorrelation Framework Timestamp w/o Time Zone Adaptive Optimization Scala 2.13 Beta Error Code Implicit Type Cast Interval Type Complex Type Support in ORC Lateral Join Compile Latency Reduction Low-latency Scheduler JAVA 17 Push-based Shuffle Session Window Visualization and Plotting RocksDB State Store Queryable State Store Pythonic Error Handling Richer Input/Output pandas APIs Parquet 1.12 (Column Index) State Store APIs DML Metrics ANSI Mode GA In Development
  70. 70. Thank you for your contributions!
  71. 71. Xiao Li (lixiao @ databricks) Wenchen Fan (wenchen @ databricks)
  • manuzhang

    Jul. 1, 2021

Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos. The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.

Views

Total views

179

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

19

Shares

0

Comments

0

Likes

1

×