SlideShare a Scribd company logo
1 of 57
Download to read offline
What's New in Apache Spark 2.4?
| 2018
About Me
• Software Engineer at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark SQL, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile
DATABRICKS WORKSPACE
Databricks Delta ML Frameworks
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Reliable & Scalable Simple & Integrated
Databricks Unified Analytics Platform
APIs
Jobs
Models
Notebooks
Dashboards End to end ML lifecycle
Databricks Customers Across Industries
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services
5
Release: Nov 8, 2018
Blog: https://t.co/k7kEHrNZXp
Above 1100 tickets.
Higher-order
Functions
Major Features on Spark 2.4
6
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Higher-order
Functions
Major Features on Spark 2.4
7
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Apache Spark: The First Unified Analytics Engine
Runtime
Delta
Spark Core Engine
Big Data Processing
ETL + SQL +Streaming
Machine Learning
MLlib + SparkR
Uniquely combines Data & AI technologies
The cross?
9
Map/Reduce
CaffeOnSpark
TensorFlowOnSpark
DataFrame-based APIs
50+ Data Sources
Python/Java/R interfaces
Structured Streaming
ML Pipelines API
Continuous
Processing
RDD
Project Tungsten
Pandas UDF
TensorFrames
scikit-learn
pandas/numpy/scipy
LIBLINEAR
R
glmnet
xgboost
GraphLab
Caffe/PyTorch/MXNet
TensorFlow
Keras
Distributed
TensorFlow
Horovod
tf.data
tf.transform
AI/ML
??
TF XLA
Project Hydrogen: Spark + AI
A gang scheduling to Apache Spark that embeds a distributed DL
job as a Spark stage to simplify the distributed training workflow.
[SPARK-24374]
• Launch the tasks in a stage at the same time
• Provide enough information and tooling to embed distributed DL
workloads
• Introduce a new mechanism of fault tolerance (When any task
failed in the middle, Spark shall abort all the tasks and restart the
stage)
10
Project Hydrogen: Spark + AI
11
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
Demo: Project Hydrogen:
https://vimeo.com/274267107 & https://youtu.be/hwbsWkhBeWM
Timeline
Spark 2.4
• [SPARK-24374] barrier execution mode (basic scenarios)
• (Databricks) Horovod integration w/ barrier mode
Spark 2.5/3.0
• [SPARK-24374] barrier execution mode
• [SPARK-24579] optimized data exchange
• [SPARK-24615] accelerator-aware scheduling
Higher-order
Functions
Major Features on Spark 2.4
13
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Pandas UDFs
• Grouped Aggregate Pandas UDFs
• pandas.Series -> a scalar
• returnType: primitive data type
• [SPARK-22274] [SPARK-22239]
14
Spark 2.3 introduced vectorized
Pandas UDFs that use Pandas to
process data. Faster data
serialization and execution using
vectorized formats
Eager Evaluation
spark.sql.repl.eagerEval.enabled [SPARK-24215]
• Supports DataFrame eager evaluation for PySpark shell and Jupyter.
• When true, the top K rows of Dataset will be displayed.
15
Prior to 2.4
Eager Evaluation
spark.sql.repl.eagerEval.enabled [SPARK-24215]
• Supports DataFrame eager evaluation for PySpark shell and Jupyter.
• When true, the top K rows of Dataset will be displayed.
16
Since 2.4
Other Notable Features
[SPARK-24396] Add Structured Streaming ForeachWriter for Python
[SPARK-23030] Use Arrow stream format for creating from and collecting
Pandas DataFrames
[SPARK-24624] Support mixture of Python UDF and Scalar Pandas UDF
[SPARK-23874] Upgrade Apache Arrow to 0.10.0
• Allow for adding BinaryType support [ARROW-2141]
[SPARK-25004] Add spark.executor.pyspark.memory limit
17
Higher-order
Functions
Major Features on Spark 2.4
18
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Flexible Streaming Sink
[SPARK-24565] Exposing output rows of each microbatch as a
DataFrame
foreachBatch(f: Dataset[T] => Unit)
• Scala/Java/Python APIs in DataStreamWriter.
• Reuse existing batch data sources
• Write to multiple locations
• Apply additional DataFrame operations
19
Reuse existing batch data sources
20
Write to multiple location
21
Structured Streaming
[SPARK-24662] Support for the LIMIT operator for streams
in Append and Complete output modes.
[SPARK-24763] Remove redundant key data from value in streaming aggregation
[SPARK-24156] Faster generation of output results and/or state cleanup with
stateful operations (mapGroupsWithState, stream-stream join, streaming
aggregation, streaming dropDuplicates) when there is no data in the input
stream.
[SPARK-24730] Support for choosing either the min or max watermark when
there are multiple input streams in a query.
22
Kafka Client 2.0.0
[SPARK-18057] Upgraded Kafka client version from 0.10.0.1 to 2.0.0
[SPARK-25005] Support “kafka.isolation.level” to read only committed
records from Kafka topics that are written using a transactional
producer.
23
Higher-order
Functions
Major Features on Spark 2.4
24
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
AVRO
• Apache Avro (https://avro.apache.org)
• A data serialization format
• Widely used in the Spark and Hadoop ecosystem, especially for
Kafka-based data pipelines.
• Spark-Avro package (https://github.com/databricks/spark-avro)
• Spark SQL can read and write the avro data.
• Inlining Spark-Avro package [SPARK-24768]
• Better experience for first-time users of Spark SQL and structured
streaming
• Expect further improve the adoption of structured streaming
25
AVRO
[SPARK-24811] from_avro/to_avro
functions to read and write Avro
data within a DataFrame instead of
just files.
Example:
1. Decode the Avro data into a struct
2. Filter by column `favorite_color`
3. Encode the column `name` in
Avro format
26
Runtime comparison (Lower is better)
AVRO Performance
27
[SPARK-24800] Refactor Avro Serializer
and Deserializer
External library
Native reader
AVRO Data Row InternalRow
AVRO Data Row InternalRow
AVRO Data InternalRow
AVRO Data InternalRow
Notebook: https://dbricks.co/AvroPerf
2X faster
AVRO Logical Types
Avro upgrade from 1.7.7 to 1.8.
[SPARK-24771]
Logical type support:
• Date [SPARK-24772]
• Decimal [SPARK-24774]
• Timestamp [SPARK-24773]
28
Options:
• compression
• ignoreExtension
• recordNamespace
• recordName
• avroSchema
Blog: Apache Avro as a Built-in Data Source in Apache Spark 2.4.
https://t.co/jks7j27PxJ
Higher-order
Functions
Major Features on Spark 2.4
29
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Image schema data source
[SPARK-22666] Spark datasource for image format
• Partition discovery [new]
• Loading recursively from directory [new]
• dropImageFailures path wildcard matching
• Path wildcard matching
30
Higher-order
Functions
Major Features on Spark 2.4
31
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
32
• 30+ PB data
• 5,000 tables
• 20,000 temporary tables
• 50,000 jobs
• 1,800+ nodes per day
• 15,000 jobs
Analytic Database
SQL
Parquet
33
Update from 1.8.2 to 1.10.0 [SPARK-23972].
• PARQUET-1025 - Support new min-max statistics in parquet-mr
• PARQUET-225 - INT64 support for delta encoding
• PARQUET-1142 Enable parquet.filter.dictionary.enabled by default.
Predicate pushdown
• STRING [SPARK-23972] [20x faster]
• Decimal [SPARK-24549]
• Timestamp [SPARK-24718]
• Date [SPARK-23727]
• Byte/Short [SPARK-24706]
• StringStartsWith [SPARK-24638]
• IN [SPARK-17091]
ORC
Native vectorized ORC reader is GAed!
• Native ORC reader is on by default [SPARK-23456]
• Update ORC from 1.4.1 to 1.5.2 [SPARK-24576]
• Turn on ORC filter push-down by default [SPARK-21783]
• Use native ORC reader to read Hive serde tables by default
[SPARK-22279]
• Avoid creating reader for all ORC files [SPARK-25126]
34
CSV
• Option samplingRatio for schema inference [SPARK-23846]
• Option enforceSchema for throwing an exception when user-specified
schema doesn‘t match the CSV header [SPARK-23786]
• Option encoding for specifying the encoding of outputs. [SPARK-19018]
Performance:
• Parsing only required columns to the CSV parser [SPARK-24244]
• Speed up count() for JSON and CSV [SPARK-24959]
• Better performance by switching to uniVocity 2.7.3 [SPARK-24945]
35
JSON
• Option encoding for specifying the encoding of inputs and outputs.
[SPARK-23723]
• Option dropFieldIfAllNull for ignoring column of all null values or
empty array/struct during JSON schema inference [SPARK-23772]
• Option lineSep for defining the line separator that should be used for
parsing [SPARK-23765]
• Speed up count() for JSON and CSV [SPARK-24959]
36
JDBC
• Option queryTimeout for the number of seconds the the driver will wait
for a Statement object to execute. [SPARK-23856]
• Option query for specifying the query to read from JDBC [SPARK-24423]
• Option pushDownFilters for specifying whether the filter pushdown is
allowed [SPARK-24288]
• Auto-correction of partition column names [SPARK-24327]
• Support Date/Timestamp in a JDBC partition column when reading in
parallel from multiple workers. [SPARK-22814]
• Add cascadeTruncate option to JDBC datasource [SPARK-22880]
37
Higher-order
Functions
Major Features on Upcoming Spark 2.4
38
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Higher-order Functions
Transformation on complex objects like arrays, maps and structures
inside of columns.
39
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
UDF ? Expensive data serialization
1) Check for element existence
SELECT EXISTS(values, e -> e > 30) AS v
FROM tbl_nested;
2) Transform an array
SELECT TRANSFORM(values, e -> e * e) AS v
FROM tbl_nested;
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
Higher-order Functions
4) Aggregate an array
SELECT REDUCE(values, 0, (value, acc) -> value + acc) AS sum
FROM tbl_nested;
Ref Databricks Blog: http://dbricks.co/2rUKQ1A
3) Filter an array
SELECT FILTER(values, e -> e > 30) AS v
FROM tbl_nested;
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
Higher-order Functions
Built-in Functions
[SPARK-23899] New or extended built-in functions for ArrayTypes and
MapTypes
• 26 functions for ArrayTypes
transform, filter, reduce, array_distinct, array_intersect, array_union,
array_except, array_join, array_max, array_min, ...
• 3 functions for MapTypes
map_from_arrays, map_from_entries, map_concat
42
Blog: Introducing New Built-in and Higher-Order Functions for
Complex Data Types in Apache Spark 2.4. https://t.co/p1TRRtabJJ
Higher-order
Functions
Major Features on Spark 2.4
43
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
When having many columns/functions
[SPARK-16406] Analyzer: Improve performance of LogicalPlan.resolve
Add an indexing structure to resolve(...) in order to find potential
matches quicker.
[SPARK-23963] Properly handle large number of columns in query on text-
based Hive table
Turns a list to array, makes a hive table scan 10 times faster when
there are a lot of columns.
[SPARK-23486] Analyzer: Cache the function name from the external
catalog for lookupFunctions
44
Optimizer/Planner
[SPARK-23803] Support Bucket Pruning
Prune buckets that cannot satisfy equal-to predicates, to reduce the
number of files to scan.
[SPARK-24802] Optimization Rule Exclusion
Disable a list of optimization rules in the optimizer
[SPARK-4502] Nested schema pruning for Parquet tables.
Column pruning on nested fields.
More: [SPARK-24339] [SPARK-23877] [SPARK-23957] [SPARK-25212] …
45
SQL API Enhancement
[SPARK-24940] Coalesce and Repartition Hint for
SQL Queries
INSERT OVERWRITE TABLE targetTable
SELECT /*+ REPARTITION(10) */ *
FROM sourceTable
[SPARK-19602] Support column resolution of fully
qualified column name (3 part name). (i.e.,
$DBNAME.$TABLENAME.$COLUMNNAME)
46
More: INTERSECT ALL,
EXCEPT ALL, Pivot,
GROUPING SET,
precedence rules for set
operations and …
[SPARK-21274] [SPARK-
24035] [SPARK-24424]
[SPARK-24966]
Blog: SQL Pivot: Converting
Rows to Columns in Apache
Spark 2.4
https://t.co/AgGKOcl2N4
SELECT * FROM db1.t3
WHERE c1 IN (SELECT db1.t4.c2 FROM
db1.t4 WHERE db1.t4.c3 = db1.t3.c2)
Other Notable Changes in SQL
[SPARK-24596] Non-cascading Cache Invalidation
• Non-cascading mode for temporary views and DataSet.unpersist()
• Cascading mode for the rest
[SPARK-23880] Do not trigger any job for caching data
[SPARK-23510][SPARK-24312] Support Hive 2.2 and Hive 2.3 metastore
[SPARK-23711] Add fallback generator for UnsafeProjection
[SPARK-24626] Parallelize location size calculation in Analyze Table
command
47
Other Notable Features in Core
[SPARK-23243] Fix RDD.repartition() data correctness issue
[SPARK-24296] Support replicating blocks larger than 2 GB
[SPARK-24307] Support sending messages over 2GB from memory
48
Higher-order
Functions
Major Features on Spark 2.4
49
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Native Spark App in K8S
New Spark scheduler backend
• PySpark support [SPARK-23984]
• SparkR support [SPARK-24433]
• Client-mode support [SPARK-23146]
• Support for mounting K8S volumes
[SPARK-23529]
Blog: What’s New for Apache Spark on Kubernetes in
the Upcoming Apache Spark 2.4 Release
https://t.co/uUpdUj2Z4B
50
on
Higher-order
Functions
Major Features on Spark 2.4
51
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Scala 2.12 Beta Support
[SPARK-14220] Build Spark against Scala 2.12
52
All the tests PASS! https://dbricks.co/Scala212Jenkins
Higher-order
Functions
Databricks Runtime 5.0
53
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
Higher-order
Functions
Databricks Runtime 5.0
54
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
FeaturesAVAILABLE NOW
Try Databricks
Runtime 5.0
For Free
https://databricks.com/try-databricks
56
57
Nike: Enabling Data Scientists to bring their Models to Market
Facebook: Vectorized Query Execution in Apache Spark at Facebook
Tencent: Large-scale Malicious Domain Detection with Spark AI
IBM: In-memory storage Evolution in Apache Spark
Capital One: Apache Spark and Sights at Speed: Streaming, Feature
management and Execution
Apple: Making Nested Columns as First Citizen in Apache Spark SQL
EBay: Managing Apache Spark workload and automatic optimizing.
Google: Validating Spark ML Jobs
HP: Apache Spark for Cyber Security in big company
Microsoft: Apache Spark Serving: Unifying Batch, Streaming and
RESTful Serving
ABSA Group: A Mainframe Data Source for Spark SQL and Streaming
Facebook: an efficient Facebook-scale shuffle service
IBM: Make your PySpark Data Fly with Arrow!
Facebook : Distributed Scheduling Framework for Apache Spark
Zynga: Automating Predictive Modeling at Zynga with PySpark
World Bank: Using Crowdsourced Images to Create Image Recognition
Models and NLP to Augment Global Trade indicator
JD.com: Optimizing Performance and Computing Resource.
Microsoft: Azure Databricks with R: Deep Dive
ICL: Cooperative Task Execution for Apache Spark
Airbnb: Apache Spark at Airbnb
Netflix: Migrating to Apache Spark at Netflix
Microsoft: Infrastructure for Deep Learning in
Apache Spark
Intel: Game playing using AI on Apache Spark
Facebook: Scaling Apache Spark @ Facebook
Lyft: Scaling Apache Spark on K8S at Lyft
Uber: Using Spark Mllib Models in a Production
Training and Serving Platform
Apple: Bridging the gap between Datasets and
DataFrames
Salesforce: The Rule of 10,000 Spark Jobs
Target: Lessons in Linear Algebra at Scale with
Apache Spark
Workday: Lesson Learned Using Apache Spark

More Related Content

What's hot

Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseDataWorks Summit/Hadoop Summit
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkDataWorks Summit
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureDataWorks Summit
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4DataWorks Summit
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0DataWorks Summit
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudDataWorks Summit
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNDataWorks Summit/Hadoop Summit
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0DataWorks Summit
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3DataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 

What's hot (20)

Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Creating the Internet of Your Things
Creating the Internet of Your ThingsCreating the Internet of Your Things
Creating the Internet of Your Things
 

Similar to What's new in Apache Spark 2.4

What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3DataWorks Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectSaltlux Inc.
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integrationDylan Wan
 
Spark Hsinchu meetup
Spark Hsinchu meetupSpark Hsinchu meetup
Spark Hsinchu meetupYung-An He
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 

Similar to What's new in Apache Spark 2.4 (20)

What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 
Spark7
Spark7Spark7
Spark7
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
Spark Hsinchu meetup
Spark Hsinchu meetupSpark Hsinchu meetup
Spark Hsinchu meetup
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

What's new in Apache Spark 2.4

  • 1. What's New in Apache Spark 2.4? | 2018
  • 2. About Me • Software Engineer at Databricks • Apache Spark Committer and PMC Member • Previously, IBM Master Inventor • Spark SQL, Database Replication, Information Integration • Ph.D. in University of Florida • Github: gatorsmile
  • 3. DATABRICKS WORKSPACE Databricks Delta ML Frameworks DATABRICKS CLOUD SERVICE DATABRICKS RUNTIME Reliable & Scalable Simple & Integrated Databricks Unified Analytics Platform APIs Jobs Models Notebooks Dashboards End to end ML lifecycle
  • 4. Databricks Customers Across Industries Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services
  • 5. 5 Release: Nov 8, 2018 Blog: https://t.co/k7kEHrNZXp Above 1100 tickets.
  • 6. Higher-order Functions Major Features on Spark 2.4 6 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 7. Higher-order Functions Major Features on Spark 2.4 7 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 8. Apache Spark: The First Unified Analytics Engine Runtime Delta Spark Core Engine Big Data Processing ETL + SQL +Streaming Machine Learning MLlib + SparkR Uniquely combines Data & AI technologies
  • 9. The cross? 9 Map/Reduce CaffeOnSpark TensorFlowOnSpark DataFrame-based APIs 50+ Data Sources Python/Java/R interfaces Structured Streaming ML Pipelines API Continuous Processing RDD Project Tungsten Pandas UDF TensorFrames scikit-learn pandas/numpy/scipy LIBLINEAR R glmnet xgboost GraphLab Caffe/PyTorch/MXNet TensorFlow Keras Distributed TensorFlow Horovod tf.data tf.transform AI/ML ?? TF XLA
  • 10. Project Hydrogen: Spark + AI A gang scheduling to Apache Spark that embeds a distributed DL job as a Spark stage to simplify the distributed training workflow. [SPARK-24374] • Launch the tasks in a stage at the same time • Provide enough information and tooling to embed distributed DL workloads • Introduce a new mechanism of fault tolerance (When any task failed in the middle, Spark shall abort all the tasks and restart the stage) 10
  • 11. Project Hydrogen: Spark + AI 11 Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling Demo: Project Hydrogen: https://vimeo.com/274267107 & https://youtu.be/hwbsWkhBeWM
  • 12. Timeline Spark 2.4 • [SPARK-24374] barrier execution mode (basic scenarios) • (Databricks) Horovod integration w/ barrier mode Spark 2.5/3.0 • [SPARK-24374] barrier execution mode • [SPARK-24579] optimized data exchange • [SPARK-24615] accelerator-aware scheduling
  • 13. Higher-order Functions Major Features on Spark 2.4 13 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 14. Pandas UDFs • Grouped Aggregate Pandas UDFs • pandas.Series -> a scalar • returnType: primitive data type • [SPARK-22274] [SPARK-22239] 14 Spark 2.3 introduced vectorized Pandas UDFs that use Pandas to process data. Faster data serialization and execution using vectorized formats
  • 15. Eager Evaluation spark.sql.repl.eagerEval.enabled [SPARK-24215] • Supports DataFrame eager evaluation for PySpark shell and Jupyter. • When true, the top K rows of Dataset will be displayed. 15 Prior to 2.4
  • 16. Eager Evaluation spark.sql.repl.eagerEval.enabled [SPARK-24215] • Supports DataFrame eager evaluation for PySpark shell and Jupyter. • When true, the top K rows of Dataset will be displayed. 16 Since 2.4
  • 17. Other Notable Features [SPARK-24396] Add Structured Streaming ForeachWriter for Python [SPARK-23030] Use Arrow stream format for creating from and collecting Pandas DataFrames [SPARK-24624] Support mixture of Python UDF and Scalar Pandas UDF [SPARK-23874] Upgrade Apache Arrow to 0.10.0 • Allow for adding BinaryType support [ARROW-2141] [SPARK-25004] Add spark.executor.pyspark.memory limit 17
  • 18. Higher-order Functions Major Features on Spark 2.4 18 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 19. Flexible Streaming Sink [SPARK-24565] Exposing output rows of each microbatch as a DataFrame foreachBatch(f: Dataset[T] => Unit) • Scala/Java/Python APIs in DataStreamWriter. • Reuse existing batch data sources • Write to multiple locations • Apply additional DataFrame operations 19
  • 20. Reuse existing batch data sources 20
  • 21. Write to multiple location 21
  • 22. Structured Streaming [SPARK-24662] Support for the LIMIT operator for streams in Append and Complete output modes. [SPARK-24763] Remove redundant key data from value in streaming aggregation [SPARK-24156] Faster generation of output results and/or state cleanup with stateful operations (mapGroupsWithState, stream-stream join, streaming aggregation, streaming dropDuplicates) when there is no data in the input stream. [SPARK-24730] Support for choosing either the min or max watermark when there are multiple input streams in a query. 22
  • 23. Kafka Client 2.0.0 [SPARK-18057] Upgraded Kafka client version from 0.10.0.1 to 2.0.0 [SPARK-25005] Support “kafka.isolation.level” to read only committed records from Kafka topics that are written using a transactional producer. 23
  • 24. Higher-order Functions Major Features on Spark 2.4 24 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 25. AVRO • Apache Avro (https://avro.apache.org) • A data serialization format • Widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines. • Spark-Avro package (https://github.com/databricks/spark-avro) • Spark SQL can read and write the avro data. • Inlining Spark-Avro package [SPARK-24768] • Better experience for first-time users of Spark SQL and structured streaming • Expect further improve the adoption of structured streaming 25
  • 26. AVRO [SPARK-24811] from_avro/to_avro functions to read and write Avro data within a DataFrame instead of just files. Example: 1. Decode the Avro data into a struct 2. Filter by column `favorite_color` 3. Encode the column `name` in Avro format 26
  • 27. Runtime comparison (Lower is better) AVRO Performance 27 [SPARK-24800] Refactor Avro Serializer and Deserializer External library Native reader AVRO Data Row InternalRow AVRO Data Row InternalRow AVRO Data InternalRow AVRO Data InternalRow Notebook: https://dbricks.co/AvroPerf 2X faster
  • 28. AVRO Logical Types Avro upgrade from 1.7.7 to 1.8. [SPARK-24771] Logical type support: • Date [SPARK-24772] • Decimal [SPARK-24774] • Timestamp [SPARK-24773] 28 Options: • compression • ignoreExtension • recordNamespace • recordName • avroSchema Blog: Apache Avro as a Built-in Data Source in Apache Spark 2.4. https://t.co/jks7j27PxJ
  • 29. Higher-order Functions Major Features on Spark 2.4 29 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 30. Image schema data source [SPARK-22666] Spark datasource for image format • Partition discovery [new] • Loading recursively from directory [new] • dropImageFailures path wildcard matching • Path wildcard matching 30
  • 31. Higher-order Functions Major Features on Spark 2.4 31 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 32. 32 • 30+ PB data • 5,000 tables • 20,000 temporary tables • 50,000 jobs • 1,800+ nodes per day • 15,000 jobs Analytic Database SQL
  • 33. Parquet 33 Update from 1.8.2 to 1.10.0 [SPARK-23972]. • PARQUET-1025 - Support new min-max statistics in parquet-mr • PARQUET-225 - INT64 support for delta encoding • PARQUET-1142 Enable parquet.filter.dictionary.enabled by default. Predicate pushdown • STRING [SPARK-23972] [20x faster] • Decimal [SPARK-24549] • Timestamp [SPARK-24718] • Date [SPARK-23727] • Byte/Short [SPARK-24706] • StringStartsWith [SPARK-24638] • IN [SPARK-17091]
  • 34. ORC Native vectorized ORC reader is GAed! • Native ORC reader is on by default [SPARK-23456] • Update ORC from 1.4.1 to 1.5.2 [SPARK-24576] • Turn on ORC filter push-down by default [SPARK-21783] • Use native ORC reader to read Hive serde tables by default [SPARK-22279] • Avoid creating reader for all ORC files [SPARK-25126] 34
  • 35. CSV • Option samplingRatio for schema inference [SPARK-23846] • Option enforceSchema for throwing an exception when user-specified schema doesn‘t match the CSV header [SPARK-23786] • Option encoding for specifying the encoding of outputs. [SPARK-19018] Performance: • Parsing only required columns to the CSV parser [SPARK-24244] • Speed up count() for JSON and CSV [SPARK-24959] • Better performance by switching to uniVocity 2.7.3 [SPARK-24945] 35
  • 36. JSON • Option encoding for specifying the encoding of inputs and outputs. [SPARK-23723] • Option dropFieldIfAllNull for ignoring column of all null values or empty array/struct during JSON schema inference [SPARK-23772] • Option lineSep for defining the line separator that should be used for parsing [SPARK-23765] • Speed up count() for JSON and CSV [SPARK-24959] 36
  • 37. JDBC • Option queryTimeout for the number of seconds the the driver will wait for a Statement object to execute. [SPARK-23856] • Option query for specifying the query to read from JDBC [SPARK-24423] • Option pushDownFilters for specifying whether the filter pushdown is allowed [SPARK-24288] • Auto-correction of partition column names [SPARK-24327] • Support Date/Timestamp in a JDBC partition column when reading in parallel from multiple workers. [SPARK-22814] • Add cascadeTruncate option to JDBC datasource [SPARK-22880] 37
  • 38. Higher-order Functions Major Features on Upcoming Spark 2.4 38 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 39. Higher-order Functions Transformation on complex objects like arrays, maps and structures inside of columns. 39 tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false) UDF ? Expensive data serialization
  • 40. 1) Check for element existence SELECT EXISTS(values, e -> e > 30) AS v FROM tbl_nested; 2) Transform an array SELECT TRANSFORM(values, e -> e * e) AS v FROM tbl_nested; tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false) Higher-order Functions
  • 41. 4) Aggregate an array SELECT REDUCE(values, 0, (value, acc) -> value + acc) AS sum FROM tbl_nested; Ref Databricks Blog: http://dbricks.co/2rUKQ1A 3) Filter an array SELECT FILTER(values, e -> e > 30) AS v FROM tbl_nested; tbl_nested |-- key: long (nullable = false) |-- values: array (nullable = false) | |-- element: long (containsNull = false) Higher-order Functions
  • 42. Built-in Functions [SPARK-23899] New or extended built-in functions for ArrayTypes and MapTypes • 26 functions for ArrayTypes transform, filter, reduce, array_distinct, array_intersect, array_union, array_except, array_join, array_max, array_min, ... • 3 functions for MapTypes map_from_arrays, map_from_entries, map_concat 42 Blog: Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4. https://t.co/p1TRRtabJJ
  • 43. Higher-order Functions Major Features on Spark 2.4 43 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 44. When having many columns/functions [SPARK-16406] Analyzer: Improve performance of LogicalPlan.resolve Add an indexing structure to resolve(...) in order to find potential matches quicker. [SPARK-23963] Properly handle large number of columns in query on text- based Hive table Turns a list to array, makes a hive table scan 10 times faster when there are a lot of columns. [SPARK-23486] Analyzer: Cache the function name from the external catalog for lookupFunctions 44
  • 45. Optimizer/Planner [SPARK-23803] Support Bucket Pruning Prune buckets that cannot satisfy equal-to predicates, to reduce the number of files to scan. [SPARK-24802] Optimization Rule Exclusion Disable a list of optimization rules in the optimizer [SPARK-4502] Nested schema pruning for Parquet tables. Column pruning on nested fields. More: [SPARK-24339] [SPARK-23877] [SPARK-23957] [SPARK-25212] … 45
  • 46. SQL API Enhancement [SPARK-24940] Coalesce and Repartition Hint for SQL Queries INSERT OVERWRITE TABLE targetTable SELECT /*+ REPARTITION(10) */ * FROM sourceTable [SPARK-19602] Support column resolution of fully qualified column name (3 part name). (i.e., $DBNAME.$TABLENAME.$COLUMNNAME) 46 More: INTERSECT ALL, EXCEPT ALL, Pivot, GROUPING SET, precedence rules for set operations and … [SPARK-21274] [SPARK- 24035] [SPARK-24424] [SPARK-24966] Blog: SQL Pivot: Converting Rows to Columns in Apache Spark 2.4 https://t.co/AgGKOcl2N4 SELECT * FROM db1.t3 WHERE c1 IN (SELECT db1.t4.c2 FROM db1.t4 WHERE db1.t4.c3 = db1.t3.c2)
  • 47. Other Notable Changes in SQL [SPARK-24596] Non-cascading Cache Invalidation • Non-cascading mode for temporary views and DataSet.unpersist() • Cascading mode for the rest [SPARK-23880] Do not trigger any job for caching data [SPARK-23510][SPARK-24312] Support Hive 2.2 and Hive 2.3 metastore [SPARK-23711] Add fallback generator for UnsafeProjection [SPARK-24626] Parallelize location size calculation in Analyze Table command 47
  • 48. Other Notable Features in Core [SPARK-23243] Fix RDD.repartition() data correctness issue [SPARK-24296] Support replicating blocks larger than 2 GB [SPARK-24307] Support sending messages over 2GB from memory 48
  • 49. Higher-order Functions Major Features on Spark 2.4 49 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 50. Native Spark App in K8S New Spark scheduler backend • PySpark support [SPARK-23984] • SparkR support [SPARK-24433] • Client-mode support [SPARK-23146] • Support for mounting K8S volumes [SPARK-23529] Blog: What’s New for Apache Spark on Kubernetes in the Upcoming Apache Spark 2.4 Release https://t.co/uUpdUj2Z4B 50 on
  • 51. Higher-order Functions Major Features on Spark 2.4 51 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 52. Scala 2.12 Beta Support [SPARK-14220] Build Spark against Scala 2.12 52 All the tests PASS! https://dbricks.co/Scala212Jenkins
  • 53. Higher-order Functions Databricks Runtime 5.0 53 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL Features
  • 54. Higher-order Functions Databricks Runtime 5.0 54 Structured Streaming Built-in source Improvement Spark on Kubernetes PySpark Improvement Native Avro Support Image Source Barrier Execution Scala 2.12 Various SQL FeaturesAVAILABLE NOW
  • 55. Try Databricks Runtime 5.0 For Free https://databricks.com/try-databricks
  • 56. 56
  • 57. 57 Nike: Enabling Data Scientists to bring their Models to Market Facebook: Vectorized Query Execution in Apache Spark at Facebook Tencent: Large-scale Malicious Domain Detection with Spark AI IBM: In-memory storage Evolution in Apache Spark Capital One: Apache Spark and Sights at Speed: Streaming, Feature management and Execution Apple: Making Nested Columns as First Citizen in Apache Spark SQL EBay: Managing Apache Spark workload and automatic optimizing. Google: Validating Spark ML Jobs HP: Apache Spark for Cyber Security in big company Microsoft: Apache Spark Serving: Unifying Batch, Streaming and RESTful Serving ABSA Group: A Mainframe Data Source for Spark SQL and Streaming Facebook: an efficient Facebook-scale shuffle service IBM: Make your PySpark Data Fly with Arrow! Facebook : Distributed Scheduling Framework for Apache Spark Zynga: Automating Predictive Modeling at Zynga with PySpark World Bank: Using Crowdsourced Images to Create Image Recognition Models and NLP to Augment Global Trade indicator JD.com: Optimizing Performance and Computing Resource. Microsoft: Azure Databricks with R: Deep Dive ICL: Cooperative Task Execution for Apache Spark Airbnb: Apache Spark at Airbnb Netflix: Migrating to Apache Spark at Netflix Microsoft: Infrastructure for Deep Learning in Apache Spark Intel: Game playing using AI on Apache Spark Facebook: Scaling Apache Spark @ Facebook Lyft: Scaling Apache Spark on K8S at Lyft Uber: Using Spark Mllib Models in a Production Training and Serving Platform Apple: Bridging the gap between Datasets and DataFrames Salesforce: The Rule of 10,000 Spark Jobs Target: Lessons in Linear Algebra at Scale with Apache Spark Workday: Lesson Learned Using Apache Spark