In just OVER two words…
but let’s talk about it.
Let’s start from the beginning
• What is Apache Spark?
• An open source cluster computing framework.
• Originally developed at the University of California, Berkeley's AMPLab.
• Aimed and designed to be a Big Data computational framework.
Version History (major changes)
1.2 – Core
1.4 – SparkR
1.5 – Bugs and
1.6 – Dataset
• Major notables:
• Scala 2.11
• API Stability
• SQL:2003 support
• Structured Streaming
• R UDFs and Mllib algorithms implementation
• Spark doesn't like API changes
• The good news:
• To migrate, you’ll have to perform
little to no changes to your code.
• The (not so) bad news:
• To benefit from all the
performance improvements, some
old code might need more
• Unifying DataFrame and Dataset:
• Dataset[Row] = DataFrame
• SparkSession replaces
• Both kept for backwards compatibility.
• Simpler, more performant accumulator
• A new, improved Aggregator API for
typed aggregation in Datasets
• Improved SQL functionalities (SQL
• Can now run all 99 TPC-DS
• The parser support ANSI-SQL as
well as HiveQL
• Subquery support
SparkSQL new features
• Native CSV data source (Based on
Databricks’ spark-csv package)
• Better off-heap memory
• Bucketing support (Hive
• Performance performance
Dataset was supposed to
be the future like a 6
SQL 2003 is so 2003
The API lives on!
SQL 2003 is cool
Structured Streaming (Alpha)
Structured Streaming is a scalable and
fault-tolerant stream processing engine
built on the Spark SQL engine.
You can express your streaming
computation the same way you would
express a batch computation on static
Structured Streaming (cont.)
You can use the Dataset / DataFrame API in Scala, Java or Python to
express streaming aggregations, event-time windows, stream-to-
batch joins, etc.
The computation is executed on the same optimized Spark SQL
Exactly-once fault-tolerance guarantees through checkpointing and
Write Ahead Logs.
How many vehicles entered each toll booth every 5 minutes?
val windowedCounts = cars.groupBy(
window($"timestamp", ”5 minutes", ”5 minutes"),
Still a micro-batching process SparkSQL is the future
Tungsten Project - Phase 2
Heap Memory management
Let’s Pop the hood
Bring Spark performance closer to the bare metal, through:
• Native memory Management
• Runtime code generation
Started @ Version 1.4
The cornerstone that enabled the Catalyst engine
Project Tungsten - Phase 2
Whole stage code generation
a. A technique that blends state-of-the-art from modern compilers and MPP databases.
b. Gives a performance boost of up to x9 faster
c. Emit optimized bytecode at runtime that collapses the entire query into a single function
d. Eliminating virtual function calls and leveraging CPU registers for intermediate data
Project Tungsten - Phase 2
Optimized input / output
a. Caching for Dataframes is based on Parquet
b. Faster Parquet reader
c. Google Gueva is OUT
d. Smarter HadoopFS connector
you have to be running on DataFrame / Dataset
Overall Judges Ruling
I Want to complain but i don’t know
Internal performance improvements
aside, this feels more like Spark 1.7
I like flink...
All is good
SparkSQL is for sure the future of
The competition has done well
Thank you (Questions?)
Long live Spark (and Flink)
Eyal Ben Ivri