A new look on Spark 2 features and Under the hood. We try to look at Apache spark latest release with an examining look, while still loving it, but also criticising it.
3. In just OVER two words…
Not Much,
but let’s talk about it.
4. Let’s start from the beginning
• What is Apache Spark?
• An open source cluster computing framework.
• Originally developed at the University of California, Berkeley's AMPLab.
• Aimed and designed to be a Big Data computational framework.
8. Version History (major changes)
1.0 –
SparkSQL
(formally Shark
project)
1.1 –
Streaming
support for
python
1.2 – Core
engine
improvements.
GraphX
graduates.
1.3 –
DataFrame.
Python engine
improvements.
1.4 – SparkR
1.5 – Bugs and
Performance
1.6 – Dataset
(experimental)
9. Spark 2.0.x
• Major notables:
• Scala 2.11
• SparkSession
• Performance
• API Stability
• SQL:2003 support
• Structured Streaming
• R UDFs and Mllib algorithms implementation
10. API
• Spark doesn't like API changes
• The good news:
• To migrate, you’ll have to perform
little to no changes to your code.
• The (not so) bad news:
• To benefit from all the
performance improvements, some
old code might need more
refactoring.
11. Programming API
• Unifying DataFrame and Dataset:
• Dataset[Row] = DataFrame
• SparkSession replaces
SqlContext/HiveContext
• Both kept for backwards compatibility.
• Simpler, more performant accumulator
API
• A new, improved Aggregator API for
typed aggregation in Datasets
12. SQL Language
• Improved SQL functionalities (SQL
2003 support)
• Can now run all 99 TPC-DS
queries
• The parser support ANSI-SQL as
well as HiveQL
• Subquery support
13. SparkSQL new features
• Native CSV data source (Based on
Databricks’ spark-csv package)
• Better off-heap memory
management
• Bucketing support (Hive
implementation)
• Performance performance
performance
15. Judges Ruling
Dataset was supposed to
be the future like a 6
months ago
SQL 2003 is so 2003
The API lives on!
SQL 2003 is cool
16. Structured Streaming (Alpha)
Structured Streaming is a scalable and
fault-tolerant stream processing engine
built on the Spark SQL engine.
You can express your streaming
computation the same way you would
express a batch computation on static
data.
17. Structured Streaming (cont.)
You can use the Dataset / DataFrame API in Scala, Java or Python to
express streaming aggregations, event-time windows, stream-to-
batch joins, etc.
The computation is executed on the same optimized Spark SQL
engine.
Exactly-once fault-tolerance guarantees through checkpointing and
Write Ahead Logs.
18. Windowing streams
How many vehicles entered each toll booth every 5 minutes?
val windowedCounts = cars.groupBy(
window($"timestamp", ”5 minutes", ”5 minutes"),
$”car"
).count()
23. Tungsten Project - Phase 2
Performance improvements
Heap Memory management
Connectors optimizations
Let’s Pop the hood
24. Project Tungsten
Bring Spark performance closer to the bare metal, through:
• Native memory Management
• Runtime code generation
Started @ Version 1.4
The cornerstone that enabled the Catalyst engine
25. Project Tungsten - Phase 2
Whole stage code generation
a. A technique that blends state-of-the-art from modern compilers and MPP databases.
b. Gives a performance boost of up to x9 faster
c. Emit optimized bytecode at runtime that collapses the entire query into a single function
d. Eliminating virtual function calls and leveraging CPU registers for intermediate data
26. Project Tungsten - Phase 2
Optimized input / output
a. Caching for Dataframes is based on Parquet
b. Faster Parquet reader
c. Google Gueva is OUT
d. Smarter HadoopFS connector
you have to be running on DataFrame / Dataset
27. Overall Judges Ruling
I Want to complain but i don’t know
what about!!
Internal performance improvements
aside, this feels more like Spark 1.7
I like flink...
All is good
SparkSQL is for sure the future of
spark
The competition has done well
for Spark