Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's New in Spark 2?


Published on

A new look on Spark 2 features and Under the hood. We try to look at Apache spark latest release with an examining look, while still loving it, but also criticising it.

Published in: Technology

What's New in Spark 2?

  1. 1. What’s New in Spark 2? Eyal Ben Ivri
  2. 2. In just two words… Not Much
  3. 3. In just OVER two words… Not Much, but let’s talk about it.
  4. 4. Let’s start from the beginning • What is Apache Spark? • An open source cluster computing framework. • Originally developed at the University of California, Berkeley's AMPLab. • Aimed and designed to be a Big Data computational framework.
  5. 5. Spark Components Data Sources Spark Core (Batch Processing, RDD, SparkContext) SparkSQL (DataFrame) Spark Streaming Spark MLlib (ML Pipelines) Spark GraphX Spark Packages
  6. 6. Spark Components (v2) Data Sources Spark Core (Batch Processing, RDD, SparkContext) Spark Streaming Spark MLlib (ML Pipelines) Spark GraphX Spark Packages Spark SQL (SparkSession, DataSet)
  7. 7. Timeline UC Berkeley’s AMPLab (2009) Open Sourced (2010) Apache Foundation (2013) Top-Level Apache Project (Feb 2014) Version 1.0 (May 2014) World record in large scale sorting (Nov 214) Version 1.6 (Jan 2016) Version 2.0.0 (Jul 2016) Version 2.0.1 (Oct 2016, Current)
  8. 8. Version History (major changes) 1.0 – SparkSQL (formally Shark project) 1.1 – Streaming support for python 1.2 – Core engine improvements. GraphX graduates. 1.3 – DataFrame. Python engine improvements. 1.4 – SparkR 1.5 – Bugs and Performance 1.6 – Dataset (experimental)
  9. 9. Spark 2.0.x • Major notables: • Scala 2.11 • SparkSession • Performance • API Stability • SQL:2003 support • Structured Streaming • R UDFs and Mllib algorithms implementation
  10. 10. API • Spark doesn't like API changes • The good news: • To migrate, you’ll have to perform little to no changes to your code. • The (not so) bad news: • To benefit from all the performance improvements, some old code might need more refactoring.
  11. 11. Programming API • Unifying DataFrame and Dataset: • Dataset[Row] = DataFrame • SparkSession replaces SqlContext/HiveContext • Both kept for backwards compatibility. • Simpler, more performant accumulator API • A new, improved Aggregator API for typed aggregation in Datasets
  12. 12. SQL Language • Improved SQL functionalities (SQL 2003 support) • Can now run all 99 TPC-DS queries • The parser support ANSI-SQL as well as HiveQL • Subquery support
  13. 13. SparkSQL new features • Native CSV data source (Based on Databricks’ spark-csv package) • Better off-heap memory management • Bucketing support (Hive implementation) • Performance performance performance
  14. 14. Demo Spark API SparkSQL
  15. 15. Judges Ruling Dataset was supposed to be the future like a 6 months ago SQL 2003 is so 2003 The API lives on! SQL 2003 is cool
  16. 16. Structured Streaming (Alpha) Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.
  17. 17. Structured Streaming (cont.) You can use the Dataset / DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to- batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.
  18. 18. Windowing streams How many vehicles entered each toll booth every 5 minutes? val windowedCounts = cars.groupBy( window($"timestamp", ”5 minutes", ”5 minutes"), $”car" ).count()
  19. 19. Demo Structured Streaming
  20. 20. Judges Ruling Still a micro-batching process SparkSQL is the future
  21. 21. Tungsten Project - Phase 2 Performance improvements Heap Memory management Connectors optimizations Let’s Pop the hood
  22. 22. Project Tungsten Bring Spark performance closer to the bare metal, through: • Native memory Management • Runtime code generation Started @ Version 1.4 The cornerstone that enabled the Catalyst engine
  23. 23. Project Tungsten - Phase 2 Whole stage code generation a. A technique that blends state-of-the-art from modern compilers and MPP databases. b. Gives a performance boost of up to x9 faster c. Emit optimized bytecode at runtime that collapses the entire query into a single function d. Eliminating virtual function calls and leveraging CPU registers for intermediate data
  24. 24. Project Tungsten - Phase 2 Optimized input / output a. Caching for Dataframes is based on Parquet b. Faster Parquet reader c. Google Gueva is OUT d. Smarter HadoopFS connector you have to be running on DataFrame / Dataset
  25. 25. Overall Judges Ruling I Want to complain but i don’t know what about!! Internal performance improvements aside, this feels more like Spark 1.7 I like flink... All is good SparkSQL is for sure the future of spark The competition has done well for Spark
  26. 26. Thank you (Questions?) Long live Spark (and Flink) Eyal Ben Ivri