Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

2,174 views

Published on

Apache Spark 2.0 was released this summer and is already being widely adopted. In this presentation Matei talks about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.

Published in: Software
  • Be the first to comment

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

  1. 1. Matei Zaharia @matei_zaharia Simplifying Big Data in Apache Spark 2.0
  2. 2. A Great Year for Apache Spark 2015 2016 Meetup Members 2015 2016 Developers Contributing 225K 66K 600 1100 2.0 New Major Version #
  3. 3. About Spark 2.0 Remains highly compatible with 1.x Builds on key lessonsand simplifies API 2000 patches from 280 contributors
  4. 4. What’s Hard About Big Data? Complex combination of processing tasks, storage systems & modes • ETL, aggregation,machine learning,streaming,etc Hard to get both productivity and performance
  5. 5. Apache Spark’s Approach Unified engine • Express entire workflow in one API • Connectexisting libraries& storage High-level APIs with space to optimize • RDDs, DataFrames, ML pipelines SQLStreaming ML Graph …
  6. 6. New in 2.0 Structured API improvements (DataFrame, Dataset, SQL) Whole-stage code generation Structured Streaming Simpler setup (SparkSession) SQL 2003 support MLlib model persistence MLlib R bindings SparkR user-defined functions …
  7. 7. Original Spark API Arbitrary Java functions on Java objects + Can organize your app using functions, classesand types – Difficult for the engine to optimize • Inefficientin-memory format • Hard to do cross-operatoroptimizations val lines = sc.textFile(“s3://...”) val points = lines.map(line => new Point(line))
  8. 8. Structured APIs New APIs for data with a fixed schema (table-like) • Efficientstorage taking advantage ofschema (e.g.columnar) • Operators take expressionsin a special DSL thatSpark can optimize DataFrames (untyped), Datasets (typed), and SQL
  9. 9. Structured API Example events = sc.read.json(“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) DataFrame API Optimized Plan Specialized Code SCAN logs SCAN users JOIN AGG FILTER while(logs.hasNext) { e = logs.next if(e.status == “ERR”) { u = users.get(e.uid) key = (u.loc, e.status) sum(key) += e.duration count(key) += 1 } } ...
  10. 10. Structured API Example events = sc.read.json(“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) DataFrame API Optimized Plan Specialized Code FILTERED SCAN SCAN users JOIN AGG while(logs.hasNext) { e = logs.next if(e.status == “ERR”) { u = users.get(e.uid) key = (u.loc, e.status) sum(key) += e.duration count(key) += 1 } } ...
  11. 11. New in 2.0 Whole-stage code generation • Fuse across multiple operators • Optimized Parquet I/O Spark 1.6 14M rows/s Spark 2.0 125M rows/s Parquet in 1.6 11M rows/s Parquet in 2.0 90M rows/s Merging DataFrame & Dataset • DataFrame = Dataset[Row]
  12. 12. Beyond Batch & Interactive: Higher-Level API for Streaming
  13. 13. What’s Hard In Using Streaming? Complex semantics • What possible resultscan the programgive? • What happensif a node runs slowly? If one fails? Integration into a complete application • Serve real-time querieson resultof stream • Give consistentresultswith batch jobs
  14. 14. Structured Streaming High-levelstreaming APIbasedon DataFrames / Datasets • Same semantics& results as batch APIs • Eventtime, windowing,sessions,transactionalI/O Rich integration with complete Apache Spark apps • Memory sink forad-hoc queries • Joinswith static data • Change queriesat runtime Not just streaming, but “continuous applications”
  15. 15. Structured Streaming API Incrementalizean existing DataFrame/Dataset/SQL query logs = ctx.read.format(“json”).open(“hdfs://logs”) logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format(”parquet”) .save(“s3://...”) Example batch job:
  16. 16. Structured Streaming API Incrementalizean existing DataFrame/Dataset/SQL query logs = ctx.readStream.format(“json”).load(“hdfs://logs”) logs.groupBy(“userid”, “hour”).avg(“latency”) .writeStream.format(”parquet") .start(“s3://...”) Example as streaming: Results always same as a batch job on a prefixof the data
  17. 17. Under the Hood Scan Files Aggregate Write to S3 Scan New Files Stateful Aggregate Update S3 Batch Plan Continuous Plan Automatically transformed
  18. 18. Ad-hoc Queries Input Stream Output Sink Streaming Computation Input Stream Output Sink Continuous Application Static Data Batch Job >_ Pure Streaming System ContinuousApplication consistent with End Goal: Full Continuous Apps
  19. 19. Development Status 2.0.1: supports ETL workloads from file systems and S3 2.0.2: Kafka input source,monitoring metrics 2.1.0: eventtime aggregation workloads & watermarks
  20. 20. Greg Owen Demo

×