Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0


Published on

The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.

Published in: Software

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

  1. 1. Matei Zaharia @matei_zaharia Apache Spark 2.0
  2. 2. Apache Spark 2.0 Next major release,coming out this month • Unstable previewrelease at Remains highly compatible with ApacheSpark 1.X Over 2000 patches from 280 contributors!
  3. 3. Apache Spark Philosophy Unified engine Support end-to-end applications High-level APIs Easy to use, rich optimizations Integrate broadly Storage systems, libraries, etc SQLStreaming ML Graph … 1 2 3
  4. 4. New in 2.0 Structured API improvements (DataFrame, Dataset, SparkSession) Structured Streaming MLlib model export MLlib R bindings SQL 2003 support Scala 2.12 support Deep learning libraries (Baidu, Yahoo!, Berkeley, Databricks) GraphFrames PyData integration Reactive streams C# bindings:Mobius JS bindings:EclairJS Broader Community Build on common interface of RDDs & DataFrames
  5. 5. Deep Dive: Structured APIs events =“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) DataFrame API Optimized Plan Specialized Code READ logs READ users JOIN AGG FILTER while(logs.hasNext) { e = if(e.status == “ERR”) { u = users.get(e.uid) key = (u.loc, e.status) sum(key) += e.duration count(key) += 1 } } ...
  6. 6. New in 2.0 Whole-stage code generation • Fuse across multiple operators Spark 1.6 14M rows/s Spark 2.0 125M rows/s Parquet in 1.6 11M rows/s Parquet in 2.0 90M rows/s Optimized input / output • Apache Parquet + built-incache
  7. 7. Structured Streaming High-levelstreaming APIbuilt on DataFrames • Eventtime, windowing,sessions,sources& sinks Also supports interactive & batch queries • Aggregate datain a stream,then serve using JDBC • Change queriesat runtime • Build and apply ML models Not just streaming, but “continuous applications”
  8. 8. Apache Spark 2.0: Infinite DataFrames Apache Spark 1.X: Static DataFrames Single API Structured Streaming API
  9. 9. logs ="json").open("s3://logs") logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format("jdbc") .save("jdbc:mysql//...") Example: Batch App
  10. 10. logs ="json").stream("s3://logs") logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format("jdbc") .startStream("jdbc:mysql//...") Example: Continuous App
  11. 11. More Details in Conference Engine: Structuring Spark, StructuredStreaming, deep dives ML: SparkR, MLlib 2.0, newalgorithms Other: deep learning, GraphFrames, Solr,Cassandra, … Try 2.0-preview at
  12. 12. Growing the Community New initiatives from Databricks
  13. 13. The largest challenge in applying big data is the skills gap. StackOverflow Developer Survey 2016
  14. 14. Databricks Community Edition Free version of Databricks with: • Interactive tutorials • Apache Spark and popular data science libraries • Visualization& debug tools GA Today!
  15. 15. Massive Open Online Courses Free 5-course series on big data with Apache Spark Introduction to Apache Spark TM Distributed Machine Learning with Apache Spark TM Big Data Analysis with Apache Spark TM Advanced Apache Spark for Data Science and Data Engineering TM Advanced Machine Learning with Apache Spark TM
  16. 16. Michael Armbrust Demo