2016 Spark Summit East Keynote: Matei Zaharia

37,117 views

Published on

Databricks CTO and Spark creator Matei Zaharia's keynote at Spark Summit East 2016: Planned major expansions to Apache Spark

Published in: Software
3 Comments
129 Likes
Statistics
Notes
No Downloads
Views
Total views
37,117
On SlideShare
0
From Embeds
0
Number of Embeds
2,406
Actions
Shares
0
Downloads
816
Comments
3
Likes
129
Embeds 0
No embeds

No notes for slide

2016 Spark Summit East Keynote: Matei Zaharia

  1. 1. Spark 2.0 Matei Zaharia February 17, 2016
  2. 2. 2015: A Great Year for Spark 2014 2015 Summit Attendees 2014 2015 Meetup Members 2014 2015 Total Contributors 3900 1100 66K 12K 500 1000
  3. 3. Meetup Groups: January 2015 source: meetup.com
  4. 4. Meetup Groups: January 2016 source: meetup.com
  5. 5. New Components DataFrames SparkR Data Sources Project Tungsten Streaming ML Kafka Connector ML Pipelines Debug UI Dataset API
  6. 6. Spark 2.0 Next major release, coming in April / May Builds on all we learned in past 2 years
  7. 7. Versioning in Spark In reality, we hate breaking APIs! Will notdo so exceptfor some dependency conflicts(e.g.Guava) 1.6.0 Patch version (only bug fixes) Major version (may change APIs) Minor version (addsAPIs/ features)
  8. 8. Major Features in 2.0 TungstenPhase 2 speedupsof 5-10x StructuredStreaming real-time engine on SQL/DataFrames Unifying Datasets and DataFrames
  9. 9. Tungsten Phase 2
  10. 10. Background on Project Tungsten CPU speedshave not kept up with I/O in past 5 years Bring Spark performance closerto bare metal, through: • Native memory management • Runtime code generation
  11. 11. Tungsten So Far Spark 1.4–1.6 added binary storage and basic code gen DataFrame + Dataset APIs enable Tungstenin userprograms • Alsoused underSpark SQL + parts of MLlib
  12. 12. New in 2.0 Whole-stage code generation • Remove expensive iteratorcalls • Fuse across multiple operators Spark 1.6 14M rows/s Spark 2.0 125M rows/s Parquet in 1.6 11M rows/s Parquet in 2.0 90M rows/s Optimized input / output • Parquet + built-incache Automatically applies to SQL, DataFrames, Datasets
  13. 13. Structured Streaming
  14. 14. Background Real-time processingis increasinglyimportant Most apps needto combine it with batch & interactive queries • Trackstate using a stream, then run SQL queries • Train an ML model offline, then update it Spark is very well-suitedto do this
  15. 15. Structured Streaming High-levelstreaming APIbuilt on Spark SQL engine • Declarative API that extendsDataFrames / Datasets • Eventtime, windowing,sessions,sources& sinks Also supports interactive & batch queries • Aggregate datain a stream,then serve using JDBC • Change queriesat runtime • Build and apply ML models Not just streaming, but “continuous applications”
  16. 16. Goal: end-to-end continuous applications Example Reporting Applications ML Model Ad-hoc Queries Traditionalstreaming Other processingtypes Kafka DatabaseETL
  17. 17. Details on Structured Streaming Spark 2.0 will have a first version focusedon ETL [SPARK-8360] Later versions will add more operators & libraries See Reynold’s keynote tomorrow for a deep dive!
  18. 18. Datasets & DataFrames
  19. 19. Datasets and DataFrames In 2015, we added DataFrames & Datasets as structured data APIs • DataFrames are collections of rows with a schema • Datasets add static types,e.g. Dataset[Person] • Both run on Tungsten Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
  20. 20. Example case class User(name: String, id: Int) case class Message(user: User, text: String) dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row] messages = dataframe.as[Message] // Dataset[Message] users = messages.filter(m => m.text.contains(“Spark”)) .map(m => m.user) // Dataset[User] pipeline.train(users) // MLlib takes either DataFrames or Datasets
  21. 21. Benefits Simpler to understand • Onlykept Dataset separate to keep binary compatibility in 1.x Libraries can take data of both forms With Streaming, same API will also work on streams
  22. 22. Long-Term RDD will remain the low-levelAPIin Spark Datasets & DataFrames give richer semanticsand optimizations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming,MLlib, GraphFrames
  23. 23. Thank you! Enjoy Spark Summit

×