2016 Spark Summit East Keynote: Matei Zaharia

•

128 likes•39,953 views

Databricks CTO and Spark creator Matei Zaharia's keynote at Spark Summit East 2016: Planned major expansions to Apache Spark

Software

Spark 2.0
Matei Zaharia
February 17, 2016

2015: A Great Year for Spark
2014 2015
Summit
Attendees
2014 2015
Meetup
Members
2014 2015
Total
Contributors
3900
1100
66K
12K
500
1000

Meetup Groups: January 2015
source: meetup.com

Meetup Groups: January 2016
source: meetup.com

New Components
DataFrames
SparkR
Data Sources
Project Tungsten
Streaming ML
Kafka Connector
ML Pipelines
Debug UI
Dataset API

Spark 2.0
Next major release, coming in April / May
Builds on all we learned in past 2 years

Versioning in Spark
In reality, we hate breaking APIs!
Will notdo so exceptfor some dependency conflicts(e.g.Guava)
1.6.0
Patch version (only bug fixes)
Major version (may change APIs)
Minor version (addsAPIs/ features)

Major Features in 2.0
TungstenPhase 2
speedupsof 5-10x
StructuredStreaming
real-time engine
on SQL/DataFrames
Unifying Datasets
and DataFrames

Background on Project Tungsten
CPU speedshave not kept up with I/O in past 5 years
Bring Spark performance closerto bare metal, through:
• Native memory management
• Runtime code generation

Tungsten So Far
Spark 1.4–1.6 added binary storage and basic code gen
DataFrame + Dataset APIs enable Tungstenin userprograms
• Alsoused underSpark SQL + parts of MLlib

New in 2.0
Whole-stage code generation
• Remove expensive iteratorcalls
• Fuse across multiple operators
Spark 1.6 14M
rows/s
Spark 2.0 125M
rows/s
Parquet
in 1.6
11M
rows/s
Parquet
in 2.0
90M
rows/s
Optimized input / output
• Parquet + built-incache
Automatically applies to SQL, DataFrames, Datasets

Background
Real-time processingis increasinglyimportant
Most apps needto combine it with batch & interactive queries
• Trackstate using a stream, then run SQL queries
• Train an ML model offline, then update it
Spark is very well-suitedto do this

Structured Streaming
High-levelstreaming APIbuilt on Spark SQL engine
• Declarative API that extendsDataFrames / Datasets
• Eventtime, windowing,sessions,sources& sinks
Also supports interactive & batch queries
• Aggregate datain a stream,then serve using JDBC
• Change queriesat runtime
• Build and apply ML models Not just streaming, but
“continuous applications”

Goal: end-to-end continuous applications
Example
Reporting Applications
ML Model
Ad-hoc Queries
Traditionalstreaming
Other processingtypes
Kafka DatabaseETL

Details on Structured Streaming
Spark 2.0 will have a first version focusedon ETL [SPARK-8360]
Later versions will add more operators & libraries
See Reynold’s keynote tomorrow for a deep dive!

Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs
• DataFrames are collections of rows with a schema
• Datasets add static types,e.g. Dataset[Person]
• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]

Example
case class User(name: String, id: Int)
case class Message(user: User, text: String)
dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]
messages = dataframe.as[Message] // Dataset[Message]
users = messages.filter(m => m.text.contains(“Spark”))
.map(m => m.user) // Dataset[User]
pipeline.train(users) // MLlib takes either DataFrames or Datasets

Benefits
Simpler to understand
• Onlykept Dataset separate to keep binary compatibility in 1.x
Libraries can take data of both forms
With Streaming, same API will also work on streams

Long-Term
RDD will remain the low-levelAPIin Spark
Datasets & DataFrames give richer semanticsand optimizations
• New libraries will increasingly use these as interchange format
• Examples: Structured Streaming,MLlib, GraphFrames

What's hot

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

Distributed ML in Apache SparkDatabricks

Spark DataFrames and ML PipelinesDatabricks

Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks

Large-Scale Data Science in Apache Spark 2.0Databricks

Introduction to Apache Spark 2.0Knoldus Inc.

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Spark Summit

Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

Apache Spark Usage in the Open Source EcosystemDatabricks

Spark Summit EU talk by Stephan KesslerSpark Summit

Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman

An Introduction to Sparkling Water by Michal MalohlavaSpark Summit

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

Writing Continuous Applications with Structured Streaming PySpark APIDatabricks

Spark streaming state of the unionDatabricks

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit

Composable Parallel Processing in Apache Spark and WeldDatabricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

What's hot (20)

Jump Start with Apache Spark 2.0 on Databricks

Distributed ML in Apache Spark

Spark DataFrames and ML Pipelines

Exceptions are the Norm: Dealing with Bad Actors in ETL

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...

Large-Scale Data Science in Apache Spark 2.0

Introduction to Apache Spark 2.0

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...

Spark Summit EU talk by Shay Nativ and Dvir Volk

Building a Data Pipeline from Scratch - Joe Crobak

Apache Spark Usage in the Open Source Ecosystem

Spark Summit EU talk by Stephan Kessler

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

An Introduction to Sparkling Water by Michal Malohlava

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Writing Continuous Applications with Structured Streaming PySpark API

Spark streaming state of the union

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Composable Parallel Processing in Apache Spark and Weld

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Viewers also liked

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit

Parallelizing Existing R Packages with SparkRDatabricks

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

The Future of Real-Time in SparkReynold Xin

Flink vs. SparkSlim Baltagi

What's New in Spark 2?Eyal Ben Ivri

Introduction to Spark (Intern Event Presentation)Databricks

Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Spark Summit

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit

Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks

Apache Spark RDDsDean Chen

Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks

Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...Spark Summit

Lessons Learned From Running Spark On DockerSpark Summit

Viewers also liked (20)

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Apache Spark 2.0: Faster, Easier, and Smarter

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Parallelizing Existing R Packages with SparkR

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

The Future of Real-Time in Spark

Flink vs. Spark

What's New in Spark 2?

Introduction to Spark (Intern Event Presentation)

Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark SQL Deep Dive @ Melbourne Spark Meetup

Apache Spark RDDs

Taking Spark Streaming to the Next Level with Datasets and DataFrames

Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...

Lessons Learned From Running Spark On Docker

Similar to 2016 Spark Summit East Keynote: Matei Zaharia

The structured streaming upgrade to Apache Spark and how enterprises can bene...Impetus Technologies

Simplifying Big Data Applications with Apache Spark 2.0Spark Summit

What’s new in Apache Spark 2.3DataWorks Summit

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks

New directions for Apache Spark in 2015Databricks

Spark + AI Summit 2020 イベント概要Paulo Gutierrez

Semantic Web Serverswebhostingguy

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Spark and machine learning in microservices architectureStepan Pushkarev

Spark Streaming @ Berlin Apache Spark Meetup, March 2015Stratio

Introduction to Datasource V2 APIdatamantra

Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

MLeap: Release Spark ML PipelinesDataWorks Summit/Hadoop Summit

Apache Flink: Past, Present and FutureGyula Fóra

Similar to 2016 Spark Summit East Keynote: Matei Zaharia (20)

The structured streaming upgrade to Apache Spark and how enterprises can bene...

Simplifying Big Data Applications with Apache Spark 2.0

What’s new in Apache Spark 2.3

Jump Start with Apache Spark 2.0 on Databricks

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

New directions for Apache Spark in 2015

Spark + AI Summit 2020 イベント概要

Semantic Web Servers

Jump Start on Apache® Spark™ 2.x with Databricks

Jumpstart on Apache Spark 2.2 on Databricks

Spark and machine learning in microservices architecture

Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Introduction to Datasource V2 API

Spark Streaming and MLlib - Hyderabad Spark Group

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

MLeap: Release Spark ML Pipelines

Apache Flink: Past, Present and Future

Recently uploaded

MYjobs Presentation Django-based projectAnoyGreter

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

What is Fashion PLM and Why Do You Need ItWave PLM

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0

CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies

React Server Component in Next.js by Hanief UtamaHanief Utama

SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Recently uploaded (20)

MYjobs Presentation Django-based project

2.pdf Ejercicios de programación competitiva

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

What is Fashion PLM and Why Do You Need It

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Software Project Health Check: Best Practices and Techniques for Your Product...

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

Intelligent Home Wi-Fi Solutions | ThinkPalm

Buds n Tech IT Solutions: Top-Notch Web Services in Noida

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

英国UN学位证,北安普顿大学毕业证书1:1制作

CRM Contender Series: HubSpot vs. Salesforce

React Server Component in Next.js by Hanief Utama

SpotFlow: Tracking Method Calls and States at Runtime

Cloud Management Software Platforms: OpenStack

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

How to Track Employee Performance A Comprehensive Guide.pdf

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

2016 Spark Summit East Keynote: Matei Zaharia

1. Spark 2.0 Matei Zaharia February 17, 2016

2. 2015: A Great Year for Spark 2014 2015 Summit Attendees 2014 2015 Meetup Members 2014 2015 Total Contributors 3900 1100 66K 12K 500 1000

3. Meetup Groups: January 2015 source: meetup.com

4. Meetup Groups: January 2016 source: meetup.com

5. New Components DataFrames SparkR Data Sources Project Tungsten Streaming ML Kafka Connector ML Pipelines Debug UI Dataset API

6. Spark 2.0 Next major release, coming in April / May Builds on all we learned in past 2 years

7. Versioning in Spark In reality, we hate breaking APIs! Will notdo so exceptfor some dependency conflicts(e.g.Guava) 1.6.0 Patch version (only bug fixes) Major version (may change APIs) Minor version (addsAPIs/ features)

8. Major Features in 2.0 TungstenPhase 2 speedupsof 5-10x StructuredStreaming real-time engine on SQL/DataFrames Unifying Datasets and DataFrames

9. Tungsten Phase 2

10. Background on Project Tungsten CPU speedshave not kept up with I/O in past 5 years Bring Spark performance closerto bare metal, through: • Native memory management • Runtime code generation

11. Tungsten So Far Spark 1.4–1.6 added binary storage and basic code gen DataFrame + Dataset APIs enable Tungstenin userprograms • Alsoused underSpark SQL + parts of MLlib

12. New in 2.0 Whole-stage code generation • Remove expensive iteratorcalls • Fuse across multiple operators Spark 1.6 14M rows/s Spark 2.0 125M rows/s Parquet in 1.6 11M rows/s Parquet in 2.0 90M rows/s Optimized input / output • Parquet + built-incache Automatically applies to SQL, DataFrames, Datasets

13. Structured Streaming

14. Background Real-time processingis increasinglyimportant Most apps needto combine it with batch & interactive queries • Trackstate using a stream, then run SQL queries • Train an ML model offline, then update it Spark is very well-suitedto do this

15. Structured Streaming High-levelstreaming APIbuilt on Spark SQL engine • Declarative API that extendsDataFrames / Datasets • Eventtime, windowing,sessions,sources& sinks Also supports interactive & batch queries • Aggregate datain a stream,then serve using JDBC • Change queriesat runtime • Build and apply ML models Not just streaming, but “continuous applications”

16. Goal: end-to-end continuous applications Example Reporting Applications ML Model Ad-hoc Queries Traditionalstreaming Other processingtypes Kafka DatabaseETL

17. Details on Structured Streaming Spark 2.0 will have a first version focusedon ETL [SPARK-8360] Later versions will add more operators & libraries See Reynold’s keynote tomorrow for a deep dive!

18. Datasets & DataFrames

19. Datasets and DataFrames In 2015, we added DataFrames & Datasets as structured data APIs • DataFrames are collections of rows with a schema • Datasets add static types,e.g. Dataset[Person] • Both run on Tungsten Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]

20. Example case class User(name: String, id: Int) case class Message(user: User, text: String) dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row] messages = dataframe.as[Message] // Dataset[Message] users = messages.filter(m => m.text.contains(“Spark”)) .map(m => m.user) // Dataset[User] pipeline.train(users) // MLlib takes either DataFrames or Datasets

21. Benefits Simpler to understand • Onlykept Dataset separate to keep binary compatibility in 1.x Libraries can take data of both forms With Streaming, same API will also work on streams

22. Long-Term RDD will remain the low-levelAPIin Spark Datasets & DataFrames give richer semanticsand optimizations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming,MLlib, GraphFrames

23. Thank you! Enjoy Spark Summit

2016 Spark Summit East Keynote: Matei Zaharia

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 2016 Spark Summit East Keynote: Matei Zaharia

Similar to 2016 Spark Summit East Keynote: Matei Zaharia (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

2016 Spark Summit East Keynote: Matei Zaharia