SlideShare a Scribd company logo

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Databricks
Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft

The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

1 of 16
Download to read offline
Matei Zaharia
@matei_zaharia
Apache Spark 2.0
Apache Spark 2.0
Next major release,coming out this month
• Unstable previewrelease at spark.apache.org
Remains highly compatible with ApacheSpark 1.X
Over 2000 patches from 280 contributors!
Apache Spark Philosophy
Unified engine
Support end-to-end applications
High-level APIs
Easy to use, rich optimizations
Integrate broadly
Storage systems, libraries, etc
SQLStreaming ML Graph
…
1
2
3
New in 2.0
Structured API improvements
(DataFrame, Dataset, SparkSession)
Structured Streaming
MLlib model export
MLlib R bindings
SQL 2003 support
Scala 2.12 support
Deep learning libraries
(Baidu, Yahoo!, Berkeley, Databricks)
GraphFrames
PyData integration
Reactive streams
C# bindings:Mobius
JS bindings:EclairJS
Broader Community
Build on common interface of RDDs & DataFrames
Deep Dive: Structured APIs
events =
sc.read.json(“/logs”)
stats =
events.join(users)
.groupBy(“loc”,“status”)
.avg(“duration”)
errors = stats.where(
stats.status == “ERR”)
DataFrame API Optimized Plan Specialized Code
READ logs READ users
JOIN
AGG
FILTER
while(logs.hasNext) {
e = logs.next
if(e.status == “ERR”) {
u = users.get(e.uid)
key = (u.loc, e.status)
sum(key) += e.duration
count(key) += 1
}
}
...
New in 2.0
Whole-stage code generation
• Fuse across multiple operators
Spark 1.6 14M
rows/s
Spark 2.0 125M
rows/s
Parquet
in 1.6
11M
rows/s
Parquet
in 2.0
90M
rows/s
Optimized input / output
• Apache Parquet + built-incache
Ad

Recommended

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache SparkDatabricks
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudSri Ambati
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 

More Related Content

What's hot

Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkJen Aman
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsDatabricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
 
Spark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with SparkSpark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with SparkDatabricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
 

What's hot (20)

Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Spark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with SparkSpark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with Spark
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 

Viewers also liked

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudJen Aman
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei ZahariaDatabricks
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbJen Aman
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development KitJen Aman
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache SparkJen Aman
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkJen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Interactive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by SparkInteractive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by SparkSpark Summit
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 

Viewers also liked (20)

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache Spark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Interactive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by SparkInteractive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by Spark
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 

Similar to Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams APIconfluent
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into PysparkAndrey Vykhodtsev
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 

Similar to Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0 (20)

Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

maximum subarray ppt for killing camp students
maximum subarray ppt for killing camp studentsmaximum subarray ppt for killing camp students
maximum subarray ppt for killing camp studentsssuser82c38d
 
Getting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptxGetting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptxmavinoikein
 
Sql server types of joins with example.pptx
Sql server types of joins with example.pptxSql server types of joins with example.pptx
Sql server types of joins with example.pptxsameer gaikwad
 
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTSi-engage
 
Les02 Restricting and Sorting Data using SQL.ppt
Les02 Restricting and Sorting Data using SQL.pptLes02 Restricting and Sorting Data using SQL.ppt
Les02 Restricting and Sorting Data using SQL.pptDrZeeshanBhatti
 
Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Asher Sterkin
 
Steps to Build a PWA with Odoo.pdf
Steps to Build a PWA with Odoo.pdfSteps to Build a PWA with Odoo.pdf
Steps to Build a PWA with Odoo.pdfayushinwizards
 
App Builder - Hierarchical Data Apps.pptx
App Builder - Hierarchical Data Apps.pptxApp Builder - Hierarchical Data Apps.pptx
App Builder - Hierarchical Data Apps.pptxPoojitha B
 
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAutokey
 
Self scaling Multi cloud nomad workloads
Self scaling Multi cloud nomad workloadsSelf scaling Multi cloud nomad workloads
Self scaling Multi cloud nomad workloadsBram Vogelaar
 
Embracing Change - The Impact of Generative AI on Strategic Portfolio Management
Embracing Change - The Impact of Generative AI on Strategic Portfolio ManagementEmbracing Change - The Impact of Generative AI on Strategic Portfolio Management
Embracing Change - The Impact of Generative AI on Strategic Portfolio ManagementOnePlan Solutions
 
unit 1 lecture 1 - Introduction - Software Engineering Myths.pdf
unit 1 lecture 1 - Introduction - Software Engineering Myths.pdfunit 1 lecture 1 - Introduction - Software Engineering Myths.pdf
unit 1 lecture 1 - Introduction - Software Engineering Myths.pdfStephenTec
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
 
Enabling Enterprise-wide OT Data access with Matrikon Data Broker.pdf
Enabling Enterprise-wide OT Data access  with Matrikon Data Broker.pdfEnabling Enterprise-wide OT Data access  with Matrikon Data Broker.pdf
Enabling Enterprise-wide OT Data access with Matrikon Data Broker.pdfJohn Archer
 
unit I lecture 3 - Software Process Models.pdf
unit I lecture 3 - Software Process Models.pdfunit I lecture 3 - Software Process Models.pdf
unit I lecture 3 - Software Process Models.pdfStephenTec
 
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)GDSCNiT
 
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdfIndia's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdfgranitesrijan
 
Manual de la Mezcladora SoundCraft Notepad -12Fx
Manual de la Mezcladora SoundCraft Notepad -12FxManual de la Mezcladora SoundCraft Notepad -12Fx
Manual de la Mezcladora SoundCraft Notepad -12Fxjavierdavidvelasco17
 
unit I lecture 4 - AGILE DEVELOPMENT AND PLAN-DRIVEN.pdf
unit I lecture 4 - AGILE DEVELOPMENT AND PLAN-DRIVEN.pdfunit I lecture 4 - AGILE DEVELOPMENT AND PLAN-DRIVEN.pdf
unit I lecture 4 - AGILE DEVELOPMENT AND PLAN-DRIVEN.pdfStephenTec
 
unit I lecture 2 - Software Engineering Ethics - Software Process.pdf
unit I lecture 2 - Software Engineering Ethics - Software Process.pdfunit I lecture 2 - Software Engineering Ethics - Software Process.pdf
unit I lecture 2 - Software Engineering Ethics - Software Process.pdfStephenTec
 

Recently uploaded (20)

maximum subarray ppt for killing camp students
maximum subarray ppt for killing camp studentsmaximum subarray ppt for killing camp students
maximum subarray ppt for killing camp students
 
Getting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptxGetting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptx
 
Sql server types of joins with example.pptx
Sql server types of joins with example.pptxSql server types of joins with example.pptx
Sql server types of joins with example.pptx
 
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS
100 TOOLS TO MEASURE AND ANALYSE YOUR DIGITAL MARKETING EFFORTS
 
Les02 Restricting and Sorting Data using SQL.ppt
Les02 Restricting and Sorting Data using SQL.pptLes02 Restricting and Sorting Data using SQL.ppt
Les02 Restricting and Sorting Data using SQL.ppt
 
Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024
 
Steps to Build a PWA with Odoo.pdf
Steps to Build a PWA with Odoo.pdfSteps to Build a PWA with Odoo.pdf
Steps to Build a PWA with Odoo.pdf
 
App Builder - Hierarchical Data Apps.pptx
App Builder - Hierarchical Data Apps.pptxApp Builder - Hierarchical Data Apps.pptx
App Builder - Hierarchical Data Apps.pptx
 
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
 
Self scaling Multi cloud nomad workloads
Self scaling Multi cloud nomad workloadsSelf scaling Multi cloud nomad workloads
Self scaling Multi cloud nomad workloads
 
Embracing Change - The Impact of Generative AI on Strategic Portfolio Management
Embracing Change - The Impact of Generative AI on Strategic Portfolio ManagementEmbracing Change - The Impact of Generative AI on Strategic Portfolio Management
Embracing Change - The Impact of Generative AI on Strategic Portfolio Management
 
unit 1 lecture 1 - Introduction - Software Engineering Myths.pdf
unit 1 lecture 1 - Introduction - Software Engineering Myths.pdfunit 1 lecture 1 - Introduction - Software Engineering Myths.pdf
unit 1 lecture 1 - Introduction - Software Engineering Myths.pdf
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
Enabling Enterprise-wide OT Data access with Matrikon Data Broker.pdf
Enabling Enterprise-wide OT Data access  with Matrikon Data Broker.pdfEnabling Enterprise-wide OT Data access  with Matrikon Data Broker.pdf
Enabling Enterprise-wide OT Data access with Matrikon Data Broker.pdf
 
unit I lecture 3 - Software Process Models.pdf
unit I lecture 3 - Software Process Models.pdfunit I lecture 3 - Software Process Models.pdf
unit I lecture 3 - Software Process Models.pdf
 
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
 
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdfIndia's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
India's_Generative_AI_Startup_Landscape_Report_2023_Inc42 (1).pdf
 
Manual de la Mezcladora SoundCraft Notepad -12Fx
Manual de la Mezcladora SoundCraft Notepad -12FxManual de la Mezcladora SoundCraft Notepad -12Fx
Manual de la Mezcladora SoundCraft Notepad -12Fx
 
unit I lecture 4 - AGILE DEVELOPMENT AND PLAN-DRIVEN.pdf
unit I lecture 4 - AGILE DEVELOPMENT AND PLAN-DRIVEN.pdfunit I lecture 4 - AGILE DEVELOPMENT AND PLAN-DRIVEN.pdf
unit I lecture 4 - AGILE DEVELOPMENT AND PLAN-DRIVEN.pdf
 
unit I lecture 2 - Software Engineering Ethics - Software Process.pdf
unit I lecture 2 - Software Engineering Ethics - Software Process.pdfunit I lecture 2 - Software Engineering Ethics - Software Process.pdf
unit I lecture 2 - Software Engineering Ethics - Software Process.pdf
 

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

  • 2. Apache Spark 2.0 Next major release,coming out this month • Unstable previewrelease at spark.apache.org Remains highly compatible with ApacheSpark 1.X Over 2000 patches from 280 contributors!
  • 3. Apache Spark Philosophy Unified engine Support end-to-end applications High-level APIs Easy to use, rich optimizations Integrate broadly Storage systems, libraries, etc SQLStreaming ML Graph … 1 2 3
  • 4. New in 2.0 Structured API improvements (DataFrame, Dataset, SparkSession) Structured Streaming MLlib model export MLlib R bindings SQL 2003 support Scala 2.12 support Deep learning libraries (Baidu, Yahoo!, Berkeley, Databricks) GraphFrames PyData integration Reactive streams C# bindings:Mobius JS bindings:EclairJS Broader Community Build on common interface of RDDs & DataFrames
  • 5. Deep Dive: Structured APIs events = sc.read.json(“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) DataFrame API Optimized Plan Specialized Code READ logs READ users JOIN AGG FILTER while(logs.hasNext) { e = logs.next if(e.status == “ERR”) { u = users.get(e.uid) key = (u.loc, e.status) sum(key) += e.duration count(key) += 1 } } ...
  • 6. New in 2.0 Whole-stage code generation • Fuse across multiple operators Spark 1.6 14M rows/s Spark 2.0 125M rows/s Parquet in 1.6 11M rows/s Parquet in 2.0 90M rows/s Optimized input / output • Apache Parquet + built-incache
  • 7. Structured Streaming High-levelstreaming APIbuilt on DataFrames • Eventtime, windowing,sessions,sources& sinks Also supports interactive & batch queries • Aggregate datain a stream,then serve using JDBC • Change queriesat runtime • Build and apply ML models Not just streaming, but “continuous applications”
  • 8. Apache Spark 2.0: Infinite DataFrames Apache Spark 1.X: Static DataFrames Single API Structured Streaming API
  • 9. logs = ctx.read.format("json").open("s3://logs") logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format("jdbc") .save("jdbc:mysql//...") Example: Batch App
  • 10. logs = ctx.read.format("json").stream("s3://logs") logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format("jdbc") .startStream("jdbc:mysql//...") Example: Continuous App
  • 11. More Details in Conference Engine: Structuring Spark, StructuredStreaming, deep dives ML: SparkR, MLlib 2.0, newalgorithms Other: deep learning, GraphFrames, Solr,Cassandra, … Try 2.0-preview at spark.apache.org
  • 12. Growing the Community New initiatives from Databricks
  • 13. The largest challenge in applying big data is the skills gap. StackOverflow Developer Survey 2016
  • 14. Databricks Community Edition Free version of Databricks with: • Interactive tutorials • Apache Spark and popular data science libraries • Visualization& debug tools GA Today! databricks.com/ce
  • 15. Massive Open Online Courses Free 5-course series on big data with Apache Spark dbricks.co/mooc16 Introduction to Apache Spark TM Distributed Machine Learning with Apache Spark TM Big Data Analysis with Apache Spark TM Advanced Apache Spark for Data Science and Data Engineering TM Advanced Machine Learning with Apache Spark TM