SlideShare a Scribd company logo
1 of 36
Download to read offline
Apache Spark Release 1.6
Patrick Wendell
About Me @pwendell
U.C. BerkeleyPhD, left to co-found Databricks
Coordinate community roadmap
Frequentreleasemanager for Spark
About Databricks
Founded by Spark team, donated Spark to Apachein 2013 and lead
developmenttoday.
Collaborative, cloud-hosted data platform powered by Spark
Free trial to check it out
https://databricks.com/
We’re hiring!
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engineacross diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries
Users Distributors & Apps
Spark’s 3 Month Release Cycle
For production jobs, use the latest
release
To try out unreleasedfeaturesor
fixes, use nightly builds
people.apache.org/~pwendell/spark-nightly/
master
branch-1.6
V1.6.0 V1.6.1
Spark 1.6
Spark 1.6 Release
Will ship upstreamthrough Apachefoundation in December (likely)
Key themes
Out of the box performance
Previews of key new API’s
Follow along with me at http://bit.ly/1OBkjMM
Follow along
http://bit.ly/1lrvdLc
Memory Management in Spark: <= 1.5
• Two separate memory managers:
• Execution memory: computation of shuffles, joins,sorts, aggregations
• Storage memory: caching and propagating internal data sources across
cluster
• Challengeswith this:
• Manual intervention to avoid unnecessary spilling
• No good defaultsfor all workloads – meaning lost efficiency
• Goal: Allowmemory regionsto shrink/growdynamically
Unified Memory Management in Spark 1.6
• Can cross between execution and storage memory
• When execution memory exceedsits own region,it can borrow as much of the
storage space as isfree and vice versa
• Borrowed storage memory can be evicted at any time
• Significantly reducesconfiguration
• Can define low water mark for storage (below which we won’tevict)
• Reference:[SPARK-10000]
History of Spark API’s
RDD API (2011) Distribute collection of JVM objects
Functional
operators (map, filter, etc)
DataFrame API (2013) Distribute collection of Row objects
Expression-based
operations and UDF’s
Logical plans and
optimizer
Fast/efficient
Dataset
“Encoder”converts from JVM Objectinto
a Dataset Row
Checkout[SPARK-9999]
JVM Object
Dataset
Row
encoder
Dataset API in Spark 1.6
Typed interface over DataFrames / Tungsten
case class Person(name: String, age: Long)
val dataframe = read.json(“people.json”)
val ds: Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.name.startsWith(“M”))
.groupBy($“name”)
.avg(“age”)
Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
Advanced
Analytics
Other Notable Core Engine Features
SQL directly over files
Advanced JSON parsing
Better instrumentation for SQL operators
Demos of What We Learned So Far
Advanced Layout of Cached Data
Storing partitioning and orderingschemes in In-memory table scan
allowsfor performance improvements: e.g. in Joins, an extra partition step can
be saved based on thisinformation
Adding distributeBy and localSort to DF API
Similar to HiveQL’s DISTRIBUTE BY
allowsthe userto control the partitioning and ordering of a data set
Checkout [SPARK-4849]
[Streaming] New improved state management
Introducing a DStream transformation for stateful streamprocessing
Does notscan every key
Easier to implementcommon use cases
timeout of idle data
returning items otherthan state
SupercedesupdateStateByKey in functionality and performance.
trackStateByKey (note, this name may change)
[Streaming] trackStateByKey example
(name may change)
// Initial RDD input
val initialRDD = ssc.sparkContext.parallelize(...)
// ReceiverInputDStream
val lines = ssc.socketTextStream(...)
val words = lines.flatMap(...)
val wordDStream = words.map(x => (x, 1))
// stateDStream using trackStateByKey
val trackStateFunc = (...) { ... }
val stateDStream =
wordDStream.trackStateByKey(StateSpec.function(trackStateFunc).initialSta
te(initialRDD))
[Streaming] Display the failed output op in Streaming
Checkout:
[SPARK-10885] PR#8950
[MLlib]: Pipeline persistence
Persist ML Pipelinesto:
Save models in the spark.ml API
Re-run workflows in a reproducible manner
Export modelsto non-Sparkapps (e.g., model server)
This is more complexthan ML model persistence because:
Must persist Transformers and Estimators, not just Models.
We need a standard way to persist Params.
Pipelines and other meta-algorithms can contain other Transformers and Estimators,
including as Params.
We should save feature metadata with Models
[MLlib]: Pipeline persistence
Reference[SPARK-6725]
Adding model export/import to
the spark.ml API.
Adding the internal
Saveable/Loadable API and
Parquet-based format
R-like statistics for GLMs
Provide R-like summary
statistics for ordinary least
squares via normal equation
solver
Check out[SPARK-9836]
Performance
SPARK-10000 Unified Memory Management- Shared memory for execution and
caching instead of exclusive division ofthe regions.
SPARK-10917,SPARK-11149 In-memory ColumnarCache Performance - Significant
(up to 14x) speed up when caching data that containscomplextypes in
DataFrames or SQL.
SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query
execution to occur using off-heap memory to avoid GC overhead
Performance (continued)
SPARK-4849 Advanced Layoutof Cached Data - storing partitioning and ordering
schemesin In-memory table scan, and adding distributeBy and localSort to DF
API
SPARK-9858 Adaptive query execution - Initial supportfor automatically selecting
the number of reducersfor joinsand aggregations.
Spark SQL
SPARK-9999 Dataset API
SPARK-11197 SQL Querieson Files
SPARK-11745 Reading non-standard JSON files
SPARK-10412 Per-operator Metrics for SQL Execution
SPARK-11329 Star (*) expansion forStructTypes
SPARK-11111 Fastnull-safe joins
SPARK-10978 Datasource API Avoid Double Filter
Spark Streaming
API Updates
SPARK-2629 New improved state management
SPARK-11198 Kinesisrecord deaggregation
SPARK-10891 Kinesismessage handlerfunction
SPARK-6328 Python Streaming ListenerAPI
UI Improvements
Made failuresvisible in the streaming tab, in the timelines,batch list, and batch
detailspage.
Made outputoperationsvisible in the streaming tab as progress bars
MLlib: New algorithms / models
SPARK-8518 Survival analysis- Log-linearmodel for survival analysis
SPARK-9834 Normal equation forleast squares - Normal equation solver,providing
R-like model summary statistics
SPARK-3147 Online hypothesistesting - A/B testing in the Spark Streaming
framework
SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer,SQL
transformer
SPARK-6517 Bisecting K-Meansclustering - Fast top-down clustering variantof K-
Means
MLlib: API Improvements
ML Pipelines
SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of
spark.ml algorithms
SPARK-5565 LDA in ML Pipelines- API for Latent Dirichlet Allocation in ML Pipelines
R API
SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via
summary(model)
SPARK-9681 Feature interactions in R formula -Interaction operator ":"in R formula
Python API - Many improvementsto Python API to approach feature parity
MLlib: Miscellaneous Improvements
SPARK-7685 , SPARK-9642 Instance weightsfor GLMs - Logistic and Linear
Regression can take instance weights
SPARK-10384,SPARK-10385 Univariate and bivariate statistics in DataFrames -
Variance, stddev,correlations, etc.
SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
For More Information
Apache Spark 1.6.0 Release Preview:http://apache-spark-developers-
list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-6-0-Release-Preview-
td15314.html
Spark 1.6 Preview available in Databricks:
https://databricks.com/blog/2015/11/20/announcing-spark-1-6-preview-in-
databricks.html
Notebooks
Spark 1.6 ImprovementsNotebook:
http://cdn2.hubspot.net/hubfs/438089/notebooks/Spark_1.6_Improvements.ht
ml?t=1448929686268
Spark 1.6 R ImprovementsNotebook:
http://cdn2.hubspot.net/hubfs/438089/notebooks/Spark_1.6_R_Improvements.
html?t=1448946977231
Join us at
Spark Summit East
February16-18, 2016 | New York City
Thanks!

More Related Content

What's hot

Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Communityjeykottalam
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying SparkDatabricks
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 

What's hot (20)

Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 

Viewers also liked

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big DataLegacy Typesafe (now Lightbend)
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 

Viewers also liked (6)

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 

Similar to Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyondXiao Li
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3DataWorks Summit
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache sparkAmine Sagaama
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Sparkfelixcss
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?Eyal Ben Ivri
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4boxu42
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 

Similar to Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell (20)

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache spark
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
AI at Scale
AI at ScaleAI at Scale
AI at Scale
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 

Recently uploaded (20)

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell

  • 1. Apache Spark Release 1.6 Patrick Wendell
  • 2. About Me @pwendell U.C. BerkeleyPhD, left to co-found Databricks Coordinate community roadmap Frequentreleasemanager for Spark
  • 3. About Databricks Founded by Spark team, donated Spark to Apachein 2013 and lead developmenttoday. Collaborative, cloud-hosted data platform powered by Spark Free trial to check it out https://databricks.com/ We’re hiring!
  • 4. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engineacross diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 5.
  • 7. Spark’s 3 Month Release Cycle For production jobs, use the latest release To try out unreleasedfeaturesor fixes, use nightly builds people.apache.org/~pwendell/spark-nightly/ master branch-1.6 V1.6.0 V1.6.1
  • 9. Spark 1.6 Release Will ship upstreamthrough Apachefoundation in December (likely) Key themes Out of the box performance Previews of key new API’s Follow along with me at http://bit.ly/1OBkjMM
  • 11. Memory Management in Spark: <= 1.5 • Two separate memory managers: • Execution memory: computation of shuffles, joins,sorts, aggregations • Storage memory: caching and propagating internal data sources across cluster • Challengeswith this: • Manual intervention to avoid unnecessary spilling • No good defaultsfor all workloads – meaning lost efficiency • Goal: Allowmemory regionsto shrink/growdynamically
  • 12. Unified Memory Management in Spark 1.6 • Can cross between execution and storage memory • When execution memory exceedsits own region,it can borrow as much of the storage space as isfree and vice versa • Borrowed storage memory can be evicted at any time • Significantly reducesconfiguration • Can define low water mark for storage (below which we won’tevict) • Reference:[SPARK-10000]
  • 13. History of Spark API’s RDD API (2011) Distribute collection of JVM objects Functional operators (map, filter, etc) DataFrame API (2013) Distribute collection of Row objects Expression-based operations and UDF’s Logical plans and optimizer Fast/efficient
  • 14. Dataset “Encoder”converts from JVM Objectinto a Dataset Row Checkout[SPARK-9999] JVM Object Dataset Row encoder
  • 15. Dataset API in Spark 1.6 Typed interface over DataFrames / Tungsten case class Person(name: String, age: Long) val dataframe = read.json(“people.json”) val ds: Dataset[Person] = dataframe.as[Person] ds.filter(p => p.name.startsWith(“M”)) .groupBy($“name”) .avg(“age”)
  • 16. Tungsten Execution PythonSQL R Streaming DataFrame (& Dataset) Advanced Analytics
  • 17. Other Notable Core Engine Features SQL directly over files Advanced JSON parsing Better instrumentation for SQL operators
  • 18. Demos of What We Learned So Far
  • 19. Advanced Layout of Cached Data Storing partitioning and orderingschemes in In-memory table scan allowsfor performance improvements: e.g. in Joins, an extra partition step can be saved based on thisinformation Adding distributeBy and localSort to DF API Similar to HiveQL’s DISTRIBUTE BY allowsthe userto control the partitioning and ordering of a data set Checkout [SPARK-4849]
  • 20. [Streaming] New improved state management Introducing a DStream transformation for stateful streamprocessing Does notscan every key Easier to implementcommon use cases timeout of idle data returning items otherthan state SupercedesupdateStateByKey in functionality and performance. trackStateByKey (note, this name may change)
  • 21. [Streaming] trackStateByKey example (name may change) // Initial RDD input val initialRDD = ssc.sparkContext.parallelize(...) // ReceiverInputDStream val lines = ssc.socketTextStream(...) val words = lines.flatMap(...) val wordDStream = words.map(x => (x, 1)) // stateDStream using trackStateByKey val trackStateFunc = (...) { ... } val stateDStream = wordDStream.trackStateByKey(StateSpec.function(trackStateFunc).initialSta te(initialRDD))
  • 22. [Streaming] Display the failed output op in Streaming Checkout: [SPARK-10885] PR#8950
  • 23. [MLlib]: Pipeline persistence Persist ML Pipelinesto: Save models in the spark.ml API Re-run workflows in a reproducible manner Export modelsto non-Sparkapps (e.g., model server) This is more complexthan ML model persistence because: Must persist Transformers and Estimators, not just Models. We need a standard way to persist Params. Pipelines and other meta-algorithms can contain other Transformers and Estimators, including as Params. We should save feature metadata with Models
  • 24. [MLlib]: Pipeline persistence Reference[SPARK-6725] Adding model export/import to the spark.ml API. Adding the internal Saveable/Loadable API and Parquet-based format
  • 25. R-like statistics for GLMs Provide R-like summary statistics for ordinary least squares via normal equation solver Check out[SPARK-9836]
  • 26. Performance SPARK-10000 Unified Memory Management- Shared memory for execution and caching instead of exclusive division ofthe regions. SPARK-10917,SPARK-11149 In-memory ColumnarCache Performance - Significant (up to 14x) speed up when caching data that containscomplextypes in DataFrames or SQL. SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
  • 27. Performance (continued) SPARK-4849 Advanced Layoutof Cached Data - storing partitioning and ordering schemesin In-memory table scan, and adding distributeBy and localSort to DF API SPARK-9858 Adaptive query execution - Initial supportfor automatically selecting the number of reducersfor joinsand aggregations.
  • 28. Spark SQL SPARK-9999 Dataset API SPARK-11197 SQL Querieson Files SPARK-11745 Reading non-standard JSON files SPARK-10412 Per-operator Metrics for SQL Execution SPARK-11329 Star (*) expansion forStructTypes SPARK-11111 Fastnull-safe joins SPARK-10978 Datasource API Avoid Double Filter
  • 29. Spark Streaming API Updates SPARK-2629 New improved state management SPARK-11198 Kinesisrecord deaggregation SPARK-10891 Kinesismessage handlerfunction SPARK-6328 Python Streaming ListenerAPI UI Improvements Made failuresvisible in the streaming tab, in the timelines,batch list, and batch detailspage. Made outputoperationsvisible in the streaming tab as progress bars
  • 30. MLlib: New algorithms / models SPARK-8518 Survival analysis- Log-linearmodel for survival analysis SPARK-9834 Normal equation forleast squares - Normal equation solver,providing R-like model summary statistics SPARK-3147 Online hypothesistesting - A/B testing in the Spark Streaming framework SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer,SQL transformer SPARK-6517 Bisecting K-Meansclustering - Fast top-down clustering variantof K- Means
  • 31. MLlib: API Improvements ML Pipelines SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml algorithms SPARK-5565 LDA in ML Pipelines- API for Latent Dirichlet Allocation in ML Pipelines R API SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model) SPARK-9681 Feature interactions in R formula -Interaction operator ":"in R formula Python API - Many improvementsto Python API to approach feature parity
  • 32. MLlib: Miscellaneous Improvements SPARK-7685 , SPARK-9642 Instance weightsfor GLMs - Logistic and Linear Regression can take instance weights SPARK-10384,SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev,correlations, etc. SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
  • 33. For More Information Apache Spark 1.6.0 Release Preview:http://apache-spark-developers- list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-6-0-Release-Preview- td15314.html Spark 1.6 Preview available in Databricks: https://databricks.com/blog/2015/11/20/announcing-spark-1-6-preview-in- databricks.html
  • 34. Notebooks Spark 1.6 ImprovementsNotebook: http://cdn2.hubspot.net/hubfs/438089/notebooks/Spark_1.6_Improvements.ht ml?t=1448929686268 Spark 1.6 R ImprovementsNotebook: http://cdn2.hubspot.net/hubfs/438089/notebooks/Spark_1.6_R_Improvements. html?t=1448946977231
  • 35. Join us at Spark Summit East February16-18, 2016 | New York City