Spark Community Update - Spark Summit San Francisco 2015

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
Spark Community Update
Matei Zaharia & Patrick Wendell
June 15th, 2015
A Great Year for Spark
Most active open source project in data processing
New language: R
Many new features & community projects
Community Growth
June 2014 June 2015
total contributors 255 730
contributors/month 75 135
lines of code 175,000 400,000
Community Growth
June 2014 June 2015
total contributors 255 730
contributors/month 75 135
lines of code 175,000 400,000
Mostly in libraries
Users
1000+ companies
…
Distributors + Apps
50+ companies
…
Large-Scale Usage
Largest cluster: 8000 nodes
Largest single job: 1 petabyte
Top streaming intake: 1 TB/hour
2014 on-disk sort record
Open Source
Ecosystem
Applications
Environments Data Sources
Current Spark Components
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Major Directions in 2015
Data Science
Similar interfaces to
single-node tools
Platform APIs
Growing the ecosystem
Data Science
DataFrames: popular API for data transformation
Machine Learning Pipelines: inspired by scikit-learn
R Language
1.3
1.4
1.4
Platform APIs
{JSON}
Data Sources
•  Uniform interface to diverse
sources (DataFrames + SQL)
Spark Packages
•  Community site with 70+ libraries
•  spark-packages.org
…
Spark
Your app
Ongoing Engine Improvements
Project Tungsten
•  Code generation, binary processing, off-heap memory
DAG visualization & debugging tools
Spark’s 1.4 Release
R Language Support
14	
  
R API based on Spark’s
DataFrames
An R Runtime for Big Data
15	
  
Spark’s scale
Thousands of machines and
cores
Spark’s performance
Runtime optimizer, code
generation, memory
management
Access to Spark’s I/O Packages
#	
  Dataframe	
  from	
  JSON	
  
>	
  people	
  <-­‐	
  read.df("people.json",	
  "json")	
  
	
  
#	
  ...	
  from	
  MySQL	
  
>	
  people	
  <-­‐	
  read.df("jdbc:mysql://sql01",	
  "jdbc")	
  
	
  
#	
  ...	
  from	
  Hive	
  
>	
  people	
  <-­‐	
  read.table("orders")	
  
16	
  
CSV ElasticSearch
Avro Cassandra
Parquet MongoDB
SequoiaDB HBase
ML Pipelines
17	
  
Data
Frame
tokenizer hashingTF lr
Pipeline Model	
  
//	
  create	
  pipeline	
  
tok	
  =	
  Tokenizer(in="text",	
  out="words”)	
  
tf	
  =	
  HashingTF(in=“words”,	
  out="features”)	
  
lr	
  =	
  LogisticRegression(maxIter=10,	
  regParam=0.01)	
  
pipeline	
  =	
  Pipeline(stages=[tok,	
  tf,	
  lr])	
  
Data
Frame
//	
  train	
  pipeline	
  
df	
  =	
  sqlCtx.table(”training”)	
  
model	
  =	
  pipeline.fit(df)	
  
	
  
//	
  make	
  predictions	
  
df	
  =	
  sqlCtx.read.json("/path/to/test”)	
  
model.transform(df)	
  
	
  	
  .select(“id”,	
  “text”,	
  “prediction”)	
  
	
  
18	
  
ML Pipelines
Stable API with hooks for third party pipeline components
Feature transformers New algorithms
VectorAssembler
String/VectorIndexer
OneHotEncoder
PolynomialExpansion
….
GLM with elastic-net
Tree classifiers
Tree regressors
OneVsRest
…
19	
  
Performance and Project Tungsten
Managed memory for aggregations
Memory efficient shuffles
1.3 1.4
Customer Pipeline
Latency
What’s Coming in Spark 1.5+?
Project Tungsten: Code generation, improved sort + aggregation
Spark Streaming: Flow control, optimized state management
ML: Single machine solvers, scalability to many features
SparkR: Integration with Spark’s machine learning APIs
20	
  
Join Us Today at Office Hours!
21	
  
Area
1:00-1:45 Spark Core, YARN
Spark Streaming
1:45-2:30 Spark SQL
3:00-3:40 Spark Ops
3:40-4:15 Spark SQL
4:30-5:15 Spark Core, PySpark
Spark MLlib
5:15-6:00 Spark MLlib
Databricks booth (A1)
More tomorrow…
Thanks!
1 of 22

Recommended

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ... by
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
1.9K views29 slides
End-to-end Data Pipeline with Apache Spark by
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
6.9K views25 slides
ETL with SPARK - First Spark London meetup by
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
30.7K views48 slides
A Journey into Databricks' Pipelines: Journey and Lessons Learned by
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
3K views30 slides
Spark's Role in the Big Data Ecosystem (Spark Summit 2014) by
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
3.6K views20 slides
A Developer’s View into Spark's Memory Model with Wenchen Fan by
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
1.5K views49 slides

More Related Content

What's hot

Writing Continuous Applications with Structured Streaming Python APIs in Apac... by
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
4.2K views49 slides
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi... by
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
900 views122 slides
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ... by
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
2.8K views30 slides
Strata NYC 2015: What's new in Spark Streaming by
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
4K views48 slides
Spark what's new what's coming by
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
4.8K views33 slides
Spark Application Carousel: Highlights of Several Applications Built with Spark by
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
2.1K views29 slides

What's hot(20)

Writing Continuous Applications with Structured Streaming Python APIs in Apac... by Databricks
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks4.2K views
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi... by Databricks
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks900 views
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ... by Databricks
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks2.8K views
Strata NYC 2015: What's new in Spark Streaming by Databricks
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks4K views
Spark what's new what's coming by Databricks
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks4.8K views
Spark Application Carousel: Highlights of Several Applications Built with Spark by Databricks
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks2.1K views
Spark Under the Hood - Meetup @ Data Science London by Databricks
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
Databricks2.5K views
Unified Big Data Processing with Apache Spark (QCON 2014) by Databricks
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks3.7K views
Apache Spark Performance: Past, Future and Present by Databricks
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
Databricks2.1K views
Extreme Apache Spark: how in 3 months we created a pipeline that can process ... by Josef A. Habdank
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank25.4K views
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette by Spark Summit
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit5K views
Spark streaming state of the union by Databricks
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
Databricks3.9K views
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu by Databricks
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks7.3K views
SMACK Stack 1.1 by Joe Stein
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
Joe Stein5.7K views
Combining Machine Learning Frameworks with Apache Spark by Databricks
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks6.9K views
Processing Large Data with Apache Spark -- HasGeek by Venkata Naga Ravi
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi12.4K views
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara... by Databricks
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks3.8K views
Intro to Spark development by Spark Summit
 Intro to Spark development  Intro to Spark development
Intro to Spark development
Spark Summit10K views
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng by Databricks
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks1.2K views
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ... by Brian O'Neill
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill2.3K views

Viewers also liked

Spark Summit 2015 keynote: Making Big Data Simple with Spark by
Spark Summit 2015 keynote: Making Big Data Simple with SparkSpark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with SparkDatabricks
7.6K views14 slides
Karakteristik transistor_fajar priyambada by
Karakteristik transistor_fajar priyambadaKarakteristik transistor_fajar priyambada
Karakteristik transistor_fajar priyambadafajar_priyambada
280 views23 slides
Go to the market by
Go to the marketGo to the market
Go to the marketProFAX
485 views10 slides
Experian Consumer Holiday Shopping Survey by
Experian Consumer Holiday Shopping SurveyExperian Consumer Holiday Shopping Survey
Experian Consumer Holiday Shopping SurveyExperian_US
3.6K views21 slides
Sessao 8a by
Sessao 8aSessao 8a
Sessao 8aFilipaNeves
377 views1 slide
Silhouette Photography Slideshow by
Silhouette Photography SlideshowSilhouette Photography Slideshow
Silhouette Photography Slideshowlhuotari
1K views10 slides

Viewers also liked(20)

Spark Summit 2015 keynote: Making Big Data Simple with Spark by Databricks
Spark Summit 2015 keynote: Making Big Data Simple with SparkSpark Summit 2015 keynote: Making Big Data Simple with Spark
Spark Summit 2015 keynote: Making Big Data Simple with Spark
Databricks7.6K views
Karakteristik transistor_fajar priyambada by fajar_priyambada
Karakteristik transistor_fajar priyambadaKarakteristik transistor_fajar priyambada
Karakteristik transistor_fajar priyambada
fajar_priyambada280 views
Go to the market by ProFAX
Go to the marketGo to the market
Go to the market
ProFAX485 views
Experian Consumer Holiday Shopping Survey by Experian_US
Experian Consumer Holiday Shopping SurveyExperian Consumer Holiday Shopping Survey
Experian Consumer Holiday Shopping Survey
Experian_US3.6K views
Silhouette Photography Slideshow by lhuotari
Silhouette Photography SlideshowSilhouette Photography Slideshow
Silhouette Photography Slideshow
lhuotari1K views
academic / small company collaborations for rare and neglected diseasesv2 by Sean Ekins
 academic / small company collaborations for rare and neglected diseasesv2 academic / small company collaborations for rare and neglected diseasesv2
academic / small company collaborations for rare and neglected diseasesv2
Sean Ekins663 views
Connected commuter high fashion by JCDecauxUK
Connected commuter high fashionConnected commuter high fashion
Connected commuter high fashion
JCDecauxUK305 views
Pover jaar voor firma Bartel Van Riet by Thierry Debels
Pover jaar voor firma Bartel Van RietPover jaar voor firma Bartel Van Riet
Pover jaar voor firma Bartel Van Riet
Thierry Debels485 views
Presentacion 1computación by Nona Muñoz
Presentacion 1computaciónPresentacion 1computación
Presentacion 1computación
Nona Muñoz251 views
テストあああるらしいすよ by roap_jp
テストあああるらしいすよテストあああるらしいすよ
テストあああるらしいすよ
roap_jp788 views
Herding Cats 101 Be a Project Management Rockstar by TorranceLearning
Herding Cats 101 Be a Project Management RockstarHerding Cats 101 Be a Project Management Rockstar
Herding Cats 101 Be a Project Management Rockstar
TorranceLearning1K views
Cuestionario de detección de rasgos autistas en el entorno escolar by olgapluna
Cuestionario de detección de rasgos autistas en el entorno escolarCuestionario de detección de rasgos autistas en el entorno escolar
Cuestionario de detección de rasgos autistas en el entorno escolar
olgapluna1.1K views
5030 Digitalstorytelling Mediatypes Storyboarding Feb 05 09ppt by Neil Foote
5030 Digitalstorytelling Mediatypes Storyboarding Feb 05 09ppt5030 Digitalstorytelling Mediatypes Storyboarding Feb 05 09ppt
5030 Digitalstorytelling Mediatypes Storyboarding Feb 05 09ppt
Neil Foote461 views

Similar to Spark Community Update - Spark Summit San Francisco 2015

Dev Ops Training by
Dev Ops TrainingDev Ops Training
Dev Ops TrainingSpark Summit
8.6K views301 slides
New directions for Apache Spark in 2015 by
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015Databricks
11.7K views16 slides
Unified Big Data Processing with Apache Spark by
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
2K views55 slides
20170126 big data processing by
20170126 big data processing20170126 big data processing
20170126 big data processingVienna Data Science Group
1.3K views76 slides
Composable Parallel Processing in Apache Spark and Weld by
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
3.6K views47 slides
New Directions for Spark in 2015 - Spark Summit East by
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastDatabricks
3.2K views19 slides

Similar to Spark Community Update - Spark Summit San Francisco 2015(20)

New directions for Apache Spark in 2015 by Databricks
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks11.7K views
Unified Big Data Processing with Apache Spark by C4Media
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media2K views
Composable Parallel Processing in Apache Spark and Weld by Databricks
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks3.6K views
New Directions for Spark in 2015 - Spark Summit East by Databricks
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
Databricks3.2K views
Jump Start with Apache Spark 2.0 on Databricks by Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks2.8K views
Tapping into Scientific Data with Hadoop and Flink by Michael Häusler
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and Flink
Michael Häusler263 views
State of the Semantic Web by Ivan Herman
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
Ivan Herman923 views
A look ahead at spark 2.0 by Databricks
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
Databricks2.1K views
Why apache Flink is the 4G of Big Data Analytics Frameworks by Slim Baltagi
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi30.7K views
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S... by Gezim Sejdiu
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
Gezim Sejdiu391 views
Intro to Spark and Spark SQL by jeykottalam
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam51.2K views
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics... by Debraj GuhaThakurta
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ... by Debraj GuhaThakurta
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Apache spark-melbourne-april-2015-meetup by Ned Shawa
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa1.1K views
Introduction to apache kafka, confluent and why they matter by Paolo Castagna
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
Paolo Castagna1.7K views
Congressional PageRank: Graph Analytics of US Congress With Neo4j by William Lyon
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
William Lyon1.4K views

More from Databricks

DW Migration Webinar-March 2022.pptx by
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
4.3K views25 slides
Data Lakehouse Symposium | Day 1 | Part 1 by
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K views43 slides
Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
741 views16 slides
Data Lakehouse Symposium | Day 4 by
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K views74 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides

More from Databricks(20)

DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views
Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks741 views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks689 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks605 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks677 views
Re-imagine Data Monitoring with whylogs and Spark by Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks550 views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks450 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks513 views
Massive Data Processing in Adobe Using Delta Lake by Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views

Spark Community Update - Spark Summit San Francisco 2015

  • 1. Spark Community Update Matei Zaharia & Patrick Wendell June 15th, 2015
  • 2. A Great Year for Spark Most active open source project in data processing New language: R Many new features & community projects
  • 3. Community Growth June 2014 June 2015 total contributors 255 730 contributors/month 75 135 lines of code 175,000 400,000
  • 4. Community Growth June 2014 June 2015 total contributors 255 730 contributors/month 75 135 lines of code 175,000 400,000 Mostly in libraries
  • 6. Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk sort record
  • 8. Current Spark Components Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments
  • 9. Major Directions in 2015 Data Science Similar interfaces to single-node tools Platform APIs Growing the ecosystem
  • 10. Data Science DataFrames: popular API for data transformation Machine Learning Pipelines: inspired by scikit-learn R Language 1.3 1.4 1.4
  • 11. Platform APIs {JSON} Data Sources •  Uniform interface to diverse sources (DataFrames + SQL) Spark Packages •  Community site with 70+ libraries •  spark-packages.org … Spark Your app
  • 12. Ongoing Engine Improvements Project Tungsten •  Code generation, binary processing, off-heap memory DAG visualization & debugging tools
  • 14. R Language Support 14   R API based on Spark’s DataFrames
  • 15. An R Runtime for Big Data 15   Spark’s scale Thousands of machines and cores Spark’s performance Runtime optimizer, code generation, memory management
  • 16. Access to Spark’s I/O Packages #  Dataframe  from  JSON   >  people  <-­‐  read.df("people.json",  "json")     #  ...  from  MySQL   >  people  <-­‐  read.df("jdbc:mysql://sql01",  "jdbc")     #  ...  from  Hive   >  people  <-­‐  read.table("orders")   16   CSV ElasticSearch Avro Cassandra Parquet MongoDB SequoiaDB HBase
  • 17. ML Pipelines 17   Data Frame tokenizer hashingTF lr Pipeline Model   //  create  pipeline   tok  =  Tokenizer(in="text",  out="words”)   tf  =  HashingTF(in=“words”,  out="features”)   lr  =  LogisticRegression(maxIter=10,  regParam=0.01)   pipeline  =  Pipeline(stages=[tok,  tf,  lr])   Data Frame //  train  pipeline   df  =  sqlCtx.table(”training”)   model  =  pipeline.fit(df)     //  make  predictions   df  =  sqlCtx.read.json("/path/to/test”)   model.transform(df)      .select(“id”,  “text”,  “prediction”)    
  • 18. 18   ML Pipelines Stable API with hooks for third party pipeline components Feature transformers New algorithms VectorAssembler String/VectorIndexer OneHotEncoder PolynomialExpansion …. GLM with elastic-net Tree classifiers Tree regressors OneVsRest …
  • 19. 19   Performance and Project Tungsten Managed memory for aggregations Memory efficient shuffles 1.3 1.4 Customer Pipeline Latency
  • 20. What’s Coming in Spark 1.5+? Project Tungsten: Code generation, improved sort + aggregation Spark Streaming: Flow control, optimized state management ML: Single machine solvers, scalability to many features SparkR: Integration with Spark’s machine learning APIs 20  
  • 21. Join Us Today at Office Hours! 21   Area 1:00-1:45 Spark Core, YARN Spark Streaming 1:45-2:30 Spark SQL 3:00-3:40 Spark Ops 3:40-4:15 Spark SQL 4:30-5:15 Spark Core, PySpark Spark MLlib 5:15-6:00 Spark MLlib Databricks booth (A1) More tomorrow…