How Spark Usage is
Evolving in 2015
Matei Zaharia
October 28,2015
A Great Year for Spark
Most active open source projectin big data
New language: R
Widespread industry support& adoption
Community Growth
2014 2015
Summit
Attendees
2014 2015
Meetup
Members
2014 2015
Developers
Contributing
3900
1100
42K
12K
350
600
Meetup Groups: January 2015
source: meetup.com
Meetup Groups: October 2015
source: meetup.com
What Spark Provides
Generalenginewith libraries for
many data analysis tasks
Accessto diverse data sources
Simple, unified API
SQLStreaming ML Graph
…
Major focus in past 2 years
Data sourceAPI added 2015
What Changed in 2015?
Databricks Survey
1400 respondentsfrom 840 companies
Threetrends:
1) Diverse applications
2) More runtime environments
3) More types of users
Industries Using Spark
Other
Software
(SaaS, Web, Mobile)
Consulting (IT)
Retail,
e-Commerce
Advertising,
Marketing, PR
Banking, Finance
Health, Medical,
Pharmacy, Biotech
Carriers,
Telecommunications
Education
Computers, Hardware
29.4%
17.7%
14.0%
9.6%
6.7%
6.5%
4.4%
4.4%
3.9%
3.5%
Top Applications
29%
36%
40%
44%
52%
68%
Faud Detection / Security
User-Facing Services
Log Processing
Recommendation
Data Warehousing
BusinessIntelligence
Spark Components Used
58%
58%
62%
69%
MLlib + GraphX
Spark Streaming
DataFrames
Spark SQL
75%
of users use more
than one component
Diverse Runtime Environments
Hadoop: combined
compute + storage
HDFS
MapReduce
Spark: independent
of storage layer
Spark
HDFS SQL
e.g. Oracle
NoSQL
e.g. Cassandra
Diverse Runtime Environments
2014 2015
Hadoop
Use a
little
Use a
lot
Hadoop
61%
31%
NoSQL Proprietary
SQL
46%
34%
43%
36% 37%
21%
Diverse Runtime Environments
HOW RESPONDENTS ARE
RUNNING SPARK
51%
on a public cloud
MOST COMMON SPARK DEPLOYMENT
ENVIRONMENTS (CLUSTER MANAGERS)
48%
40%
11%
Standalone mode YARN Mesos
Cluster Managers
Diversity of Users
84%
38% 38%
71%
31%
58%
18%
LanguagesUsed: 2014 LanguagesUsed: 2015
Fastest Growing Components
+280%
increase in
Windowsusers
+56%
production use
of Streaming
+380%
production
use of SQL
Are We Done?
No! Development is faster than ever.
Biggest technical changein 2015 was DataFrames
• Movesmany computationsonto the relational Spark SQL optimizer
Enables both newAPIs and more optimization, which is now
happening throughProject Tungsten
Traditional Spark DataFrames
RDDs DataFrames
Opaque
Java
objects
User code
Storage
DataFrame API SQL
Schema-
aware
cache
Structured
data sources
Java functions Expressions
Optimizer
Query pushdown
Coming in Spark 1.6
Dataset API: typed interface over DataFrames / Tungsten
• Common ask from developerswho saw DataFrames
case class Person(name: String, age: Int)
val dataframe = read.json(“people.json”)
val ds: Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.name.startsWith(“M”))
.groupBy(“name”)
.avg(“age”)
Other Upcoming Features
DataFrame integration with GraphXand Streaming
More Tungstenfeatures: faster in-memory cache,SSD storage,
better code generation
Data sourcesfor Streaming
See Reynold’s talk tomorrow for details
Dank je!
Enjoy Spark Summit

Spark Summit EU 2015: Matei Zaharia keynote