Spark Summit EU 2015: Matei Zaharia keynote

How Spark Usage is
Evolving in 2015
Matei Zaharia
October 28,2015

A Great Year for Spark
Most active open source projectin big data
New language: R
Widespread industry support& adoption

Community Growth
2014 2015
Summit
Attendees
2014 2015
Meetup
Members
2014 2015
Developers
Contributing
3900
1100
42K
12K
350
600

Meetup Groups: January 2015
source: meetup.com

Meetup Groups: October 2015
source: meetup.com

What Spark Provides
Generalenginewith libraries for
many data analysis tasks
Accessto diverse data sources
Simple, unified API
SQLStreaming ML Graph
…
Major focus in past 2 years
Data sourceAPI added 2015

Databricks Survey
1400 respondentsfrom 840 companies
Threetrends:
1) Diverse applications
2) More runtime environments
3) More types of users

Industries Using Spark
Other
Software
(SaaS, Web, Mobile)
Consulting (IT)
Retail,
e-Commerce
Advertising,
Marketing, PR
Banking, Finance
Health, Medical,
Pharmacy, Biotech
Carriers,
Telecommunications
Education
Computers, Hardware
29.4%
17.7%
14.0%
9.6%
6.7%
6.5%
4.4%
4.4%
3.9%
3.5%

Top Applications
29%
36%
40%
44%
52%
68%
Faud Detection / Security
User-Facing Services
Log Processing
Recommendation
Data Warehousing
BusinessIntelligence

Spark Components Used
58%
58%
62%
69%
MLlib + GraphX
Spark Streaming
DataFrames
Spark SQL
75%
of users use more
than one component

Diverse Runtime Environments
Hadoop: combined
compute + storage
HDFS
MapReduce
Spark: independent
of storage layer
Spark
HDFS SQL
e.g. Oracle
NoSQL
e.g. Cassandra

2014 2015
Hadoop
Use a
little
Use a
lot
Hadoop
61%
31%
NoSQL Proprietary
SQL
46%
34%
43%
36% 37%
21%

HOW RESPONDENTS ARE
RUNNING SPARK
51%
on a public cloud
MOST COMMON SPARK DEPLOYMENT
ENVIRONMENTS (CLUSTER MANAGERS)
48%
40%
11%
Standalone mode YARN Mesos
Cluster Managers

Diversity of Users
84%
38% 38%
71%
31%
58%
18%
LanguagesUsed: 2014 LanguagesUsed: 2015

Fastest Growing Components
+280%
increase in
Windowsusers
+56%
production use
of Streaming
+380%
production
use of SQL

Are We Done?
No! Development is faster than ever.
Biggest technical changein 2015 was DataFrames
• Movesmany computationsonto the relational Spark SQL optimizer
Enables both newAPIs and more optimization, which is now
happening throughProject Tungsten

Traditional Spark DataFrames
RDDs DataFrames
Opaque
Java
objects
User code
Storage
DataFrame API SQL
Schema-
aware
cache
Structured
data sources
Java functions Expressions
Optimizer
Query pushdown

Coming in Spark 1.6
Dataset API: typed interface over DataFrames / Tungsten
• Common ask from developerswho saw DataFrames
case class Person(name: String, age: Int)
val dataframe = read.json(“people.json”)
val ds: Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.name.startsWith(“M”))
.groupBy(“name”)
.avg(“age”)

Other Upcoming Features
DataFrame integration with GraphXand Streaming
More Tungstenfeatures: faster in-memory cache,SSD storage,
better code generation
Data sourcesfor Streaming
See Reynold’s talk tomorrow for details

Spark Summit EU 2015: Matei Zaharia keynote

More Related Content

What's hot

Viewers also liked

Similar to Spark Summit EU 2015: Matei Zaharia keynote

More from Databricks

Recently uploaded

Spark Summit EU 2015: Matei Zaharia keynote