New directions for Apache Spark in 2015

New Directions for Spark in 2015
Matei Zaharia
February 20, 2015

What is Apache Spark?
Fast and general engine for big data processing with
libraries for SQL, streaming, advanced analytics
Most active open source project in big data
2

Founded by the creators of Spark in 2013
Largest organization contributing to Spark
–  3/4 of the code in 2014
End-to-end hosted service, Databricks Cloud
About Databricks
3

2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500 active production deployments
4

Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
5

Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache
6

7
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes

9
New Directions in 2015
Data Science
High-level interfaces similar
to single-machine tools
Platform Interfaces
Plug in data sources
and algorithms

10
DataFrames
Similar API to data frames
in R and Pandas
Automatically optimized
via Spark SQL
Coming in Spark 1.3
df = jsonFile(“tweets.json”)
df[df[“user”] == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime

11
R Interface (SparkR)
Arrives in Spark 1.4 (June)
Exposes DataFrames,
RDDs, and ML library in R
df = jsonFile(“tweets.json”)
summarize(
group_by(
df[df$user == “matei”,],
“date”),
sum(“retweets”))

12
Machine Learning Pipelines
High-level API inspired by
SciKit-Learn
Featurization, evaluation,
model tuning
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline([tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
modelDataFrame

13
External Data Sources
Platform API to plug smart
data sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
Spark
{JSON}

14
External Data Sources
Platform API to plug smart
data sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
SELECT * FROM mysql_users u JOIN
hive_logs h
WHERE u.lang = “en”
Spark
{JSON}
SELECT * FROM users WHERE lang=“en”

15
Goal: one engine for all data sources,
workloads and environments

To Learn More
Two free massive online
courses on Spark:
databricks.com/moocs
16
Try
Databricks Cloud:
databricks.com

New directions for Apache Spark in 2015

More Related Content

What's hot

Viewers also liked

Similar to New directions for Apache Spark in 2015

More from Databricks

Recently uploaded

New directions for Apache Spark in 2015