2. What is Apache Spark?
Fast and general engine for big data processing with
libraries for SQL, streaming, advanced analytics
Most active open source project in big data
2
3. Founded by the creators of Spark in 2013
Largest organization contributing to Spark
– 3/4 of the code in 2014
End-to-end hosted service, Databricks Cloud
About Databricks
3
4. 2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500 active production deployments
4
9. 9
New Directions in 2015
Data Science
High-level interfaces similar
to single-machine tools
Platform Interfaces
Plug in data sources
and algorithms
10. 10
DataFrames
Similar API to data frames
in R and Pandas
Automatically optimized
via Spark SQL
Coming in Spark 1.3
df = jsonFile(“tweets.json”)
df[df[“user”] == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime
11. 11
R Interface (SparkR)
Arrives in Spark 1.4 (June)
Exposes DataFrames,
RDDs, and ML library in R
df = jsonFile(“tweets.json”)
summarize(
group_by(
df[df$user == “matei”,],
“date”),
sum(“retweets”))
12. 12
Machine Learning Pipelines
High-level API inspired by
SciKit-Learn
Featurization, evaluation,
model tuning
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline([tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
modelDataFrame
13. 13
External Data Sources
Platform API to plug smart
data sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
Spark
{JSON}
14. 14
External Data Sources
Platform API to plug smart
data sources into Spark
Returns DataFrames usable
in Spark apps or SQL
Pushes logic into sources
SELECT * FROM mysql_users u JOIN
hive_logs h
WHERE u.lang = “en”
Spark
{JSON}
SELECT * FROM users WHERE lang=“en”