End-to-End Data Pipelines
with Apache Spark
Burak Yavuz
December 27, 2015
Who Am I?
• Software Engineer at Databricks
• MS Management Science & Eng. @ Stanford
University
• BS Mechanical Eng. @ Bogazici University,
Istanbul
• Contributor to Spark Core, MLlib, SQL, and
Streaming
• Maintainer of Spark Packages
2
Outline
• Intro - Spark & Ecosystem
• Build an End-to-End Data Product
• Step 1: Understand your Data
• SparkSQL - DataFrames
• Step 2: Build your Service
• SparkMLlib - ML Pipelines
• Step 3: Monitor your Service
• Spark Streaming
• Kafka
3
Timeline of Spark
• 2010: a research paper
• 2010-13: a project under github/mesos
• 2013-14: Apache incubating -> TLP
• 2014: the most active project in the ASF
4
Apache Spark
5
Spark Ecosystem
• 770 contributors
• 6000+ forks on GitHub
• 14000+ commits!
6
https://github.com/apache/spark
7
http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
8
http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
9
http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
10
• a community index of 3rd-party packages
• helps users find packages
• helps package developers meet users
• users provide feedback through voting and
commenting
• index maintained by Databricks
11
3rd Party Packages
Community
Spark Packages
http://spark-packages.org
Types of Packages Currently Available
• Data Source Connectors
• spark-avro, spark-redshift, spark-mongodb, spark-
sequoiadb, spark-cassandra-connector, …
• Deployment Scripts
• spark_azure, spark_gce, sbt-spark-ec2
• Machine Learning Algorithms
• spark-hash, spark-mrmr-feature-selection, streaming-
matrix-factorization, generalized-kmeans-clustering
• and many more…
12
What’s new in Spark 1.6
• Dataset API
• Automatic memory configuration
• Optimized state storage in Spark Streaming
• Pipeline persistence in Spark ML
13
Demo
Source Code: http://brkyvz.github.io/spark-pipeline
Scenario: As an e-commerce company, we would like to recommend
products that users may like in order to increase sales and profit.
Dataset: http://jmcauley.ucsd.edu/data/amazon/
- 18 GB
- 82.83 million reviews
We will use a subset with 24 million reviews
14
15
16
Recommendation Engines
• Finding Similar Items
• Clustering using:
• Metadata
• Matrix Factorization
• Frequent Itemsets
• Ranking
• Rating Prediction using:
• Matrix Factorization
17
Architecture
18
Web
Service 1
Web
Service 2
Web
Service 3
Cassandra
Sales Data
Database
Spark
Sales + Ratings
Rating
Data
ML Model
Recommendations
Request
19
Step 1: Understand your Data
20
Step 2: Build your Service
Solution Proposal
Use Matrix Factorization to understand customers
and items.
Then:
1) Predict the rating for a product for a given user
2) Find similar products, and show top k
21
Matrix Factorization
22
https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf
Matrix Factorization
23
https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf
24
https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf
25
Step 3: Monitor your Service
• Distributed messaging system
• High-throughput
• Fast
• Scalable
• Durable
• http://kafka.apache.org/
26
Apache Kafka
Architecture
27
Web
Service 1
Web
Service 2
Web
Service 3
Kafka Spark Streaming
Architecture
28
Web
Service 1
Web
Service 2
Web
Service 3
Kafka Spark Streaming
Thank you.
burak@databricks.com

End-to-End Data Pipelines with Apache Spark