Introduction to Apache Spark and MLlib

Apache Spark
Pushkar Umaranikar
CS 267
Guided by : Dr. Tran

Outline…
• Why Spark ?
• Introduction to Spark
• Spark Programming model
• Installing Spark
• Launching Spark on Amazon EC2
• Example
• Observation
• Introduction to MLlib
• Kmeans Algorithm

Why Spark?
• MapReduce became popular in complex multi
– stage algorithms.
e.g. Machine Learning, iterative graph
algorithms.
• Multi-stage and interactive applications
require faster data sharing across parallel jobs.
• Why don’t run MapReduce in memory ?

What is Spark ?
• Fast , map-Reduce like engine.
• Uses in memory cluster computing.
• Compatible with Hadoop’s storage API’s.
• Has API’s written in Scala, Java, Python.
• Useful for large datasets and Iterative
algorithms.
• Up to 40x faster that Hadoop.
• Support for
Shark – Hive on Spark
MLlib – Machine learning library
GraphX – Graph Processing

History
• Started as a research project at the UC
Berkeley AMPLab in 2009, and was open
sourced in early 2010.
• After being released, Spark grew a developer
community on GitHub and entered Apache in
2013 as its permanent home.
• Codebase size
Spark : 20,000 LOC
Hadoop 1.0 : 90,000 LOC

Spark : Programming Model
• Resilient Distributed Datasets (RDDs) are basic
building block.
 Distributed collections of objects that can be cached in memory across
cluster nodes.
 Automatically rebuilt on failure.
• RDD operations
 Transformations: Creates new dataset from existing one. e.g. Map.
 Actions: Return a value to a driver program after running computation
on the dataset. e.g. Reduce.

Data Sharing
• MapReduce
• Spark

Installing Apache Spark
• Get Spark From downloads page of Apache
Spark site.
• Go into the top-level Spark directory and run
$ sbt/sbt assembly

Launching Spark on EC2
• Set the environment
variables AWS_ACCESS_KEY_ID and AWS_SECRET_AC
CESS_KEY.
• Go into the ec2 directory in the release of Spark you
downloaded.
• Run ./spark-ec2 -k <keypair> -i <key-
file> -s <num-slaves> launch
<cluster-name>

Launching Spark on EC2 (2)
-k <keypair> : Name of EC2 key pair
-i <key-file> : Private key of your key pair.
-s <num-slaves> : Number of slaves to launch.
launch <cluster-name> : Is the name to give your cluster.
-r <ec2-region> : Specifies an EC2 region in which to
launch instances.

Spark on EC2
• Terminating a Cluster
Run ./spark-ec2 destroy <cluster-name>
• Stop Cluster
Run ./spark-ec2 stop <cluster-name>
• Restarting Cluster
Run ./spark-ec2 -i <key-file> start <cluster-
name>
• Accessing Data in S3
s3n://<bucket>/path

Spark example :Word count
• Spark Java API is defined in the org.apache.spark.api.java package, and
includes a JavaSparkContext for initializing Spark.
Location where
spark is installed
Name of your
application
Collection of JARs to
send to cluster. Can be
from local or from HDFS
Cluster URL to
connect to

Spark Example : Word count(2)
To split the lines into words, we
use flatMap to split each line on
whitespace
map each word to
a (word, 1) pair
Use reduceByKey to count
the occurrences of each
word

Word count example : MapReduce

Observation
• Word count program execution time
 Apache Spark : 13.48s
 MapReduce : 21.82s
• Run programs faster than Hadoop.

MLlib
• Spark’s scalable machine learning library.
• Currently supports
 Binary classification
 Regression
 Clustering
 Collaborative filtering

KMeans Algorithm – Clustering Example
• Parameters – MLlib implementation
 Number of clusters (k)
 Max. number of iterations to run
 Epsilon (converged distance)
 Initialization mode (random or kmeans||)
Initialization steps (number of steps in kmeans||)

KMeans Example
Cluster URL to
connect to
File containing
input matrix
Number of
clusters
Converged
distance

References:
• https://www.youtube.com/watch?v=49Hr5xZyTEA
• RDD : A fault tolerant abstraction for in-memory
cluster computing
http://www.cs.berkeley.edu/~matei/papers/2011/tr_sp
ark.pdf
• https://spark.apache.org/documentation.html
• https://github.com/mesos/spark/wiki/Spark-
Programming-Guide

Introduction to Apache Spark and MLlib

Introduction to Apache Spark and MLlib

More Related Content

What's hot

Viewers also liked

Similar to Introduction to Apache Spark and MLlib

Recently uploaded

Introduction to Apache Spark and MLlib