Guided by : Dr. Tran
• Why Spark ?
• Introduction to Spark
• Spark Programming model
• Installing Spark
• Launching Spark on Amazon EC2
• Introduction to MLlib
• Kmeans Algorithm
• MapReduce became popular in complex multi
– stage algorithms.
e.g. Machine Learning, iterative graph
• Multi-stage and interactive applications
require faster data sharing across parallel jobs.
• Why don’t run MapReduce in memory ?
What is Spark ?
• Fast , map-Reduce like engine.
• Uses in memory cluster computing.
• Compatible with Hadoop’s storage API’s.
• Has API’s written in Scala, Java, Python.
• Useful for large datasets and Iterative
• Up to 40x faster that Hadoop.
• Support for
Shark – Hive on Spark
MLlib – Machine learning library
GraphX – Graph Processing
• Started as a research project at the UC
Berkeley AMPLab in 2009, and was open
sourced in early 2010.
• After being released, Spark grew a developer
community on GitHub and entered Apache in
2013 as its permanent home.
• Codebase size
Spark : 20,000 LOC
Hadoop 1.0 : 90,000 LOC
Spark : Programming Model
• Resilient Distributed Datasets (RDDs) are basic
Distributed collections of objects that can be cached in memory across
Automatically rebuilt on failure.
• RDD operations
Transformations: Creates new dataset from existing one. e.g. Map.
Actions: Return a value to a driver program after running computation
on the dataset. e.g. Reduce.
Launching Spark on EC2
• Set the environment
variables AWS_ACCESS_KEY_ID and AWS_SECRET_AC
• Go into the ec2 directory in the release of Spark you
• Run ./spark-ec2 -k <keypair> -i <key-
file> -s <num-slaves> launch
Launching Spark on EC2 (2)
-k <keypair> : Name of EC2 key pair
-i <key-file> : Private key of your key pair.
-s <num-slaves> : Number of slaves to launch.
launch <cluster-name> : Is the name to give your cluster.
-r <ec2-region> : Specifies an EC2 region in which to
Spark on EC2
• Terminating a Cluster
Run ./spark-ec2 destroy <cluster-name>
• Stop Cluster
Run ./spark-ec2 stop <cluster-name>
• Restarting Cluster
Run ./spark-ec2 -i <key-file> start <cluster-
• Accessing Data in S3
Spark example :Word count
• Spark Java API is defined in the org.apache.spark.api.java package, and
includes a JavaSparkContext for initializing Spark.
spark is installed
Name of your
Collection of JARs to
send to cluster. Can be
from local or from HDFS
Cluster URL to
Spark Example : Word count(2)
To split the lines into words, we
use flatMap to split each line on
map each word to
a (word, 1) pair
Use reduceByKey to count
the occurrences of each
KMeans Algorithm – Clustering Example
• Parameters – MLlib implementation
Number of clusters (k)
Max. number of iterations to run
Epsilon (converged distance)
Initialization mode (random or kmeans||)
Initialization steps (number of steps in kmeans||)
Cluster URL to