2. Outline…
• Why Spark ?
• Introduction to Spark
• Spark Programming model
• Installing Spark
• Launching Spark on Amazon EC2
• Example
• Observation
• Introduction to MLlib
• Kmeans Algorithm
3. Why Spark?
• MapReduce became popular in complex multi
– stage algorithms.
e.g. Machine Learning, iterative graph
algorithms.
• Multi-stage and interactive applications
require faster data sharing across parallel jobs.
• Why don’t run MapReduce in memory ?
4. What is Spark ?
• Fast , map-Reduce like engine.
• Uses in memory cluster computing.
• Compatible with Hadoop’s storage API’s.
• Has API’s written in Scala, Java, Python.
• Useful for large datasets and Iterative
algorithms.
• Up to 40x faster that Hadoop.
• Support for
Shark – Hive on Spark
MLlib – Machine learning library
GraphX – Graph Processing
5. History
• Started as a research project at the UC
Berkeley AMPLab in 2009, and was open
sourced in early 2010.
• After being released, Spark grew a developer
community on GitHub and entered Apache in
2013 as its permanent home.
• Codebase size
Spark : 20,000 LOC
Hadoop 1.0 : 90,000 LOC
6. Spark : Programming Model
• Resilient Distributed Datasets (RDDs) are basic
building block.
Distributed collections of objects that can be cached in memory across
cluster nodes.
Automatically rebuilt on failure.
• RDD operations
Transformations: Creates new dataset from existing one. e.g. Map.
Actions: Return a value to a driver program after running computation
on the dataset. e.g. Reduce.
10. Launching Spark on EC2
• Set the environment
variables AWS_ACCESS_KEY_ID and AWS_SECRET_AC
CESS_KEY.
• Go into the ec2 directory in the release of Spark you
downloaded.
• Run ./spark-ec2 -k <keypair> -i <key-
file> -s <num-slaves> launch
<cluster-name>
11. Launching Spark on EC2 (2)
-k <keypair> : Name of EC2 key pair
-i <key-file> : Private key of your key pair.
-s <num-slaves> : Number of slaves to launch.
launch <cluster-name> : Is the name to give your cluster.
-r <ec2-region> : Specifies an EC2 region in which to
launch instances.
13. Spark on EC2
• Terminating a Cluster
Run ./spark-ec2 destroy <cluster-name>
• Stop Cluster
Run ./spark-ec2 stop <cluster-name>
• Restarting Cluster
Run ./spark-ec2 -i <key-file> start <cluster-
name>
• Accessing Data in S3
s3n://<bucket>/path
15. Spark example :Word count
• Spark Java API is defined in the org.apache.spark.api.java package, and
includes a JavaSparkContext for initializing Spark.
Location where
spark is installed
Name of your
application
Collection of JARs to
send to cluster. Can be
from local or from HDFS
Cluster URL to
connect to
16. Spark Example : Word count(2)
To split the lines into words, we
use flatMap to split each line on
whitespace
map each word to
a (word, 1) pair
Use reduceByKey to count
the occurrences of each
word
21. KMeans Algorithm – Clustering Example
• Parameters – MLlib implementation
Number of clusters (k)
Max. number of iterations to run
Epsilon (converged distance)
Initialization mode (random or kmeans||)
Initialization steps (number of steps in kmeans||)
22. KMeans Example
Cluster URL to
connect to
File containing
input matrix
Number of
clusters
Converged
distance