Introduction to Apache Spark and MLlib
Upcoming SlideShare
Loading in...5
×
 

Introduction to Apache Spark and MLlib

on

  • 858 views

Slide deck contains overview of Apache Spark and Machine learning library MLlib.

Slide deck contains overview of Apache Spark and Machine learning library MLlib.

Statistics

Views

Total Views
858
Views on SlideShare
832
Embed Views
26

Actions

Likes
3
Downloads
42
Comments
0

2 Embeds 26

http://www.slideee.com 23
http://dschool.co 3

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Introduction to Apache Spark and MLlib Introduction to Apache Spark and MLlib Presentation Transcript

  • Apache Spark Pushkar Umaranikar CS 267 Guided by : Dr. Tran
  • Outline… • Why Spark ? • Introduction to Spark • Spark Programming model • Installing Spark • Launching Spark on Amazon EC2 • Example • Observation • Introduction to MLlib • Kmeans Algorithm
  • Why Spark? • MapReduce became popular in complex multi – stage algorithms. e.g. Machine Learning, iterative graph algorithms. • Multi-stage and interactive applications require faster data sharing across parallel jobs. • Why don’t run MapReduce in memory ?
  • What is Spark ? • Fast , map-Reduce like engine. • Uses in memory cluster computing. • Compatible with Hadoop’s storage API’s. • Has API’s written in Scala, Java, Python. • Useful for large datasets and Iterative algorithms. • Up to 40x faster that Hadoop. • Support for Shark – Hive on Spark MLlib – Machine learning library GraphX – Graph Processing
  • History • Started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010. • After being released, Spark grew a developer community on GitHub and entered Apache in 2013 as its permanent home. • Codebase size Spark : 20,000 LOC Hadoop 1.0 : 90,000 LOC
  • Spark : Programming Model • Resilient Distributed Datasets (RDDs) are basic building block.  Distributed collections of objects that can be cached in memory across cluster nodes.  Automatically rebuilt on failure. • RDD operations  Transformations: Creates new dataset from existing one. e.g. Map.  Actions: Return a value to a driver program after running computation on the dataset. e.g. Reduce.
  • Data Sharing • MapReduce • Spark
  • Installing Apache Spark • Get Spark From downloads page of Apache Spark site. • Go into the top-level Spark directory and run $ sbt/sbt assembly
  • Installing Apache Spark (2)
  • Launching Spark on EC2 • Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_AC CESS_KEY. • Go into the ec2 directory in the release of Spark you downloaded. • Run ./spark-ec2 -k <keypair> -i <key- file> -s <num-slaves> launch <cluster-name>
  • Launching Spark on EC2 (2) -k <keypair> : Name of EC2 key pair -i <key-file> : Private key of your key pair. -s <num-slaves> : Number of slaves to launch. launch <cluster-name> : Is the name to give your cluster. -r <ec2-region> : Specifies an EC2 region in which to launch instances.
  • Launching Spark on EC2(3)
  • Spark on EC2 • Terminating a Cluster Run ./spark-ec2 destroy <cluster-name> • Stop Cluster Run ./spark-ec2 stop <cluster-name> • Restarting Cluster Run ./spark-ec2 -i <key-file> start <cluster- name> • Accessing Data in S3 s3n://<bucket>/path
  • Components
  • Spark example :Word count • Spark Java API is defined in the org.apache.spark.api.java package, and includes a JavaSparkContext for initializing Spark. Location where spark is installed Name of your application Collection of JARs to send to cluster. Can be from local or from HDFS Cluster URL to connect to
  • Spark Example : Word count(2) To split the lines into words, we use flatMap to split each line on whitespace map each word to a (word, 1) pair Use reduceByKey to count the occurrences of each word
  • Spark Example : Word count(3)
  • Word count example : MapReduce
  • Observation • Word count program execution time  Apache Spark : 13.48s  MapReduce : 21.82s • Run programs faster than Hadoop.
  • MLlib • Spark’s scalable machine learning library. • Currently supports  Binary classification  Regression  Clustering  Collaborative filtering
  • KMeans Algorithm – Clustering Example • Parameters – MLlib implementation  Number of clusters (k)  Max. number of iterations to run  Epsilon (converged distance)  Initialization mode (random or kmeans||) Initialization steps (number of steps in kmeans||)
  • KMeans Example Cluster URL to connect to File containing input matrix Number of clusters Converged distance
  • KMeans Example(2)
  • References: • https://www.youtube.com/watch?v=49Hr5xZyTEA • RDD : A fault tolerant abstraction for in-memory cluster computing http://www.cs.berkeley.edu/~matei/papers/2011/tr_sp ark.pdf • https://spark.apache.org/documentation.html • https://github.com/mesos/spark/wiki/Spark- Programming-Guide