SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Upcoming SlideShare
Loading in...5
×
 

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

on

  • 500 views

An overview of Spark, SparkR, and scalable machine learning. A talk and discussion given to the Utah R users group at the University of Utah.

An overview of Spark, SparkR, and scalable machine learning. A talk and discussion given to the Utah R users group at the University of Utah.

Statistics

Views

Total Views
500
Views on SlideShare
460
Embed Views
40

Actions

Likes
2
Downloads
22
Comments
0

2 Embeds 40

https://twitter.com 38
https://www.linkedin.com 2

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th Presentation Transcript

  • Scalable Machine Learning Alton Alexander @10altoids R
  • Motivation to use Spark • http://spark.apache.org/ –Speed –Ease of use –Generality –Integrated with Hadoop • Scalability
  • Performance View slide
  • Architecture • New Amazon Memory Optimized Options https://aws.amazon.com/about-aws/whats-new/2014/04/10/r3- announcing-the-next-generation-of-amazon-ec2-memory-optimized- instances/ View slide
  • Learn more about Spark • http://spark.apache.org/documentation.html – Great documentation (with video tutorials) – June 2014 Conference http://spark-summit.org • Keynote Talk at STRATA 2014 – Use cases by Yahoo and other companies – http://youtu.be/KspReT2JjeE – Matei Zaharia – Core developer and now at DataBricks – 30min Talk in Detail http://youtu.be/nU6vO2EJAb4?t=20m42s
  • Motivation to use R • Great community – R: The most powerful and most widely used statistical software – https://www.youtube.com/watch?v=TR2bHSJ_eck • Statistics • Packages – There’s an R package for that – Roger Peng, John Hopkins – https://www.youtube.com/watch?v=yhTerzNFLbo • Plots
  • Example: Word Count Library(SparkR) sc <- sparkR.init(master="local") Lines <- textFile(sc, “hdfs://my_text_file”) Words <- flatMap(lines, function(line){ strsplit(line, “ “)[[1]] }) wordCount <- lapply(words, function(word){ list(word,1L) }) Counts <- reduceByKey(wordCount, "+", 2L) Output <- collect(counts)
  • Learn more about SparkR • GitHub repository – https://github.com/amplab-extras/SparkR-pkg – How to install – Examples • An old but still good talk introducing SparkR – http://www.youtube.com/watch?v=MY0NkZY_tJw&list=PL -x35fyliRwiP3YteXbnhk0QGOtYLBT3a – Shows MINST demo
  • Backup Slides
  • Hands on Exercises • http://spark- summit.org/2013/exercises/index.html – Walk through the tutorial – Set up a cluster on EC2 – Data exploration – Stream processing with spark streaming – Machine learning
  • Local box • Start with a micro dev box using the latest public build on Amazon EC2 – spark.ami.pvm.v9 - ami-5bb18832 • Or start by just installing it on your laptop – wget http://d3kbcqa49mib13.cloudfront.net/spark- 0.9.1-bin-hadoop1.tgz • Add AWS keys as environment variables – AWS_ACCESS_KEY_ID= – AWS_SECRET_ACCESS_KEY=
  • Run the examples • Load pyspark and work interactively – /root/spark-0.9.1-bin-hadoop1/bin/pyspark – >>> help(sc) • Estimate pi – ./bin/pyspark python/examples/pi.py local[4] 20
  • Start Cluster • Configure the cluster and start it – spark-0.9.1-bin-hadoop1/ec2/spark-ec2 -k spark- key -i ~/spark-key.pem -s 1 launch spark -test- cluster • Log onto the master – spark-0.9.1-bin-hadoop1/ec2/spark-ec2 login spark-test-cluster
  • Ganglia: The Cluster Dashboard
  • Run these Demos • http://spark.apache.org/docs/latest/mllib- guide.html – Talks about each of the algorithms – Gives some demos in Scala – More demos in Python
  • Clustering from pyspark.mllib.clustering import KMeans from numpy import array from math import sqrt # Load and parse the data data = sc.textFile(“data/kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE))
  • Python Code • http://spark.incubator.apache.org/docs/latest/ap i/pyspark/index.html • Python API for Spark • Package Mllib – Classification – Clustering – Recommendation – Regression
  • Clustering Skullcandy Followers from pyspark.mllib.clustering import KMeans from numpy import array from math import sqrt # Load and parse the data data = sc.textFile(“../skullcandy.csv") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE))
  • Clustering Skullcandy Followers
  • Clustering Skullcandy Followers
  • Apply model to all followers • predictions = parsedData.map(lambda follower: clusters.predict(follower)) • Save this out for visualization – predictions.saveAsTextFile("predictions.csv")
  • Predicted Groups
  • Skullcandy Dashboard
  • Backup
  • • Upgrade to python 2.7 • https://spark- project.atlassian.net/browse/SPARK-922
  • Correlation Matrix