0
Scalable Machine Learning
Alton Alexander
@10altoids
R
Motivation to use Spark
• http://spark.apache.org/
–Speed
–Ease of use
–Generality
–Integrated with Hadoop
• Scalability
Performance
Architecture
• New Amazon Memory Optimized Options
https://aws.amazon.com/about-aws/whats-new/2014/04/10/r3-
announcing-th...
Learn more about Spark
• http://spark.apache.org/documentation.html
– Great documentation (with video tutorials)
– June 20...
Motivation to use R
• Great community
– R: The most powerful and most widely used statistical software
– https://www.youtu...
Example: Word Count
Library(SparkR)
sc <- sparkR.init(master="local")
Lines <- textFile(sc, “hdfs://my_text_file”)
Words <...
Learn more about SparkR
• GitHub repository
– https://github.com/amplab-extras/SparkR-pkg
– How to install
– Examples
• An...
Backup Slides
Hands on Exercises
• http://spark-
summit.org/2013/exercises/index.html
– Walk through the tutorial
– Set up a cluster on ...
Local box
• Start with a micro dev box using the latest public
build on Amazon EC2
– spark.ami.pvm.v9 - ami-5bb18832
• Or ...
Run the examples
• Load pyspark and work interactively
– /root/spark-0.9.1-bin-hadoop1/bin/pyspark
– >>> help(sc)
• Estima...
Start Cluster
• Configure the cluster and start it
– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 -k spark-
key -i ~/spark-key.pe...
Ganglia: The Cluster Dashboard
Run these Demos
• http://spark.apache.org/docs/latest/mllib-
guide.html
– Talks about each of the algorithms
– Gives some ...
Clustering
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
# Load and parse the ...
Python Code
• http://spark.incubator.apache.org/docs/latest/ap
i/pyspark/index.html
• Python API for Spark
• Package Mllib...
Clustering Skullcandy Followers
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
...
Clustering Skullcandy Followers
Clustering Skullcandy Followers
Apply model to all followers
• predictions = parsedData.map(lambda
follower: clusters.predict(follower))
• Save this out f...
Predicted Groups
Skullcandy Dashboard
Backup
• Upgrade to python 2.7
• https://spark-
project.atlassian.net/browse/SPARK-922
Correlation Matrix
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Upcoming SlideShare
Loading in...5
×

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

1,732

Published on

An overview of Spark, SparkR, and scalable machine learning. A talk and discussion given to the Utah R users group at the University of Utah.

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,732
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
84
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Transcript of "SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th"

  1. 1. Scalable Machine Learning Alton Alexander @10altoids R
  2. 2. Motivation to use Spark • http://spark.apache.org/ –Speed –Ease of use –Generality –Integrated with Hadoop • Scalability
  3. 3. Performance
  4. 4. Architecture • New Amazon Memory Optimized Options https://aws.amazon.com/about-aws/whats-new/2014/04/10/r3- announcing-the-next-generation-of-amazon-ec2-memory-optimized- instances/
  5. 5. Learn more about Spark • http://spark.apache.org/documentation.html – Great documentation (with video tutorials) – June 2014 Conference http://spark-summit.org • Keynote Talk at STRATA 2014 – Use cases by Yahoo and other companies – http://youtu.be/KspReT2JjeE – Matei Zaharia – Core developer and now at DataBricks – 30min Talk in Detail http://youtu.be/nU6vO2EJAb4?t=20m42s
  6. 6. Motivation to use R • Great community – R: The most powerful and most widely used statistical software – https://www.youtube.com/watch?v=TR2bHSJ_eck • Statistics • Packages – There’s an R package for that – Roger Peng, John Hopkins – https://www.youtube.com/watch?v=yhTerzNFLbo • Plots
  7. 7. Example: Word Count Library(SparkR) sc <- sparkR.init(master="local") Lines <- textFile(sc, “hdfs://my_text_file”) Words <- flatMap(lines, function(line){ strsplit(line, “ “)[[1]] }) wordCount <- lapply(words, function(word){ list(word,1L) }) Counts <- reduceByKey(wordCount, "+", 2L) Output <- collect(counts)
  8. 8. Learn more about SparkR • GitHub repository – https://github.com/amplab-extras/SparkR-pkg – How to install – Examples • An old but still good talk introducing SparkR – http://www.youtube.com/watch?v=MY0NkZY_tJw&list=PL -x35fyliRwiP3YteXbnhk0QGOtYLBT3a – Shows MINST demo
  9. 9. Backup Slides
  10. 10. Hands on Exercises • http://spark- summit.org/2013/exercises/index.html – Walk through the tutorial – Set up a cluster on EC2 – Data exploration – Stream processing with spark streaming – Machine learning
  11. 11. Local box • Start with a micro dev box using the latest public build on Amazon EC2 – spark.ami.pvm.v9 - ami-5bb18832 • Or start by just installing it on your laptop – wget http://d3kbcqa49mib13.cloudfront.net/spark- 0.9.1-bin-hadoop1.tgz • Add AWS keys as environment variables – AWS_ACCESS_KEY_ID= – AWS_SECRET_ACCESS_KEY=
  12. 12. Run the examples • Load pyspark and work interactively – /root/spark-0.9.1-bin-hadoop1/bin/pyspark – >>> help(sc) • Estimate pi – ./bin/pyspark python/examples/pi.py local[4] 20
  13. 13. Start Cluster • Configure the cluster and start it – spark-0.9.1-bin-hadoop1/ec2/spark-ec2 -k spark- key -i ~/spark-key.pem -s 1 launch spark -test- cluster • Log onto the master – spark-0.9.1-bin-hadoop1/ec2/spark-ec2 login spark-test-cluster
  14. 14. Ganglia: The Cluster Dashboard
  15. 15. Run these Demos • http://spark.apache.org/docs/latest/mllib- guide.html – Talks about each of the algorithms – Gives some demos in Scala – More demos in Python
  16. 16. Clustering from pyspark.mllib.clustering import KMeans from numpy import array from math import sqrt # Load and parse the data data = sc.textFile(“data/kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE))
  17. 17. Python Code • http://spark.incubator.apache.org/docs/latest/ap i/pyspark/index.html • Python API for Spark • Package Mllib – Classification – Clustering – Recommendation – Regression
  18. 18. Clustering Skullcandy Followers from pyspark.mllib.clustering import KMeans from numpy import array from math import sqrt # Load and parse the data data = sc.textFile(“../skullcandy.csv") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE))
  19. 19. Clustering Skullcandy Followers
  20. 20. Clustering Skullcandy Followers
  21. 21. Apply model to all followers • predictions = parsedData.map(lambda follower: clusters.predict(follower)) • Save this out for visualization – predictions.saveAsTextFile("predictions.csv")
  22. 22. Predicted Groups
  23. 23. Skullcandy Dashboard
  24. 24. Backup
  25. 25. • Upgrade to python 2.7 • https://spark- project.atlassian.net/browse/SPARK-922
  26. 26. Correlation Matrix
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×