Your SlideShare is downloading. ×
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

801
views

Published on

An overview of Spark, SparkR, and scalable machine learning. A talk and discussion given to the Utah R users group at the University of Utah.

An overview of Spark, SparkR, and scalable machine learning. A talk and discussion given to the Utah R users group at the University of Utah.


0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
801
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
40
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Scalable Machine Learning Alton Alexander @10altoids R
  • 2. Motivation to use Spark • http://spark.apache.org/ –Speed –Ease of use –Generality –Integrated with Hadoop • Scalability
  • 3. Performance
  • 4. Architecture • New Amazon Memory Optimized Options https://aws.amazon.com/about-aws/whats-new/2014/04/10/r3- announcing-the-next-generation-of-amazon-ec2-memory-optimized- instances/
  • 5. Learn more about Spark • http://spark.apache.org/documentation.html – Great documentation (with video tutorials) – June 2014 Conference http://spark-summit.org • Keynote Talk at STRATA 2014 – Use cases by Yahoo and other companies – http://youtu.be/KspReT2JjeE – Matei Zaharia – Core developer and now at DataBricks – 30min Talk in Detail http://youtu.be/nU6vO2EJAb4?t=20m42s
  • 6. Motivation to use R • Great community – R: The most powerful and most widely used statistical software – https://www.youtube.com/watch?v=TR2bHSJ_eck • Statistics • Packages – There’s an R package for that – Roger Peng, John Hopkins – https://www.youtube.com/watch?v=yhTerzNFLbo • Plots
  • 7. Example: Word Count Library(SparkR) sc <- sparkR.init(master="local") Lines <- textFile(sc, “hdfs://my_text_file”) Words <- flatMap(lines, function(line){ strsplit(line, “ “)[[1]] }) wordCount <- lapply(words, function(word){ list(word,1L) }) Counts <- reduceByKey(wordCount, "+", 2L) Output <- collect(counts)
  • 8. Learn more about SparkR • GitHub repository – https://github.com/amplab-extras/SparkR-pkg – How to install – Examples • An old but still good talk introducing SparkR – http://www.youtube.com/watch?v=MY0NkZY_tJw&list=PL -x35fyliRwiP3YteXbnhk0QGOtYLBT3a – Shows MINST demo
  • 9. Backup Slides
  • 10. Hands on Exercises • http://spark- summit.org/2013/exercises/index.html – Walk through the tutorial – Set up a cluster on EC2 – Data exploration – Stream processing with spark streaming – Machine learning
  • 11. Local box • Start with a micro dev box using the latest public build on Amazon EC2 – spark.ami.pvm.v9 - ami-5bb18832 • Or start by just installing it on your laptop – wget http://d3kbcqa49mib13.cloudfront.net/spark- 0.9.1-bin-hadoop1.tgz • Add AWS keys as environment variables – AWS_ACCESS_KEY_ID= – AWS_SECRET_ACCESS_KEY=
  • 12. Run the examples • Load pyspark and work interactively – /root/spark-0.9.1-bin-hadoop1/bin/pyspark – >>> help(sc) • Estimate pi – ./bin/pyspark python/examples/pi.py local[4] 20
  • 13. Start Cluster • Configure the cluster and start it – spark-0.9.1-bin-hadoop1/ec2/spark-ec2 -k spark- key -i ~/spark-key.pem -s 1 launch spark -test- cluster • Log onto the master – spark-0.9.1-bin-hadoop1/ec2/spark-ec2 login spark-test-cluster
  • 14. Ganglia: The Cluster Dashboard
  • 15. Run these Demos • http://spark.apache.org/docs/latest/mllib- guide.html – Talks about each of the algorithms – Gives some demos in Scala – More demos in Python
  • 16. Clustering from pyspark.mllib.clustering import KMeans from numpy import array from math import sqrt # Load and parse the data data = sc.textFile(“data/kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE))
  • 17. Python Code • http://spark.incubator.apache.org/docs/latest/ap i/pyspark/index.html • Python API for Spark • Package Mllib – Classification – Clustering – Recommendation – Regression
  • 18. Clustering Skullcandy Followers from pyspark.mllib.clustering import KMeans from numpy import array from math import sqrt # Load and parse the data data = sc.textFile(“../skullcandy.csv") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE))
  • 19. Clustering Skullcandy Followers
  • 20. Clustering Skullcandy Followers
  • 21. Apply model to all followers • predictions = parsedData.map(lambda follower: clusters.predict(follower)) • Save this out for visualization – predictions.saveAsTextFile("predictions.csv")
  • 22. Predicted Groups
  • 23. Skullcandy Dashboard
  • 24. Backup
  • 25. • Upgrade to python 2.7 • https://spark- project.atlassian.net/browse/SPARK-922
  • 26. Correlation Matrix