Intro to Apache Spark - Lab

295 views

Published on

An intro lab with Apache Spark.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
295
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Intro to Apache Spark - Lab

  1. 1. Introduction to Apache Spark
  2. 2. www.mammothdata.com | @mammothdataco Lab Overview ● ‘Hello world’ RDD example ● Importing a dataset ● Dataframe operations and visualizations ● Using MLLib on dataset
  3. 3. www.mammothdata.com | @mammothdataco Lab — Hello World ● ./run_spark
  4. 4. www.mammothdata.com | @mammothdataco Lab — Hello World ● val text = sc.parallelize(Seq(“your text here”)) ● val words = text.flatMap(line => line.split(" ")) ● words.collect
  5. 5. www.mammothdata.com | @mammothdataco Lab — Hello World ● val taggedWords = words.map(word => (word,1)) ● val counts = taggedWords.reduceByKey(_ + _) ● counts.collect()
  6. 6. www.mammothdata.com | @mammothdataco Lab — Dataset ● https://archive.ics.uci.edu/ml/datasets/Wine ● Information on 3 different types of wine from Genoa ● 178 entries (small!)
  7. 7. www.mammothdata.com | @mammothdataco Lab — Loading The Wine Dataset ● val wines = sqlContext.read.json("wine.json") ● wines.registerTempTable(“wines”)
  8. 8. www.mammothdata.com | @mammothdataco Lab — Showing the generated Schema ● wines.printSchema
  9. 9. www.mammothdata.com | @mammothdataco Lab — Dataframe Operations ● wines.first
  10. 10. www.mammothdata.com | @mammothdataco Lab — Dataframe Operations ● sqlContext.sql("SELECT Type, count(Type) AS count FROM wines GROUP BY Type").show
  11. 11. www.mammothdata.com | @mammothdataco Lab — Dataframe Operations ● Experiment with %sql on the dataset (SELECT, COUNT, etc)
  12. 12. www.mammothdata.com | @mammothdataco Lab — K-means Clustering ● K-Means clustering is an unsupervised algorithm which splits a dataset into a number of clusters (k) based on a notion of similarity between points. It is often applied to real-world data to obtain a picture of structure hidden in large datasets, for example, identifying location clusters or breaking down sales into distinct purchasing groups.
  13. 13. www.mammothdata.com | @mammothdataco Lab — K-means Clustering k initial "means" (in this case k=3) are randomly generated within the data domain (shown in colour).
  14. 14. www.mammothdata.com | @mammothdataco Lab — K-means Clustering k (in this case, 3) clusters are created by comparing each data point to the closest mean.
  15. 15. www.mammothdata.com | @mammothdataco Lab — K-means Clustering The centroid of each of these clusters is found, and these are used as new means. New clusters are formed via observing the closest data points to these new mean as shown in Step 2. The process is repeated until the means converge (or until we hit our iteration limit)
  16. 16. www.mammothdata.com | @mammothdataco Lab — K-means Clustering: Imports ● import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors ● import org.apache.spark.sql._
  17. 17. www.mammothdata.com | @mammothdataco Lab — K-means Clustering: Features ● val featureCols = wines.select("Alcohol", "Hue", "Proline") ● val features = featureCols.rdd.map { case Row(a: Double, h: Double, p: Double) => Vectors.dense(a,h,p) } ● features.cache
  18. 18. www.mammothdata.com | @mammothdataco Lab — K-means Clustering: Training Model ● val numClusters = 2 ● val numIterations = 20 ● val model = KMeans.train(features, numClusters, numIterations)
  19. 19. www.mammothdata.com | @mammothdataco Lab — K-means Clustering: Finding k ● k can be any number you like! ● WSSSE - Within Set Sum of Squared Error ● Squared sum of distances between points and their respective centroid ● val wssse = model.computeCost(features)
  20. 20. www.mammothdata.com | @mammothdataco Lab — K-means Clustering: Finding k ● Test on k = 1 to 5 ● (1 to 5 by 1).map (k => KMeans.train(features, k, numIterations).computeCost(features)) ● WSSSE normally decreases as k increases ● Look for the ‘elbow’
  21. 21. www.mammothdata.com | @mammothdataco Lab — K-means Clustering: Training Model ● val numClusters = 1 ● val numIterations = 20 ● val wssse = KMeans.train(features, numClusters, numIterations).computeCost(features)
  22. 22. www.mammothdata.com | @mammothdataco Lab — K-means Clustering: k = 3 ● val numClusters = 3 ● val numIterations = 10 ● val model = KMeans.train(features, numClusters, numIterations)
  23. 23. www.mammothdata.com | @mammothdataco Lab — K-means Clustering: Obtaining Type Predictions ● val predictions = features.map ( feature => model.predict (feature))
  24. 24. www.mammothdata.com | @mammothdataco Lab — K-means Clustering: Comparing To Labels ● val counts = predictions.map (p => (p,1)).reduceByKey(_+_) ● counts.collect
  25. 25. www.mammothdata.com | @mammothdataco Lab — Next Steps ● Looks good, right? Let’s look at what the labels for each point really are. ● val features = featureCols.rdd.map { case Row(t: Double, a: Double, h: Double, p: Double) => (t,Vectors.dense(a,h,p)) } ● val predictions = features.map ( feature => (feature._1, model.predict(feature._2))) ● val counts = predictions.map (p => (p,1)).reduceByKey(_+_) ● counts.collect ● A slightly different story!
  26. 26. www.mammothdata.com | @mammothdataco Lab — Next Steps ● k-means clustering - useful! But not perfect! ● Try again with more features in the vector and see if it improves the clustering. ● Bayes? Random Forests? All in MLLib and with similar interfaces!
  27. 27. www.mammothdata.com | @mammothdataco Lab — Next Steps ● spark.apache.org
  28. 28. www.mammothdata.com | @mammothdataco Lab — Questions ● ?

×