Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Latent Semantic Analysis of Wikiped... by Sandy Ryza 4672 views
- Large Scale Machine Learning with A... by Cloudera, Inc. 6404 views
- From Data to AI with the Machine Le... by Louis Dorard 1025 views
- Value IT - Journey from Cost functi... by Nikhil Chadha 128 views
- 4 pengenalan-supervised-dan-unsuper... by Nera Ajahh 53 views
- Architecting for Data Science by Johann Schleier-S... 750 views

6,758 views

Published on

Bio:

Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.

No Downloads

Total views

6,758

On SlideShare

0

From Embeds

0

Number of Embeds

69

Shares

0

Downloads

266

Comments

0

Likes

21

No embeds

No notes for slide

- 1. ● Data scientist at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown
- 2. ● How many kinds of stuff are there? ● Why is some stuff not like the others? ● How do I contextualize new stuff? ● Is there a simpler way to represent this stuff?
- 3. ● Learn hidden structure of your data ● Interpret new data as it relates to this structure
- 4. ● Clustering ○ Partition data into categories ● Dimensionality reduction ○ Find a condensed representation of your data
- 5. ● Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
- 6. bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toDouble) numbers.sum()
- 7. bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toInt) numbers.cache() .sum()
- 8. bigfile.txt lines numbers Partition Partition Partition sum Driver
- 9. Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
- 10. Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
- 11. ● Anomalies as data points far away from any cluster
- 12. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations)
- 13. ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ○ Recompute cluster centers from the points in each cluster
- 14. ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ■ Process each data point independently ○ Recompute cluster centers from the points in each cluster ■ Average across partitions
- 15. // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
- 16. // Update the cluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
- 17. ● K-Means is very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
- 18. ● Start with random point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
- 19. ● Initial cluster has expected bound of O(log k) of optimum cost
- 20. ● Requires k passes over the data
- 21. ● Do only a few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
- 22. Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
- 23. ● Select a basis for your data that ○ Is orthonormal ○ Maximizes variance along its axes
- 24. ● Find dominant trends
- 25. ● Find a lower-dimensional representation that lets you visualize the data ● Feature learning - find a representation that’ s good for clustering or classification ● Latent Semantic Analysis
- 26. val data: RDD[Vector] = ... val mat = new RowMatrix(data) // compute the top 5 principal components val principalComponents = mat.computePrincipalComponents(5) // project data into subspace val transformed = data.map(_.toBreeze * mat.toBreeze)
- 27. ● Center data ● Find covariance matrix ● Its eigenvectors are the principal components
- 28. Datam n Covariance Matrix n n
- 29. Data m n Data Data Data Data Data
- 30. Data m n Data Data Data Data Data n n n n ...
- 31. Data m n Data Data Data Data Data n n n n ... ...
- 32. n n
- 33. n n
- 34. n n
- 35. def computeGramianMatrix (): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) => { RowMatrix.dspr( 1.0, v, U.data) U }, combOp = (U1, U2) => U1 += U2 ) RowMatrix.triuToFull(n, GU.data) }
- 36. n n
- 37. ● n^2 must fit in memory
- 38. ● n^2 must fit in memory ● Not yet implemented: EM algorithm can do it with O(kn), where k is the number of principal components

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment