• Like
  • Save
Unsupervised Learning with Apache Spark
Upcoming SlideShare
Loading in...5
×
 

Unsupervised Learning with Apache Spark

on

  • 2,209 views

Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. ...

Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD).

Bio:
Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.

Statistics

Views

Total Views
2,209
Views on SlideShare
2,181
Embed Views
28

Actions

Likes
9
Downloads
94
Comments
0

4 Embeds 28

http://www.slideee.com 22
http://iver.postach.io 3
http://dschool.co 2
https://twitter.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Unsupervised Learning with Apache Spark Unsupervised Learning with Apache Spark Presentation Transcript

    • ● Data scientist at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown
    • ● How many kinds of stuff are there? ● Why is some stuff not like the others? ● How do I contextualize new stuff? ● Is there a simpler way to represent this stuff?
    • ● Learn hidden structure of your data ● Interpret new data as it relates to this structure
    • ● Clustering ○ Partition data into categories ● Dimensionality reduction ○ Find a condensed representation of your data
    • ● Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
    • bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toDouble) numbers.sum()
    • bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toInt) numbers.cache() .sum()
    • bigfile.txt lines numbers Partition Partition Partition sum Driver
    • Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
    • Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
    • ● Anomalies as data points far away from any cluster
    • val data = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations)
    • ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ○ Recompute cluster centers from the points in each cluster
    • ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ■ Process each data point independently ○ Recompute cluster centers from the points in each cluster ■ Average across partitions
    • // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
    • // Update the cluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
    • ● K-Means is very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
    • ● Start with random point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
    • ● Initial cluster has expected bound of O(log k) of optimum cost
    • ● Requires k passes over the data
    • ● Do only a few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
    • Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
    • ● Select a basis for your data that ○ Is orthonormal ○ Maximizes variance along its axes
    • ● Find dominant trends
    • ● Find a lower-dimensional representation that lets you visualize the data ● Feature learning - find a representation that’ s good for clustering or classification ● Latent Semantic Analysis
    • val data: RDD[Vector] = ... val mat = new RowMatrix(data) // compute the top 5 principal components val principalComponents = mat.computePrincipalComponents(5) // project data into subspace val transformed = data.map(_.toBreeze * mat.toBreeze)
    • ● Center data ● Find covariance matrix ● Its eigenvectors are the principal components
    • Datam n Covariance Matrix n n
    • Data m n Data Data Data Data Data
    • Data m n Data Data Data Data Data n n n n ...
    • Data m n Data Data Data Data Data n n n n ... ...
    • n n
    • n n
    • n n
    • def computeGramianMatrix (): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) => { RowMatrix.dspr( 1.0, v, U.data) U }, combOp = (U1, U2) => U1 += U2 ) RowMatrix.triuToFull(n, GU.data) }
    • n n
    • ● n^2 must fit in memory
    • ● n^2 must fit in memory ● Not yet implemented: EM algorithm can do it with O(kn), where k is the number of principal components