SlideShare a Scribd company logo
1 of 80
Download to read offline
● Data scientist at Cloudera
● Recently lead Apache Spark development at
Cloudera
● Before that, committing on Apache Hadoop
● Before that, studying combinatorial
optimization and distributed systems at
Brown
● How many kinds of stuff are there?
● Why is some stuff not like the others?
● How do I contextualize new stuff?
● Is there a simpler way to represent this stuff?
● Learn hidden structure of your data
● Interpret new data as it relates to this
structure
● Clustering
○ Partition data into categories
● Dimensionality reduction
○ Find a condensed representation of your
data
● Designing a system for processing huge
data in parallel
● Taking advantage of it with algorithms that
work well in parallel
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toDouble) numbers.sum()
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toInt) numbers.cache()
.sum()
bigfile.txt lines numbers
Partition
Partition
Partition
sum
Driver
Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal component analysis /
singular value decomposition
● Alternating least squares
Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal component analysis /
singular value decomposition
● Alternating least squares
● Anomalies as data points far away from any
cluster
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters,
numIterations)
● Alternate between two steps:
○ Assign each point to a cluster based on
existing centers
○ Recompute cluster centers from the
points in each cluster
● Alternate between two steps:
○ Assign each point to a cluster based on
existing centers
■ Process each data point independently
○ Recompute cluster centers from the
points in each cluster
■ Average across partitions
// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()
// Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value
● K-Means is very sensitive to initial set of
center points chosen.
● Best existing algorithm for choosing centers
is highly sequential.
● Start with random point from dataset
● Pick another one randomly, with probability
proportional to distance from the closest
already chosen
● Repeat until initial centers chosen
● Initial cluster has expected bound of O(log k)
of optimum cost
● Requires k passes over the data
● Do only a few (~5) passes
● Sample m points on each pass
● Oversample
● Run K-Means++ on sampled points to find
initial centers
Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal
component
analysis / singular value
decomposition
● Alternating least squares
● Select a basis for your data that
○ Is orthonormal
○ Maximizes variance along its axes
● Find dominant trends
● Find a lower-dimensional representation that
lets you visualize the data
● Feature learning - find a representation that’
s good for clustering or classification
● Latent Semantic Analysis
val data: RDD[Vector] = ...
val mat = new RowMatrix(data)
// compute the top 5 principal components
val principalComponents =
mat.computePrincipalComponents(5)
// project data into subspace
val transformed = data.map(_.toBreeze *
mat.toBreeze)
● Center data
● Find covariance matrix
● Its eigenvectors are the principal
components
Datam
n
Covariance Matrix
n
n
Data
m
n
Data
Data
Data
Data
Data
Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
...
Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
... ...
n
n
n
n
n
n
def computeGramianMatrix (): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2
// Compute the upper triangular part of the gram matrix.
val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))(
seqOp = (U, v) => {
RowMatrix.dspr( 1.0, v, U.data)
U
},
combOp = (U1, U2) => U1 += U2
)
RowMatrix.triuToFull(n, GU.data)
}
n
n
● n^2 must fit in memory
● n^2 must fit in memory
● Not yet implemented: EM algorithm can do it
with O(kn), where k is the number of
principal components

More Related Content

What's hot

Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Sparkdatamantra
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big DataLeonardo Gamas
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetAnkit Beohar
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 

What's hot (20)

Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 

Viewers also liked

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...Amazon Web Services
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit
 
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTELArtificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTELAssist
 
Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...MapR Technologies
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational InterfacesThe Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational InterfacesTWG
 

Viewers also liked (10)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...
AWS re:Invent 2016: State of the Union: Amazon Alexa and Recent Advances in C...
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
Realizing AI Conversational Bot
Realizing AI Conversational BotRealizing AI Conversational Bot
Realizing AI Conversational Bot
 
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTELArtificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
 
Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational InterfacesThe Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
 

Similar to Unsupervised Learning with Apache Spark

Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Qbeast
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like SparkAlpine Data
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfssuser598883
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxSandeep Singh
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdfJulioRecaldeLara1
 

Similar to Unsupervised Learning with Apache Spark (20)

05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
Planet
PlanetPlanet
Planet
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
 

More from DB Tsai

2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...DB Tsai
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to Spark2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to SparkDB Tsai
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 

More from DB Tsai (6)

2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to Spark2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to Spark
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 

Recently uploaded

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 

Recently uploaded (20)

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 

Unsupervised Learning with Apache Spark

  • 1.
  • 2. ● Data scientist at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. ● How many kinds of stuff are there? ● Why is some stuff not like the others? ● How do I contextualize new stuff? ● Is there a simpler way to represent this stuff?
  • 15. ● Learn hidden structure of your data ● Interpret new data as it relates to this structure
  • 16. ● Clustering ○ Partition data into categories ● Dimensionality reduction ○ Find a condensed representation of your data
  • 17. ● Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
  • 18.
  • 19.
  • 20.
  • 21. bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toDouble) numbers.sum()
  • 22.
  • 23. bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toInt) numbers.cache() .sum()
  • 25.
  • 26.
  • 27. Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 28. Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 29.
  • 30.
  • 31. ● Anomalies as data points far away from any cluster
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations)
  • 38. ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ○ Recompute cluster centers from the points in each cluster
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44. ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ■ Process each data point independently ○ Recompute cluster centers from the points in each cluster ■ Average across partitions
  • 45. // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
  • 46. // Update the cluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
  • 47.
  • 48. ● K-Means is very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
  • 49.
  • 50. ● Start with random point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
  • 51. ● Initial cluster has expected bound of O(log k) of optimum cost
  • 52. ● Requires k passes over the data
  • 53. ● Do only a few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63. Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 64. ● Select a basis for your data that ○ Is orthonormal ○ Maximizes variance along its axes
  • 65.
  • 67. ● Find a lower-dimensional representation that lets you visualize the data ● Feature learning - find a representation that’ s good for clustering or classification ● Latent Semantic Analysis
  • 68. val data: RDD[Vector] = ... val mat = new RowMatrix(data) // compute the top 5 principal components val principalComponents = mat.computePrincipalComponents(5) // project data into subspace val transformed = data.map(_.toBreeze * mat.toBreeze)
  • 69. ● Center data ● Find covariance matrix ● Its eigenvectors are the principal components
  • 74. n n
  • 75. n n
  • 76. n n
  • 77. def computeGramianMatrix (): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) => { RowMatrix.dspr( 1.0, v, U.data) U }, combOp = (U1, U2) => U1 += U2 ) RowMatrix.triuToFull(n, GU.data) }
  • 78. n n
  • 79. ● n^2 must fit in memory
  • 80. ● n^2 must fit in memory ● Not yet implemented: EM algorithm can do it with O(kn), where k is the number of principal components