SlideShare a Scribd company logo
Spark MLlib
• Experience
Vpon Data Engineer
TWM, Keywear, Nielsen
• Bryan’s notes for data analysis
http://bryannotes.blogspot.tw
• Spark.TW
• Linikedin
https://tw.linkedin.com/pub/bryan-yang/7b/763/a79
ABOUT ME
Agenda
• Introduction to Machine Learning
• Why MLlib
• Basic Statistic
• K-means
• Logistic Regression
• ALS
• Demo
Introduction to Machine
Learning
Definition
• Machine learning is a study of computer algorithms
that improve automatically through experience.
Terminology
• Observation
- The basic unit of data.
• Features
- Features is that can be used to describe each observation
in a quantitative manner.
• Feature vector
- is an n-dimensional vector of numerical features that
represents some object.
• Training/ Testing/ Evaluation set
- Set of data to discover potential predictive relationships
Learning(Training)
• Features:
1. Color: Red/ Green
2. Type: Fruit
3. Shape: nearly Circle
etc…
Learning(Training)
ID Color Type Shape
is
apple (
Label)
1 Red Fruit Cirle Y
2 Red Fruit Cirle Y
3 Black Logo
nearly
Cirle
N
4 Blue N/A Cirle N
Categories of Machine Learning
• Classification: predict class from observations.
• Clustering: group observations into meaningful
groups.
• Regression: predict value from observations.
Cluster the observations with no Labels
https://en.wikipedia.org/wiki/Cluster_analysis
Cut the observations
http://stats.stackexchange.com/questions/159957/how-to-do-one-vs-one-classification-for-logistic-regression
Find a model to describe the
observations
https://commons.wikimedia.org/wiki/File:Linear_regression.svg
Machine Learning
Pipelines
http://www.nltk.org/book/ch06.html
What is MLlib
What is MLlib
• MLlib is an Apache Spark component focusing on
machine learning:
• MLlib is Spark’s core ML library
• Developed by MLbase team in AMPLab
• 80+ contributions from various organization
• Support Scala, Python, and Java APIs
Algorithms in MLlib
• Statistics: Description, correlation
• Clustering: k-means
• Collaborative filtering: ALS
• Classification: SVMs, naive Bayes, decision tree.
• Regression: linear regression, logistic regression
• Dimensionality: SVD, PCA
Why Mllib
• Scalability
• Performance
• user-friendly documentation and APIs
• Cost of maintenance
Performance
Data Type
• Dense vector
• Sparse vector
• Labeled point
Dense & Sparse
• Raw Data:
ID
A
B C D E F
1 1 0 0 0 0 3
2 0 1 0 1 0 2
3 1 1 1 0 1 1
Dense vs Sparse
• Training Set:
- number of example: 12 million
- number of features: 500
- sparsity: 10%
• Not only save storage, but also received a 4x speed
up
Dense Sparse
Storge 47GB 7GB
Time 240s 58s
Labeled Point
• Dummy variable (1,0)
• Categorical variable (0, 1, 2, …)
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
DEMO
Descriptive Statistics
• Supported function:
- count
- max
- min
- mean
- variance
…
• Supported data types
- Dense
- Sparse
- Labeled Point
Example
from pyspark.mllib.stat import Statistics
from pyspark.mllib.linalg import Vectors
import numpy as np
## example data(2 x 2 matrix at least)
data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])
## to RDD
distData = sc.parallelize(data)
## Compute Statistic Value
summary = Statistics.colStats(distData)
print "Duration Statistics:"
print " Mean: {}".format(round(summary.mean()[0],3))
print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))
print " Max value: {}".format(round(summary.max()[0],3))
print " Min value: {}".format(round(summary.min()[0],3))
print " Total value count: {}".format(summary.count())
print " Number of non-zero values: {}".format(summary.numNonzeros()[0])
DEMO
Quiz 1
• Which statement is wrong purpose of descriptive statistic?
1. Find the extreme value of the data.
2. Understanding the data properties preliminary.
3. Describe the full properties of the data.
Correlation
https://commons.wikimedia.org/wiki/File:Linear_Correlation_Examples
Example
• Data Set
from pyspark.mllib.stat import Statistics
## 選擇統計方法
corrType='pearson'
## 建立測試資料及
group1 = sc.parallelize([1,3,4,3,7,5,6,8])
group2 = sc.parallelize([1,2,3,4,5,6,7,8])
corr = Statistics.corr(group1,group2, corrType)
print corr
# 0.87
A B C D E F G H
Case1 1 3 4 3 7 5 6 8
Case2 1 2 3 4 5 6 7 8
Shape is important
http://2012books.lardbucket.org/books/beginning-psychology/
DEMO
Quiz 2
• Which situation below can not use pearson
correlation?
1. When two variables have no correlation
2. When two variables have correlation of indices
3. When count of data rows > 30
4. When variables are normal distribution.
Clustering Model
K-Means
K-means
• K-means clustering aims to partition n observations
into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a
prototype of the cluster.
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Summary
https://en.wikipedia.org/wiki/K-means_clustering
K-Means: Python
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
# Load and parse the data
data = sc.textFile("data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=10, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
DEMO
• Which data is not suitable for K-means analysis?
Quiz 3
Classification Model
Logistic Regression
linear regression
https://commons.wikimedia.org/wiki/File:Linear_regression.svg
When outcome is only 1/0
http://www.slideshare.net/21_venkat/logistic-regression-17406472
Hypotheses function
• hypotheses:
• sigmoid function:
• Maximum Likelihood estimation
Cost Function
if y = 0if y = 1 regularized
Sample Code
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithSGD.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
float(parsedData.count())
print("Training Error = " + str(trainErr))
Quiz 4
• What situation is not suitable for logistic regression?
1. data with multiple label point
2. data with no label
3. data with more the 1,000 features
4. data with dummy features
DEMO
Recommendation Model
Alternating Least Squares
Definition
• Collaborative Filtering(CF) is a subset of algorithms
that exploit other users and items along with their
ratings(selection, purchase information could be
also used) and target user history to recommend an
item that target user does not have ratings for.
• Fundamental assumption behind this approach is
that other users preference over the items could be
used recommending an item to the user who did not
see the item or purchase before.
Definition
• CF differs itself from content-based methods in the
sense that user or the item itself does not play a
role in recommeniation but rather how(rating) and
which users(user) rated a particular item.
(Preference of users is shared through a group of
users who selected, purchased or rate items
similarly)
Recommendation
Recommendation
Recommendation
Recommendation
Spark MLlib - Training Material
ALS: Python
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))
DEMO
Quiz 5
• Search for other recommendation algorithms, point
out the difference with ALS.
- Item Base
- User Base
- Content Base
Reference
• Machine Learning Library (MLlib) Guide
http://spark.apache.org/docs/1.4.1/mllib-guide.html
• MLlib: Spark's Machine Learning Library
http://www.slideshare.net/jeykottalam/mllib
• Recent Developments in Spark MLlib and Beyond
http://www.slideshare.net/Hadoop_Summit
• Introduction to Machine Learning
http://www.slideshare.net/rahuldausa/introduction-to-machine-
learning

More Related Content

What's hot

Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
Observability in the world of microservices
Observability in the world of microservicesObservability in the world of microservices
Observability in the world of microservices
Chandresh Pancholi
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
Ketan Gote
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Discover Pinterest
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
Omid Vahdaty
 
Purple Team Exercise Framework Workshop #PTEF
Purple Team Exercise Framework Workshop #PTEFPurple Team Exercise Framework Workshop #PTEF
Purple Team Exercise Framework Workshop #PTEF
Jorge Orchilles
 
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
StreamNative
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
Rico Chen
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for Logs
Marco Pracucci
 
Evading Microsoft ATA for Active Directory Domination
Evading Microsoft ATA for Active Directory DominationEvading Microsoft ATA for Active Directory Domination
Evading Microsoft ATA for Active Directory Domination
Nikhil Mittal
 
Cyber Threat Hunting Workshop
Cyber Threat Hunting WorkshopCyber Threat Hunting Workshop
Cyber Threat Hunting Workshop
Digit Oktavianto
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
Brendan Gregg
 
Threat hunting - Every day is hunting season
Threat hunting - Every day is hunting seasonThreat hunting - Every day is hunting season
Threat hunting - Every day is hunting season
Ben Boyd
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
Iraklis Psaroudakis
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 

What's hot (20)

Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Observability in the world of microservices
Observability in the world of microservicesObservability in the world of microservices
Observability in the world of microservices
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Purple Team Exercise Framework Workshop #PTEF
Purple Team Exercise Framework Workshop #PTEFPurple Team Exercise Framework Workshop #PTEF
Purple Team Exercise Framework Workshop #PTEF
 
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for Logs
 
Evading Microsoft ATA for Active Directory Domination
Evading Microsoft ATA for Active Directory DominationEvading Microsoft ATA for Active Directory Domination
Evading Microsoft ATA for Active Directory Domination
 
Cyber Threat Hunting Workshop
Cyber Threat Hunting WorkshopCyber Threat Hunting Workshop
Cyber Threat Hunting Workshop
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Threat hunting - Every day is hunting season
Threat hunting - Every day is hunting seasonThreat hunting - Every day is hunting season
Threat hunting - Every day is hunting season
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 

Viewers also liked

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
Bryan Yang
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
nszakir
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
Spark Summit
 
02 newton-raphson
02 newton-raphson02 newton-raphson
02 newton-raphson
stephanus_ananda
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Choosing Regression Models
Choosing Regression ModelsChoosing Regression Models
Choosing Regression Models
Stephen Senn
 
Build your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by stepBuild your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by step
Bryan Yang
 
Skiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strcturesSkiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strctures
zukun
 
Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015
Bryan Yang
 
Data Scientist's Daily Life
Data Scientist's Daily LifeData Scientist's Daily Life
Data Scientist's Daily Life
Bryan Yang
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark Streaming
Keira Zhou
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
Bryan Yang
 
DSP 資料科學計畫簡介
DSP 資料科學計畫簡介DSP 資料科學計畫簡介
DSP 資料科學計畫簡介
codefortomorrow
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
IMC Institute
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
 

Viewers also liked (17)

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
02 newton-raphson
02 newton-raphson02 newton-raphson
02 newton-raphson
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Choosing Regression Models
Choosing Regression ModelsChoosing Regression Models
Choosing Regression Models
 
Build your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by stepBuild your ETL job using Jenkins - step by step
Build your ETL job using Jenkins - step by step
 
Skiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strcturesSkiena algorithm 2007 lecture10 graph data strctures
Skiena algorithm 2007 lecture10 graph data strctures
 
Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015Building your bi system-HadoopCon Taiwan 2015
Building your bi system-HadoopCon Taiwan 2015
 
Data Scientist's Daily Life
Data Scientist's Daily LifeData Scientist's Daily Life
Data Scientist's Daily Life
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark Streaming
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
DSP 資料科學計畫簡介
DSP 資料科學計畫簡介DSP 資料科學計畫簡介
DSP 資料科學計畫簡介
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 

Similar to Spark MLlib - Training Material

Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
VirajPathania1
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
Savitribai Phule Pune University
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
Robin Reni
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
ssuser598883
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
Sandeep Singh
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
JulioRecaldeLara1
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
tangadhurai
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
Yalçın Yenigün
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
Michael Gerke
 
background.pptx
background.pptxbackground.pptx
background.pptx
KabileshCm
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
Aly Abdelkareem
 
Ember
EmberEmber
Ember
mrphilroth
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
Ruby Shrestha
 

Similar to Spark MLlib - Training Material (20)

Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Ember
EmberEmber
Ember
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 

More from Bryan Yang

敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法
Bryan Yang
 
Data pipeline essential
Data pipeline essentialData pipeline essential
Data pipeline essential
Bryan Yang
 
Docker 101
Docker 101Docker 101
Docker 101
Bryan Yang
 
資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥
Bryan Yang
 
Data pipeline 101
Data pipeline 101Data pipeline 101
Data pipeline 101
Bryan Yang
 
Building a data driven business
Building a data driven businessBuilding a data driven business
Building a data driven business
Bryan Yang
 
產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例
Bryan Yang
 
Serverless ETL
Serverless ETLServerless ETL
Serverless ETL
Bryan Yang
 
敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法
Bryan Yang
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to docker
Bryan Yang
 

More from Bryan Yang (10)

敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法
 
Data pipeline essential
Data pipeline essentialData pipeline essential
Data pipeline essential
 
Docker 101
Docker 101Docker 101
Docker 101
 
資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥資料分析的快樂就是如此樸實無華且枯燥
資料分析的快樂就是如此樸實無華且枯燥
 
Data pipeline 101
Data pipeline 101Data pipeline 101
Data pipeline 101
 
Building a data driven business
Building a data driven businessBuilding a data driven business
Building a data driven business
 
產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例產業數據力-以傳統零售業為例
產業數據力-以傳統零售業為例
 
Serverless ETL
Serverless ETLServerless ETL
Serverless ETL
 
敏捷開發心法
敏捷開發心法敏捷開發心法
敏捷開發心法
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to docker
 

Recently uploaded

High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
saadkhan1485265
 
Welcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what yourWelcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what your
Virni Arrora
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
ginni singh$A17
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
birajmohan012
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).docbai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
PhngThLmHnh
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
palanisamyiiiier
 
Machine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentationMachine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentation
RahulS66654
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
tanupasswan6
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
Kanchana Weerasinghe
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
vrvipin164
 
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
tanupasswan6
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
TARIKU ENDALE
 
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
janvikumar4133
 
Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)
Alireza Kamrani
 
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy DsouzaOpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata
 
ISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standardsISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standards
DevanshuAnada1
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
bhupeshkumar0889
 
Data analytics and Access Program Recommendations
Data analytics and Access Program RecommendationsData analytics and Access Program Recommendations
Data analytics and Access Program Recommendations
hemantsharmaus
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 

Recently uploaded (20)

High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
 
Welcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what yourWelcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what your
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).docbai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
 
Machine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentationMachine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentation
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
 
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
 
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
 
Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)
 
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy DsouzaOpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
 
ISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standardsISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standards
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
 
Data analytics and Access Program Recommendations
Data analytics and Access Program RecommendationsData analytics and Access Program Recommendations
Data analytics and Access Program Recommendations
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 

Spark MLlib - Training Material

  • 2. • Experience Vpon Data Engineer TWM, Keywear, Nielsen • Bryan’s notes for data analysis http://bryannotes.blogspot.tw • Spark.TW • Linikedin https://tw.linkedin.com/pub/bryan-yang/7b/763/a79 ABOUT ME
  • 3. Agenda • Introduction to Machine Learning • Why MLlib • Basic Statistic • K-means • Logistic Regression • ALS • Demo
  • 5. Definition • Machine learning is a study of computer algorithms that improve automatically through experience.
  • 6. Terminology • Observation - The basic unit of data. • Features - Features is that can be used to describe each observation in a quantitative manner. • Feature vector - is an n-dimensional vector of numerical features that represents some object. • Training/ Testing/ Evaluation set - Set of data to discover potential predictive relationships
  • 7. Learning(Training) • Features: 1. Color: Red/ Green 2. Type: Fruit 3. Shape: nearly Circle etc…
  • 8. Learning(Training) ID Color Type Shape is apple ( Label) 1 Red Fruit Cirle Y 2 Red Fruit Cirle Y 3 Black Logo nearly Cirle N 4 Blue N/A Cirle N
  • 9. Categories of Machine Learning • Classification: predict class from observations. • Clustering: group observations into meaningful groups. • Regression: predict value from observations.
  • 10. Cluster the observations with no Labels https://en.wikipedia.org/wiki/Cluster_analysis
  • 12. Find a model to describe the observations https://commons.wikimedia.org/wiki/File:Linear_regression.svg
  • 15. What is MLlib • MLlib is an Apache Spark component focusing on machine learning: • MLlib is Spark’s core ML library • Developed by MLbase team in AMPLab • 80+ contributions from various organization • Support Scala, Python, and Java APIs
  • 16. Algorithms in MLlib • Statistics: Description, correlation • Clustering: k-means • Collaborative filtering: ALS • Classification: SVMs, naive Bayes, decision tree. • Regression: linear regression, logistic regression • Dimensionality: SVD, PCA
  • 17. Why Mllib • Scalability • Performance • user-friendly documentation and APIs • Cost of maintenance
  • 19. Data Type • Dense vector • Sparse vector • Labeled point
  • 20. Dense & Sparse • Raw Data: ID A B C D E F 1 1 0 0 0 0 3 2 0 1 0 1 0 2 3 1 1 1 0 1 1
  • 21. Dense vs Sparse • Training Set: - number of example: 12 million - number of features: 500 - sparsity: 10% • Not only save storage, but also received a 4x speed up Dense Sparse Storge 47GB 7GB Time 240s 58s
  • 22. Labeled Point • Dummy variable (1,0) • Categorical variable (0, 1, 2, …) from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint # Create a labeled point with a positive label and a dense feature vector. pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) # Create a labeled point with a negative label and a sparse feature vector. neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
  • 23. DEMO
  • 24. Descriptive Statistics • Supported function: - count - max - min - mean - variance … • Supported data types - Dense - Sparse - Labeled Point
  • 25. Example from pyspark.mllib.stat import Statistics from pyspark.mllib.linalg import Vectors import numpy as np ## example data(2 x 2 matrix at least) data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]]) ## to RDD distData = sc.parallelize(data) ## Compute Statistic Value summary = Statistics.colStats(distData) print "Duration Statistics:" print " Mean: {}".format(round(summary.mean()[0],3)) print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3)) print " Max value: {}".format(round(summary.max()[0],3)) print " Min value: {}".format(round(summary.min()[0],3)) print " Total value count: {}".format(summary.count()) print " Number of non-zero values: {}".format(summary.numNonzeros()[0])
  • 26. DEMO
  • 27. Quiz 1 • Which statement is wrong purpose of descriptive statistic? 1. Find the extreme value of the data. 2. Understanding the data properties preliminary. 3. Describe the full properties of the data.
  • 29. Example • Data Set from pyspark.mllib.stat import Statistics ## 選擇統計方法 corrType='pearson' ## 建立測試資料及 group1 = sc.parallelize([1,3,4,3,7,5,6,8]) group2 = sc.parallelize([1,2,3,4,5,6,7,8]) corr = Statistics.corr(group1,group2, corrType) print corr # 0.87 A B C D E F G H Case1 1 3 4 3 7 5 6 8 Case2 1 2 3 4 5 6 7 8
  • 31. DEMO
  • 32. Quiz 2 • Which situation below can not use pearson correlation? 1. When two variables have no correlation 2. When two variables have correlation of indices 3. When count of data rows > 30 4. When variables are normal distribution.
  • 34. K-means • K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
  • 42. K-Means: Python from pyspark.mllib.clustering import KMeans, KMeansModel from numpy import array from math import sqrt # Load and parse the data data = sc.textFile("data/mllib/kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE))
  • 43. DEMO
  • 44. • Which data is not suitable for K-means analysis? Quiz 3
  • 47. When outcome is only 1/0 http://www.slideshare.net/21_venkat/logistic-regression-17406472
  • 49. • Maximum Likelihood estimation Cost Function if y = 0if y = 1 regularized
  • 50. Sample Code from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.mllib.regression import LabeledPoint from numpy import array # Load and parse the data def parsePoint(line): values = [float(x) for x in line.split(' ')] return LabeledPoint(values[0], values[1:]) data = sc.textFile("data/mllib/sample_svm_data.txt") parsedData = data.map(parsePoint) # Build the model model = LogisticRegressionWithSGD.train(parsedData) # Evaluating the model on training data labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features))) trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) print("Training Error = " + str(trainErr))
  • 51. Quiz 4 • What situation is not suitable for logistic regression? 1. data with multiple label point 2. data with no label 3. data with more the 1,000 features 4. data with dummy features
  • 52. DEMO
  • 54. Definition • Collaborative Filtering(CF) is a subset of algorithms that exploit other users and items along with their ratings(selection, purchase information could be also used) and target user history to recommend an item that target user does not have ratings for. • Fundamental assumption behind this approach is that other users preference over the items could be used recommending an item to the user who did not see the item or purchase before.
  • 55. Definition • CF differs itself from content-based methods in the sense that user or the item itself does not play a role in recommeniation but rather how(rating) and which users(user) rated a particular item. (Preference of users is shared through a group of users who selected, purchased or rate items similarly)
  • 61. ALS: Python from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating # Load and parse the data data = sc.textFile("data/mllib/als/test.data") ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) # Build the recommendation model using Alternating Least Squares rank = 10 numIterations = 20 model = ALS.train(ratings, rank, numIterations) # Evaluate the model on training data testdata = ratings.map(lambda p: (p[0], p[1])) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean() print("Mean Squared Error = " + str(MSE))
  • 62. DEMO
  • 63. Quiz 5 • Search for other recommendation algorithms, point out the difference with ALS. - Item Base - User Base - Content Base
  • 64. Reference • Machine Learning Library (MLlib) Guide http://spark.apache.org/docs/1.4.1/mllib-guide.html • MLlib: Spark's Machine Learning Library http://www.slideshare.net/jeykottalam/mllib • Recent Developments in Spark MLlib and Beyond http://www.slideshare.net/Hadoop_Summit • Introduction to Machine Learning http://www.slideshare.net/rahuldausa/introduction-to-machine- learning