SlideShare a Scribd company logo
1 of 49
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
#GHCI16
2016Introduction to Predictive Analytics-
Hands On Workshop Using R &
Python
Presenters:
Python
Lavanya Sita Tekumalla
Sharmistha Jat
R
Maheshwari Dhandapani
Subramanian Lakshminarayanan
Sowmya Venugopal
Bindu
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Agenda
•Basics of Predictive Modeling Techniques (30m)
•Hands on Workshop: Regression
• (1) Build Model : R (30m) (2) Build Model : Python(30m)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
What is Predictive Analytics?
Learn from available data and make meaningful
predictions
Why Predictive Analytics?
Too much data – too many scenarios...
Hard for humans to explicitly describe predictive rules
for all scenarios
Exercise: lets predict something…
Predict how long it takes to reach home
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Supervised Learning
Regression : Predict continuous target
Can I predict time taken to get home from past history?
Can I predict Sensex Value from past market history?
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Supervised Learning
Classification : Predict the class/type of object
Classify Images of Cats from Dogs from examples?
Identify hand written digits by studying examples
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Unsupervised Learning
Clustering : Identify groups inherent in data
Given a set of news articles, what are the underlying topics or themes?
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Predict Movie Success ??
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Predict Movie Success: Features
•Features:
–Actors
–Director
–Gross budget
–Social media feedback
–Genre and keywords
–Release date
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Example: Predict Movie Sales?
Known Data:
Available advertising dollars
and corresponding sales for
lots of prior movies
Prediction Task:
For a new movie, given
advertising budget – can you
forecast sales ?
Regression:
Sales = f (Advertising budget)
How to learn f ????
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Example: Movie Hit / Flop from budget and
Trailer Facebook likes?
Known Data:
Available budgets and
facebook statistics of various
hit and flop movies...
Prediction Task:
For a new movie, I know budget
and facebook likes on trailer –
what is the probability of hit ?
Classification:
Can I learn the Seperating Line
Between hit and flop movies? Budget
FacebookLikes
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
The Predictive Analytics Framework
Data/Exampl
es
Feature
Extraction
Learning Algorithm
Model
New Data
Instance
Prediction
Evaluation: How well is my algorithm working ?
Model Selection: What learning Algorithm to use ?
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Important Aspects of Analytics Framework:
•Feature Engineering: Finding the
discerning characteristics
•Data Collection: Collecting the
right data / combining multiple
sources
•Cleanup: Huge effort -
noise/missing data/format
conversion...
"If you torture the data
long enough, it will
confess to anything." --
Ronald Coase
“The goal is to turn data into
information and information into
insight." -- Carly Fiorina
PAGE 13 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression Analysis
What ?
●“Regression analysis is a way of finding and
representing the relationship between two or more
variables.”
●Simple tool yet effective for prediction and
estimates
Why ?
● To predict an event/outcome using attributes or
features influencing it.
Examples
• Why UPS truck drivers don’t take a left turn?
• Predict movie rating
PAGE 14 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression Analysis
How ?
The key is to arrive at equation which brings in the relationship
between the outcome and its influencing features.
It answers the questions:
• Which variables matter most or the least?
— Independent /Predictors/Features
— Dependent/Outcome
• How do those variables interact with each other?
Y = β0+β1x1+β2x2......+εMovie
Rating
Budget
Duration
PAGE 15 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Exploration
Identify the nature of data and pattern in underlying set
Descriptive analysis : Describes or summarizes the raw data making it more
human interpretable. It condenses data into nuggets of information
(Mean,Median)
- Missing data , when impute, when omit (R packages :Mice, VIM, Amelia)
- Nature of data distribution ( around the mean, skewness, outliers)
Data
Variable
Continuous
-Quantitative
Categorical
-Qualitative
PAGE 16 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Visualize Data Distribution
PAGE 17 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Visualization of variables relationship
- How two features/variables
are related with one-another?
• -1.00 → Increase in one variable
cause decrease in other
• +1.00 → increase in one
variable causes increase in
other
• 0 → is a perfect zero correlation
- Is there a redundancy?
PAGE 18 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Cleansing
What is cleaning
“Conversion of raw data → technically correct data → to consistent data “
Why is cleansing important
Incorrect or inconsistent data can lead to drawing false
conclusions.
• Removal of outliers which can skew your results
• Removal of missing data
• Removal of duplicates
• Transformation of data
List of R Packages for data cleansing
MICE, Amelia, missForest, Hmisc, mi
PAGE 19 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Plotting missing data using mice package in R
Data Cleansing
PAGE 20 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Feature selection
To identify the important variables for building predictive
models which is free from “correlated variables”, “bias” and
“unwanted noise”.
e.g. Boruta Package in R → Identifies important variables using Random
Forrest
PAGE 21 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Building the Model
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
R - Workshop
PAGE 23 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
R SetUP
• Copy the install binaries and packages to your
laptop
• Install R & Rstudio
• Install the Packages (ggplot2,VIM,mice,Hmisc etc)
• Copy the Model code, RDS file and the Dataset
• Set the working directory using
• Setwd(<dir where you have the script,
dataset,RDS file>)
PAGE 24 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Explore Data using R
PAGE 25 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Validate the model
• Run model against “test” data set which was set
aside to predict after training
• Check the Prediction vs Actual observed value
• (Cross)Validation is done to assess the “fit”ness of
model
• Model should not under (or) over-fit future unseen
data
• Validate regression using
— R2 (higher is better)
— Residuals ( ideally should have random distribution to avoid
heteroscedasticity )
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Python - Workshop
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Basic Pipeline
1) Data loading and Inspection
2) Cleaning and Preprocessing
3) Train , Test partitioning
4) Feature Selection
5) Regression
6) Model Selection, parameter tuning, regularisation
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Loading
# loading imdb data into a python list format
import csv
imdb_data_csv= csv.reader(open('movie_metadata.csv'))
imdb_data=[]
for item in imdb_data_csv:
imdb_data.append(item)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Columns in Data
'color'
'director_name'
'num_critic_for_reviews'
'duration'
'director_facebook_likes'
'director_facebook_likes'
'actor_2_name'
'actor_1_facebook_likes'
'gross'
'genres'
'actor_1_name'
'movie_title'
'num_voted_users'
'cast_total_facebook_likes',
'actor_3_name',
'facenumber_in_poster',
'plot_keywords',
'movie_imdb_link',
'num_user_for_reviews',
'language',
'country',
'content_rating',
'budget',
'title_year',
'actor_2_facebook_likes',
'imdb_score',
'aspect_ratio',
'movie_facebook_likes'
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Preprocessing of data
Steps:
1) Convert text fields to numbers
2) Convert strings (numbers in CSV get read up as strings) to float or
int type
3) Remove NANs
4) Remove un-interesting columns from data
5) Feature selection
data_float = preprocessing(imdb_data)
data_np = np.array(data_float)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Train and Test data partitioning
from sklearn.model_selection import train_test_split
# remove label from data
data_np_x = np.delete(data_np, [20], axis=1)
# data partitioning
x_train, x_test, y_train, y_test = train_test_split(data_np_x,
data_np[:,20], test_size=0.25, random_state=0)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression
# apply regression and voila !!
from sklearn.linear_model import Ridge
regr_0 = Ridge(alpha=1.0)
regr_0.fit(x_train, y_train)
y_pred = regr_0.predict(x_test)
# model evaluation
from sklearn.metrics import mean_absolute_error
print 'absolute error: ', mean_absolute_error(y_test, y_pred)
from sklearn.metrics import mean_squared_error
print 'squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Feature Selection
Select important columns which correlate well with
output
1) Model learning and inference faster
2) Accuracy Improvement
3) Feature Selection using PCA
from sklearn.decomposition import TruncatedSVD
from copy import deepcopy
svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
data_svd = deepcopy(data_np_onehot)
data_svd = svd.fit_transform(data_svd)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Model Selection
How to select parameters of a model
Types of Regression
Popular regression models
1) Linear Regression
2) Ridge Regression: L2 Smoothing
3) Kernel regression: Higher order/non-linear
4) Lasso Regression: L1 Smoothing
5) Decision Tree regression (CART)
6) Random Forest Regression
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Ridge Regression: Regularization
Why Regularization??
-- Less Training Data:
Avoid Overfitting
-- Noisy Data: Smoothing/
Robustness to Outliers
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Ridge Regression: Regularization
# apply Ridge regression !!
from sklearn.linear_model import Ridge
regr_ridge = Ridge(alpha=10);
regr_ridge.fit(x_train, y_train)
y_pred = regr_ridge.predict(x_test)
# model evaluation
print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)
print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
#Alpha determines how much smoothing/ regularization of weights we
want
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
How to select Parameter alpha?
K-fold Cross Validation:
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
How to select Parameter alpha?
K-fold Cross Validation:
verbose_level=10
from sklearn.model_selection import GridSearchCV
regr_ridge = GridSearchCV(Ridge(), cv=3, verbose=verbose_level,
param_grid={"alpha": [ 10,1,0.1]})
regr_ridge.fit(x_train, y_train)
y_pred = regr_ridge.predict(x_test)
print(regr_ridge.best_params_);
# model evaluation
print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)
print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso regression: Feature Sparsity
Another form of Regularization with L1 Norm:
# Lasso Regression
from sklearn.linear_model import Ridge
regr_0 = Ridge(alpha=1.0)
regr_0.fit(x_train, y_train)
y_pred = regr_0.predict(x_test)
#Alpha determines how much sparsity inducing smoothing/
regularization of weights we want
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso regression: Feature Sparsity
Ridge Regression Lasso Regression
Plotting the Coefficients in Ridge Regression vs Lasso Regression
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso Regularization Regression
verbose_level=1
from sklearn.linear_model import Lasso
regr_ls = GridSearchCV(Lasso(), cv=2, verbose=verbose_level,
param_grid={"alpha": [ 0.01,0.1,1,10]})
regr_ls.fit(x_train, y_train)
y_pred = regr_ls.predict(x_test)
print(regr_ls.best_params_);
# model evaluation
print 'Lasso absolute error: ', mean_absolute_error(y_test, y_pred)
print 'Lasso squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression: Vizualization with
depth
Depth 1 Depth 2Depth 1 Depth 5
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression
regr_dt = GridSearchCV(DecisionTreeRegressor(), cv=2, verbose=verbose_level,
param_grid={"max_depth": [ 2,3,4,5,6]})
#regr_dt = DecisionTreeRegressor(max_depth=2)
regr_dt.fit(x_train, y_train)
y_pred = regr_dt.predict(x_test)
print(regr_dt.best_params_);
# model evaluation
print 'decision tree absolute error: ', mean_absolute_error(y_test, y_pred)
print 'decsion tree squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Random Forest for Regression
--> Learn multiple Decision Trees with random partitions of data
--> Predict value as average of prediction from multiple trees
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
regr_rf = GridSearchCV(RandomForestRegressor(), cv=2, verbose=verbose_level,
param_grid={"max_depth": [ 2,3,4,5]})
#regr_dt = DecisionTreeRegressor(max_depth=2)
regr_rf.fit(x_train, y_train)
y_pred = regr_rf.predict(x_test)
print(regr_rf.best_params_);
# model evaluation
print 'Random Forest absolute error: ', mean_absolute_error(y_test, y_pred)
print 'Random Forest squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Other Forms Of Regression
# Support Vector Regression
kfold_regr = GridSearchCV(SVR(), cv=5, verbose=10, param_grid={"C": [
10,1,0.1, 1e-2], "epsilon": [ 0.05,0.1, 0.2]})
#Gaussian Process Regression
kfold_regr = GridSearchCV(GaussianProcessRegressor(kernel=None), cv=5,
verbose=10, param_grid={"alpha": [ 10,1,0.1, 1e-2]})
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Recap of Python Session
Preprocessing –
--> Feature Selection,
--> Handling missing data
--> Handling categorical data
Model Evaluation: Making training and testing data
Model Selection -
--> Find parameters : Cross validation
--> Various regression models:
a. Simple Model : Linear Regression
b. Regularization (L2 norm): Ridge regression
c. Sparse Regularization: Lasso regression
d. Interpretable – decision trees
e. Random forests– Ensambles on Decision trees
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Thank you

More Related Content

Similar to Predictive Analytics -Workshop

GHC16_BuildingResiliencyInMulti-tierSystems
GHC16_BuildingResiliencyInMulti-tierSystemsGHC16_BuildingResiliencyInMulti-tierSystems
GHC16_BuildingResiliencyInMulti-tierSystemsShreya Mukhopadhyay
 
When Data is Your Product: Empowering Business Users
When Data is Your Product: Empowering Business UsersWhen Data is Your Product: Empowering Business Users
When Data is Your Product: Empowering Business UsersIntuit Inc.
 
Ideas to Execution: (Mis)using Code for Prototypes
Ideas to Execution: (Mis)using Code for PrototypesIdeas to Execution: (Mis)using Code for Prototypes
Ideas to Execution: (Mis)using Code for PrototypesIntuit Inc.
 
Demystifying Modern Web Development
Demystifying Modern Web DevelopmentDemystifying Modern Web Development
Demystifying Modern Web DevelopmentIntuit Inc.
 
How Humans & Machines Can Improve Site Search Results - Search Y: Paris
How Humans & Machines Can Improve Site Search Results - Search Y: ParisHow Humans & Machines Can Improve Site Search Results - Search Y: Paris
How Humans & Machines Can Improve Site Search Results - Search Y: ParisJP Sherman
 
Digital transformation continues to drive IT strategy, How is QA and testing ...
Digital transformation continues to drive IT strategy, How is QA and testing ...Digital transformation continues to drive IT strategy, How is QA and testing ...
Digital transformation continues to drive IT strategy, How is QA and testing ...QA or the Highway
 
Using Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales ForecastingUsing Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales ForecastingSenturus
 
Irina Pashina - UX Strategy Spanning Marketing and Technical Content at SAP
Irina Pashina - UX Strategy Spanning Marketing and Technical Content at SAPIrina Pashina - UX Strategy Spanning Marketing and Technical Content at SAP
Irina Pashina - UX Strategy Spanning Marketing and Technical Content at SAPLavaConConference
 
Predictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE RomePredictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE RomeSAP Ariba
 
FYP2-Growth engine
FYP2-Growth engineFYP2-Growth engine
FYP2-Growth engineUsama Husain
 
CPO Rising 2016: The Art and Science of Procurement
CPO Rising 2016: The Art and Science of Procurement CPO Rising 2016: The Art and Science of Procurement
CPO Rising 2016: The Art and Science of Procurement SAP Ariba
 
IRJET- Supply Chain Network Design for Plant Location and Selection of Capaci...
IRJET- Supply Chain Network Design for Plant Location and Selection of Capaci...IRJET- Supply Chain Network Design for Plant Location and Selection of Capaci...
IRJET- Supply Chain Network Design for Plant Location and Selection of Capaci...IRJET Journal
 
Grassroots Kanban – An Evolutionary Approach To Change
Grassroots Kanban – An Evolutionary Approach To ChangeGrassroots Kanban – An Evolutionary Approach To Change
Grassroots Kanban – An Evolutionary Approach To ChangeSynerzip
 
Making Predictive Analytics Practical: How Marketing Can Drive Engagement
Making Predictive Analytics Practical: How Marketing Can Drive EngagementMaking Predictive Analytics Practical: How Marketing Can Drive Engagement
Making Predictive Analytics Practical: How Marketing Can Drive EngagementProgress® Sitefinity™
 
Outsourcing risk mitigation and critical success factors
Outsourcing risk mitigation and critical success factorsOutsourcing risk mitigation and critical success factors
Outsourcing risk mitigation and critical success factorsSPAN Infotech (India) Pvt Ltd
 
Agricultural Chemicals 2016 Supply Chain Benchmarking Study
Agricultural Chemicals 2016 Supply Chain Benchmarking StudyAgricultural Chemicals 2016 Supply Chain Benchmarking Study
Agricultural Chemicals 2016 Supply Chain Benchmarking Studyaccenture
 
8 WAYS TO BOOST BUSINESS WITH SMART DATA ANALYSIS: Minitab Insights Promo e-book
8 WAYS TO BOOST BUSINESS WITH SMART DATA ANALYSIS: Minitab Insights Promo e-book8 WAYS TO BOOST BUSINESS WITH SMART DATA ANALYSIS: Minitab Insights Promo e-book
8 WAYS TO BOOST BUSINESS WITH SMART DATA ANALYSIS: Minitab Insights Promo e-bookBlackberry&Cross
 
Hadoop’s Impact on Recruit Company
Hadoop’s Impact on Recruit CompanyHadoop’s Impact on Recruit Company
Hadoop’s Impact on Recruit CompanyRecruit Technologies
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooTwo Sigma
 

Similar to Predictive Analytics -Workshop (20)

GHC16_BuildingResiliencyInMulti-tierSystems
GHC16_BuildingResiliencyInMulti-tierSystemsGHC16_BuildingResiliencyInMulti-tierSystems
GHC16_BuildingResiliencyInMulti-tierSystems
 
When Data is Your Product: Empowering Business Users
When Data is Your Product: Empowering Business UsersWhen Data is Your Product: Empowering Business Users
When Data is Your Product: Empowering Business Users
 
Ideas to Execution: (Mis)using Code for Prototypes
Ideas to Execution: (Mis)using Code for PrototypesIdeas to Execution: (Mis)using Code for Prototypes
Ideas to Execution: (Mis)using Code for Prototypes
 
Demystifying Modern Web Development
Demystifying Modern Web DevelopmentDemystifying Modern Web Development
Demystifying Modern Web Development
 
How Humans & Machines Can Improve Site Search Results - Search Y: Paris
How Humans & Machines Can Improve Site Search Results - Search Y: ParisHow Humans & Machines Can Improve Site Search Results - Search Y: Paris
How Humans & Machines Can Improve Site Search Results - Search Y: Paris
 
Digital transformation continues to drive IT strategy, How is QA and testing ...
Digital transformation continues to drive IT strategy, How is QA and testing ...Digital transformation continues to drive IT strategy, How is QA and testing ...
Digital transformation continues to drive IT strategy, How is QA and testing ...
 
Using Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales ForecastingUsing Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales Forecasting
 
Irina Pashina - UX Strategy Spanning Marketing and Technical Content at SAP
Irina Pashina - UX Strategy Spanning Marketing and Technical Content at SAPIrina Pashina - UX Strategy Spanning Marketing and Technical Content at SAP
Irina Pashina - UX Strategy Spanning Marketing and Technical Content at SAP
 
Predictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE RomePredictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE Rome
 
FYP2-Growth engine
FYP2-Growth engineFYP2-Growth engine
FYP2-Growth engine
 
CPO Rising 2016: The Art and Science of Procurement
CPO Rising 2016: The Art and Science of Procurement CPO Rising 2016: The Art and Science of Procurement
CPO Rising 2016: The Art and Science of Procurement
 
IRJET- Supply Chain Network Design for Plant Location and Selection of Capaci...
IRJET- Supply Chain Network Design for Plant Location and Selection of Capaci...IRJET- Supply Chain Network Design for Plant Location and Selection of Capaci...
IRJET- Supply Chain Network Design for Plant Location and Selection of Capaci...
 
Grassroots Kanban – An Evolutionary Approach To Change
Grassroots Kanban – An Evolutionary Approach To ChangeGrassroots Kanban – An Evolutionary Approach To Change
Grassroots Kanban – An Evolutionary Approach To Change
 
Making Predictive Analytics Practical: How Marketing Can Drive Engagement
Making Predictive Analytics Practical: How Marketing Can Drive EngagementMaking Predictive Analytics Practical: How Marketing Can Drive Engagement
Making Predictive Analytics Practical: How Marketing Can Drive Engagement
 
Outsourcing risk mitigation and critical success factors
Outsourcing risk mitigation and critical success factorsOutsourcing risk mitigation and critical success factors
Outsourcing risk mitigation and critical success factors
 
Agricultural Chemicals 2016 Supply Chain Benchmarking Study
Agricultural Chemicals 2016 Supply Chain Benchmarking StudyAgricultural Chemicals 2016 Supply Chain Benchmarking Study
Agricultural Chemicals 2016 Supply Chain Benchmarking Study
 
8 WAYS TO BOOST BUSINESS WITH SMART DATA ANALYSIS: Minitab Insights Promo e-book
8 WAYS TO BOOST BUSINESS WITH SMART DATA ANALYSIS: Minitab Insights Promo e-book8 WAYS TO BOOST BUSINESS WITH SMART DATA ANALYSIS: Minitab Insights Promo e-book
8 WAYS TO BOOST BUSINESS WITH SMART DATA ANALYSIS: Minitab Insights Promo e-book
 
Talent Pool Landscape Analysis - Data Scientist 2018
Talent Pool Landscape Analysis - Data Scientist 2018Talent Pool Landscape Analysis - Data Scientist 2018
Talent Pool Landscape Analysis - Data Scientist 2018
 
Hadoop’s Impact on Recruit Company
Hadoop’s Impact on Recruit CompanyHadoop’s Impact on Recruit Company
Hadoop’s Impact on Recruit Company
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee Joo
 

Recently uploaded

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad EscortsCall girls in Ahmedabad High profile
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Predictive Analytics -Workshop

  • 1. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA #GHCI16 2016Introduction to Predictive Analytics- Hands On Workshop Using R & Python Presenters: Python Lavanya Sita Tekumalla Sharmistha Jat R Maheshwari Dhandapani Subramanian Lakshminarayanan Sowmya Venugopal Bindu
  • 2. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Agenda •Basics of Predictive Modeling Techniques (30m) •Hands on Workshop: Regression • (1) Build Model : R (30m) (2) Build Model : Python(30m)
  • 3. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA What is Predictive Analytics? Learn from available data and make meaningful predictions Why Predictive Analytics? Too much data – too many scenarios... Hard for humans to explicitly describe predictive rules for all scenarios Exercise: lets predict something… Predict how long it takes to reach home
  • 4. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Common Analytics Tasks... Supervised Learning Regression : Predict continuous target Can I predict time taken to get home from past history? Can I predict Sensex Value from past market history?
  • 5. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Common Analytics Tasks... Supervised Learning Classification : Predict the class/type of object Classify Images of Cats from Dogs from examples? Identify hand written digits by studying examples
  • 6. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Common Analytics Tasks... Unsupervised Learning Clustering : Identify groups inherent in data Given a set of news articles, what are the underlying topics or themes?
  • 7. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Predict Movie Success ??
  • 8. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Predict Movie Success: Features •Features: –Actors –Director –Gross budget –Social media feedback –Genre and keywords –Release date
  • 9. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Example: Predict Movie Sales? Known Data: Available advertising dollars and corresponding sales for lots of prior movies Prediction Task: For a new movie, given advertising budget – can you forecast sales ? Regression: Sales = f (Advertising budget) How to learn f ????
  • 10. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Example: Movie Hit / Flop from budget and Trailer Facebook likes? Known Data: Available budgets and facebook statistics of various hit and flop movies... Prediction Task: For a new movie, I know budget and facebook likes on trailer – what is the probability of hit ? Classification: Can I learn the Seperating Line Between hit and flop movies? Budget FacebookLikes
  • 11. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA The Predictive Analytics Framework Data/Exampl es Feature Extraction Learning Algorithm Model New Data Instance Prediction Evaluation: How well is my algorithm working ? Model Selection: What learning Algorithm to use ?
  • 12. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Important Aspects of Analytics Framework: •Feature Engineering: Finding the discerning characteristics •Data Collection: Collecting the right data / combining multiple sources •Cleanup: Huge effort - noise/missing data/format conversion... "If you torture the data long enough, it will confess to anything." -- Ronald Coase “The goal is to turn data into information and information into insight." -- Carly Fiorina
  • 13. PAGE 13 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Regression Analysis What ? ●“Regression analysis is a way of finding and representing the relationship between two or more variables.” ●Simple tool yet effective for prediction and estimates Why ? ● To predict an event/outcome using attributes or features influencing it. Examples • Why UPS truck drivers don’t take a left turn? • Predict movie rating
  • 14. PAGE 14 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Regression Analysis How ? The key is to arrive at equation which brings in the relationship between the outcome and its influencing features. It answers the questions: • Which variables matter most or the least? — Independent /Predictors/Features — Dependent/Outcome • How do those variables interact with each other? Y = β0+β1x1+β2x2......+εMovie Rating Budget Duration
  • 15. PAGE 15 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Data Exploration Identify the nature of data and pattern in underlying set Descriptive analysis : Describes or summarizes the raw data making it more human interpretable. It condenses data into nuggets of information (Mean,Median) - Missing data , when impute, when omit (R packages :Mice, VIM, Amelia) - Nature of data distribution ( around the mean, skewness, outliers) Data Variable Continuous -Quantitative Categorical -Qualitative
  • 16. PAGE 16 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Visualize Data Distribution
  • 17. PAGE 17 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Visualization of variables relationship - How two features/variables are related with one-another? • -1.00 → Increase in one variable cause decrease in other • +1.00 → increase in one variable causes increase in other • 0 → is a perfect zero correlation - Is there a redundancy?
  • 18. PAGE 18 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Data Cleansing What is cleaning “Conversion of raw data → technically correct data → to consistent data “ Why is cleansing important Incorrect or inconsistent data can lead to drawing false conclusions. • Removal of outliers which can skew your results • Removal of missing data • Removal of duplicates • Transformation of data List of R Packages for data cleansing MICE, Amelia, missForest, Hmisc, mi
  • 19. PAGE 19 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Plotting missing data using mice package in R Data Cleansing
  • 20. PAGE 20 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Feature selection To identify the important variables for building predictive models which is free from “correlated variables”, “bias” and “unwanted noise”. e.g. Boruta Package in R → Identifies important variables using Random Forrest
  • 21. PAGE 21 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Building the Model
  • 22. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA R - Workshop
  • 23. PAGE 23 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA R SetUP • Copy the install binaries and packages to your laptop • Install R & Rstudio • Install the Packages (ggplot2,VIM,mice,Hmisc etc) • Copy the Model code, RDS file and the Dataset • Set the working directory using • Setwd(<dir where you have the script, dataset,RDS file>)
  • 24. PAGE 24 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Explore Data using R
  • 25. PAGE 25 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Validate the model • Run model against “test” data set which was set aside to predict after training • Check the Prediction vs Actual observed value • (Cross)Validation is done to assess the “fit”ness of model • Model should not under (or) over-fit future unseen data • Validate regression using — R2 (higher is better) — Residuals ( ideally should have random distribution to avoid heteroscedasticity )
  • 26. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Python - Workshop
  • 27. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Basic Pipeline 1) Data loading and Inspection 2) Cleaning and Preprocessing 3) Train , Test partitioning 4) Feature Selection 5) Regression 6) Model Selection, parameter tuning, regularisation
  • 28. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Data Loading # loading imdb data into a python list format import csv imdb_data_csv= csv.reader(open('movie_metadata.csv')) imdb_data=[] for item in imdb_data_csv: imdb_data.append(item)
  • 29. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Columns in Data 'color' 'director_name' 'num_critic_for_reviews' 'duration' 'director_facebook_likes' 'director_facebook_likes' 'actor_2_name' 'actor_1_facebook_likes' 'gross' 'genres' 'actor_1_name' 'movie_title' 'num_voted_users' 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'
  • 30. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Preprocessing of data Steps: 1) Convert text fields to numbers 2) Convert strings (numbers in CSV get read up as strings) to float or int type 3) Remove NANs 4) Remove un-interesting columns from data 5) Feature selection data_float = preprocessing(imdb_data) data_np = np.array(data_float)
  • 31. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Train and Test data partitioning from sklearn.model_selection import train_test_split # remove label from data data_np_x = np.delete(data_np, [20], axis=1) # data partitioning x_train, x_test, y_train, y_test = train_test_split(data_np_x, data_np[:,20], test_size=0.25, random_state=0)
  • 32. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Regression # apply regression and voila !! from sklearn.linear_model import Ridge regr_0 = Ridge(alpha=1.0) regr_0.fit(x_train, y_train) y_pred = regr_0.predict(x_test) # model evaluation from sklearn.metrics import mean_absolute_error print 'absolute error: ', mean_absolute_error(y_test, y_pred) from sklearn.metrics import mean_squared_error print 'squared error: ',mean_squared_error(y_test, y_pred)
  • 33. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Feature Selection Select important columns which correlate well with output 1) Model learning and inference faster 2) Accuracy Improvement 3) Feature Selection using PCA from sklearn.decomposition import TruncatedSVD from copy import deepcopy svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42) data_svd = deepcopy(data_np_onehot) data_svd = svd.fit_transform(data_svd)
  • 34. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Model Selection How to select parameters of a model Types of Regression Popular regression models 1) Linear Regression 2) Ridge Regression: L2 Smoothing 3) Kernel regression: Higher order/non-linear 4) Lasso Regression: L1 Smoothing 5) Decision Tree regression (CART) 6) Random Forest Regression
  • 35. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Ridge Regression: Regularization Why Regularization?? -- Less Training Data: Avoid Overfitting -- Noisy Data: Smoothing/ Robustness to Outliers
  • 36. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Ridge Regression: Regularization # apply Ridge regression !! from sklearn.linear_model import Ridge regr_ridge = Ridge(alpha=10); regr_ridge.fit(x_train, y_train) y_pred = regr_ridge.predict(x_test) # model evaluation print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred) print 'ridge squared error: ',mean_squared_error(y_test, y_pred) #Alpha determines how much smoothing/ regularization of weights we want
  • 37. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA How to select Parameter alpha? K-fold Cross Validation:
  • 38. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA How to select Parameter alpha? K-fold Cross Validation: verbose_level=10 from sklearn.model_selection import GridSearchCV regr_ridge = GridSearchCV(Ridge(), cv=3, verbose=verbose_level, param_grid={"alpha": [ 10,1,0.1]}) regr_ridge.fit(x_train, y_train) y_pred = regr_ridge.predict(x_test) print(regr_ridge.best_params_); # model evaluation print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred) print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
  • 39. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Lasso regression: Feature Sparsity Another form of Regularization with L1 Norm: # Lasso Regression from sklearn.linear_model import Ridge regr_0 = Ridge(alpha=1.0) regr_0.fit(x_train, y_train) y_pred = regr_0.predict(x_test) #Alpha determines how much sparsity inducing smoothing/ regularization of weights we want
  • 40. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Lasso regression: Feature Sparsity Ridge Regression Lasso Regression Plotting the Coefficients in Ridge Regression vs Lasso Regression
  • 41. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Lasso Regularization Regression verbose_level=1 from sklearn.linear_model import Lasso regr_ls = GridSearchCV(Lasso(), cv=2, verbose=verbose_level, param_grid={"alpha": [ 0.01,0.1,1,10]}) regr_ls.fit(x_train, y_train) y_pred = regr_ls.predict(x_test) print(regr_ls.best_params_); # model evaluation print 'Lasso absolute error: ', mean_absolute_error(y_test, y_pred) print 'Lasso squared error: ',mean_squared_error(y_test, y_pred)
  • 42. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Decision Tree Regression
  • 43. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Decision Tree Regression: Vizualization with depth Depth 1 Depth 2Depth 1 Depth 5
  • 44. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Decision Tree Regression regr_dt = GridSearchCV(DecisionTreeRegressor(), cv=2, verbose=verbose_level, param_grid={"max_depth": [ 2,3,4,5,6]}) #regr_dt = DecisionTreeRegressor(max_depth=2) regr_dt.fit(x_train, y_train) y_pred = regr_dt.predict(x_test) print(regr_dt.best_params_); # model evaluation print 'decision tree absolute error: ', mean_absolute_error(y_test, y_pred) print 'decsion tree squared error: ',mean_squared_error(y_test, y_pred)
  • 45. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Random Forest for Regression --> Learn multiple Decision Trees with random partitions of data --> Predict value as average of prediction from multiple trees
  • 46. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Random Forest Regression from sklearn.ensemble import RandomForestRegressor regr_rf = GridSearchCV(RandomForestRegressor(), cv=2, verbose=verbose_level, param_grid={"max_depth": [ 2,3,4,5]}) #regr_dt = DecisionTreeRegressor(max_depth=2) regr_rf.fit(x_train, y_train) y_pred = regr_rf.predict(x_test) print(regr_rf.best_params_); # model evaluation print 'Random Forest absolute error: ', mean_absolute_error(y_test, y_pred) print 'Random Forest squared error: ',mean_squared_error(y_test, y_pred)
  • 47. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Other Forms Of Regression # Support Vector Regression kfold_regr = GridSearchCV(SVR(), cv=5, verbose=10, param_grid={"C": [ 10,1,0.1, 1e-2], "epsilon": [ 0.05,0.1, 0.2]}) #Gaussian Process Regression kfold_regr = GridSearchCV(GaussianProcessRegressor(kernel=None), cv=5, verbose=10, param_grid={"alpha": [ 10,1,0.1, 1e-2]})
  • 48. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Recap of Python Session Preprocessing – --> Feature Selection, --> Handling missing data --> Handling categorical data Model Evaluation: Making training and testing data Model Selection - --> Find parameters : Cross validation --> Various regression models: a. Simple Model : Linear Regression b. Regularization (L2 norm): Ridge regression c. Sparse Regularization: Lasso regression d. Interpretable – decision trees e. Random forests– Ensambles on Decision trees
  • 49. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Thank you