SlideShare a Scribd company logo
Robert L. Grossman
Collin Bennett
Open Data Group
December 4, 2012
Building and Deploying
Big Data Analytic Models
5.1 SAMS
Scores
Actions
Measures
Model
Consumer
Analytic
Diamond
Analytic
strategy &
governance
Analytic algorithms
& models
Analytic Infrastructure
Analytic operations,
security & compliance
SAM
Models
Scores
Actions
Measures Dashboards
Consuming models:
Model
Consumer
5
Model
Producer
ModelData
CART, SVM, k-means, etc. PMML
Building models:
Life Cycle of Predictive Model
Exploratory Data Analysis (R) Process the data (Hadoop)
Build model in
dev/modeling environment
Deploy model in
operational systems with
scoring application
(Hadoop, streams &
Augustus)
Refine model
Analytic modeling
Operations
PMML
Log
files
Retire model and deploy
improved model
5.2 Analytic Algorithms & Models
Key Modeling Activities
 Exploring data analysis – goal is to understand the
data so that features can be generated and model
evaluated
 Generating features – goal is to shape events into
meaningful features that enable meaningful
prediction of the dependent variable
 Building models – goal is to derive statistically valid
models from learning set
 Evaluating models – goal is to develop meaningful
report so that the performance of one model can be
compared to another model
? Naïve Bayes
? K nearest neighbor
1957 K-means
1977 EM Algorithm
1984 CART
1993 C4.5
1994 A Priori
1995 SVM
1997 AdaBoost
1998 PageRank
9
Beware of any vendor
whose unique value
proposition includes
some “secret analytic
sauce.”
Source: X. Wu, et. al., Top 10 algorithms in data mining, Knowledge and Information Systems,
Volume 14, 2008, pages 1-37.
Pessimistic View: We Get a Significantly
New Algorithm Every Decade or So
1970’s neural networks
1980’s classifications and regression trees
1990’s support vector machine
2000’s graph algorithms
In general, understanding the data through
exploratory analysis and generating good
features is much more important than the
type of predictive model you use.
Questions About Datasets
• Number of records
• Number of data fields
• Size
– Does it fit into memory?
– Does it fit on a single disk or disk array?
• How many missing values?
• How many duplicates?
• Are there labels (for the dependent variable)?
• Are there keys?
Questions About Data Fields
• Data fields
– What is the mean, variance, …?
– What is the distribution? Plot the histograms.
– What are extreme values, missing values, unusual
values?
• Pairs of data fields
– What is the correlation? Plot the scatter plots.
• Visualize the data
– Histograms, scatter plots, etc.
stable clusters new cluster
…
Computing Count Distributions
def valueDistrib(b, field, select=None):
# given a dataframe b and an attribute field,
# returns the class distribution of the field
ccdict = {}
if select:
for i in select:
count = ccdict.get(b[i][field], 0)
ccdict[b[i][field]] = count + 1
else:
for i in range(len(b)):
count = ccdict.get(b[i][field], 0)
ccdict[b[i][field]] = count + 1
return ccdict
Features
• Normalize
– [0, 1] or [-1, 1]
• Aggregate
• Discretize or bin
• Transform continuous to continuous or
discrete to discrete
• Compute ratios
• Use percentiles
• Use indicator variables
Predicting Office Building Renewal
Using Text-based Features
Classification Avg R
imports 5.17
clothing 4.17
van 3.18
fashion 2.50
investment 2.42
marketing 2.38
oil 2.11
air 2.09
system 2.06
foundation 2.04
Classification Avg R
technology 2.03
apparel 2.03
law 2.00
casualty 1.91
bank 1.88
technologies 1.84
trading 1.82
associates 1.81
staffing 1.77
securities 1.77
Twitter Intention & Prediction
• Use machine learning
classification
techniques:
– Support Vector
Machines
– Trees
– Naïve Bayes
• Models require clean
data and lots of hand-
scored examples
SVM Naïve Bayes Trees
• Namibia floods classified by R package E1071.
• Often implementation dependent.
If You Can Ask For Something
• Ask for more data
• Ask for “orthogonal data”
5.3 Multiple Models
Trees
For CART trees: L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone,
Classification and Regression Trees, 1984, Chapman & Hall.
Trees Partition Feature Space
• Trees partition the feature space into regions by asking whether an
attribute is less than a threshold.
R
M
1.75
2.45 4.95
M < 2.45
R < 1.75?
A
B
C
C
A B
C
M < 4.95?
C
Split Using
Entropy
split 1
split 2
split 3
split 4
split 1
split 2
split 3
split 4
vs
increase in information = 0.32 increase in information = 0
information = 0.64objects have attributes
or
Growing the Tree Step 1. Class proportions.
Node u with n objects
n1 of class A (red)
n2 of class B (blue), etc.
Step 2. Entropy
I (u) = - nj /n log nj /n
Step 3. Split proportions.
m1 sent to child 1– node u1
m2 sent to child 2– node u2
Step 4. Choose attribute
to maximize
D = I(u) - mj /n I (uj)
blue
blue
red
red
blue
u
u
1
u
2
29
Multiple Models: Ensembles
Key Idea: Combine Weak Learners
• Average several weak models, rather than build
one complex model.
Model 1
Model 2
Model 3
Combining Weak Learners
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 Classifier 3 Classifiers 5 Classifiers
55% 57.40% 59.30%
60% 64.0% 68.20%
65% 71.00% 76.50%
Ensembles
are unreasonably effective at building
models.
Average for regression trees. Use majority
voting for classification.
f1(x) + f2(x) + … + f64(x)
Building ensembles over clusters of
commodity workstations has been used
since the 1990’s.
Ensemble Models
1. Partition data and
scatter
2. Build models (e.g.
tree-based model)
3. Gather models
4. Form collection of
models into ensemble
(e.g. majority vote for
classification &
averaging for
regression)
1. partition data
3. gather
models
model
4. form
ensemble
2. build
model
Example: Fraud
• Ensembles of trees have proved remarkably
effective in detecting fraud
• Trees can be very large
• Very fast to score
• Lots of variants
– Random forests
– Random trees
• Sometimes need to reverse engineer reason
codes.
CUSUM
37
CUSUM
 Assume that we have a mean and
standard deviation for both
distributions.
Observed
Distribution
Baseline
f0(x) = baseline density
f1(x) = observed density
g(x) = log(f1(x)/f0(x))
Z0 = 0
Zn+1 = max{Zn+g(x), 0}
Alert if Zn+1>threshold
f0(x) f1(x)
38
Cubes of Models (Segmented Models)
Build separate segmented model
for every entity of interest
Divide & conquer data (segment)
using multidimensional data
cubes
For each distinct cube, establish
separate baselines for each
quantify of interest
Detect changes from baselines
Entity
(bank, etc.)
Geospatial
region
Estimate separate baselines for each
quantify of interest
Type of
Transaction
39
Examples - Change Detection
Using Cubes of Models
• Highway Traffic Data
– each day (7) x each hour (24) x each sensor
(hundreds) x each weather condition (5) x each
special event (dozens)
– 50,000 baselines models used in current
testbed
Multiple Models in Hadoop
• Trees in Mapper
• Build lots of small trees (over historical data in
Hadoop)
• Load all of them into the mapper
• Mapper
– For each event, score against each tree
– Map emits
• Key t value = event id t tree id, score
• Reducer
– Aggregate scores for each event
– Select a “winner” for each event
5.4 Evaluating and
Comparing Models
The Take Home Message
• Given two predictive models, develop a
methodology to determine which is better.
• Get a predictive model up in development
quickly, no matter how bad. Call it the
Champion.
• Two methodologies
– Champion-Challenger
– Automated Testing and Deployment Environment
Champion Challenger
Methodology
• Develop a new model each week. Call it the
Challenger.
• Meet each week to discuss the performance
of the model. Compare the two models and
keep the better one as the new Champion.
Automating Testing & Deployment
Environment
• Develop a custom environment for the large scale
testing and deployment of many (tens, hundreds,
thousands) of different models.
• Develop an appropriate experimental design.
• Test on a small scale.
• Deploy those that test well on a large(r) scale.
• Continuously improve the design and automation
of the process.
• Think of as an automated scale up of A/B testing.
Have Weekly Meetings
• Have the modeler and business owner and
engineer on the operations side meet once a
week.
• Discuss misclassifications.
• Discuss business value from the alerts
• Avoid jargon (Detection rate, False Positives,
etc.) whenever possible. Instead…
• Always use visualizations.
2-Class Confusion Matrix
• There are two rates
true positive rate = TPrate = (#TP)/(#P)
false positive rate = FPrate = (#FP)/(#N)
True class
Predicted class
positive negative
positive (#P) #TP #P - #TP
negative (#N) #FP #N - #FP
Confusion Matrix
• Recall the TP and FP rates are defined by:
– TPrate = TP / P = recall
– FPrate = FP / N
• Precision and accuracy are defined by:
– Precision = TP / (TP + FP)
– Classifier assigns TP + FP to the positive class
– Accuracy = (TP + TN) / (P + N)
Precision and Accuracy
• Recall the TP and FP rates are defined by:
– TPrate = TP / P = recall
– FPrate = FP / N
• Precision and accuracy are defined by:
– Precision = TP / (TP + FP)
– Classifier assigns TP + FP to the positive class
– Accuracy = (TP + TN) / (P + N)
Graph to produce
the ROC Curve
Why visualize performance ?
• A single number tells only part of the story
– There are fundamentally two numbers of interest
(FP and TP), which cannot be summarized with a
single number.
– How are errors distributed across the classes?
• Shape of curves are much more informative.
Example: 3 classifiers
True
Predicted
pos neg
pos 60 40
neg 20 80
True
Predicted
pos neg
pos 70 30
neg 50 50
True
Predicted
pos neg
pos 40 60
neg 30 70
Classifier 1
TP = 0.4
FP = 0.3
Classifier 2
TP = 0.7
FP = 0.5
Classifier 3
TP = 0.6
FP = 0.2
ROC plot for the 3 Classifiers
Perfect
Classifier
False Positive Rate
Tree Positive Rate
Random
Classifier
ROC Curve
• Recall the TP and FP rates are defined by:
– TPrate = TP / P
– FPrate = FP / N
• The ROC Curve is defined by:
– x = FPrate
– y = TPrate
Creating an ROC Curve
• A classifier produces a single ROC point.
• If the classifier has a threshold or sensitivity
parameter, varying it produces a series of ROC
points (confusion matrices).
ROC Curve
Performance of a Random Classifier
• When there are an equal number of Positives
(P) and Negatives (N), then a random classifier
is accurate about 50% of the time.
• The lift measures the relative performance of
a classifier as compared to a random classifier
• Especially important with imbalanced classes.
Lift Curve
• The Lift Curve is defined by:
– x = (TP + FP) / (P + N)
– y = TP
Questions?
For the most current version of these slides,
please see tutorials.opendatagroup.com
About Open Data
• Open Data began operations in 2001 and has
built predictive models for companies for over
ten years
• Open Data provides management consulting,
outsourced analytic services, & analytic staffing
• For more information
• www.opendatagroup.com
• info@opendatagroup.com

More Related Content

What's hot

Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
shanelynn
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machines
Nawal Sharma
 
SVM Tutorial
SVM TutorialSVM Tutorial
SVM Tutorialbutest
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
Carlo Carandang
 
Svm vs ls svm
Svm vs ls svmSvm vs ls svm
Svm vs ls svm
Pulipaka Sai Ravi Teja
 
Introduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi ChenIntroduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi Chen
Zhuyi Xue
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
Xiang Zhang
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
Sakis Sotiropoulos
 
Tutorial - Support vector machines
Tutorial - Support vector machinesTutorial - Support vector machines
Tutorial - Support vector machinesbutest
 
Support vector regression and its application in trading
Support vector regression and its application in tradingSupport vector regression and its application in trading
Support vector regression and its application in tradingAashay Harlalka
 
Machine Learning using Support Vector Machine
Machine Learning using Support Vector MachineMachine Learning using Support Vector Machine
Machine Learning using Support Vector Machine
Mohsin Ul Haq
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
zekeLabs Technologies
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine Learning
Paxcel Technologies
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Svm Presentation
Svm PresentationSvm Presentation
Svm Presentationshahparin
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
VARUN KUMAR
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
Yogendra Singh
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector MachineShao-Chuan Wang
 
How to use SVM for data classification
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classification
Yiwei Chen
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
Ujjawal
 

What's hot (20)

Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machines
 
SVM Tutorial
SVM TutorialSVM Tutorial
SVM Tutorial
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
 
Svm vs ls svm
Svm vs ls svmSvm vs ls svm
Svm vs ls svm
 
Introduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi ChenIntroduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi Chen
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Tutorial - Support vector machines
Tutorial - Support vector machinesTutorial - Support vector machines
Tutorial - Support vector machines
 
Support vector regression and its application in trading
Support vector regression and its application in tradingSupport vector regression and its application in trading
Support vector regression and its application in trading
 
Machine Learning using Support Vector Machine
Machine Learning using Support Vector MachineMachine Learning using Support Vector Machine
Machine Learning using Support Vector Machine
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine Learning
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Svm Presentation
Svm PresentationSvm Presentation
Svm Presentation
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
 
How to use SVM for data classification
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classification
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 

Similar to Building and deploying analytics

background.pptx
background.pptxbackground.pptx
background.pptx
KabileshCm
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
Parinaz Ameri
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
Wush Wu
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
Marco Quartulli
 
Cheatsheet machine-learning-tips-and-tricks
Cheatsheet machine-learning-tips-and-tricksCheatsheet machine-learning-tips-and-tricks
Cheatsheet machine-learning-tips-and-tricks
Steve Nouri
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
Sanjeev Mishra
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
Saad Elbeleidy
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
YONG ZHENG
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
Tien-Yang (Aiden) Wu
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
台灣資料科學年會
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
PriyadharshiniG41
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
Trushita Redij
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
Aditya Joshi
 
Recommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingRecommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model Training
Crossing Minds
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
WeCloudData
 

Similar to Building and deploying analytics (20)

Planet
PlanetPlanet
Planet
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
Cheatsheet machine-learning-tips-and-tricks
Cheatsheet machine-learning-tips-and-tricksCheatsheet machine-learning-tips-and-tricks
Cheatsheet machine-learning-tips-and-tricks
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
 
Recommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model TrainingRecommender Systems from A to Z – Model Training
Recommender Systems from A to Z – Model Training
 
Enar short course
Enar short courseEnar short course
Enar short course
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 

Recently uploaded

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 

Recently uploaded (20)

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 

Building and deploying analytics

  • 1. Robert L. Grossman Collin Bennett Open Data Group December 4, 2012 Building and Deploying Big Data Analytic Models
  • 3. Analytic Diamond Analytic strategy & governance Analytic algorithms & models Analytic Infrastructure Analytic operations, security & compliance
  • 6. Life Cycle of Predictive Model Exploratory Data Analysis (R) Process the data (Hadoop) Build model in dev/modeling environment Deploy model in operational systems with scoring application (Hadoop, streams & Augustus) Refine model Analytic modeling Operations PMML Log files Retire model and deploy improved model
  • 8. Key Modeling Activities  Exploring data analysis – goal is to understand the data so that features can be generated and model evaluated  Generating features – goal is to shape events into meaningful features that enable meaningful prediction of the dependent variable  Building models – goal is to derive statistically valid models from learning set  Evaluating models – goal is to develop meaningful report so that the performance of one model can be compared to another model
  • 9. ? Naïve Bayes ? K nearest neighbor 1957 K-means 1977 EM Algorithm 1984 CART 1993 C4.5 1994 A Priori 1995 SVM 1997 AdaBoost 1998 PageRank 9 Beware of any vendor whose unique value proposition includes some “secret analytic sauce.” Source: X. Wu, et. al., Top 10 algorithms in data mining, Knowledge and Information Systems, Volume 14, 2008, pages 1-37.
  • 10. Pessimistic View: We Get a Significantly New Algorithm Every Decade or So 1970’s neural networks 1980’s classifications and regression trees 1990’s support vector machine 2000’s graph algorithms
  • 11. In general, understanding the data through exploratory analysis and generating good features is much more important than the type of predictive model you use.
  • 12. Questions About Datasets • Number of records • Number of data fields • Size – Does it fit into memory? – Does it fit on a single disk or disk array? • How many missing values? • How many duplicates? • Are there labels (for the dependent variable)? • Are there keys?
  • 13. Questions About Data Fields • Data fields – What is the mean, variance, …? – What is the distribution? Plot the histograms. – What are extreme values, missing values, unusual values? • Pairs of data fields – What is the correlation? Plot the scatter plots. • Visualize the data – Histograms, scatter plots, etc.
  • 14.
  • 15. stable clusters new cluster …
  • 16.
  • 17. Computing Count Distributions def valueDistrib(b, field, select=None): # given a dataframe b and an attribute field, # returns the class distribution of the field ccdict = {} if select: for i in select: count = ccdict.get(b[i][field], 0) ccdict[b[i][field]] = count + 1 else: for i in range(len(b)): count = ccdict.get(b[i][field], 0) ccdict[b[i][field]] = count + 1 return ccdict
  • 18. Features • Normalize – [0, 1] or [-1, 1] • Aggregate • Discretize or bin • Transform continuous to continuous or discrete to discrete • Compute ratios • Use percentiles • Use indicator variables
  • 19. Predicting Office Building Renewal Using Text-based Features Classification Avg R imports 5.17 clothing 4.17 van 3.18 fashion 2.50 investment 2.42 marketing 2.38 oil 2.11 air 2.09 system 2.06 foundation 2.04 Classification Avg R technology 2.03 apparel 2.03 law 2.00 casualty 1.91 bank 1.88 technologies 1.84 trading 1.82 associates 1.81 staffing 1.77 securities 1.77
  • 20. Twitter Intention & Prediction • Use machine learning classification techniques: – Support Vector Machines – Trees – Naïve Bayes • Models require clean data and lots of hand- scored examples
  • 21. SVM Naïve Bayes Trees • Namibia floods classified by R package E1071. • Often implementation dependent.
  • 22. If You Can Ask For Something • Ask for more data • Ask for “orthogonal data”
  • 24. Trees For CART trees: L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone, Classification and Regression Trees, 1984, Chapman & Hall.
  • 25. Trees Partition Feature Space • Trees partition the feature space into regions by asking whether an attribute is less than a threshold. R M 1.75 2.45 4.95 M < 2.45 R < 1.75? A B C C A B C M < 4.95? C
  • 26. Split Using Entropy split 1 split 2 split 3 split 4 split 1 split 2 split 3 split 4 vs increase in information = 0.32 increase in information = 0 information = 0.64objects have attributes or
  • 27. Growing the Tree Step 1. Class proportions. Node u with n objects n1 of class A (red) n2 of class B (blue), etc. Step 2. Entropy I (u) = - nj /n log nj /n Step 3. Split proportions. m1 sent to child 1– node u1 m2 sent to child 2– node u2 Step 4. Choose attribute to maximize D = I(u) - mj /n I (uj) blue blue red red blue u u 1 u 2
  • 29. Key Idea: Combine Weak Learners • Average several weak models, rather than build one complex model. Model 1 Model 2 Model 3
  • 30. Combining Weak Learners 1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 1 Classifier 3 Classifiers 5 Classifiers 55% 57.40% 59.30% 60% 64.0% 68.20% 65% 71.00% 76.50%
  • 31. Ensembles are unreasonably effective at building models. Average for regression trees. Use majority voting for classification. f1(x) + f2(x) + … + f64(x)
  • 32. Building ensembles over clusters of commodity workstations has been used since the 1990’s.
  • 33. Ensemble Models 1. Partition data and scatter 2. Build models (e.g. tree-based model) 3. Gather models 4. Form collection of models into ensemble (e.g. majority vote for classification & averaging for regression) 1. partition data 3. gather models model 4. form ensemble 2. build model
  • 34. Example: Fraud • Ensembles of trees have proved remarkably effective in detecting fraud • Trees can be very large • Very fast to score • Lots of variants – Random forests – Random trees • Sometimes need to reverse engineer reason codes.
  • 35. CUSUM
  • 36. 37 CUSUM  Assume that we have a mean and standard deviation for both distributions. Observed Distribution Baseline f0(x) = baseline density f1(x) = observed density g(x) = log(f1(x)/f0(x)) Z0 = 0 Zn+1 = max{Zn+g(x), 0} Alert if Zn+1>threshold f0(x) f1(x)
  • 37. 38 Cubes of Models (Segmented Models) Build separate segmented model for every entity of interest Divide & conquer data (segment) using multidimensional data cubes For each distinct cube, establish separate baselines for each quantify of interest Detect changes from baselines Entity (bank, etc.) Geospatial region Estimate separate baselines for each quantify of interest Type of Transaction
  • 38. 39 Examples - Change Detection Using Cubes of Models • Highway Traffic Data – each day (7) x each hour (24) x each sensor (hundreds) x each weather condition (5) x each special event (dozens) – 50,000 baselines models used in current testbed
  • 39. Multiple Models in Hadoop • Trees in Mapper • Build lots of small trees (over historical data in Hadoop) • Load all of them into the mapper • Mapper – For each event, score against each tree – Map emits • Key t value = event id t tree id, score • Reducer – Aggregate scores for each event – Select a “winner” for each event
  • 41. The Take Home Message • Given two predictive models, develop a methodology to determine which is better. • Get a predictive model up in development quickly, no matter how bad. Call it the Champion. • Two methodologies – Champion-Challenger – Automated Testing and Deployment Environment
  • 42. Champion Challenger Methodology • Develop a new model each week. Call it the Challenger. • Meet each week to discuss the performance of the model. Compare the two models and keep the better one as the new Champion.
  • 43. Automating Testing & Deployment Environment • Develop a custom environment for the large scale testing and deployment of many (tens, hundreds, thousands) of different models. • Develop an appropriate experimental design. • Test on a small scale. • Deploy those that test well on a large(r) scale. • Continuously improve the design and automation of the process. • Think of as an automated scale up of A/B testing.
  • 44. Have Weekly Meetings • Have the modeler and business owner and engineer on the operations side meet once a week. • Discuss misclassifications. • Discuss business value from the alerts • Avoid jargon (Detection rate, False Positives, etc.) whenever possible. Instead… • Always use visualizations.
  • 45. 2-Class Confusion Matrix • There are two rates true positive rate = TPrate = (#TP)/(#P) false positive rate = FPrate = (#FP)/(#N) True class Predicted class positive negative positive (#P) #TP #P - #TP negative (#N) #FP #N - #FP
  • 46. Confusion Matrix • Recall the TP and FP rates are defined by: – TPrate = TP / P = recall – FPrate = FP / N • Precision and accuracy are defined by: – Precision = TP / (TP + FP) – Classifier assigns TP + FP to the positive class – Accuracy = (TP + TN) / (P + N)
  • 47. Precision and Accuracy • Recall the TP and FP rates are defined by: – TPrate = TP / P = recall – FPrate = FP / N • Precision and accuracy are defined by: – Precision = TP / (TP + FP) – Classifier assigns TP + FP to the positive class – Accuracy = (TP + TN) / (P + N) Graph to produce the ROC Curve
  • 48. Why visualize performance ? • A single number tells only part of the story – There are fundamentally two numbers of interest (FP and TP), which cannot be summarized with a single number. – How are errors distributed across the classes? • Shape of curves are much more informative.
  • 49. Example: 3 classifiers True Predicted pos neg pos 60 40 neg 20 80 True Predicted pos neg pos 70 30 neg 50 50 True Predicted pos neg pos 40 60 neg 30 70 Classifier 1 TP = 0.4 FP = 0.3 Classifier 2 TP = 0.7 FP = 0.5 Classifier 3 TP = 0.6 FP = 0.2
  • 50. ROC plot for the 3 Classifiers Perfect Classifier False Positive Rate Tree Positive Rate Random Classifier
  • 51. ROC Curve • Recall the TP and FP rates are defined by: – TPrate = TP / P – FPrate = FP / N • The ROC Curve is defined by: – x = FPrate – y = TPrate
  • 52. Creating an ROC Curve • A classifier produces a single ROC point. • If the classifier has a threshold or sensitivity parameter, varying it produces a series of ROC points (confusion matrices).
  • 54. Performance of a Random Classifier • When there are an equal number of Positives (P) and Negatives (N), then a random classifier is accurate about 50% of the time. • The lift measures the relative performance of a classifier as compared to a random classifier • Especially important with imbalanced classes.
  • 55. Lift Curve • The Lift Curve is defined by: – x = (TP + FP) / (P + N) – y = TP
  • 56. Questions? For the most current version of these slides, please see tutorials.opendatagroup.com
  • 57. About Open Data • Open Data began operations in 2001 and has built predictive models for companies for over ten years • Open Data provides management consulting, outsourced analytic services, & analytic staffing • For more information • www.opendatagroup.com • info@opendatagroup.com