This document provides an overview of analytics tools and methods from the perspective of guest lecturer Scott Allen Mongeau. It includes details about Mongeau's background and experience in data science, as well as sections on people and roles in data science, technologies and tools, processes and methods, predictive analytics using machine learning, descriptive analytics using unsupervised learning, and causal modeling.
2. 2
2
2
Education
• PhD (ABD)
• MBA
• MA Financial Mgmt
• Cert. Finance
• GD IT Mgmt
• MA Com Tech
Experience
• SAS Institute
Sr. Mgr. Business Solutions
• Deloitte
Manager Analytics
• Nyenrode University
Lecturer Analytics
• SARK7
Owner / Principal Consultant
• Genentech Inc. / Roche
Principal Analyst / Sr. Mgr.
• Atradius
Sr. R&D Engineer
• CFSI
CIO
Data Scientist
Cyber Analytics
scott.mongeau@sas.com
+31 (0)64 235 3427
Scott Allen Mongeau
Certified Analytics Professional (CAP)
YouTube
• Introduction to Advanced Analytics
• Introduction to Cognitive Analytics
• TedX RSM: Data Analytics
Blog: sctr7.com
Twitter: sark7
Web: sark7.com
IT solutions
Research
methods
Finance
Data
analytics
Consulting
3. 3
40 #1
14,000
93
80,000+
US $ 3.2 B
23%
SAS employees worldwide
of the top
100companieson the
GLOBAL
500 LIST
Annual reinvestment in
R&D
Continuous Revenue
Growth since 1976
Years of
BUSINESS
ANALYTICS
World’s
privately held
software company
LARGEST
Customer sites in 148 countries
DATA
ANALYTICS MARKET LEADER
5. 5MOORE’S LAW: EXPONENTIAL GROWTH OF COMPUTING POWER
5
25,000 x
Home computers
High-capacity servers
Smartphone
explosion
Cloud, AI / Watson, IoT
2015
38. 3838
Fair use: illustrate publication and article of issue in question. The Economist.
http://en.wikipedia.org/wiki/Category:Fair_use_The_Economist_magazine_covers
38
41. 4141
41
Public domain Agricultural Research Service
http://en.wikipedia.org/wiki/File:Orange_juice_1.jpg
GNU Free Documentation License: Ibanix Suzuki Shahid DL650 motorcycle
http://commons.wikimedia.org/wiki/File:Suzuki_vstrom_dl650_motorcycle.jpg
43. 43
Supervised learning - predictive
• K-Means
• Decision Trees (DT)
(random forests, boosted trees)
• Naïve Bayes classifier
• Neural networks
• Support Vector Machine (SVM)
• Ensembles / Ensemble Learning
Decision Tree
Machine Learning
Support Vector Machines
44. 4444
MACHINE LEARNING PREDICTION (SUPERVISED)
CAR Engine
Training set Validation set
Non-criminal Criminal
NORMAL UNUSUAL
Device
Time of day
Source
location
IP
Threat
intelligence
Amount
At risk
profile
Destination
location
Secure
profile
Known
devices
Average
amount
Known
location
Known
destination
45. 45
45
EXAMPLE MACHINE LEARNING TOOLS
Open source
•R
•Python
•Weka
Commercial
• SAS BASE & JMP
• SAS Enterprise Miner
• IBM SPSS
• Oracle Data Mining
• Rapid Miner
Ranjit Bose, (2009),"Advanced analytics: opportunities and challenges",
Industrial Management & Data Systems, Vol. 109 Iss 2 pp. 155 - 172
http://dx.doi.org/10.1108/02635570910930073
48. 4848
• Data preparation
• Model development
• Model management
• Model deployment
http://www.sas.com/en_gb/insights/articles/analytics/
Industrialize-your-analytics-today.html
50. 5050
CONFUSION
MATRIX
A confusion matrix
separates out the
decisions made by
the classifier,
making explicit how
one class is being
confused for
another. In this way
different sorts of
errors may be dealt
with separately.
Foster & Fawcett. Data Science for Business
What you need to know about data mining and data-analytic thinking: Chapter 7: Decision Analytic Thinking
51. 5151
RECEIVER OPERATING
CHARACTERISTICS (ROC) &
AREA UNDER THE CURVE (AUC)
“A ROC graph is a two-
dimensional plot of a
classifier with false positive
rate on the x axis against
true positive rate on the y
axis.
ROC graph depicts relative
trade-offs that a classifier
makes between benefits
(true positives) and costs
(false positives).”
Provost; Fawcett. Data Science for Business
Chapter 8: Visualization Model Performance
Area Under the Curve (AUC):
area under a classifier’s curve
expressed as a fraction of the
unit square. Its value ranges
from zero to one.
52. 5252
CUMULATIVE RESPONSE /
LIFT CURVE
• How much the line representing the
model performance is lifted up over
the random performance diagonal
Provost; Fawcett. Data Science for Business. Chapter 8: Visualizing Model Performance
• I.E. “our model gives a two times (or a 2X)
lift”: this means that at the chosen
threshold (often not mentioned), the lift
curve shows that the model’s targeting is
twice as good as random
59. 59
DATA ANALYTICS DRIVERS: V4C
59
Social and mobile
Data analytics
Interactive platforms Real-Time systems
•VOLUME
•VELOCITY
•VARIETY
•VARIABILITY
•COMPLEXITY
V4C
60. 60
• Cases where prediction is
not “deterministic”
• Bayes rate
• Theoretical maximum accuracy
that can be achieved for a
problem
60
MODEL ERRORS: INHERENT
RANDOMNESS
61. 61
• Bias: even with ‘Big Data’, model will
never reach perfect accuracy of true
model
• Example
• Linear regression model to predict
response to an advertising campaign…
• Model is an abstraction…
• True model always
more complex
61
MODEL ERRORS: BIAS
62. 62
• Variance: procedures with more variance tend to
produce models with larger errors
• Accuracy tends to vary across training sets
• Given finite sample set…
• Different models emerge
from different samples
• Different models tend to
have different accuracy
62
MODEL ERRORS: VARIANCE
63. 63
Big Data
• Complex model
• Many variables
• Low bias…
• but high variance
• Subject to overfitting
63
BALANCE: BIAS VERSUS VARIANCE
Strong models
– Tested abstraction
– Few, but significant
variables
– Low variance…
– but high bias
Jno. T-62 tank in Russian service. http://www.aviation.ru/jno/Kubinka02
http://commons.wikimedia.org/wiki/File:T-62_tank_in_Russian_service_(2).jpg
67. 67
• Explanatory performance NOT EQUAL to predictive efficacy (and vice versa),
difference between inductive and deductive methods/thinking
• This is a (sometimes heated) methodological debate amongst
practitioners/academics…
• Is it really a debate, or a religious (professional/Kuhnian) dispute? Econometrics
+ machine learning (H. Varian)
EXPLANATORY
ANALYTICS
68. 68
• Varian, Hal R. 2014. Machine Learning and Econometrics. Stanford lecture slides:
https://web.stanford.edu/class/ee380/Abstracts/140129-slides-Machine-Learning-and-Econometrics.pdf
• Varian, Hal R. 2013. Big Data: New Tricks for Econometrics. Paper:
http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf
MACHINE LEARNING
AND ECONOMETRICS
69. 69
• Ensemble learning…
• Promising – averages over many predictive
cases to reduce impact of variance
• However, is CORRELATIVE, not CAUSAL
• CAUSAL data analysis requires
• Investment in data acquisition
• Similarity measurements
• Expected value calculations
• Correlation understanding
• Identifying informative variables
• Fitting equations to data
• Significance testing
• Domain knowledge
69
MODEL MANAGEMENT