Khosrow Hassibi, PhD
CWRU EECS Seminar Series
3
• "Data Science(DS)" is nothing new but the term itself and the
recent level of interest in it.
6
• Data mining
• Predictive analytics / Advanced analytics
• Machine learning
Machine Learning (ML) Traditional statistics (TS)
Goal: “learning” from data of all sorts Goal: Analyzing and summarizing data
No rigid pre-assumptions about the problem and data
distributions in general
Tight assumptions about the problem and data
distributions
More liberal in the techniques and approaches Conservative in techniques and approaches
Generalization is pursued empirically through training,
validation and test datasets
Generalization is pursued using statistical tests on the
training dataset
Not shy of using heuristics in approaches in search of
a “good solution”
Using tight initial assumptions about data and the problem,
typically in search of an optimal solution under those
assumptions
Redundancy in features (variables) is okay, and often
helpful. Preferable to use algorithms designed to
handle large number of features
Often requires independent features. Preferable to use
less number of input features
Does not promote data reduction prior to learning.
Promotes a culture of abundance: “the more data, the
better”
Promotes data reduction as much as possible before
modeling (sampling, less inputs, …)
Has faced with solving more complex problems in
learning, reasoning, perception, knowledge
presentation, …
Mainly focused on traditional data analysis
15
CompetitiveAdvantage
Analytic Sophistication
Std.
Reports
Ad hoc
Reports
OLAP
Drill Down
Dashboard
& Visualize
Alerts
Statistical
Analysis
Forecasting
Predictive
Modeling
Optimization
What Happened?
What Happened specifically?
Where exactly is the problem?
What is happening overall?
What actions are needed?
Why is this happening?
What is the trend?
What will happen?
What is the best that can happen given constraints?
Advanced Analytics:
Predictive & Proactive
Basic Analytics:
Descriptive & Reactive
16
• Speed of processing response
advanced analytics
real-time
• Data preparation
advanced
analytics
18
19
Organization
Category
Information
Management
Proficiency
Analytics
Proficiency
Data Culture
Aspirational Low Low
Line of business
driven
Experienced Medium Medium
Moving toward
enterprise driven
Transformed High High Enterprise driven
Source: MIT Sloan, “Analytics: The Widening Divide,” 2011.
• Three progressive levels of analytics sophistication
Data
Scenario
Big Data? Storage Analysis Business Value
1 No Standard Standard Known
2 Yes Possible Nonstandard Somewhat known
3 Yes Possible Not possible Not known
4 Yes Not possible — Not known
20
• Big Data Scenarios in Transformed or Experienced Analytics
Environments
21
Platform Architecture Storage
1 Workstation Multicore Local
2 Enterprise Server SMP Shared
3 Cluster or Grid CCSS Shared
4 General MPP Database SN Distributed data
5 Hadoop SN Distributed data
6 MPP Analytics Appliance SN Distributed data
7 MPP In-Memory Analytics
Appliance
SNIM Distributed in-memory
data (volatile)
SMP: Symmetric Multi-Processing
SN: Shared Nothing Distributed Computing
CCSS: Cluster Computing with Shared Storage
SNIM: Shared Nothing In-Memory Distributed Computing
23
25
• Myth #1:
• Myth #2:
• Myth #3:
• Myth #4:
27
Aggregations, Joins,
Sorts, Transformations
v
28
31
Product holdings
Banking tenure
Account Balances
Checking account
Data
Demographic/
Formographic data
Web Data
ATM
transaction
Discount
Brokerage Data
Online/Bill pay Data
The warehouse/data lake hosts data from different sources which provide a
comprehensive view of customer information.
Call Center Data
Other Accounts
Data
Savings Account
Data
Marketing
Response Data ……..
32
33
34
35
Debt<10% of Income Debt=0%
Good
Credit
Risks
Bad
Credit
Risks
Good
Credit
Risks
Yes
YesYes
NO
NONO
Income>$40K
Development ADS
Production
ADS Model
Model
Development
Model
Deployment
Scores
36
Debt<10% of Income Debt=0%
Good
Credit
Risks
Bad
Credit
Risks
Good
Credit
Risks
Yes
YesYes
NO
NONO
Income>$40K
Development ADS
Production
ADS Model
Model
Development
Model
Deployment
Scores
Data Store/
Warehouse
37
Debt<10% of Income Debt=0%
Good
Credit
Risks
Bad
Credit
Risks
Good
Credit
Risks
Yes
YesYes
NO
NONO
Income>$40K
Production
ADS
Model
Model
Development
Model
Deployment
Scores
Data Store/
Warehouse
Development ADS
38
39
40
41
Model Development with
Reusable ADS
Analytic Server for
Model Development
Aggregations, Joins, Sorts
Transformations
Aggregations, Joins,
Sorts, Transformations
v
42
43
44
45
Horizontals: Applications
Customer
Lifestyle
Life-stage
Lifetime
Value
Customer
Satisfaction
Survey
Analysis
Customer
Acquisition
Campaign
Effectiveness
Customer
Retention
Cross Sell
Up Sell
Propensity
To Buy
Market
Segmentation
Identity
Theft
Failure/
Defect
Detection
Fraud
Prevention
Six Sigma
Process
Yield
Optimization
Demand
Forecasting
Risk
Management
Financial
Forecast
Pricing
Analysis
Customer
Segmentation
Product
Recommen-
dations
Customer Marketing Sales Operations Finance
46
Cost
Revenue
Customer leavesCustomer joins
(or rejoins)
Subscription life (months)
0
2 3 4
Create
Interest
Acquisition
Cost
Recurring Revenue
1
2 3
4 Direct Cost
to Serve
5
Cross-sell
Upsell
6Renewal
7
Migration
8 9
10
Churn Bad debt
Win back
12
Lifetime Value
-500
-1,000
1,000
500
24 Source: McKinsey & Co
47
Offer 1
Offer 2
Offer 3
Offer 4
Offer 5
Offer 6
Call Centres
Face 2 Face: Retail /
Dealer / Sales force
Web Presence / Email
SMS / MMS/ WAP
Direct Mail / Bill Inserts /
Bill Messages
Product
Targets
Contact Centre
Capacity
Contact Frequency /
Permissions /
Preferences
Min List
Sizes
Constraints
Saturation
Wrong Timing
Missed
Opportunity
Many Offers Many Channels Millions of Customers
Predictive
Model
(ANNs)
Feature
Computation
ScoreRaw Inputs Cooked inputs
Profiles
(Memory)
Cursive machine-print text
Real-time Fraud Detection
* See Hinton lecture at Google
Real-time Transactional Fraud Detection
using Neural Networks
OCR of machine-print cursive text using
neural networks (typically using hundreds of
thousand of weights.
Recent Book: “High-Performance Data Mining and Big Data
Analytics”
My Blog
What is Data Science and How to Succeed in it

What is Data Science and How to Succeed in it

  • 1.
    Khosrow Hassibi, PhD CWRUEECS Seminar Series
  • 3.
  • 5.
    • "Data Science(DS)"is nothing new but the term itself and the recent level of interest in it.
  • 6.
  • 9.
    • Data mining •Predictive analytics / Advanced analytics • Machine learning
  • 11.
    Machine Learning (ML)Traditional statistics (TS) Goal: “learning” from data of all sorts Goal: Analyzing and summarizing data No rigid pre-assumptions about the problem and data distributions in general Tight assumptions about the problem and data distributions More liberal in the techniques and approaches Conservative in techniques and approaches Generalization is pursued empirically through training, validation and test datasets Generalization is pursued using statistical tests on the training dataset Not shy of using heuristics in approaches in search of a “good solution” Using tight initial assumptions about data and the problem, typically in search of an optimal solution under those assumptions Redundancy in features (variables) is okay, and often helpful. Preferable to use algorithms designed to handle large number of features Often requires independent features. Preferable to use less number of input features Does not promote data reduction prior to learning. Promotes a culture of abundance: “the more data, the better” Promotes data reduction as much as possible before modeling (sampling, less inputs, …) Has faced with solving more complex problems in learning, reasoning, perception, knowledge presentation, … Mainly focused on traditional data analysis
  • 15.
    15 CompetitiveAdvantage Analytic Sophistication Std. Reports Ad hoc Reports OLAP DrillDown Dashboard & Visualize Alerts Statistical Analysis Forecasting Predictive Modeling Optimization What Happened? What Happened specifically? Where exactly is the problem? What is happening overall? What actions are needed? Why is this happening? What is the trend? What will happen? What is the best that can happen given constraints? Advanced Analytics: Predictive & Proactive Basic Analytics: Descriptive & Reactive
  • 16.
  • 18.
    • Speed ofprocessing response advanced analytics real-time • Data preparation advanced analytics 18
  • 19.
    19 Organization Category Information Management Proficiency Analytics Proficiency Data Culture Aspirational LowLow Line of business driven Experienced Medium Medium Moving toward enterprise driven Transformed High High Enterprise driven Source: MIT Sloan, “Analytics: The Widening Divide,” 2011. • Three progressive levels of analytics sophistication
  • 20.
    Data Scenario Big Data? StorageAnalysis Business Value 1 No Standard Standard Known 2 Yes Possible Nonstandard Somewhat known 3 Yes Possible Not possible Not known 4 Yes Not possible — Not known 20 • Big Data Scenarios in Transformed or Experienced Analytics Environments
  • 21.
    21 Platform Architecture Storage 1Workstation Multicore Local 2 Enterprise Server SMP Shared 3 Cluster or Grid CCSS Shared 4 General MPP Database SN Distributed data 5 Hadoop SN Distributed data 6 MPP Analytics Appliance SN Distributed data 7 MPP In-Memory Analytics Appliance SNIM Distributed in-memory data (volatile) SMP: Symmetric Multi-Processing SN: Shared Nothing Distributed Computing CCSS: Cluster Computing with Shared Storage SNIM: Shared Nothing In-Memory Distributed Computing
  • 23.
  • 25.
  • 26.
    • Myth #1: •Myth #2: • Myth #3: • Myth #4: 27 Aggregations, Joins, Sorts, Transformations v
  • 27.
  • 29.
    31 Product holdings Banking tenure AccountBalances Checking account Data Demographic/ Formographic data Web Data ATM transaction Discount Brokerage Data Online/Bill pay Data The warehouse/data lake hosts data from different sources which provide a comprehensive view of customer information. Call Center Data Other Accounts Data Savings Account Data Marketing Response Data ……..
  • 30.
  • 31.
  • 32.
  • 33.
    35 Debt<10% of IncomeDebt=0% Good Credit Risks Bad Credit Risks Good Credit Risks Yes YesYes NO NONO Income>$40K Development ADS Production ADS Model Model Development Model Deployment Scores
  • 34.
    36 Debt<10% of IncomeDebt=0% Good Credit Risks Bad Credit Risks Good Credit Risks Yes YesYes NO NONO Income>$40K Development ADS Production ADS Model Model Development Model Deployment Scores Data Store/ Warehouse
  • 35.
    37 Debt<10% of IncomeDebt=0% Good Credit Risks Bad Credit Risks Good Credit Risks Yes YesYes NO NONO Income>$40K Production ADS Model Model Development Model Deployment Scores Data Store/ Warehouse Development ADS
  • 36.
  • 37.
  • 38.
  • 39.
    41 Model Development with ReusableADS Analytic Server for Model Development Aggregations, Joins, Sorts Transformations Aggregations, Joins, Sorts, Transformations v
  • 40.
  • 41.
  • 42.
  • 43.
    45 Horizontals: Applications Customer Lifestyle Life-stage Lifetime Value Customer Satisfaction Survey Analysis Customer Acquisition Campaign Effectiveness Customer Retention Cross Sell UpSell Propensity To Buy Market Segmentation Identity Theft Failure/ Defect Detection Fraud Prevention Six Sigma Process Yield Optimization Demand Forecasting Risk Management Financial Forecast Pricing Analysis Customer Segmentation Product Recommen- dations Customer Marketing Sales Operations Finance
  • 44.
    46 Cost Revenue Customer leavesCustomer joins (orrejoins) Subscription life (months) 0 2 3 4 Create Interest Acquisition Cost Recurring Revenue 1 2 3 4 Direct Cost to Serve 5 Cross-sell Upsell 6Renewal 7 Migration 8 9 10 Churn Bad debt Win back 12 Lifetime Value -500 -1,000 1,000 500 24 Source: McKinsey & Co
  • 45.
    47 Offer 1 Offer 2 Offer3 Offer 4 Offer 5 Offer 6 Call Centres Face 2 Face: Retail / Dealer / Sales force Web Presence / Email SMS / MMS/ WAP Direct Mail / Bill Inserts / Bill Messages Product Targets Contact Centre Capacity Contact Frequency / Permissions / Preferences Min List Sizes Constraints Saturation Wrong Timing Missed Opportunity Many Offers Many Channels Millions of Customers
  • 47.
    Predictive Model (ANNs) Feature Computation ScoreRaw Inputs Cookedinputs Profiles (Memory) Cursive machine-print text Real-time Fraud Detection * See Hinton lecture at Google Real-time Transactional Fraud Detection using Neural Networks OCR of machine-print cursive text using neural networks (typically using hundreds of thousand of weights.
  • 49.
    Recent Book: “High-PerformanceData Mining and Big Data Analytics” My Blog