This module gives the basic and fundamental notions that a manager must comprehend in order to be able to work with technical data scientists.
After some terminology, differences between the notions of big data and data science is discussed. A basic prediction (classification) task is considered through an example. No technical background is assumed, since no math or coding is presented. The module concludes with a hands-on case study (bank direct marketing) to get the participants initiated with problem formulation for data science.
Data Science for Business Managers - The bare minimum a manager should know
1. 1
Data Science for Business Managers
Akın Osman Kazakçı
MINES ParisTech
Balazs Kégl
Ecole Polytechnique, CNRS
2. • Data science: basic notions
• Data representation
• Types of Machine Learning Problems
• Classification (continued)
• Clustering
• Case Study: Bank direct marketing
2
Plan
3. Terminology - Data science
Machine Learning
Artificial Intelligence
Data Modeling
Robotics,
Computer vision,
Expert systems…
Rule-based
Inference
Numerical Text Sound
&
Speech
Image
Data
Science
4. Terminology - Big data
• While data science refers to the technical and scientific aspects of
data (i.e. algorithms and models)
• … big data is more related to engineering concerns (and
economic value): handling large volumes of data (often real-time)
for improved decision-making
• You will often hear about Hadoop:
5. Machine Learning
Can we enable computers to learn programs
instead of being explicitly programmed?
Yes, under two conditions:
1. examples
2. algorithms that can
generalise from
examples
6. Haberman survival data
HSD are recorded cases from a study of the survival of breast cancer surgery patients between
1958 and 1970 at the University of Chicago's Billings Hospital
What can be learned from this data? And why?
Image Credits:
Rebecca Bilbro
7. Haberman survival data
HSD are recorded cases from a study of the survival of breast cancer surgery patients between
1958 and 1970 at the University of Chicago's Billings Hospital
This is a “classification” problem.
Labels are “categorical”.
If we learn a model of this data, future instances can be
classified as “will survive” or “will not survive”.
8. What is an ML model?
This line
represents a
model
ML models are functions y = f(x).
•should represent the data
•should generalise to new data
The function f -
9. • Will my employees leave? Or perform good?
• Which machine will break down and when?
• How likely is that a client will repay his debt?
• Which other product can I sell to this client
(cross-selling)?
9
Examples
10. Types of ML problems
(non-exhaustive)
Classification
Find the correct category
Clustering
Find meaningful groups
Regression
Find the correct value or
probability
11. Back to classification
Unfortunately, most datasets are not
linearly separable
Modèles Standards
Simple linear model,
Many red and blue
items are misclassified
A complex non linear
model, better
separation of data
(With other potential
problems; see next module)
12. Case study
Bank direct marketing
(handouts)
[Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing:An Application
of the CRISP-DM Methodology.
In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp.
117-121, Guimarães, Portugal, October, 2011. EUROSIS.
Direct marketing: the business of selling products or services directly to the
public, e.g. by mail order or telephone selling, rather than through retailers.
13. Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical:
"admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")
3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or
widowed)
4 - education (categorical: "unknown","secondary","primary","tertiary")
5 - default: has credit in default? (binary: "yes","no")
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: "yes","no")
8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
12 - duration: last contact duration, in seconds (numeric)
# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric,
includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign
(numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical:
"unknown","other","failure","success")