DSCI 552 machine learning for data science

DSCI 552 MACHINE LEARNING FOR
DATA SCIENCE
Ke-Thia Yao
Lecture 1, 12 January 2023
1

Textbook
2
 Ethem Alpaydin
 Introduction to Machine
Learning, Fourth Edition
 MIT Press
 ISBN 9780262043793

Optional Textbook for Scikit-Learn
 Aurélien Géron
 Hands-On Machine Learning with
Scikit-Learn, Keras, and
TensorFlow, 3rd Edition
 Available online through USC
library
https://libraries.usc.edu/databases/safari-
books
 Scikit-Learn website provides
excellent documentation and user
guides
 https://scikit-
learn.org/stable/index.html
3

Office Hours
4
 USC ISI Office:
 4676 Admiralty Way, Suite 835
 Marina del Rey, CA 90292
 (310) 448-8297
 kyao@isi.edu
 USC Marina del Rey Shuttle
 http://transnet.usc.edu/index.php/bus-map-schedules/
 Office Hours:
 Tuesdays 2-4PM on Zoom
https://usc.zoom.us/j/95896335860?pwd=MkhtMEsvR1BsUThvU3hMYjNHZE5Gd
z09&from=addon
 Thursdays 2-4PM on campus, location TDB

Grading
5
 Homework / Programming Assignments: 35%
 Class participation: 5%
 Midterm: 20%
 Final Exam: 20%
 Semester Project: 20%

Viterbi Code of Academic Integrity
"A Community of Honor"
6
We are the USC Viterbi School of Engineering, a community of
academic and professional integrity. As students, faculty, and staff our
fundamental purpose is the pursuit of knowledge and truth. We
recognize that ethics and honesty are essential to this mission and
pledge to uphold the highest standards of these principles. As
responsible men and women of engineering, our lifelong commitment
is to respect others and be fair in all endeavors. Our actions will reflect
and promote a community of honor.

Schedule
7
Date Topic
12-Jan-23 Introduction to ML, Supervised learning, Bias, K-nearest neighbor vs
19-Jan-23 Bayesian decision theory, Naïve Bayes, Jupyter, SciKit Learn
26-Jan-23 Parametric Methods, Bias/Variance Trade-off
2-Feb-23 Nonparametric methods, Decision Trees
9-Feb-23 Dimension reduction
16-Feb-23 Clustering
23-Feb-23 Linear Discrimination, Multilayer Perceptrons,
2-Mar-23 Midterm
9-Mar-23 Deep Learning
16-Mar-23 Spring Recess
23-Mar-23 Local Models, Kernel Machines
30-Mar-23 Graph Models, Boltzmann Machines, Quantum Adiabetic Annealer
6-Apr-23 Hidden Markov Models
13-Apr-23 Combining Multiple Learners
20-Apr-23 Reinforcement Learning
27-Apr-23 Presentation

What is Machine Learning
8
 Machine learning is the science (and art) of programming computers so they
can learn from data
 Here is a slightly more general definition:
[Machine learning is the] field of study that gives computers the ability to learn
without being explicitly programmed.
Arthur Samuel, 1959
 And a more engineering-oriented one:
A computer program is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T, as measured by P,
improves with experience E.
Tom Mitchell, 1997

Why Use Machine Learning
9
Traditional approach Machine learning approach

Why “Learn” ?
10
 Machine learning is programming computers to optimize a
performance criterion using example data or past experience.
 There is no need to “learn” to calculate payroll
 Learning is used when:
 Human expertise does not exist (navigating on Mars),
 Humans are unable to explain their expertise (speech recognition)
 Solution changes in time (routing on a computer network)
 Solution needs to be adapted to particular cases (user biometrics)

Big Data
11
 Widespread use of personal computers and wireless communication
leads to “big data”
 We are both producers and consumers of data
 Data is not random, it has structure, e.g., customer behavior
 We need “big theory” to extract that structure from data for
(a) Understanding the process
(b) Making predictions for the future
 Cheaper computational power (e.g., GPUs).

Why Mine Data? Scientific Viewpoint
12
 Data collected and stored at
enormous speeds (GB/hour)
 remote sensors on a satellite
 telescopes scanning the skies
 microarrays generating gene expression data
 scientific simulations generating terabytes of data
 Traditional techniques infeasible for raw data
 Data mining may help scientists
 in classifying and segmenting data
 in Hypothesis Formation

Why Mine Data? Commercial Viewpoint
13
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
 Provide better, customized services for an edge (e.g. in Customer
Relationship Management)

Big Data Opportunity
14
 Unlock significant value by making information transparent and usable
 Collect and store more accurate and detailed data in digital form
 Allows ever-narrower segmentation of customers and provide precise
tailored products & services
 Sophisticated analytics to substantially improve decision making
 Improve the next generation of products and services
Source: McKinsey & Company

Big Data Opportunity (cont.)
15
McKinsey Report
 Data have swept into every industry and business function and are
now an important factor of production, alongside labor and capital.
 The use of big data will become a key basis of competition and
growth for individual firms.
 The use of big data will underpin new waves of productivity growth
and consumer surplus.
 There will be a shortage of talent necessary for organizations to take
advantage of big data.

Data Mining
16
 Retail: Market basket analysis, Customer relationship management
(CRM)
 Finance: Credit scoring, fraud detection
 Manufacturing: Control, robotics, troubleshooting
 Medicine: Medical diagnosis
 Telecommunications: Spam filters, intrusion detection
 Bioinformatics: Motifs, alignment
 Web mining: Search engines
 ...

What We Talk About When We Talk About
“Learning”
17
 Learning general models from a data of particular examples
 Data is cheap and abundant (data warehouses, data marts);
knowledge is expensive and scarce.
 Example in retail: Customer transactions to consumer behavior:
People who bought “Blink” also bought “Outliers” (www.amazon.com)
 Build a model that is a good and useful approximation to the data.

What is Machine Learning?
18
 Optimize a performance criterion using example data or past
experience
 Role of Statistics: Inference from a sample
 Role of Computer science: Efficient algorithms to
 Solve the optimization problem
 Representing and evaluating the model for inference
 Role of domain knowledge
 Selecting the right attributes, representation and datasets

Machine Learning Tasks
19
 Supervised Learning
 Classification
 Regression
 Unsupervised Learning
 Association
 Reinforcement Learning

Supervised Learning: Classification
 Given training set with labels
 Predict label for a new instance that is not in the training set
20

Classification
21
 Example: Credit scoring
 Differentiating between
low-risk and high-risk
customers from their
income and savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk

Classification: Applications
22
 Aka Pattern recognition
 Face recognition: Pose, lighting, occlusion (glasses, beard), make-up,
hair style
 Character recognition: Different handwriting styles.
 Speech recognition: Temporal dependency.
 Medical diagnosis: From symptoms to illnesses
 Biometrics: Recognition/authentication using physical and/or
behavioral characteristics: Face, iris, signature, etc
 Outlier/novelty detection:

Face Recognition
23
Training examples of a person
Test images
ORL dataset,
AT&T Laboratories, Cambridge UK

Supervised Learning: Regression
 Given training set with target numerical values
 Predict target for a new instance that is not in the training set
24

Regression
 Example: Price of a used car
 x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters
25
y = wx+w0

Regression Applications
26
 Navigating a car: Angle of the steering
 Kinematics of a robot arm
α1= g1(x,y)
α2= g2(x,y)
α1
α2
(x,y)
 Response surface design

Supervised Learning: Uses
27
 Prediction of future cases: Use the rule to predict the output for future
inputs
 Knowledge extraction: The rule is easy to understand
 Compression: The rule is simpler than the data it explains
 Outlier detection: Exceptions that are not covered by the rule, e.g.,
fraud

Unsupervised Learning
28
 Learning “what normally happens”
 Clustering: Grouping similar instances
 Example applications
 Customer segmentation in CRM
 Image compression: Color quantization
 Bioinformatics: Learning motifs (nucleotide or amino-acid sequence patterns)

Unsupervised Learning: Clustering
29
 Given training set with no labels
 Group similar instances into clusters

Unsupervised Learning:
Anomaly Detection
30
 Given training set with no labels
 Assign new instance as either normal or anomaly

Unsupervised Learning:
Dimension Reduction
31
 Given training set with high number of features (say images)
 Output training set with lower number of features

Learning Associations
32
 Basket analysis
 Given training dataset containing baskets of products/services
 Find
P (Y | X ) probability that somebody who buys X also buys Y where X
and Y are products/services.
Example: P ( chips | beer ) = 0.7

Reinforcement Learning
33
 Learning a policy: A sequence of
outputs
 No supervised output but delayed
reward
 Credit assignment problem
 Game playing
 Robot in a maze
 Multiple agents, partial
observability, ...

Data Mining Process
35
Time to
complete
Importance to
success
1. Exploring the problem
20% 80%
2. Exploring the solution
3. Implementation specification
4. Data mining
80% 20%
a. Data preparation
b. Data surveying
c. Data modeling

Inductive Bias
36
 Important decisions in learning systems:
 Structure of the model (language)
 Order to search the space of structures
 Way that overfitting to the particular training data is avoided
 Type of inductive bias:
 Language bias
 Search bias
 Overfitting-avoidance bias

Linear Least Square Classification
38
 Predictive Method: Linear regression
 Find the best line f(x) that divides the space into positive and negative
regions

Linear Least Square Fit Bias
39
 Language bias
 The function f(x) is linear
 Search bias
 Analytical solution minimizing the error, sum of the squares
 Overfitting-avoidance bias
 Not needed. Language is too simple.

K Nearest Neighbor
40
 Let the k nearest neighbor vote for classification

K Nearest Neighbor Bias
43
 Language bias
 Represent point by its k nearest neighbor
 Search bias
 Deterministic
 Overfitting-avoidance bias
 Adjust k using validation/dev data set

Model Selection Using Holdout Validation
44

Optimal Bayes Decision Boundary
46

Generalization as search
48
 Inductive learning: find a concept description that fits the data
 Example: rule sets as description language
 Enormous, but finite, search space
 Simple solution:
 enumerate the concept space
 eliminate descriptions that do not fit examples
 surviving descriptions contain target concept

Bias and Learning Example
49
ID Pump
Type
Pump
Size
Max
Load
Pump
Eff.
Class
1 A Large Low High Normal
2 B Small High Low Failure
3 B Large High High Normal
4 … … … … …
 Attributes
 ID is integer
 Pump Type is {A, B}
 Pump Size is {Large, Small}
 Max Load is {High, Low}
 Pump Eff. Is {High, Low}
 Class is {Normal, Failure}
• Ignoring the ID and Class attributes, how many
distinct instances are possible?
The size of the instance space is = 2*2*2*2 = 16

Modeling Language: Hypothesis Space
50
 Suppose for this problem, the four attributes (instance language) precisely capture
the features of the domain
 Let the instance space 𝐼𝐼
 Instances <type, size, load, efficiency>
𝑖𝑖1=<A, Large, High, High>
𝑖𝑖2=<A, Large, High, Low>
𝑖𝑖3=<A, Large, Low, High>
…
𝑖𝑖15=<B, Small, Low, High>
𝑖𝑖16=<B, Small, Low, Low>
}
,
,
,
,
{ 16
3
2
1 i
i
i
i
I 
=

Power Set
51
 Power set of a set S is set of all possible subset of S.
 Power set of {a, b, c} is

Hypothesis Space
52
 Let the modeling language be the power set of I
}}
,
,
,
{
,
},
,
,
{
,
},
,
,
{
},
,
,
{
},
,
{
,
},
,
{
},
,
{
},
{
,
},
{
},
{
{{},
2
16
2
1
16
15
14
4
2
1
3
2
1
16
15
3
1
2
1
16
2
1
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
H I





=
=
}
,
,
{
;
65536
2 65536
1
16
h
h
H
H 
=
=
=
What is the size of modeling language (hypothesis space)?

Learning Algorithm
53
 A hypothesis h is consistent with the training instance i,
 if i is labeled normal, then h contains i
 If i is labeled failure , then h does not contain i
 Learning the model
 Initially, let set of candidate C = H
 Remove all hypothesis for C that is not consistent with the instances in the
training set
 Classification: given an instance i, each hypothesis h in C votes
 +1 if the hypothesis h is consistent with i
 -1 otherwise

Example Training
54
 Training set:
a = normal, c = failure
 Candidate hypothesis consistent with training set must contain instance
a, and not instance c

Is Unbiased Learning Possible?
55
 There are only 16 unique instances
 Suppose the training set contains 15 instances (i1, i2, …, i15), and they
are all labeled failure
 What is the content of candidate set C?
 C = { {}, {i16} }
 What is the vote count for i16
 Zero

Summary
56
 Machine learning
 Analysis of often large amounts of data to find unsuspected patterns and to
summarize in novel ways
 Machine learning process involves
 Exploring the problem, exploring the solution, implementation specification, data
preparation, data surveying, data modeling
 Machine learning task types
 Association
 Supervised Learning: Classification, Regression
 Unsupervised Learning
 Importance of inductive bias in data mining

DSCI 552 machine learning for data science

More Related Content

Similar to DSCI 552 machine learning for data science

Recently uploaded

DSCI 552 machine learning for data science