Lecture 01: Machine Learning for Language Technology - Introduction

Lecture 1: Introduction
Last updated: 2 Sept 2013
Uppsala University
Department of Linguistics and Philology, September 2013
1 Lecture 1: Introduction
Machine Learning for Language Technology
(Schedule)

Practical Information
Reading list, assignments, exams, etc.
Lecture 1: Introduction2
Most slides from previous courses (i.e. from E. Alpaydin and J. Nivre) with adaptations

Course web pages:
 http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf
 http://stp.lingfil.uu.se/~santinim/ml/MachineLearning_fall20
13.htm
 Contact details:
 santinim@stp.lingfil.uu.se
 (marinasantini.ms@gmail.com)

About the Course
4
 Introduction to machine learning
 Focus on methods used in Language Technology
and NLP
 Decision trees and nearest neighbor methods (Lecture 2)
 Linear models –The Weka ML package (Lectures 3 and
4)
 Ensemble methods – Structured Predictions (Lectures 5
and 6)
 Text Mining and Big Data - R and Rapid Miner(Lecture 7)
 Unsupervised learning (clustering) (Lecture 8, Magnus
Rosell)
 Builds on Statistical Methods in NLPLecture 1: Introduction

Digression: Generative vs. Discriminative Methods
 A generative method only applies to probabilistic
models. A model is generative if it gives us the
model of the joint distribution of x and y together (P
(Y , X ). It is called generative because you can
generate with the correct probability distribution data
points.
 Conditional methods model the conditional
distribution of the output given the input: P (Y | X ).
 Discriminative methods do not model probabilities at
all, but they map the input to the output directly.

Compulsory Reading List
6
 Main textbooks:
1. Ethem Alpaydin. 2010. Introduction to Machine Learning.
Second Edition. MIT Press (free online version)
2. Hal Daumé III. 2012. A Course in Machine Learning (free
online version)
3. Ian H. Witten, Eibe Frank Data Mining: Practical Machine
Learning Tools and Techniques, Second Edition (free online
version)
 Additional material :
1. Dietterich, T. G. (2000). Ensemble Methods in Machine
Learning. In J. Kittler and F. Roli (Ed.) First International
Workshop on Multiple Classifier Systems
2. Michael Collins. 2002. Discriminative Training Methods for
Hidden Markov Models: Theory and Experiments with
Perceptron Algorithms.In Proceedings of the 2002
Conference on Empirical Methods in Natural Language
Processing
3. Hanna M. Wallach. 2004. Conditional Random Fields: An
Introduction.Technical Report MS-CIS-04-21. Department of
Computer and Information Science, University of
Pennsylvania.

Optional Reading
 Hal Daumé III, John Langford, Daniel Marcu (2005)
Search-Based Structured Prediction as
Classification. NIPS Workshop on Advances in
Structured Learning for Text and Speech Processing
(ASLTSP).

Assignments and Examination
8
 Three Assignments:
 Decision trees and nearest neighbors
 Perceptron learning
 Clustering
 General Info:
 No lab sessions, supervision by email
 Reports and Assignments 1 & 2 must be submitted to
santinim@stp.lingfil.uu.se
 Report and Assignment 3 must be submitted to rosell@kth.se
 Examination:
 Written report submitted for each assignment
 All three assignments necessary to pass the course
 Grade determined by majority grade on assignments

Practical Organization
9
 Marina Santini (1-7); Magnus Rosell (8)
 45min + 15 min break
 Lectures on Course webpage and SlideShare
 Email all your questions to me:
santinim@stp.lingfil.uu.se
 Video Recordings of the previous ML course:
http://stp.lingfil.uu.se/~nivre/master/ml.html
 Send me an email, so I make sure that I have all the
correct email addresses to santinim@stp.lingfil.uu.se

Schedule: http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf

What is Machine Learning?
Introduction to:
•Classification
•Regression
•Supervised Learning
•Unsupervised
Learning
•Reinforcement
Learning Lecture 1: Introduction11

What is Machine Learning
 Machine learning is programming computers to
optimize a performance criterion for some task using
example data or past experience
 Why learning?
 No known exact method – vision, speech recognition,
robotics, spam filters, etc.
 Exact method too expensive – statistical physics
 Task evolves over time – network routing
 Compare:
 No need to use machine learning for computing payroll…
we just need an algorithm

Machine Learning – Data Mining – Artificial
Intelligence – Statistics
 Machine Learning: creation of a model that uses training
data or past experience
 Data Mining: application of learning methods to large
datasets (ex. physics, astronomy, biology, etc.)
 Text mining = machine learning applied to unstructured textual
data (ex. sentiment analyisis, social media monitoring, etc. Text
Mining, Wikipedia)
 Artificial intelligence: a model that can adapt to a
changing environment.
 Statistics: Machine learning uses the theory of
statistics in building mathematical models, because
the core task is making inference from a sample.

The bio-cognitive analogy
 Imagine that a learning algorithm as a single neuron.
 This neuron receives input from other neurons, one
for each input feature.
 The strength of these inputs are the feature values.
 Each input has a weight and the neuron simply sums
up all the weighted inputs.
 Based on this sum, the neuron decides whether to
“fire” or not. Firing is interpreted as being a positive
example and not firing is interpreted as being a
negative example.

Elements of Machine Learning
15
1. Generalization:
 Generalize from specific examples
 Based on statistical inference
2. Data:
 Training data: specific examples to learn from
 Test data: (new) specific examples to assess
performance
3. Models:
 Theoretical assumptions about the task/domain
 Parameters that can be inferred from data
4. Algorithms:
 Learning algorithm: infer model (parameters) from data
 Inference algorithm: infer predictions from model

Types of Machine Learning
 Association
 Supervised Learning
 Classification
 Regression
 Unsupervised Learning
 Reinforcement Learning

Learning Associations
 Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services
Example: P ( chips | beer ) = 0.7

Classification
 Example: Credit
scoring
 Differentiating
between low-risk and
high-risk customers
from their income and
savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk

Classification in NLP
19
 Binary classification:
 Spam filtering (spam vs. non-spam)
 Spelling error detection (error vs. non error)
 Multiclass classification:
 Text categorization (news, economy, culture, sport, ...)
 Named entity classification (person, location, organization,
...)
 Structured prediction:
 Part-of-speech tagging (classes = tag sequences)
 Syntactic parsing (classes = parse trees)

Regression
 Example:
Price of used car
 x : car attributes
y : price
y = g (x | q )
g ( ) model,
q parameters
y = wx+w0

Uses of Supervised Learning
 Prediction of future cases:
 Use the rule to predict the output for future inputs
 Knowledge extraction:
 The rule is easy to understand
 Compression:
 The rule is simpler than the data it explains
 Outlier detection:
 Exceptions that are not covered by the rule, e.g., fraud

Unsupervised Learning
 Finding regularities in data
 No mapping to outputs
 Clustering:
 Grouping similar instances
 Example applications:
 Customer segmentation in CRM
 Image compression: Color quantization
 NLP: Unsupervised text categorization

Reinforcement Learning
 Learning a policy = sequence of outputs/actions
 No supervised output but delayed reward
 Example applications:
 Game playing
 Robot in a maze
 NLP: Dialogue systems, for example:
 NJFun: A Reinforcement Learning Spoken Dialogue System
(http://acl.ldc.upenn.edu/W/W00/W00-0304.pdf)
 Reinforcement Learning for Spoken Dialogue Systems:
Comparing Strengths and Weaknesses for Practical
Deployment
(http://research.microsoft.com/apps/pubs/default.aspx?id=7029
5)


Supervised Learning
Introduction to:
•Margin
•Noise
•Bias

Supervised Classification
 Learning the class C of a “family car” from
examples
 Prediction: Is car x a family car?
 Knowledge extraction: What do people expect
from a family car?
 Output (labels):
Positive (+) and negative (–) examples
 Input representation (features):
x1: price, x2 : engine power

Training set X

X  {xt
,rt
}t1
N

r 
1 if x is positive
0 if x is negative



26

x 
x1
x2







Hypothesis class H

p1  price  p2  AND e1  engine power  e2 

Empirical (training) error

h(x) 
1 if h says x is positive
0 if h says x is negative




E(h | X)  1 h xt
  rt
 t1
N

Empirical error of h on X:

S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G
h  H, between S and G
is consistent [E( h | X) =
0] and make up the
version space

Margin
 Choose h with largest margin

Noise
Unwanted anomaly in data
 Imprecision in input attributes
 Errors in labeling data points
 Hidden attributes (relative to H)
Consequence:
 No h in H may be consistent!

Noise and Model Complexity
Arguments for simpler model (Occam’s razor principle)
1. Easier to make predictions
2. Easier to train (fewer parameters)
3. Easier to understand
4. Generalizes better (if data is noisy)

Inductive Bias
 Learning is an ill-posed problem
 Training data is never sufficient to find a unique solution
 There are always infinitely many consistent hypotheses
 We need an inductive bias:
 Assumptions that entail a unique h for a training set X
1. Hypothesis class H – axis-aligned rectangles
2. Learning algorithm – find consistent hypothesis with max-
margin
3. Hyperparameters – trade-off between training error and
margin

Model Selection and Generalization
 Generalization – how well a model performs on new
data
 Overfitting: H more complex than C
 Underfitting: H less complex than C

Triple Trade-Off
 Trade-off between three factors:
1. Complexity of H, c(H)
2. Training set size N
3. Generalization error E on new data
 Dependencies:
 As N E
 As c(H) first E and then E

Model Selection  Generalization Error
 To estimate generalization error, we need data
unseen during training:
 Given models (hypotheses) h1, ..., hk induced from
the training set X, we can use E(hi | V ) to select the
model hi with the smallest generalization error

ˆE  E(h |V)  1 h xt
  rt
 t1
M


V  {xt
,rt
}t1
M
 X

Model Assessment
 To estimate the generalization error of the best
model hi, we need data unseen during training and
model selection
 Standard setup:
1. Training set X (50–80%)
2. Validation (development) set V (10–25%)
3. Test (publication) set T (10–25%)
 Note:
 Validation data can be added to training set before testing
 Resampling methods can be used if data is limited

Cross-Validation
121
31
2
2
2
32
1
1
1



K
K
K
K
K
K
XXXTXV
XXXTXV
XXXTXV




 K-fold cross-validation: Divide X into X1, ..., XK
 Note:
 Generalization error estimated by means across K folds
 Training sets for different folds share K–2 parts
 Separate test set must be maintained for model
assessment

Bootstrapping
3680
1
1 1
.





 
e
N
N
 Generate new training sets of size N from X by
random sampling with replacement
 Use original training set as validation set (V = X )
 Probability that we do not pick an instance after N
draws
that is, only 36.8% of instances are new!

Measuring Error
 Error rate = # of errors / # of instances = (FP+FN) / N
 Accuracy = # of correct / # of instances = (TP+TN) / N
 Recall = # of found positives / # of positives = TP / (TP+FN)
 Precision = # of found positives / # of found = TP / (TP+FP)

Statistical Inference
41
 Interval estimation to quantify the precision of our
measurements
 Hypothesis testing to assess whether differences
between models are statistically significant

m  1.96

N
e01  e10 1 
2
e01  e10
~ X1
2

Supervised Learning – Summary
42
 Training data + learner  hypothesis
 Learner incorporates inductive bias
 Test data + hypothesis  estimated generalization
 Test data must be unseen
 Next lectures:
 Different learners in LT with different inductive biases

Anatomy of a Supervised Learner
(Dimensions of a supervised machine learning
algorithm)
 Model:
 Loss function:
 Optimization
procedure:

g x |q 

E q | X  L rt
,g xt
|q  t


q*  arg min
q
E q | X 

Reading
 Alpaydin (2010): Ch. 1-2; 19 (mathematical
underpinnings)
 Witten and Frank (2005): Ch. 1 (examples and
domains of application)

End of Lecture 1
Thanks for your attention

Lecture 01: Machine Learning for Language Technology - Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lecture 01: Machine Learning for Language Technology - Introduction

Similar to Lecture 01: Machine Learning for Language Technology - Introduction (20)

More from Marina Santini

More from Marina Santini (20)

Recently uploaded

Recently uploaded (20)

Lecture 01: Machine Learning for Language Technology - Introduction

Editor's Notes