• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction to Machine Learning
 

Introduction to Machine Learning

on

  • 424 views

 

Statistics

Views

Total Views
424
Views on SlideShare
424
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Introduction to Machine Learning Introduction to Machine Learning Presentation Transcript

    • Introduction to Machine Learning Jinhyuk Choi Human-Computer Interaction Lab @ Information and Communications University
    • Contents  Concepts of Machine Learning  Multilayer Perceptrons  Decision Trees  Bayesian Networks
    • What is Machine Learning?  Large storage / large amount of data  Looks random but certain patterns  Web log data  Medical record  Network optimization  Bioinformatics  Machine vision  Speech recognition…  No complete identification of the process  A good or useful approximation
    • What is Machine Learning? Definition  Programming computers to optimize a performance criterion using example data or past experience  Role of Statistics  Inference from a sample  Role of Computer science  Efficient algorithms to solve the optimization problem  Representing and evaluating the model for inference  Descriptive (training) / predictive (generalization) Learning from Human-generated data??
    • What is Machine Learning? Concept Learning • Inducing general functions from specific training examples (positive or negative) • Looking for the hypothesis that best fits the training examples Objects Concept 눈, 코, 다리 Bird 생식능력, 날개, 부리, boolean function : … 깃털… Bird(animal)  “true or not” 무생물… • Concepts: - describing some subset of objects or events defined over a larger set - a boolean-valued function
    • What is Machine Learning? Concept Learning  Inferring a boolean-valued function from training examples of its input and output Hypothesis 1 Hypothesis 2 Concept Web log data Medical record Network optimization Positive examples Bioinformatics Negative examples Machine vision Speech recognition…
    • What is Machine Learning? Learning Problem Design  Do you enjoy sports ?  Learn to predict the value of “EnjoySports” for an arbitrary day, based on the value of its other attributes  What problem?  Why learning?  Attributes selection  Effective?  Enough?  What learning algorithm?
    • Applications  Learning associations  Classification  Regression  Unsupervised learning  Reinforcement learning
    • Examples (1)  TV program preference inference based on web usage data Web page #1 TV Program #1 Web page #2 TV Program #2 Web page #3 Classifier TV Program #3 Web page #4 1 2 TV Program #4 …. …. 3 What are we supposed to do at each step?
    • Examples (2) from a HW of Neural Networks Class (KAIST-2002)  Function approximation (Mexican hat)   f3 ( x1 , x2 )  sin 2 x12  x2 , 2 x1 , x2 [1,1]
    • Examples (3) from a HW of Machine Learning Class (ICU-2006)  Face image classification
    • Examples (4) from a HW of Machine Learning Class (ICU-2006)
    • Examples (5) from a HW of Machine Learning Class (ICU-2006)  Sensay
    • Examples (6) A. Krause et. al, “Unsupervised, Dynamic Identification of Physiological and Activity Context in Wearable Computing”, ISWC 2005
    • #1. Multilayer Perceptrons
    • Neural Network? VS. Adaline MLP SOM Hopfield network RDFN Bifurcating neuron networks …
    • Multilayer Networks of Sigmoid Units • Supervised learning • 2-layer • Fully connected Really looks like the brain??
    • Sigmoid Unit
    • The back-propagation algorithm  Network model Input layer hidden layer output layer xi yj ok v ji wkj     y j  s  v ji x i     w y  ok  s  kj j    i       j    1 E v , w    tk  ok  2  Error function: 2 k  Stochastic gradient descent
    • Gradient-Descent Function Minimization
    • Gradient-descent function minimization    In order to find a vector parameter x that minimizes a function f x  …    Start with a random initial value of x  x 0 .  Determine the direction of the steepest descent in the parameter space by  f f f    f    , ,...,   x x  1 2 x n     Move to the direction a step.    x i  1  x i   hf  x  Repeat the above two steps until no more change in .  For gradient-descent to work…  The function to be minimized should be continuous.  The function should not have too many local minima.
    • Back-propagation
    • Derivation of back-propagation algorithm Adjustment of wkj : 2 E  1 2 1   t  s  w y        tk  ok     k   k j j  wk j  wk j   2 k  2 wk j     j     1  y j ok  1  ok  1 2 tk   ok   2  y j ok  1  ok  tk   ok   E wkj  h  h ok 1  ok tk  ok y j wkj  o  d k
    • Derivation of back-propagation algorithm Adjustment of vji : 2 E  1 2 1   t  s  w y        tk  ok       kj j   v j i  v j i   2 k  2 k v j i   k     j    2 1   t  s  w s  v x      k   kj  ji i     2 k v j i    j   i    1   x i y j  1  y j   wkj ok 1  ok 1 2 tk  ok  2 k  x i y j  1  y j    wkj ok 1  ok tk  ok  k E v ji  h  hy j 1  y j   wkjok 1  ok tk  ok x i v ji k  h y j 1  y j   wkj dko x i  y k  dj
    • Backpropagation
    • Batch learning vs. Incremental learning Batch standard backprop proceeds as Incremental standard backprop can be done as follows: follows: Initialize the weights W. Initialize the weights W. Repeat the following steps for j = 1 to NL: Repeat the following steps: Process one training case (y_j,X_j) to compute the gradient Process all the training data DL to compute the gradient of the error (loss) function Q(y_j,X_j,W). of the average error function AQ(DL,W). Update the weights by subtracting the gradient times the Update the weights by subtracting the gradient times the learning rate. learning rate.
    • Training
    • Overfitting
    • #2. Decision Trees
    • Introduction  Divide & conquer  Hierarchical model  Sequence of recursive splits  Decision node vs. leaf node  Advantage  Interpretability  IF-THEN rules
    • Divide and Conquer  Internal decision nodes  Univariate: Uses a single attribute, xi  Numeric xi : Binary split : xi > wm  Discrete xi : n-way split for n possible values  Multivariate: Uses all attributes, x  Leaves  Classification: Class labels, or proportions  Regression: Numeric; r average, or local fit  Learning  Construction of the tree using training examples  Looking for the simplest tree among the trees that code the training data without error  Based on heuristics  NP-complete  “Greedy”; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993)
    • Classification Trees  Split is main procedure for tree construction  By impurity measure  For node m, Nm instances reach m, Nim belong to Ci i ˆ Ci | x ,m  pm  i Nm P Nm To be pure!!!  Node m is pure if pim is 0 or 1 K  Measure of impurity is entropy Im   pm log2pm i i i 1
    • Representation  Each node specifies a test of some attribute of the instance  Each branch correspond to one of the possible values for this attribute
    • Best Split  If node m is pure, generate a leaf and stop, otherwise split and continue recursively  Impurity after split: Nmj of Nm take branch j. Nimj belong to Ci i N mj ˆ Ci | x ,m, j   pmj  P i N mj n N mj K I'm   p i mj i log2pmj j 1 Nm i 1  Find the variable and split that min impurity (among all variables -- and split positions for numeric variables) Q) “Which attribute should be tested at the root of the tree?”
    • Top-Down Induction of Decision Trees
    • Entropy  “Measure of uncertainty”  “Expected number of bits to resolve uncertainty”  Suppose Pr{X = 0} = 1/8  If other events are equally likely, the number of events is 8. To indicate one out of so many events, one needs lg 8 bits.  Consider a binary random variable X s.t. Pr{X = 0} = 0.1.  1  0.1 lg 1 1  The expected number of bits: 0.1 lg 0.1 1  0.1  In general, if a random variable X has c values with prob. p_c: c c 1  The expected number of bits: H   pi lg   pi lg pi i 1 pi i 1
    • Entropy Example  14 examples Entropy([9,5])  (9 /14) log 2 (9 /14)  (5 /14) log 2 (5 /14)  0.940 Entropy 0 : all members positive or negative Entropy 1 : equal number of positive & negative 0 < Entropy < 1 : unequal number of positive & negative
    • Information Gain  Measures the expected reduction in entropy caused by partitioning the examples
    • Information Gain • # of samples = 100 ICU-Student tree • # of positive samples = 50 Candidate • Entropy = 1 Left side: • # of samples = 50 Gender • # of positive samples = 40 • Entropy = 0.72 Right side: Male Female • # of samples = 50 • # of positive samples = 10 • Entropy = 0.72 IQ Height On average • Entropy = 0.5 * 0.72 + 0.5*0.72 = 0.72 • Reduction in entropy = 0.28  Information gain
    • Training Examples
    • Selecting the Next Attribute
    • Partially learned tree
    • Hypothesis Space Search  Hypothesis space: the set of all possible decision trees  DT is guided by information gain measure. Occam’s razor ??
    • Overfitting • Why “over”-fitting? – A model can become more complex than the true target function(concept) when it tries to satisfy noisy data as well
    • Avoiding over-fitting the data  Two classes of approaches to avoid overfitting  Stop growing the tree earlier.  Post-prune the tree after overfitting  Ok, but how to determine the optimal size of a tree?  Use validation examples to evaluate the effect of pruning (stopping)  Use a statistical test to estimate the effect of pruning (stopping)  Use a measure of complexity for encoding decision tree.  Approaches based on the first strategy  Reduced error pruning  Rule post-pruning
    • Rule Extraction from Trees C4.5Rules (Quinlan, 1993)
    • #3. Bayesian Networks
    • Bayes’ Rule Introduction prior likelihood posterior P C  p x | C  P C | x   p x  evidence P C  0  P C  1  1 p x   p x | C  1P C  1  p x | C  0P C  0 p C  0 | x   P C  1 | x   1
    • Bayes’ Rule: K>2 Classes Introduction p x | Ci P Ci  P Ci | x   p x  p x | Ci P Ci   K  p x | Ck P Ck  k 1 K P Ci   0 and  P Ci   1 i 1 choose Ci if P Ci | x   max k P Ck | x 
    • Bayesian Networks Introduction  Graphical models, probabilistic networks  causality and influence  Nodes are hypotheses (random vars) and the prob corresponds to our belief in the truth of the hypothesis  Arcs are direct influences between hypotheses  The structure is represented as a directed acyclic graph (DAG)  Representation of the dependencies among random variables  The parameters are the conditional probs in the arcs Small set of all possible probability, relating B.N. combinations of only neighbor node cicumstances
    • Bayesian Networks Introduction  Learning  Inducing a graph  From prior knowledge  From structure learning  Estimating parameters  EM  Inference  Beliefs from evidences  Especially among the nodes not directly connected
    • Structure Introduction  Initial configuration of BN  Root nodes  Prior probabilities  Non-root nodes  Conditional probabilities given all possible combinations of direct predecessors P(a) P(b) A B P(c|a) C D P(d|ab), P(d|aㄱb), P(d|ㄱab), P(d|ㄱaㄱb) P(c|ㄱa) P(e|d) E P(e|ㄱd)
    • Causes and Bayes’ Rule Introduction Diagnostic inference: diagnostic Knowing that the grass is wet, what is the probability that rain is causal the cause? P W | R P R  P R | W   P W  P W | R P R   P W | R P R   P W |~ R P ~ R  0.9  0.4   0.75 0.9  0.4  0.2  0.6
    • Causal vs Diagnostic Inference Introduction Causal inference: If the sprinkler is on, what is the probability that the grass is wet? P(W|S) = P(W|R,S) P(R|S) + P(W|~R,S) P(~R|S) = P(W|R,S) P(R) + P(W|~R,S) P(~R) = 0.95*0.4 + 0.9*0.6 = 0.92 Diagnostic inference: If the grass is wet, what is the probability that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S) P(S|R,W) = 0.21 Explaining away: Knowing that it has rained decreases the probability that the sprinkler is on.
    • Bayesian Networks: Causes Introduction Causal inference: P(W|C) = P(W|R,S) P(R,S|C) + P(W|~R,S) P(~R,S|C) + P(W|R,~S) P(R,~S|C) + P(W|~R,~S) P(~R,~S|C) and use the fact that P(R,S|C) = P(R|C) P(S|C) Diagnostic: P(C|W ) = ?
    • Bayesian Nets: Local structure Introduction P (F | C) = ? d P X 1 , X d    P X i | parentsX i  i 1
    • Bayesian Networks: Inference Introduction  P (C,S,R,W,F ) = P (C ) P (S |C ) P (R |C ) P (W |R,S ) P (F |R )  P (C,F ) = ∑S ∑R ∑W P (C,S,R,W,F )  P (F |C) = P (C,F ) / P(C ) Not efficient!  Belief propagation (Pearl, 1988)  Junction trees (Lauritzen and Spiegelhalter, 1988)  Independence assumption
    • Inference Evidence & Belief Propagation  Evidence – values of observed nodes  V3 = T,V6 = 3 V1  Our belief in what the value of Vi „should‟ be changes.  This belief is propagated V3 V2  As if the CPTs became V4 V3=T 1.0 P V2=T V2=F V3=F 0.0 V6=1 0.0 0.0 V5 V6 V6=2 0.0 0.0 V6=3 1.0 1.0
    • Belief Propagation Bayes Law: P( B | A) P( A) P( A | B)  P( B) “Causal” message “Diagnostic” message Going down arrow, sum out parent Going up arrow, Bayes Law Message Messages Specifically: 1/a 9 * some figures from: Peter Lucas BN lecture course
    • The  Messages • What are the messages? • For simplicity, let the nodes be binary V1=T 0.8 The message passes on information. V1=F 0.2 What information? Observe: V1 P(V2| V1) = P(V2| V1=T)P(V1=T) + P(V2| V1=F)P(V1=F) P V1=T V1=F The information needed is the CPT of V1 = V(V1) V2 V2=T 0.4 0.9 V2=F 0.6 0.1  Messages capture information passed from parent to child
    • The  Messages • We know what the  messages are • What about ? Assume E = { V2 } and compute by Bayes‟rule: V1 P(V1 ) P(V2 | V1 ) P(V1 | V2 )   aP(V1 ) P(V2 | V1 ) P(V2 ) The information not available at V1 is the P(V2|V1). To be V2 passed upwards by a -message. Again, this is not in general exactly the CPT, but the belief based on evidence down the tree.
    • Belief Propagation U1 U2 λ(U2) π(U1) π(U2) λ(U1) V λ(V1) π(V2) π(V1) λ(V2) V1 V2
    • Evidence & Belief V1 Evidence Belief V3 V2 V4 V5 V6 Evidence Works for classification ??
    • Naive Bayes’ Classifier Given C, xj are independent: p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)
    • Application Procedures For classification  MLP  Data collection & Pre-processing (Training data / Test data)  Decision node selection (output node)  Network training  Generalization  Parameter tuning & Pruning  Final network  Decision Trees  Data collection & Pre-processing (Training data / Test data)  Decision attribute selection  Tree construction  Pruning  Final tree  Bayesian Networks  Data collection & Pre-processing (Training data / Test data)  Structure configuration  Prior knowledge  Parameter learning  Decision node selection  Inference (classification)  Evidence & belief  Final network
    • Simulation  Simulation Packages  WEKA (JAVA)  http://www.cs.waikato.ac.nz/ml/weka/  FullBNT (MATLAB)  http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html  MSBNx  http://research.microsoft.com/msbn/  MATLAB Neural Networks Toolbox  http://www.mathworks.com/products/neuralnet/  C4.5  http://www.rulequest.com/Personal/
    • WEKA
    • FullBNT  clear all  N = 4; % 노드의 개수  dag = zeros(N,N); % 네크워크 구조 shell  C = 1; S = 2; R = 3; W = 4; % 각 노드 Naming  dag(C,[R S]) = 1; % 네트워크 구조 명시  dag(R,W) = 1;  dag(S,W)=1;  %discrete_nodes = 1:N;  node_sizes = 2*ones(1,N); % 각 노드가 가질 수 있는 값의 개수  %node_sizes = [4 2 3 5];  %onodes = [];  %bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);  bnet = mk_bnet(dag, node_sizes, 'names', {'C','S','R','W'}, 'discrete', 1:4);  %C = bnet.names('cloudy'); % bnet.names is an associative array  %bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);  %%%%%% Specified Parameters  %bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);  %bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);  %bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);  %bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
    • MSBNx
    • References  Textbooks  Ethem ALPAYDIN, Introduction to Machine Learning, The MIT Press, 2004  Tom Mitchell, Machine Learning, McGraw Hill, 1997  Neapolitan, R.E., Learning Bayesian Networks, Prentice Hall, 2003  Materials  Serafí Moral, Learning Bayesian Networks, University of Granada, Spain n  Zheng Rong Yang, Connectionism, Exeter University  KyuTae Cho ,Jeong Ki Yoo ,HeeJin Lee, Uncertainty in AI, Probabilistic reasoning, Especially for Bayesian Networks  Gary Bradski, Sebastian Thrun, Bayesian Networks in Computer Vision, Stanford University  Recommended Textbooks  Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006  J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992  Haykin, Simon S., Neural networks : a comprehensive foundation, Prentice Hall, 1999  Jensen, Finn V., Bayesian networks and decision graphs, Springer, 2007