The attribute that should be tested at the root of the decision tree is the attribute that results in the maximum information gain, or minimum entropy, when used to split the training data. In other words, the attribute that best separates the data according to the target classes. This attribute will create "purer" nodes with respect to the target classes.
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...Anderson Pinho
This paper presents a new model for neuro-evolutionary systems. It is a new quantum-inspired evolutionary algorithm with binary-real representation (QIEA-BR) for evolution of a neural network. The proposed model is an extension of the QIEA-R developed for numerical optimization. The Quantum-Inspired Neuro-Evolutionary Computation model (QINEA-BR) is able to completely configure a feed-forward neural network in terms of selecting the relevant input variables, number of neurons in the hidden layer and all existent synaptic weights. QINEA-BR is evaluated in a benchmark problem of financial credit evaluation. The results obtained demonstrate the effectiveness of this new model in comparison with other machine learning and statistical models, providing good accuracy in separating good from bad customers.
The transformer is the neural architecture that has received most attention in the early 2020's. It removed the recurrency in RNNs, replacing it with and attention mechanism across the input and output tokens of a sequence (cross-attenntion) and between the tokens composing the input (and output) sequences, named self-attention.
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...Anderson Pinho
This paper presents a new model for neuro-evolutionary systems. It is a new quantum-inspired evolutionary algorithm with binary-real representation (QIEA-BR) for evolution of a neural network. The proposed model is an extension of the QIEA-R developed for numerical optimization. The Quantum-Inspired Neuro-Evolutionary Computation model (QINEA-BR) is able to completely configure a feed-forward neural network in terms of selecting the relevant input variables, number of neurons in the hidden layer and all existent synaptic weights. QINEA-BR is evaluated in a benchmark problem of financial credit evaluation. The results obtained demonstrate the effectiveness of this new model in comparison with other machine learning and statistical models, providing good accuracy in separating good from bad customers.
The transformer is the neural architecture that has received most attention in the early 2020's. It removed the recurrency in RNNs, replacing it with and attention mechanism across the input and output tokens of a sequence (cross-attenntion) and between the tokens composing the input (and output) sequences, named self-attention.
生成式對抗網路 (Generative Adversarial Network, GAN) 顯然是深度學習領域的下一個熱點,Yann LeCun 說這是機器學習領域這十年來最有趣的想法 (the most interesting idea in the last 10 years in ML),又說這是有史以來最酷的東西 (the coolest thing since sliced bread)。生成式對抗網路解決了什麼樣的問題呢?在機器學習領域,回歸 (regression) 和分類 (classification) 這兩項任務的解法人們已經不再陌生,但是如何讓機器更進一步創造出有結構的複雜物件 (例如:圖片、文句) 仍是一大挑戰。用生成式對抗網路,機器已經可以畫出以假亂真的人臉,也可以根據一段敘述文字,自己畫出對應的圖案,甚至還可以畫出二次元人物頭像 (左邊的動畫人物頭像就是機器自己生成的)。本課程希望能帶大家認識生成式對抗網路這個深度學習最前沿的技術。
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Slides for talk Abhishek Sharma and I gave at the Gennovation tech talks (https://gennovationtalks.com/) at Genesis. The talk was part of outreach for the Deep Learning Enthusiasts meetup group at San Francisco. My part of the talk is covered from slides 19-34.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
Anima Anandkumar is a faculty at the EECS Dept. at U.C.Irvine since August 2010. Her research interests are in the area of large-scale machine learning and high-dimensional statistics. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She has been a visiting faculty at Microsoft Research New England in 2012 and a postdoctoral researcher at the Stochastic Systems Group at MIT between 2009-2010. She is the recipient of the Microsoft Faculty Fellowship, ARO Young Investigator Award, NSF CAREER Award, and IBM Fran Allen PhD fellowship.
Presented at the 2011 IEEE 7th International Conference on Intelligent Computer Communication and Processing (ICCP 2011), August 26th, 2011 in Cluj-Napoca, Romania.
Publication: http://bit.ly/x1OpFL
Abstract:
In this paper we introduce a system for semantic understanding of traffic scenes. The system detects objects in video images captured in real vehicular traffic situations, classifies them, maps them to the OpenCyc1 ontology and finally generates descriptions of the traffic scene in CycL or cvasi-natural language. We employ meta-classification methods based on AdaBoost and Random forest algorithms for identifying interest objects like: cars, pedestrians, poles in traffic and we derive a set of annotations for each traffic scene. These annotations are mapped to OpenCyc concepts and predicates, spatiotemporal rules for object classification and scene understanding are then asserted in the knowledge base. Finally, we show that the system performs well in understanding traffic scene situations and summarizing them. The novelty of the approach resides in the combination of stereo-based object detection and recognition methods with logic based spatio-temporal reasoning.
1. Introduction to Machine Learning
Jinhyuk Choi
Human-Computer Interaction Lab @ Information and Communications University
2. Contents
Concepts of Machine Learning
Multilayer Perceptrons
Decision Trees
Bayesian Networks
3. What is Machine Learning?
Large storage / large amount of data
Looks random but certain patterns
Web log data
Medical record
Network optimization
Bioinformatics
Machine vision
Speech recognition…
No complete identification of the process
A good or useful approximation
4. What is Machine Learning?
Definition
Programming computers to optimize a
performance criterion using example data or past
experience
Role of Statistics
Inference from a sample
Role of Computer science
Efficient algorithms to solve the optimization problem
Representing and evaluating the model for inference
Descriptive (training) / predictive (generalization)
Learning from Human-generated data??
5. What is Machine Learning?
Concept Learning
• Inducing general functions from specific training examples (positive or
negative)
• Looking for the hypothesis that best fits the training examples
Objects Concept
눈, 코, 다리 Bird
생식능력, 날개, 부리,
boolean function :
… 깃털… Bird(animal) “true or not”
무생물…
• Concepts:
- describing some subset of objects or events defined over a larger set
- a boolean-valued function
6. What is Machine Learning?
Concept Learning
Inferring a boolean-valued function from training examples of its input and
output
Hypothesis 1
Hypothesis 2
Concept
Web log data
Medical record
Network optimization
Positive examples Bioinformatics
Negative examples Machine vision
Speech recognition…
7. What is Machine Learning?
Learning Problem Design
Do you enjoy sports ?
Learn to predict the value of “EnjoySports” for an arbitrary day, based on
the value of its other attributes
What problem?
Why learning?
Attributes selection
Effective?
Enough?
What learning algorithm?
9. Examples (1)
TV program preference inference based on web usage data
Web page #1 TV Program #1
Web page #2 TV Program #2
Web page #3 Classifier TV Program #3
Web page #4 1 2 TV Program #4
…. ….
3
What are we supposed to do at each step?
10. Examples (2)
from a HW of Neural Networks Class (KAIST-2002)
Function approximation (Mexican hat)
f3 ( x1 , x2 ) sin 2 x12 x2 ,
2
x1 , x2 [1,1]
11. Examples (3)
from a HW of Machine Learning Class (ICU-2006)
Face image classification
19. The back-propagation algorithm
Network model
Input layer hidden layer output layer
xi yj ok
v ji
wkj
y j s v ji x i
w y
ok s kj j
i
j
1
E v , w tk ok
2
Error function: 2 k
Stochastic gradient descent
21. Gradient-descent function minimization
In order to find a vector parameter x that minimizes a function f x …
Start with a random initial value of x x 0 .
Determine the direction of the steepest descent in the parameter space by
f f f
f , ,...,
x x
1 2 x n
Move to the direction a step.
x i 1 x i hf
x
Repeat the above two steps until no more change in .
For gradient-descent to work…
The function to be minimized should be continuous.
The function should not have too many local minima.
23. Derivation of back-propagation algorithm
Adjustment of wkj :
2
E 1 2 1 t s w y
tk ok k
k j j
wk j wk j 2 k 2 wk j
j
1
y j ok 1 ok 1 2 tk ok
2
y j ok 1 ok tk ok
E
wkj h h ok 1 ok tk ok y j
wkj
o
d k
24. Derivation of back-propagation algorithm
Adjustment of vji :
2
E 1 2 1 t s w y
tk ok kj j
v j i v j i 2 k 2 k v j i k
j
2
1 t s w s v x
k
kj ji i
2 k v j i j
i
1
x i y j 1 y j wkj ok 1 ok 1 2 tk ok
2 k
x i y j 1 y j wkj ok 1 ok tk ok
k
E
v ji h hy j 1 y j wkjok 1 ok tk ok x i
v ji k
h y j 1 y j wkj dko x i
y k
dj
26. Batch learning vs. Incremental learning
Batch standard backprop proceeds as
Incremental standard backprop can be done as follows:
follows:
Initialize the weights W.
Initialize the weights W.
Repeat the following steps for j = 1 to NL:
Repeat the following steps:
Process one training case (y_j,X_j) to compute the gradient
Process all the training data DL to compute the gradient
of the error (loss) function Q(y_j,X_j,W).
of the average error function AQ(DL,W).
Update the weights by subtracting the gradient times the
Update the weights by subtracting the gradient times the
learning rate.
learning rate.
30. Introduction
Divide & conquer
Hierarchical model
Sequence of
recursive splits
Decision node vs.
leaf node
Advantage
Interpretability
IF-THEN rules
31. Divide and Conquer
Internal decision nodes
Univariate: Uses a single attribute, xi
Numeric xi : Binary split : xi > wm
Discrete xi : n-way split for n possible values
Multivariate: Uses all attributes, x
Leaves
Classification: Class labels, or proportions
Regression: Numeric; r average, or local fit
Learning
Construction of the tree using training examples
Looking for the simplest tree among the trees that code the training
data without error
Based on heuristics
NP-complete
“Greedy”; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993)
32. Classification Trees
Split is main procedure for tree
construction
By impurity measure
For node m, Nm instances reach m, Nim
belong to Ci
i
ˆ Ci | x ,m pm
i Nm
P
Nm To be pure!!!
Node m is pure if pim is 0 or 1
K
Measure of impurity is entropy Im pm log2pm
i i
i 1
33. Representation
Each node specifies a test of some attribute of the instance
Each branch correspond to one of the possible values for this
attribute
34. Best Split
If node m is pure, generate a leaf and stop, otherwise split
and continue recursively
Impurity after split: Nmj of Nm take branch j. Nimj belong to
Ci i
N mj
ˆ Ci | x ,m, j pmj
P i
N mj
n N mj K
I'm p i
mj
i
log2pmj
j 1 Nm i 1
Find the variable and split that min impurity (among all
variables -- and split positions for numeric variables)
Q) “Which attribute should be tested at the root of the tree?”
36. Entropy
“Measure of uncertainty”
“Expected number of bits to resolve uncertainty”
Suppose Pr{X = 0} = 1/8
If other events are equally likely, the number of events is 8. To indicate
one out of so many events, one needs lg 8 bits.
Consider a binary random variable X s.t. Pr{X = 0} = 0.1.
1 0.1 lg
1 1
The expected number of bits: 0.1 lg
0.1 1 0.1
In general, if a random variable X has c values with prob. p_c:
c c
1
The expected number of bits: H pi lg pi lg pi
i 1 pi i 1
37. Entropy
Example
14 examples
Entropy([9,5])
(9 /14) log 2 (9 /14) (5 /14) log 2 (5 /14) 0.940
Entropy 0 : all members positive or negative
Entropy 1 : equal number of positive & negative
0 < Entropy < 1 : unequal number of positive & negative
38. Information Gain
Measures the expected reduction in entropy caused by partitioning
the examples
39. Information Gain
• # of samples = 100
ICU-Student tree • # of positive samples = 50
Candidate • Entropy = 1
Left side:
• # of samples = 50
Gender • # of positive samples = 40
• Entropy = 0.72
Right side:
Male Female • # of samples = 50
• # of positive samples = 10
• Entropy = 0.72
IQ Height On average
• Entropy = 0.5 * 0.72 + 0.5*0.72
= 0.72
• Reduction in entropy = 0.28
Information gain
43. Hypothesis Space Search
Hypothesis space: the set of
all possible decision trees
DT is guided by information
gain measure.
Occam’s razor ??
44. Overfitting
• Why “over”-fitting?
– A model can become more complex than the true target
function(concept) when it tries to satisfy noisy data as well
45. Avoiding over-fitting the data
Two classes of approaches to avoid overfitting
Stop growing the tree earlier.
Post-prune the tree after overfitting
Ok, but how to determine the optimal size of a tree?
Use validation examples to evaluate the effect of pruning (stopping)
Use a statistical test to estimate the effect of pruning (stopping)
Use a measure of complexity for encoding decision tree.
Approaches based on the first strategy
Reduced error pruning
Rule post-pruning
48. Bayes’ Rule
Introduction
prior likelihood
posterior
P C p x | C
P C | x
p x
evidence
P C 0 P C 1 1
p x p x | C 1P C 1 p x | C 0P C 0
p C 0 | x P C 1 | x 1
49. Bayes’ Rule: K>2 Classes
Introduction
p x | Ci P Ci
P Ci | x
p x
p x | Ci P Ci
K
p x | Ck P Ck
k 1
K
P Ci 0 and P Ci 1
i 1
choose Ci if P Ci | x max k P Ck | x
50. Bayesian Networks
Introduction
Graphical models, probabilistic networks
causality and influence
Nodes are hypotheses (random vars) and the prob corresponds to our
belief in the truth of the hypothesis
Arcs are direct influences between hypotheses
The structure is represented as a directed acyclic graph (DAG)
Representation of the dependencies among random variables
The parameters are the conditional probs in the arcs
Small set of all possible
probability, relating B.N. combinations of
only neighbor node cicumstances
51. Bayesian Networks
Introduction
Learning
Inducing a graph
From prior knowledge
From structure learning
Estimating parameters
EM
Inference
Beliefs from evidences
Especially among the nodes not directly connected
52. Structure
Introduction
Initial configuration of BN
Root nodes
Prior probabilities
Non-root nodes
Conditional probabilities given all possible combinations of direct
predecessors
P(a) P(b)
A B
P(c|a)
C D P(d|ab), P(d|aㄱb), P(d|ㄱab), P(d|ㄱaㄱb)
P(c|ㄱa)
P(e|d) E
P(e|ㄱd)
53. Causes and Bayes’ Rule
Introduction
Diagnostic inference:
diagnostic Knowing that the grass is wet,
what is the probability that rain is
causal the cause?
P W | R P R
P R | W
P W
P W | R P R
P W | R P R P W |~ R P ~ R
0.9 0.4
0.75
0.9 0.4 0.2 0.6
54. Causal vs Diagnostic Inference
Introduction
Causal inference: If the
sprinkler is on, what is the
probability that the grass is wet?
P(W|S) = P(W|R,S) P(R|S) +
P(W|~R,S) P(~R|S)
= P(W|R,S) P(R) +
P(W|~R,S) P(~R)
= 0.95*0.4 + 0.9*0.6 = 0.92
Diagnostic inference: If the grass is wet, what is the probability
that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S)
P(S|R,W) = 0.21
Explaining away: Knowing that it has rained
decreases the probability that the sprinkler is on.
55. Bayesian Networks: Causes
Introduction
Causal inference:
P(W|C) = P(W|R,S) P(R,S|C) +
P(W|~R,S) P(~R,S|C) +
P(W|R,~S) P(R,~S|C) +
P(W|~R,~S) P(~R,~S|C)
and use the fact that
P(R,S|C) = P(R|C) P(S|C)
Diagnostic: P(C|W ) = ?
56. Bayesian Nets: Local structure
Introduction
P (F | C) = ?
d
P X 1 , X d P X i | parentsX i
i 1
57. Bayesian Networks: Inference
Introduction
P (C,S,R,W,F ) = P (C ) P (S |C ) P (R |C ) P (W |R,S ) P (F |R )
P (C,F ) = ∑S ∑R ∑W P (C,S,R,W,F )
P (F |C) = P (C,F ) / P(C ) Not efficient!
Belief propagation (Pearl, 1988)
Junction trees (Lauritzen and Spiegelhalter, 1988)
Independence assumption
58. Inference
Evidence & Belief Propagation
Evidence – values of observed nodes
V3 = T,V6 = 3 V1
Our belief in what the value of Vi
„should‟ be changes.
This belief is propagated V3 V2
As if the CPTs became
V4
V3=T 1.0 P V2=T V2=F
V3=F 0.0 V6=1 0.0 0.0
V5 V6
V6=2 0.0 0.0
V6=3 1.0 1.0
59. Belief Propagation
Bayes Law:
P( B | A) P( A)
P( A | B)
P( B)
“Causal” message “Diagnostic” message
Going down arrow, sum out parent Going up arrow, Bayes Law
Message
Messages
Specifically:
1/a
9
* some figures from: Peter Lucas BN lecture course
60. The Messages
• What are the messages?
• For simplicity, let the nodes be binary
V1=T 0.8 The message passes on information.
V1=F 0.2 What information? Observe:
V1
P(V2| V1) = P(V2| V1=T)P(V1=T)
+ P(V2| V1=F)P(V1=F)
P V1=T V1=F The information needed is the CPT
of V1 = V(V1)
V2 V2=T 0.4 0.9
V2=F 0.6 0.1 Messages capture information
passed from parent to child
61. The Messages
• We know what the messages are
• What about ?
Assume E = { V2 } and compute by Bayes‟rule:
V1 P(V1 ) P(V2 | V1 )
P(V1 | V2 ) aP(V1 ) P(V2 | V1 )
P(V2 )
The information not available at V1 is the P(V2|V1). To be
V2 passed upwards by a -message. Again, this is not in general
exactly the CPT, but the belief based on evidence down the tree.
62. Belief Propagation
U1 U2
λ(U2)
π(U1)
π(U2)
λ(U1)
V
λ(V1)
π(V2)
π(V1)
λ(V2)
V1 V2
63. Evidence & Belief
V1 Evidence
Belief V3 V2
V4
V5 V6
Evidence
Works for classification ??
70. References
Textbooks
Ethem ALPAYDIN, Introduction to Machine Learning, The MIT Press, 2004
Tom Mitchell, Machine Learning, McGraw Hill, 1997
Neapolitan, R.E., Learning Bayesian Networks, Prentice Hall, 2003
Materials
Serafí Moral, Learning Bayesian Networks, University of Granada, Spain
n
Zheng Rong Yang, Connectionism, Exeter University
KyuTae Cho ,Jeong Ki Yoo ,HeeJin Lee, Uncertainty in AI, Probabilistic reasoning,
Especially for Bayesian Networks
Gary Bradski, Sebastian Thrun, Bayesian Networks in Computer Vision, Stanford
University
Recommended Textbooks
Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992
Haykin, Simon S., Neural networks : a comprehensive foundation, Prentice Hall, 1999
Jensen, Finn V., Bayesian networks and decision graphs, Springer, 2007