The document discusses information theory concepts like entropy, joint entropy, conditional entropy, and mutual information. It then discusses how these concepts relate to generalization in deep learning models. Specifically, it explains that the PAC-Bayesian bound is data-dependent, so models with high VC dimension can still generalize if the data is clean, resulting in low KL divergence between the prior and posterior distributions.
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...Leo Asselborn
This presentation proposes an algorithmic approach to
synthesize stabilizing control laws for discrete-time piecewise
affine probabilistic (PWAP) systems based on computations of
probabilistic reachable sets. The considered class of systems
contains probabilistic components (with Gaussian distribution)
modeling additive disturbances and state initialization. The
probabilistic reachable state sets contain all states that are
reachable with a given confidence level under the effect of
time-variant control laws. The control synthesis uses principles
of the ellipsoidal calculus, and it considers that the system
parametrization depends on the partition of the state space. The
proposed algorithm uses LMI-constrained semi-definite programming
(SDP) problems to compute stabilizing controllers,
while polytopic input constraints and transitions between regions
of the state space are considered. The formulation of
the SDP is adopted from a previous work in [1] for switched
systems, in which the switching of the continuous dynamics
is triggered by a discrete input variable. Here, as opposed
to [1], the switching occurs autonomously and an algorithmic
procedure is suggested to synthesis a stabilizing controller. An
example for illustration is included.
Presentation at OM-2017, the Twelfth International Workshop on Ontology Matching collocated with the 16th International Semantic Web Conference ISWC-2017, October 21st, 2017, Vienna, Austria
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
Massive graphs are ubiquitous in various application domains, such as social networks, road networks, communication networks, biological networks, RDF graphs, and so on. Such graphs are massive (for example, with hundreds of millions of nodes and edges or even more) and contain rich information (for example, node/edge weights, labels and textual contents). In such massive graphs, an important class of problems is to process various graph structure related queries. Graph reachability, as an example, asks whether a node can reach another in a graph. However, the large graph scale presents new challenges for efficient query processing.
In this talk, I will introduce two new yet important types of graph reachability queries: weight constraint reachability that imposes edge weight constraint on the answer path, and k-hop reachability that imposes a length constraint on the answer path. With such realistic constraints, we can find more meaningful and practically feasible answers. These two reachablity queries have wide applications in many real-world problems, such as QoS routing and trip planning.
We elaborate on hierarchical credal sets, which are sets of probability mass functions paired with second-order distributions. A new criterion to make decisions based on these models is proposed. This is achieved by sampling from the set of mass functions and considering the Kullback-Leibler divergence from the weighted center of mass of the set. We evaluate this criterion in a simple classification scenario: the results show performance improvements when compared to a credal classifier where the second-order distribution is not taken into account.
* ML in HEP
* classification and regression
* knn classification and regression
* ROC curve
* optimal bayesian classifier
* Fisher's QDA
* intro to Logistic Regression
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...Leo Asselborn
This presentation proposes an algorithmic approach to
synthesize stabilizing control laws for discrete-time piecewise
affine probabilistic (PWAP) systems based on computations of
probabilistic reachable sets. The considered class of systems
contains probabilistic components (with Gaussian distribution)
modeling additive disturbances and state initialization. The
probabilistic reachable state sets contain all states that are
reachable with a given confidence level under the effect of
time-variant control laws. The control synthesis uses principles
of the ellipsoidal calculus, and it considers that the system
parametrization depends on the partition of the state space. The
proposed algorithm uses LMI-constrained semi-definite programming
(SDP) problems to compute stabilizing controllers,
while polytopic input constraints and transitions between regions
of the state space are considered. The formulation of
the SDP is adopted from a previous work in [1] for switched
systems, in which the switching of the continuous dynamics
is triggered by a discrete input variable. Here, as opposed
to [1], the switching occurs autonomously and an algorithmic
procedure is suggested to synthesis a stabilizing controller. An
example for illustration is included.
Presentation at OM-2017, the Twelfth International Workshop on Ontology Matching collocated with the 16th International Semantic Web Conference ISWC-2017, October 21st, 2017, Vienna, Austria
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
Massive graphs are ubiquitous in various application domains, such as social networks, road networks, communication networks, biological networks, RDF graphs, and so on. Such graphs are massive (for example, with hundreds of millions of nodes and edges or even more) and contain rich information (for example, node/edge weights, labels and textual contents). In such massive graphs, an important class of problems is to process various graph structure related queries. Graph reachability, as an example, asks whether a node can reach another in a graph. However, the large graph scale presents new challenges for efficient query processing.
In this talk, I will introduce two new yet important types of graph reachability queries: weight constraint reachability that imposes edge weight constraint on the answer path, and k-hop reachability that imposes a length constraint on the answer path. With such realistic constraints, we can find more meaningful and practically feasible answers. These two reachablity queries have wide applications in many real-world problems, such as QoS routing and trip planning.
We elaborate on hierarchical credal sets, which are sets of probability mass functions paired with second-order distributions. A new criterion to make decisions based on these models is proposed. This is achieved by sampling from the set of mass functions and considering the Kullback-Leibler divergence from the weighted center of mass of the set. We evaluate this criterion in a simple classification scenario: the results show performance improvements when compared to a credal classifier where the second-order distribution is not taken into account.
* ML in HEP
* classification and regression
* knn classification and regression
* ROC curve
* optimal bayesian classifier
* Fisher's QDA
* intro to Logistic Regression
It's the deck for one Hulu internal machine learning workshop, which introduces the background, theory and application of expectation propagation method.
Uncertainty & Probability
Baye's rule
Choosing Hypotheses- Maximum a posteriori
Maximum Likelihood - Baye's concept learning
Maximum Likelihood of real valued function
Bayes optimal Classifier
Joint distributions
Naive Bayes Classifier
Low-rank matrix approximations in Python by Christian Thurau PyData 2014PyData
Low-rank approximations of data matrices have become an important tool in machine learning and data mining. They allow for embedding high dimensional data in lower dimensional spaces and can therefore mitigate effects due to noise, uncover latent relations, or facilitate further processing. These properties have been proven successful in many application areas such as bio-informatics, computer vision, text processing, recommender systems, social network analysis, among others. Present day technologies are characterized by exponentially growing amounts of data. Recent advances in sensor technology, internet applications, and communication networks call for methods that scale to very large and/or growing data matrices. In this talk, we will describe how to efficiently analyze data by means of matrix factorization using the Python Matrix Factorization Toolbox (PyMF) and HDF5. We will briefly cover common methods such as k-means clustering, PCA, or Archetypal Analysis which can be easily cast as a matrix decomposition, and explain their usefulness for everyday data analysis tasks.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
5. VC Bound
• Over-fitting is caused by high VC Dimension
• For a given dataset (n is constant), search for the best VC Dimension
d=n (shatter)
d (VC Dimension)
over-fittingerror
best VC Dimension
numbef of
training instances
VC Dimension
(model complexity)✏(h) ˆ✏(h) +
r
8
n
log(
4(2n)d
)
training error
testing error
6. VC Dimension
• VC Dimension of linear model:
• O(W)
• W = number of parameters
• VC Dimension of fully-connected
neural networks:
• O(LW log W)
• L = number of layers
• W = number of parameters
• VC Dimension is independent from data distribution, and only
depends on model
11. Generalization in Deep Learning
1 0 1 0 2
random noise features
shatter !
deep
neural
networks
(Inception)
feature
:label: 1 0 1 0 2
original dataset (CIFAR)
0 1 1 2 0
random label
ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0
12. Generalization in Deep Learning
1 0 1 0 2
random noise features
deep
neural
networks
(Inception)
feature
:label: 1 0 1 0 2
original dataset (CIFAR)
0 1 1 2 0
random label
ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0
✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9 ✏(h) ⇡ 0.9
13. Generalization in Deep Learning
• Testing error depends on data distribution
• However, VC-Bound does not depend on data distribution
1 0 1 0 2
random noise
features
original dataset
feature
:label: 1 0 1 0 2
random label
0 1 1 2 0
✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9
19. KL(ˆ✏(Q)k✏(Q))
KL(QkP) + log(n
)
n 1
PAC-Bayesian Bound for Deep Learning
• With 1-ẟ probability, the following inequality (PAC-Bayesian Bound) is
satisfied
number of
training instances
KL divergence
between P and Q
Distribution of model
before training (prior)
Distribution of models
after training (posterior)
KL divergence
between testing error
& training error
20. PAC-Bayesian Bound for Deep Learning
✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
high ✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
, high
under fitting over fittingappropriate fitting
✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
moderate ✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) +
2n 1
, moderate ✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
2n 1
low ✏(Q) ˆ✏(Q) +
s
, high
low KL(Q||P) moderate KL(Q||P) high KL(Q||P)
training data
testing data
P
Q
P
Q
P
Q
KL(ˆ✏(Q)k✏(Q))
KL(QkP) + log(n
)
n 1
lowKL(ˆ✏(Q)k✏(Q)) moderae KL(ˆ✏(Q)k✏(Q)) high KL(ˆ✏(Q)k✏(Q))
28. Conditional Entropy
Information Diagram
H(X)H(Y )
H(X, Y )
H(Y |X)
H(Y |X) =
X
x2X
p(x)
X
y2Y
p(y|x) log p(y|x)
=
X
x2X
p(x)
X
y2Y
p(y|x)(log p(x, y) log p(x))
=
X
x2X,y2Y
p(x, y) log p(x, y) +
X
x2X
p(x) log p(x)
= H(X, Y ) H(X)
29. Conditional Entropy
P(X,Y) Y=1 Y=2
X=1 0.4 0.4
X=2 0.1 0.1
P(X,Y) Y=1 Y=2
X=1 0.4 0.1
X=2 0.1 0.4
X & Y are independent Y is a stochasit function of X
H(X, Y ) = 1.722
H(X) = 0.722, H(Y ) = 1
H(Y |X) = H(X, Y ) H(X)
= 1 = H(Y )
H(X, Y ) = 1.722
H(X) = 1, H(Y ) = 1
H(Y |X) = H(X, Y ) H(X)
= 0.722
P(X,Y) Y=1 Y=2
X=1 0.5 0
X=2 0 0.5
Y is a deterministic function of X
H(X, Y ) = 1
H(X) = 1, H(Y ) = 1
H(Y |X) = H(X, Y ) H(X)
= 0
H(X)
H(X, Y )
H(Y |X) = H(Y )
H(Y )
H(X, Y )
H(X)
H(Y |X) = 0
H(Y |X)
H(X, Y )
30. Mutual Information
• The mutual dependence between two variables X, Y
I(X; Y ) =
X
x2X,y2Y
p(x, y) log
p(x, x)
p(x)p(y)
P(X,Y) Y=1 Y=2
X=1 0.4 0.4
X=2 0.1 0.1
P(X,Y) Y=1 Y=2
X=1 0.4 0.1
X=2 0.1 0.4
X & Y are independent
I(X; Y ) = 0 I(X; Y ) = 0.278
Y is a stochasit function of X
P(X,Y) Y=1 Y=2
X=1 0.5 0
X=2 0 0.5
Y is a deterministic function of X
I(X, Y ) = 1
31. Mutual Information
H(X)H(Y )
H(X, Y )
Information
Diagram
I(X; Y )
I(X; Y ) =
X
x2X,y2Y
p(x, y) log
p(x, x)
p(x)p(y)
=
X
x2X,y2Y
p(x, y)( log p(x) log p(y) + log p(x, x))
=
X
x2X
p(x) log p(x)
X
y2Y
p(y) log p(y) +
X
x2X,y2Y
p(x, y) log p(x, x))
= H(X) + H(Y ) H(X, Y )
32. Mutual Information
P(X,Y) Y=1 Y=2
X=1 0.4 0.4
X=2 0.1 0.1
P(X,Y) Y=1 Y=2
X=1 0.4 0.1
X=2 0.1 0.4
X & Y are independent
H(Y )
H(X, Y )
H(X)
I(X; Y )
H(X)
H(X, Y )
H(Y )
I(X; Y ) = 0
Y is a stochasit function of X
P(X,Y) Y=1 Y=2
X=1 0.5 0
X=2 0 0.5
Y is a deterministic function of X
H(X, Y ) = 1.722
H(X) = 0.722, H(Y ) = 1
I(X; Y ) = H(X) + H(Y )
H(X, Y ) = 0
H(X, Y ) = 1.722
H(X) = 1, H(Y ) = 1
I(X; Y ) = H(X) + H(Y )
H(X, Y ) = 0.278
H(X, Y ) = 1
H(X) = 1, H(Y ) = 1
I(X; Y ) = H(X) + H(Y )
H(X, Y ) = 1
H(Y )
H(X, Y )
H(X)
I(X; Y )
37. Cause of Over-fitting
• Training loss (Cross-Entropy):
Hp,q(y|x, w) = Ex,y
⇥
p(y|x, w) log q(y|x, w)
⇤
p : probability density function of data
q : probability density function predicted by model
x : input feature of training data
y : label of training data
w : weights of model
✓ : latent parameters of data distribution
38. Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w)
Cause of Over-fitting
the uncertainty of y given w and x
Hp(x)
Hp(y)
Hp(y|x, w)
Hp(w)
39. Cause of Over-fitting
• lower
-> lower uncertainty of y given w and x -> lower training error
• ex: given x as , and a fixed w
8
>><
>>:
p(y = 1|x, w) = 0.9
p(y = 2|x, w) = 0.1
p(y = 3|x, w) = 0.0
p(y = 4|x, w) = 0.0
8
>><
>>:
p(y = 1|x, w) = 0.3
p(y = 2|x, w) = 0.3
p(y = 3|x, w) = 0.2
p(y = 4|x, w) = 0.2
higher Hp(y|x, w)lower Hp(y|x, w)
Hp(y|x, w)
Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w)
40. Cause of Over-fitting
Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w)
Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
✓ : latent parameters of (training & testing) data distribution
Hp(y|x)Hp(ytest|xtest)
Hp(✓)
41. Cause of Over-fitting
Hp(y|x)Hp(ytest|xtest)
Hp(✓)
✓ : latent parameters of (training & testing) data distribution
1
2
3
x y
3
2
x y
3
2
x y
3
Hp(y|x, ✓)
Ip(y; ✓|x)
useful information in
training data
noisy information
and outlier in
training data
noise and outlier
in testing data
normal samples not in
training data
42. Cause of Over-fitting
Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
Hp(x)
Hp(y)
Hp(✓)
Hp(y|x, ✓) the uncertainty of
y given w and x
Hp(y|x, w)
Hp(w)
I(y; ✓|x, w)
noisy information
and outlier in training data
useful information not
learned by weights
noisy information
and outlier learned by
weights
I(y; w|x, ✓)
47. Cause of Over-fitting
Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
Hp(x)
Hp(y)
Hp(✓)
Hp(w)
noisy information
and outlier learned by
weights
I(y; w|x, ✓)
48. w
• Cause of over fitting: weights memorize the noisy informationin training data.
Cause of Over-fitting
High VC Dimension, but clean data
-> few noise to memorize
High VC Dimension, and noisy data
-> much noise to memorize
training data
testing data
Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
ww
49. • is unknown, cannot be compute
• , the information in the weight, is an upper bound of
Information in the Weights as a Regularizer
I(y; w|x, ✓) I(y, x; w|✓) = I(D; w|✓) I(D; w)
I(y; w|x, ✓)
I(y; w|x, ✓)I(D; w)
D
Hp(x)
Hp(y)
Hp(✓)
I(y; w|x, ✓) Hp(x, y) = Hp(D)
I(y, x; w|✓) = I(D; w|✓)I(D; w)
Hp(w)
I(y; w|x, ✓)
50. Information in the Weights as a Regularizer
• The actual data distribution p is unknown
• Estimate by
• New loss function : as a regularizerIp(D; w) ⇡ Iq(D; w)
L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)
I(D; w) Iq(D; w)
51. Connection with Flat Minimum
• Flat minimum has low information in the weight
kH k⇤ : nuclear norm of the Hessian at the local minimum
Flat minimum -> low neclear norm of the Hessian -> low information
Iq(w; D)
1
2
K
⇥
log k ˆwk2
2 + log kH k⇤ K log(
K2
2
)
⇤
52. Connection with PAC-Bayesian Bound
• Given a prior distribution , we have:p(w)
distribution of weights
before training (prior)
distributionof weights
after training on dataset D
(posterior)
Iq(w, D) = EDKL(q(w|D)kq(w))
EDKL(q(w|D)kq(w)) + EDKL(q(w|D)kp(w))
EDKL(q(w|D)kp(w))
53. Connection with PAC-Bayesian Bound
• Loss function with the regularizer :
• PAC Bayesian Bound :
L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)
Hp,q(y|x, w) + EDKL(q(w|D)kp(w))
Ip(D; w) ⇡ Iq(D; w)
ED
⇥
Ltest
(q(w|D))
⇤
Hp,q(y|x, w) + LmaxED
⇥
KL(q(w|D)kp(w))
⇤
n(1 1
2 )
Ltest : test error of the network with weights q(w|D)
Lmax : maximum per-sample loss function