1. CS290c – Machine Learning
Final Project
Examples and Implications of Physics Math in Machine Learning
Daniel A. O’Leary
2. In the course of lecture and discussion for this class, it appeared as though there
exists a pattern of discovery for machine learning theories. Several instances of machine
learning research exhibit the following steps:
• a topic or problem is identified
• some reasonable solutions are suggested
• one theory is chosen as best
• this theory is investigated and implemented
• it is discovered that the math of this solution coincides with a physics
property that shares the mathematical properties of this solution
• the physics math reveals new facts regarding the nature of the problem
being investigated.
Not only does this seem to occur with enough frequency to suggest that the physics and
machine learning may share a common theme, the ideas that seem to come out of these
examples of mathematics cross-over are often the type of elegant or “big” ideas that
shape and propel a discipline forward. As such it seems reasonable that another way to
search for these “big” leaps in understanding would be to look at the physics that applies
to machine learning, identify trends or similarities and extrapolate from these trends
possible future mathematics crossovers that will prove true in machine learning.
I propose to investigate the use of physics principals as metaphors in machine
learning. Specifically, I will review energy, temperature, mean field theory, and
momentum. In doing so, I hope to discover an aspect of these topics in physics that has
been overlooked in the machine learning community. Essentially, I hope to extend
physics analogies to illuminate a machine learning problem. Failing that, I would like to
reach a conclusion about topics of physics that represent a high likelihood of being used
in machine learning in the future. In identifying such areas a “reverse engineering”
process might be performed – that a search among existing machine learning questions
for patterns that correspond to the math in high likelihood physics topics. In identifying
these similarities, understanding of machine learning can be advanced through existing
physics knowledge.
3. Energy
Energy is a challenging topic in physics; it is simultaneously obvious and elusive. We all
know what energy is, generally it is a measure of the ability to do work; however, energy
does not have a consistent shape nor discrete unit. It is easy to conceive, but hard to
define. Adding to this difficulty, Einstein’s famous E=mc2 suggests that the line we draw
between matter and energy is purely a mental construct. It is perfectly valid to suggest
that any discussion in physics is, in fact, a discussion on energy.
Nonetheless, we can put these issues aside and accept our more intuitive
definition of energy for now (the ability to do work), and concentrate on the energy’s
applications in machine learning. Energy in a neural network first appeared in 1982 and
has since become commonplace. A review of literature on the topic shows that energy is
consistently used as a function of weights, whose minima describe the attractors within
the network, in Boltzman distributions [Dayan], within Gibbs Free Energy [Csato], and
as a tool for classification, regression, constraint satisfaction, and determining latent
variables [Yann] all of which are derived from the Hopfield energy equation for energy.
[Hertz]
Within the context of a neural network, energy (H) is defined as the difference
between weights of like and dislike variables. Consider the variables Si and Sj, binary
variables that report positive or negative one and the weight between them is wij. The
energy of a network is H=-½3wijSiSj, such that weights between agreeing variables
decrease the systems energy and weights between disagreeing variable increase the
system’s energy. The ½ of the summed weights accounts for the fact that wij = wji and as
such each weighted edge is counted twice
– halving the total sum eliminates this
redundancy. Si = sign( jwjiSj), Si takes the
value of the higher weighted value that
connects to it.
The graph of an energy function can
be considered as a landscape of hills and
Figure 1 An example of an Energy landscape. valleys (see figure 1). Patterns that consist
of the training set act as attractors and sit
4. at the minima of the landscape. Some other minima are the echos of surrounding
attractors, these minima are called spurious mixture states because they mimic attractors
but are the result of a mixture of true attractors. Further, some spurious minima are
unrelated to the attractors, these are called spin glass states and have their basis in a
different physics topic which is too great a digression for an investigation here.
Ultimately, it is the minima and its relationship to the attractors that makes the energy
function useful.
Energy in the neural network context is compared to the energy exhibited by an
array of micromagnets, each magnet exhibiting a spin that corresponds to the machine
learning variables positive and negative one outputs. These arrays of micromagnets
naturally settle on a minima of lowest energy, just like the neural networks. Further this
physics metaphore is extended to show other physics metaphors in the machine learning
world. It is also exhibits the qualities that I suggest correlate among physics math in
machine learning: the magnetic array physics example is discrete, binary, it deals with
physics at the atomic level, and it deals with
energy as electrostatic forces.
Temperature
Temperature in the physics world can be
Figure 2 Energy function exhibiting
thought of as the energy contained in an object. spurious minima
For example, the difference between a cold piece
of iron and a warm piece of iron is the warm piece of iron contains more energy. This
energy can be considered the velocity of the atoms in the substance – the warm piece of
iron has the same atoms as the cold piece of iron, but the atoms are moving faster.
Within the machine learning context, temperature is an expression of noise within
a system and a parameter controlling the update rule. Temperature plays a role in
Beltzman machines and Helmholtz free energy equations [Tanaka]. Temperature decay is
used in annealing to avoid spurious minima [Galland].
Temperature (T) in machine learning expresses as the inverse of the Absoute
temperature(β=1/T). In the physics world, machine learning temperature exhibits
properties similar to temperature within a micromanetic array (just like the
5. micromagnetic array discussed in energy). In our
previous example of micromagnetic arrays, the
spin of an atom was determined solely by the spin
of its surrounding atoms, in fact this is only true
as a material approaches absolute zero. At higher
temperatures, the energy in the material (indicated Figure 3 Energy function
reducing spurious minima
by temperature) can cause the spin to flip. Thus,
through temperature.
spin becomes a probability Si = +1 w/ probability
g(h) and g(h)=Fβ(+/-hi)=1/(1+exp(-/+ 2βhi) Fβ(+/-hi). This is a sigmoid function that
flattens as temperature increases, becoming completely flat at a critical temperature Tc.
Thus the critical temperature is the temperature at which knowledge of the spin states
offers no indication of what a connected spin state will be (The point at which noise
exceeds the ability to draw any similarities). Temperature is used to reduce the imact of
spurious minima by increasing noise greater than the spurious minima’s depth, and
thereby allowing us to be “kicked out” of spurious states and continue in search of
attractors. In being an extension of the magnetic array, this physics property also has the
qualities of being discrete, binary, occurs at the atomic level, and involves energy as an
electrostatic force.
<S>
Mean Field Theory
The mean field theory is
powerful, yet simple concept. It is the
aggregation of spin states to a single
1
spin state and using that spin state
(the mean field) as the only spin state
involved. This greatly simplifies the
T
problem and offer insight to the Tc
average spin state(<Si>) of an array Figure 3 Predictablity of Spin State with respect to
element. temperature
The transition of the mean field theory to machine learning is trivial – having
closely aligned the math of the magnetic array example, mean field theory directly
6. applies. Further the complexity of finding minima in an energy function grows
exponentially with the number of paths such that many interesting problems are too
complex because of the number of paths their graphs contain. Use of the mean field
theory makes such intractable problems manageable and greatly extends the use of
Boltzman machines [Tanaka], Gibbs free energy, and Belief Propogation [Csato]
On a conceptual level, the mean field theory offers an unexpected insight to the
impact of noise in pattern recognition. Just as temperature in the physics model creates
the sharp change in behavior at a
NCorrect
particular noise level, noise in the
machine learning model undergoes
a similar phase transition with
respect to the number of correctly
labeled bits. “One might have
T assumed naively that the behavior
Tc would change smoothly as T was
Figure 6 the number of correct bits retrieved in a varied, but in a large system this is
pattern with respect to temperature
often not the case. . . In the present
context [mean field theory] says that a large network abruptly cease to function at all if a
certain noise level is exceeded.” [Hertz] As shown in figure 5, when the critical
temperature in a stochastic network is reached, an input pattern offers no more insight
than random guessing.
Momentum
Momentum, in the physics sense, is the product of mass and velocity. A bowling
ball is thrown down a bowling lane, when it is released from the player’s hand, it no
longer is acted on by external forces (assuming a frictionless vacuum), but continues to
move forward. It is momentum that allows the ball to move in the absence of maintained
force.
In machine learning, momentum refers to a dampening effect placed on weight
changes in back-propagation algorithms. Momentum () is found in the following equation
for changes in weight [Hertz, 123]:
7. wpq (t+1)=-(E/wpq)+wpq(t)
This momentum acts as an average force, compelling each new iteration of weights to
coincide with previous weights.
I include momentum in this paper for several reasons; first it is a physics concept
and as such clearly falls into the category of topics under investigation. Second, it deals
with energy, so it coincides with the physics topics discussed so far. Third, it differs from
previous topics in that it includes large objects (like a car or a ball rolling to a stop) unlike
the atomic scope discussed with previous topics. Finally, it offers a counter-example to
the trends I identify as being consistent with physics properties used in machine learning
– momentum is not discrete, it is not binary, and it is not concerned with energy in an
electrostatic force sense. It is my suspicion that machine learning comes to the term
momentum not due to its mathematical similarities to physic’s momentum, but because
of the common usage meaning of momentum. In common usage, momentum means
impetus, a driving force, which is very different from the physics term.
Conclusion
Having recognized a high propensity of physics metaphors in machine learning, I
entered into this paper with two goals. The first was to review the examples of physics
math in machine learning with the belief that there exists some extension of these physics
principals that applies to machine learning but has not been explored by the machine
learning community.
I made no discovery extending machine learning based on the physics metaphors
that machine learning employs; nonetheless, I believe that such extensions exist. These
metaphors have been revealed in the past and I am confident they will continue to be
revealed in the future. I did not discover any extension because in this endeavor, my
reach exceeds my grasp. It was a little naive (and perhaps a little arrogant) to think that I
could review literature for a few weeks and identify a pattern that had been missed by a
community that has been investigating the topic since its inception. However - and I
cannot state this strongly enough - I do not imply that absence of proof is proof of
absence. I remain confident in the premise of my search, that the math shared in machine
learning and physics represents a “hidden truth” – that both are linked by a truth
8. regarding the way in which information is transferred and processed – and that until such
a truth is revealed, the more mature science of physics will offer many insights into the
relatively new endeavor of machine learning.
My second goal for this paper was to look at the physics math that is applied to
machine learning and draw a conclusion about which aspects of physics would more
likely offer insight into fruitful future research in the realm of machine learning.
Obviously, not all physics knowledge applies to machine learning, but in identifying
qualities of the topics in physics that prove most applicable to machine learning we can
extrapolate which topics in physics are more likely to offer insights to machine learning.
My conclusions in this area are far from earth shattering. For example, the idea that
physics representations of objects at their smallest level, where events are discrete and
binary, are more often applicable to the discrete and binary questions in machine learning
is a very reasonable, perhaps even intuitive notion.
Quantum mechanics deals with quantized information, at an atomic level, with a
high propensity to exhibit binary behavior. By and large, the physics that applies to
machine learning shares these qualities. This alone suggests that research in this direction
is worth considering. Add to that the fact that quantum mechanics’ narrow, non-intuitive
nature places it outside the sphere of interest of most computer scientists. This suggests
that machine learning - through the lens of quantum math - is probably an
underdeveloped topic, and I feel confident in suggesting that this is an area worthy of
greater investigation, and a successful result for the second goal in writing this paper.
9. References
Hertz J, Krogh A, Palmer R Introduction to the Theory of Neural Computation.
Publisher: Addison-Wesley Pub. Co, Redwood City, Calif., 1991.
Reif F, Fundamentals of Statistical and Thermal Physics, Publisher McGraw-Hill 1985
Dayan P, Hinton GE, Neal RM, Zemel RS. The Helmholtz machine. [Journal Paper]
Neural Computation, vol.7, no.5, Sept. 1995, pp. 889-904. USA.
Csato, L. Opper, M. Winther, O. , TAP Gibbs Free Energy, Belief Propagation and
Sparsity ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, VOL 1,
ISSU 14, 2002 pages 657-664
Tanaka T. Mean-field theory of Boltzmann machine learning. [Journal Paper]
Physical Review E (Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary
Topics), vol.58, no.2, Aug. 1998, pp. 2302-10. Publisher: APS through AIP, USA.
Kivinen J, Warmath MK. Boosting as entropy projection. [Conference Paper]
Proceedings of the Twelfth Annual Conference on Computational Learning Theory.
ACM. 1999, pp. 134-44. New York, NY, USA.
Galland CC. The limitations of deterministic Boltzmann machine learning. [Journal
Paper] Network: Computation in Neural Systems, vol.4, no.3, Aug. 1993, pp. 355-79. UK.
Yann LeCun and Fu Jie Huang, "Loss Functions for Discriminative Training of Energy-
Based Models," in Proc. of the 10-th International Workshop on Artificial Intelligence
and Statistics (AIStats'05) , 2005