A discussion of the nature of AI/ML as an empirical science. Covering concepts in the field, how to position ourselves, how to plan for research, what are empirical methods in AI/ML, and how to build up a theory of AI.
6. WHY EMPIRICAL?
īUnless theoretically proven, we need realistic validation.
ī Physics theories are great, but they are based on assumptions ī need
experimental validation
īToy theories donât scale ī have zero impact ī Winter AI.
ī Hence, the theories have gone down a bit, e.g., down the Chomsky hierarchy in
NLP, and down to surface pattern recognition in CV.
īReal difficult problems lead to new theoretical set ups.
īEmpirical observations offer insights, suggests problem
formulation, induction and abduction
īPush the field forward. AI now strongly driven by industry.
25/06/2020 6
7. EMPIRICAL RESEARCH != IGNORANCE OF THEORY
Theory guides the search. It quantifies the goals. Making sure what we
obtained is correct.
Current AI theories:
ī Optimization. Mostly non-convex. Loss landscape of deep nets.
ī Learnability and generalization. Learning dynamics, convergence and stability.
ī Probabilistic inference. Quantification of uncertainty. Algorithmic assurance.
ī First-order logic.
ī Constraint satisfaction.
25/06/2020 7
8. IS EMPIRICAL ML/AI DIFFICULT?
Yes! This is the real world.
It is messy and must be
difficult.
īRobotics
īHealthcare
īDrug design
25/06/2020 8
No! The âeffectiveâ
solution space is small and
constrained.
īMachine translation
īSelf-driving cars
īDrug design
9. TWO RESEARCH STYLES
Methodological: One method, many application
scenarios. The method will fail at some point.
Applied: One app, many competing methods.
Some methods are intrinsically better than others.
E.g., CNN for image processing.
25/06/2020 9
10. HOW TO POSITION YOURSELF
25/06/2020 10
â[âĻ] the dynamics of the game will evolve. In the
long run, the right way of playing football is to
position yourself intelligently and to wait for the
ball to come to you. Youâll need to run up and
down a bit, either to respond to how the play is
evolving or to get out of the way of the scrum
when it looks like it might flatten you.â (Neil
Lawrence, 7/2015, now with Amazon)
http://inverseprobability.com/2015/07/12/Thoughts-on-ICML-2015/
11. âThen ask yourself, what is your unique position? What are your strengths and advantages that
people do not have? Can you move faster than others? It may be by having access to data, access
to expertise in the neighborhood, or borrowing angles outside the field. Sometimes digging up
old ideas is highly beneficial, too.
Alternatively, just calm down, and do boring-but-important stuffs. Important problems are like
the goal areas in ball games. The ball will surely come.â
https://letdataspeak.blogspot.com/2016/12/making-dent-in-machine-learning-or-how.html
25/06/2020 11
12. WISDOM OF THE ELDERS
Max Planck
ī Science advanced by one death at a time
Geoff Hinton:
ī When you have a cool idea but everyone says it is rubbish, then it is a really good idea.
ī The next breakthrough will be made by some grad students
Richard Hamming:
ī After 7 years you will run out of ideas. Better change to a totally new field and restart.
ī If you solve one problem, another will come. Better to solve all problems for next year, even if we
donât know what they are.
http://paulgraham.com/hamming.html
13. STRATEGIES TO SELECT RESEARCH AREAS
Ask what intelligence means, and what human
brain would do
Set an intelligence level (bacteria, insect, fish,
mouse, dog, monkey, human)
Check solvable unsolved problems. There are
LOTS of them!!!
Look for impact
Look for being difference. Fill the gap. Be
unique. Be recognizable.
Riding the early wave.
25/06/2020 13
Ask what a physicist would do:
ī Simplest laws/tricks that would do 99%
of the job
ī Minimizing some free-energy
ī Looking for invariance, symmetry
14. INSIGHTS FROM NEUROSCIENCE
Brain is the only known
intelligence system
Can provide inspirations for AI
Can validate AI claims
Hassabis, Demis, et al. "Neuroscience-inspired artificial
intelligence." Neuron95.2 (2017): 245-258.
Schmidhuber, JÃŧrgen. "Deep learning in neural networks: An overview." Neural
networks 61 (2015): 85-117.
Indeed, many AI concepts have had
originated from neuroscience
ī However, the influence is only at the
initial step
ī AI doesnât have the constraints that
neuroscience does
ī AI can do things that are impossible
within neuroscience
16. 25/06/2020 16
#REF: A Cognitive Architecture Approach to Interactive Task Learning, John E. Laird, University of Michigan
17. 25/06/2020 17
#REF: A Cognitive Architecture Approach to Interactive Task Learning, John E. Laird, University of Michigan
Knowledge graphs
Planning, RL
NTM Attention, RL
RLUnsupervised
CNN
Episodic memory
Where is the processor?
18. THINKING & LEARNING, FAST AND SLOW
Thinking
īFast: reactive, primitive, associative, animal like
īSlow: deliberative, reasoning, logical.
Learning
īFast: often memory-based, one-shot learning style, fast-weights,
e.g., k-NN
īSlow: often model- based, iterative, slow-weights, e.g., backgrop
25/06/2020 18
19. UNSOLVED PROBLEMS (1)
Learn to organize and remember ultra-long
sequences
Learn to generate arbitrary objects, with
zero supports
Reasoning about object, relation, causality,
self and other agents
Imagine scenarios, act on the world and
learn from the feedbacks
Continual learning, never-ending, across
tasks, domains, representations
Learn by socializing
Learn just by observing and self-prediction
Organizing and reasoning about (common-
sense) knowledge
Grounding symbols to sensor signals
Automated discovery of physical laws
Solve genetics, neuroscience and healthcare
Automate physical sciences
Automate software engineering
21. LECUNâS LIST (2018)
Deep Learning on new domains (beyond multi-dimensional arrays)
Graphs, structured data...
Marrying deep learning and (logical) reasoning
Replacing symbols by vectors and logic by algebra
Self-supervised learning of world models
Dealing with uncertainty in high-dimensional continuous
spaces
Learning hierarchical representations of control space
Instantiating complex/abstract action plans into simpler ones
Theory!
Compilers for differentiable programming.
22. A TYPICAL EMPIRICAL PROCESS
Hypothesis generation
īMostly by our brain
īBut AI can help the process too!
Hypothesis testing by experiments
Two types:
īDiscovery: it is there, awaiting to be found. Mostly theories of
nature.
īInvention: not there, but can be made. Mostly engineering.
25/06/2020 22
23. HOW TO PLAN AN EMPIRICAL RESEARCH
Get the right people. Smart. Self-motivated. Can read maths. Can code. Can do dirty data cleaning.
Choose a right topic. Impactful ones at the right time. Looks for current failures. Set the targets.
Explore theoretical options.
Breaking into solvable parts
ī Before: 3 years of PhD, one problem.
ī Now: every 3-6 months to meet a conference deadline (Feb-March, May-June, Aug-Sept, Nov-Dec)
Collect data. LOTS of them. Clean up.
Code it up. In DL, it means using your favourite programming frameworks (e.g.,PyTorch, TensorFlow)
Evaluate the options on some small data, then big data, then a variety of data
Repeat
25/06/2020 23
Question: When should we stop evaluating?
24. HOW TO BE PRODUCTIVE AND BE HAPPY
Maximize your quantity and quality of enlightenment.
Fail fast. Fail smart. Start small. Scale latter.
Use simplest possible method to confirm or reject a hypothesis
If the hypothesis is partially confirmed, refine the hypothesis, or
improve the method, and repeat.
Time spent per experiment should be limited to an hour or similar
range. Winning code seems to run for 2 weeks.
Productivity = knowledge unit gained / time unit.
25/06/2020 24
25. ML TALES
âMachine learning has very high expectations in terms of
ī very rapidly producing a lot of successful work and
ī exerting influence over the firehose of everyone elseâs rapidly produced
work.
ī Ilya Sutskever has over 50,000 Google Scholar citations,
ī None of the four most recent Fields Medal winners has over 5,000
citations.â
https://veronikach.com/how-i-fail/how-i-fail-ian-goodfellow-phd14-computer-science/
25/06/2020 25
26. WHAT INVOLVED IN AN AI/ML PAPER?
Something wrong, doesnât work, or need to be done (if not then the cost of inaction is too
high), or things overlooked by previous work.
ī Must be related to intelligence (learning, reasoning, etc).
A high-level model of the problem (e.g., graphical models, RNN, memory, GAN, etc).
A math problem resulting from optimizing the model (e.g., convergence, stability,
generalization etc).
A math solution (e.g., a theorem, an end-to-end solution)
An engineering solution for scaling, speeding up, reducing data needed, etc.
Lots of experiments (synthetic and realistic) to:
ī Validate that the solutions are correct
ī Validate that the solutions are right
ī Show evidence that the problems are worth solving.
ī Hint of the future impact, works or follow-up.
25/06/2020 26
27. CONSTANT FAILURE AS PART OF WORKFLOW
Failure is a rule, not exception
See Ian Goodfellowâs case in https://veronikach.com/how-i-
fail/how-i-fail-ian-goodfellow-phd14-computer-science/
ī âFailure of specific ideas is just a constant part of my workflow.â
Those not rejected are not trying hard enough
ī Hinton: models not overfitting are too small
ī Management wisdom: Staff are always incompetent
But donât come back to your boss/supervisor every single time
with a failure report!
25/06/2020 27
28. WHAT CAN WE LEARN FROM FAILURE?
The idea is fundamental wrong, or
Hyper-parameters, esp. for deep networks, are wrong, or
There are software bugs, or
The expectation is wrong.
It is important to know when, where and why things donât add up.
ī Too good results are usually wrong.
ī Too bad results are usually wrong.
25/06/2020 28
29. CHOOSING HIGH IMPACT PROBLEMS
Impact to the population (e.g., 100M people or more â Andrew
Ng)
Start a new paradigm â a new trend
Ride the wave
No need to have a competitive mind â âMine beats yoursâ
25/06/2020 29
30. CHOOSING IMPORTANT BUT LESS SHINY PROBLEMS
Computer vision & NLP are currently shinny areas. Dominated by industry.
You canât compete well without good data, computing resources, incentives
and high concentration of talent.
Some of the best lie in the intersection of areas. Esp. those with some
domain knowledge (which you can learn fast).
ī Biomedicine, e.g., drug design, genomics, clinical data.
ī Sustainability, e.g., Earth health, ocean, crop management, land use, climate changes,
energy, waste management.
ī Social good, law, developing countries, access to healthcare and education, clean water,
transportation.
ī Value-alignment, ethics, safety, assurance.
ī Accelerating sciences, e.g., molecule design/exploration, inverse materials design,
quantum worlds.
25/06/2020 30
31. RICHARD HAMMING âĻ AGAIN
"The average scientist [....] spends almost all his time working on
problems which he believes will not be important and he also
doesn't believe that they will lead to important problems."
25/06/2020 31
32. HOW TO BE PRODUCTIVE AND BE HAPPY
Maximize your quantity and quality of enlightenment.
Fail fast. Fail smart. Start small. Scale latter.
Use simplest possible method to confirm or reject a hypothesis
If the hypothesis is partially confirmed, refine the hypothesis, or
improve the method, and repeat.
Time spent per experiment should be limited to an hour or similar
range. Winning code seems to run for 2 weeks.
Productivity = knowledge unit gained / time unit.
25/06/2020 32
33. PATTERNS REPEATED EVERYWHERE
Lots of problems can be formulated in similar ways
We often need an intermediate representation, e.g., graphs that
can be seen in many fields.
Solving the generic problems are often faster and generalizable.
See Suttonâs remarks.
Many domains can be solved at the same time. After all, we have
just one brain that does everything.
Methods that survive need to be simple, generic and scalable.
PCA, CNN, RNN, SGD are examples.
25/06/2020 33
34. METHODS VERSUS CONCEPTS
Methods come and go, but concepts stay
ī Decision trees, kernels ī Template matching, data geometry
ī Random forests, gradient boosting ī Smoothing, variance reduction
ī Neural nets ī Differentiable programming, function reuse, credit
assignment
âMachine learning is the new algorithmsâ
ī https://nlpers.blogspot.com.au/2014/10/machine-learning-is-new-
algorithms.html
35. PARADIGM, FRAMEWORK, ALGORITHMS
Paradigms are major shifts in each science, happening not so
often. Could be centuries, decades.
Frameworks are specific design within a paradigm (e.g.,
differentiable programming)
Algorithms are specific realization of framework, aiming at
efficiency and scalability (e.g., quick sort).
25/06/2020 35
36. BIG INNOVATION = SYNERGY OF SMALL
IMPROVEMENTS
Case-study: CRF (2001):
ī Conditional MRF? Really?
ī Log-linear parameterization? Since early 1970s.
ī Maximum entropy? Since 1957.
ī Forward-backward algorithm? Special case of belief-propagation. Since 1988.
ī Large-scale convex optimization? Since early 1990s.
So why the big deal?
ī Right time: lot of data, big models, computers are strong enough, ML starts taking off.
ī Nice combination of many components, each of which was not very well-known in the ML
community.
ī Importantly, the empirical results are fantastic compared to HMM and MEMM.
25/06/2020 36
37. TREND IN AI/ML: HOMOGENIZATION
CNN: convolution
RNN: translation
Transformer: attention
ResNet: iterative refinement
FiLM: gated iterative refinement
R. Sutton: learning + search
D. Knuth: search + sort
25/06/2020 37
38. RULES OF INNOVATION
Filling the missing blocks â axis of innovation
ī E.g., single ī multiple; flat ī hierarchy;
Combinatorial innovation
ī E.g., MRF + MaxEnt -> CRF
Three words rule:
ī E.g., PCA, SVM, RNN, CNN, GCN, ICA, HMM, ANN, SNE, LLE, MAP, MRF, CRF, DRF,
IBP, HDP, MDP, GMM, GBM, GAN, VAE, NSM, NTM, DNC, DCA, TDA, CCA, LPP.
Timing
ī Too early works are not appreciated.
ī Too late works are not appreciated.
25/06/2020 38
39. ACQUIRING A GOOD TASTE
Like music, film and fashion, taste is important in research.
Andrej Karpathy:
ī âWhen you pitch a potential problem to your adviser youâll [âĻ] sense the excitement in their eyes as
they contemplate the uncharted territory ripe for exploration. In that split second a lot happens: an
evaluation of the problemâs importance, difficulty, its sexiness, its historical context [...]â.
ī âA lot of the problems I was excited about at the time were in retrospect poorly conceived,
intractable, or irrelevant.â
A tricky part:
ī âEven if you think your taste is impeccable, that doesn't matter one bit; to publish papers and earn a
Ph.D., you need to do work that resonates with senior researchers in your field.â
ī A.k.a., the feeling of in-group belonging!
25/06/2020 39
40. WHAT IS A GOOD PROBLEM?
BY ANDREJ KARPATHY, 2016
A fertile ground: a chain of papers. In short, a research program. Think ahead. Pay
attention to the field â how it is moving forward.
ī Latter: Ride the wave.
Plays to your adviserâs interests and strengths: Oh, well! More support, promotion,
connection, better quality advice, sense of alignment, sense of group contribution.
Be ambitious: [âĻ] thinking 10x forces you out of the box, to confront the real
limitations of an approach, to think from first principles, to change the strategy
completely, to innovate.
Ambitious but with an attack: important problems != great projects. Hamming:
having an attack makes a problem important.
25/06/2020 40
41. TASTE AND RESEARCH PERSONALITY
Scientist â get into the fundamental of things, curious on how
things work, without worrying about the application
ī Mostly single method/problem for a long time. Many possible applications
can result.
Engineer â want to solve real world problems
ī Single application for a long time. Many possible methods that apply.
42. BUILDING GRAND THEORIES
More like meta-theories or paradigms (e.g., parallel distributed processing â PDP,
AGI)
They inspire research programs (e.g., Representation learning, deep learning,
machine reasoning)
Mostly not harmed by refutation (e.g., Neural networks didnât work in the 1990s)
Have sceptical critics as well as enthusiastic supporters (e.g., symbolic vs sub-
symbolic).
Why theory?
ī Those who built theory usually have a better scientific career.
43. PARADIGM, THEORY AND MODEL
Paradigm is like meta-theory, within it there are many competing theories
A theory concerns the nature of processes in a particular domain
A model is an account of specific tasks and the behaviour associated with them.
ī Hence, theories normally originate with thought about the [intelligence]
mechanisms, and models from attempts to provide specific accounts of empirical
data.
44. WHAT INVOLVED IN AN AI/ML PAPER?
Something wrong, doesnât work, or need to be done (if not then the cost of inaction is too
high), or things overlooked by previous work.
ī Must be related to intelligence (learning, reasoning, etc).
A high-level model of the problem (e.g., graphical models, RNN, memory, GAN, etc).
A math problem resulting from optimizing the model (e.g., convergence, stability,
generalization etc).
A math solution (e.g., a theorem, an end-to-end solution)
An engineering solution for scaling, speeding up, reducing data needed, etc.
Lots of experiments (synthetic and realistic) to:
ī Validate that the solutions are correct
ī Validate that the solutions are right
ī Show evidence that the problems are worth solving.
ī Hint of the future impact, works or follow-up.
25/06/2020 44
45. MORE ON AXES OF INVENTION
Unstructured ī structured
(vector & set ī seq, tree, grid,
graph)
Low dim ī High dim
Dense ī Sparse
Uninformative prior ī
informative prior
25/06/2020 45
Homogeneous ī Heterogeneous
Single ī Ensemble
One scale ī Nesting
Static ī Dynamic, phase
transition
Diverse ī Universal
Non-human ī with human;
46. CASE-STUDY: BERT
Non-contextual embedding ī Contextual embedding
Leverage on lots of engineering advances
Engineering feat, NOT a conceptual breakthrough
Related to pseudo-likelihood (invented in 1975!) and Markov
random fields (invented in the 1970s or even earlier)
ī Whole sentence language model was done in the 1990s.
Can be beaten quickly (e.g., by XLNet, just in a year).
25/06/2020 46
47. RIDING THE EARLY WAVES
25/06/2020 47
Neural network, cycle 1: 1985-1995
ī Backprop
SVM:1995-2005
ī Nice theory, strong results for moderate data
Ensemble methods: 1995-2005
ī Nice theory, strong results for moderate data
Probabilistic graphical models: 1992-2002
ī Nice theory, open up new thinking, difficult in
practice
Topic models & Bayesian non-parametrics:
2002-2012
ī Nice theory, indirect applications
Feature learning: 2005-2015
ī Little theory, toy results, great excitements
Neural network, cycle 2: 2012-present
ī Differentiable programming, backprop + adaptive
SGD
ī Little theory, strong results for large data
For PhD students: be an expert when the trend is at its peak!
48. MY OWN JOURNEY
In 2004 I bet my PhD thesis on âConditional random fieldsâ. It
died when I finished in 2008.
In 2007 I bet my theoretical research area on Deep Learning, back
then it was Restricted Boltzmann Machines. RBM died around
2012. But the field moves on and now reaches its peak.
In 2012 I bet my applied research area on Biomedicine. This is
super-hot now.
In 2017 I am betting on âlearning to reasonâ ī cognitive
architectures.
49. THE CURSE OF TEXT BOOKS
Truyenâslaw: âwhenever a universally accepted textbook is published,
the field is saturated.â
īkernels (2002),
īprobabilistic graphical models (2006),
īstructured output learning (2007),
īlearning to rank (2009),
īBayesian non-parametrics (2012), and
īdeep learning (2016).
īdecision tree (1984),
īneural nets, cycle 1 (1995),
īinformation retrieval (1998),
īreinforcement learning (1998),
īmulti-agent system (2001),
īensemble methods (2001),
50. HOW MUCH COMPUTING POWER?
The larger the better. It gives you freedom.
ī Sutton (âthe bitter lessonâ): generic methods + lots of compute win!
ī See the history of speech recognition, machine translation.
However, for faster discovery and learning, better maximize the number of
reasonably good GPUs ī more people, more parallel experiments.
On the other hand, little compute forces us to be smart. There are lots of important
problems with lean data. See âYou and your researchâ by Richard Hamming.
Our brain uses very little energy (somewhere like 10 Watts). There must be clever
ways to arrange computation.
25/06/2020 50
51. âSO YOU WANT TO BE A RESEARCH SCIENTISTâ
BY VINCENT VANHOUCKE, NOV 2018
Research is about ill-posed questions with multiple (or no) answers
You will spend your entire career working on things that donât work
Your work will probably be obsolete the minute you publish it
With infinite freedom comes infinite responsibility
Paradoxically, much of research is about risk management
You will need to retool often
Youâll have to subject yourself to intense scrutiny
Your entire career will largely be measured by one number
You wonât work a day in your life
25/06/2020 51https://medium.com/s/story/so-you-want-to-be-a-research-scientist-363c075d3d4c
52. WHAT TO OPTIMIZE?
Vincent Vanhoucke:
ī âMeasuring progress in units of learning, as opposed to units of solving, is one of the key
paradigm shifts one has to undergo to be effective in a research setting.â
Meaning?
ī Read/think/write more, code less.
ī Give priority to depth over breath.
ī If code, choose fast prototyping.
ī If run experiments, make it small and fail fast, and learn from there.
ī After acquiring certain empirical experience, donât just keep doing that. The return will be
diminishing.
25/06/2020 52