CanCognitiveNeuroscienceProvideaTheoryof
DeepLearningCapacity?
Ted Willke and the Mind’s Eye Team
Intel Labs
May 20, 2016
2
“Breakthrough innovation occurs when we bring down boundaries
and encourage disciplines to learn from each other”
― Gyan Nagpal, Talent Economics: The Fine Line Between
Winning and Losing the Global War for Talent
2
MIND’SEYE
3
Cognitive Neuroscience
4
CognitiveNeuroscience
 Is the study of the neurobiological mechanisms that underlie
cognitive processes, like attention, control, and decision
making
 Answer questions like: How does the brain coordinate
behaviour to achieve goals? What are the brain structures
upon which these functions depend? How does brain
function differ amongst people?
 Draws upon brain imaging/recordings and other
observations to derive models
5
Context-DependentDecisionMaking
Michael Shvartsman, Vibhav Srivatsava, Narayanan Sundaram, Jonathan D. Cohen, “Using behavior to decode allocation of attention in context dependent decision making”, accepted at
International Conference on Cognitive Modeling, 2016.
6
SelectiveForgetting
Kim, Ghootae and Lewis-Peacock, Jarrod A. and Norman, Kenneth A. and Turk-Browne, Nicholas B., “Pruning of memories by context-based prediction error,”
Proceedings of the National Academy of Sciences, 2014
7
Productionandcomprehensionofnaturalisticnarrativespeech
Silbert LJ, Honey CJ, Simony E, Poeppel D, Hasson U (2014) Coupled neural systems underlie the production and comprehension of naturalistic narrative speech. Proc Natl Acad Sci USA
111:E4687-4696.
8
CRACKSAPPEAR,DISRUPTIVEIDEAS30yearson
MIT Press, 1986
9
CognitiveNeuroscience
Adapted from Marvin Minksy in Artificial Intelligence at MIT, Expanding Frontiers, Patrick H. Winston (Ed.), Vol.1, MIT Press, 1990. Reprinted in AI Magazine, Summer 1991
evolve
10
Neural networks
11
NeuralNetworkpreliminaries
http://wiki.apache.org/hama/MultiLayerPerceptron
12
NeuralNetworkpreliminaries
Lecun et al., “Deep Learning” in Nature (2015)
13
Arbitraryfunctions
https://upload.wikimedia.org/wikipedia/commons/7/7b/XOR_perceptron_net.png
14
Theoriginaltenetsofparalleldistributedprocessing(roughly)
1. Cognitive processes arise from the real-time propagation of
activation via weighted connections
2. Active representations are patterns of activation distributed over
ensembles of units
3. Processing is interactive (bidirectional)
4. Knowledge is encoded in the connection weights (not in a separate
store)
5. Learning and long-term memory depend on changes to these weights
6. Processing, learning, and representation are graded and continuous
7. Processing, learning, and representation depend on the environment
T.T. Rogers, J.L. McClelland / Cognitive Science 28 (2014)
15
Brain-Inspiredmachinelearning
Structure-Inspired Learning
 Neurons (e.g., spiking models)
 Networks (e.g., deep belief networks)
 Architectures (e.g., Human Brain Project)
Cognitive-Inspired Learning
 Reinforcement Learning
 Context-based Memory
 Noisy Decision Making
15
"Gray754" by Henry Vandyke Carter - Henry Gray
(1918) Anatomy of the Human Body
16
Deeplearningtakesadvantageofparalleldistributedprocessing
http://www.amax.com/blog/wp-content/uploads/2015/12/blog_deeplearning3.jpg
17
Winningtopspotsinvisualrecognitionchallenges,etc.
(1) Lin et al., 2015, (2) https://www.cityscapes-dataset.com/dataset-overview/ (3) Deng et al., 2009 (4) http://lsun.cs.princeton.edu/2015.html
MS COCO (Common Objects in Context) CityScapes Datasets (Semantic Understanding)
ImageNet (Object Localization) LSUN (Saliency Prediction)
18
Yang et al. (2015)
Whataresittinginthebasketonabicycle?
19
Yang et al. (2015)
StackedAttentionNetworksforImageQuestionAnswering
20
TheGloryandtheremainingmystery
We have achieved…
 Exceeding human-level performance on visual recognition tasks
 Mastering more and more complex games (Go)
 Demonstrating human-level control in reinforcement learning (Atari)
 Question-answering and other AI services are upon us
but we still don’t know…
 How learnt (feature) representations are encoded (or if they converge for
the same networks trained on the same data)
 The capacity for learning representations
 The trade-off between efficiency of representation and flexibility of
processing
 How things learnt interfere with each other
21
Representations and Learning Capacity
22
Li et al. (ICLR 2016)
Representationencoding:meaningfulandconsistent?
Can we reliably
map feature
representations
between these
networks?
23
Li et al. (ICLR 2016)
ConvergentLearning?
Conclusions:
1. Some features are learned reliably in multiple nets (some are not)
2. Units learn to span low-D subspaces, which are common (but specific basis vectors are not)
3. Representations are encoded as a mix of single unit and slightly distributed codes
4. Mean activation values across different networks converge to a nearly identical distribution
24
Can cognitive neuroscience provide
any insight into the nature of learning
and task capacity?
25
Theappealofhighly-parallelneuralnetworks
Both cognitive neuroscience and machine learning applications exploit
the following two features of neural networks to great benefit:
a) The ability to learn and process complex representations, taking into
account a large number of interrelated and interacting constraints
b) The ability for the same network to process a wide range of
potentially disparate representations (or tasks), sometimes called
“multitask learning.”
But what are their limits??
26
Thebrain:Theblackboxattheendofournecks
• Facts:
 Only 2% of body weight but uses up
to 20% of energy
 ~200B neurons
 Neurons fire up to ~10 kHz
 1K to 10K connections per neuron
• Cerebral neocortex:
 ~20B neurons
 ~125 trillion synapses
There are more ways to organize the neocortex’s
~125 trillion synapses than stars in the known universe
27
Theparadox–onetaskatatime
28
Afundamentalpuzzleconcerninghumanprocessing
Why, in some circumstances is the brain capable of a remarkable degree
of parallelism (e.g., locomotion, navigation, speech, and bimanual
gesticulation), while in others it’s capacity for parallelism is radically
limited (e.g., the inability to conduct mental arithmetic while constructing
a grocery list at the same time)??!!
29
Atheory
 The difference in multitasking ability may reflect the degree
to which different tasks rely on shared representations
 The more that different processes interact, the stronger the
imposition of seriality
 May reflect a fundamental trade-off in neural network
architectures between the efficiency of shared
representations (and the capacity for generalization that they
afford) and the effectiveness of multitasking.
30
Multi-taskingandcross-talk
Feng et al. (CABN 2014)
31
You will see a sequence of words.
Quickly say the color of the letters.
32
SNOW
33
Ready!
34
BLUE
35
RED
36
BLACK
37
GREEN
41
BLACK
42
BLUE
46
GREEN
49
RED
52
BLUE
54
Now with the words upside down.
55
BLACK
56
GREEN
58
BLACK
59
RED
60
Were you faster to answer?
61
ADemonstrationofinterference
Stroop (1935)
62
multi-taskinginterference(Inthestrooptest)
Cohen et al. (1990)
Color Word
Verbalize Task
63
Control-DemandingBehavior(Fengetal.2014)
 First to describe the trade-off between the efficiency of
representation (“multiplexing”) and the simultaneous
engagement of different processing pathways (“multitasking”)
 Showed that even a modest amount of multiplexing rapidly
introduces cross-talk among processing pathways
 Proposed that the large advantage of efficient encoding have
driven the human brain to favour this over the capacity for
control-demanding processes.
64
Typesofinterference
65
Maximumindependentset(MIS)
The MIS is the largest set of processes in the network that can be
simultaneously executed without interference.
66
networkstructure(distributioncomplexity)
 The network capacity for multitasking depends on the
distribution of in-degrees and out-degrees of the network
(we only play with in-degree of output components though)
 We represent this with a “distribution complexity” symmetry
measure (maximized for uniform distribution)
 We study the characteristics of the network with DC fixed
67
Takeaway:Evenmodestamountsofprocessoverlapimposedramatic
constraintsonparallelprocessingcapability
68
Trade-offbetweengeneralizationandparallelism:Feed-Forwardsimulation
69
Training/Testdetails
Training
 20 network groups, 20 random initializations per group
 All networks trained on same stimuli, 16 tasks
 Trained to generate 1-hot task outputs (MSE < 0.0001)
Test
 70/30 split
 Generalization is MSE(ave) for ALL stimuli in test set
 Parallel processing is measured response to (2,3,4) tasks
simultaneously activated, measuring MSE for target pattern
70
SharedRepresentations
Smaller weights (a) Larger weights (b)
71
Generalizationvsparallelprocessingcapability
72
Parallelprocessingcapabilityvsmaxinitialweights
73
Futurework
 Extend analysis to weighted graphs
 Study more complex networks (i.e., deeper
structures, recurrent connections)
 Study human performance (via neuroimaging data)!
74
C.elegans
74
The OpenWorm Project
(image generated by neuroConstruct)
SINCE
1986
Thankyou!

Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16