Cognitive plausibility in learning algorithms
With application to natural language processing
Arvi Tavast, PhD
Qlaara Labs, UT, TLU
Tallinn, 10 May 2016
Introduction Understanding humans Results Application
Motivation
Why cognitive plausibility?
Objective: best product vs best research
Model the brain
End-to-end learning from raw unlabelled data
Grounded cognition
Cognitive computing, neuromorphic computing
Feedback loop: using the model to better understand the
object to be modelled
Introduction Understanding humans Results Application
Outline
Heretical view on language - established learning model - application to NLP
1 Introduction
2 Understanding humans
Understanding human communication
Understanding human learning
Rescorla-Wagner learning model
3 Results
4 Application
Naive Discrimination Learning
Introduction Understanding humans Results Application
My background
mainly in linguistics
1993 TUT computer systems
1989-2004 IT translation
2000-2006 Microsoft MILS
2002 UT MA linguistics
2008 UT PhD linguistics
2015 Uni T ¨Ubingen postdoc quantitative linguistics
Introduction Understanding humans Results Application
Understanding human communication
How do we explain the observation that verbal communication sometimes works
The channel metaphor
Speaking is like sending things by train, selecting suitable
wagons (words) for each thing (thought)
Hearing is like decoding the message
⇒ meanings are properties of words
Communication as uncertainty reduction
Speaking is like sending blueprints for building things, which
the receiver will have to follow (subject to their abilities,
available materials, etc.)
⇒ meanings are properties of people
Hearing is like using hints to reduce our uncertainty about
the message
Introduction Understanding humans Results Application
Understanding human communication
When can the channel metaphor work?
Encoding of a message must contain a set of discriminable
states that is greater than or equal to the number of
discriminable states in the to-be-encoded message
or:
Encoding thoughts with words can only work if the number
of possible thoughts is smaller than or equal to the number
of possible words
This is the case only in restricted domains (weather forecasts)
Compare: reconstructing a document based on its hash sum
Introduction Understanding humans Results Application
Understanding human learning
Compositional vs discriminative
Possible ways of conceptualising biological learning
Compositional model: we start as an empty page, adding
knowledge like articles in an encyclopedia
Discriminative model: we start by perceiving a single object
(the world) and gradually learn to discriminate between its
parts
If discriminative:
Human language models can not be constant across time or
subjects
Introduction Understanding humans Results Application
The Rescorla-Wagner learning model
Language acquisition can be described as creating a statistical relationship
The Rescorla-Wagner model: how do we learn that Cj means O
if we see that Cj ⇒ O, the relationship is strengthened
less, if there are other cues
if we see that Cj ⇒ ¬O, the relationship is weakened
more, if there are other cues
(if we see that ¬Cj ⇒ O, the relationship is weakened)
Introduction Understanding humans Results Application
Feature-label-order effect
Creating the relationship between word and concept is only possible in one direction
Feature-label-order effect
If concept ⇒ word, the relationship is strengthened
If word ⇒ concept, the relationship is not strengthened
Number of objects in the world number of words in
language
Abstraction inevitably and irreversibly discards information
Recovering a meaning from a word is necessarily
underspecified
Ramscar, M., Yarlett, D., Dye, M., Denny, K., and Thorpe, K. (2010). The effects of feature-label-order and their
implications for symbolic learning. Cognitive Science, 34(6), 909–957.
Introduction Understanding humans Results Application
Aging and cognitive decline
Why do our verbal abilities seem to fail around the age of 65?
Ramscar, M., Hendrix, P., Shaoul, C., Milin, P., and Baayen, H. (2014). The myth of cognitive decline: Non-linear dynamics
of lifelong learning. Topics in Cognitive Science, 6(1), 5–42.
Introduction Understanding humans Results Application
Morphology
Implicit morphology (without morphemes)
0.1
0.378
0.116
0.576
0.531
0.4190.39
0.377
0.516
0.475
0.47
0.587
0.124
0.225
0.216
0.1630.138
0.5
0.5
#mA
ki#
#tA
tA# #mt
mtA
tAk
Aki
itA
#mi
mit
At#
mAt
#m@
@tA
m@t
#m::t
m::tA
###
Introduction Understanding humans Results Application
Naive Discrimination Learning
The R package: installation and basic usage
ndl: https://cran.r-project.org/web/packages/ndl/index.html
ndl2 (+ incremental learning): contact the authors
wm = estimateWeights(events) # Danks equilibria
wm = learnWeights(events) # incremental, ndl2 only
Introduction Understanding humans Results Application
Naive Discrimination Learning
Input data for Danks estimation: frequencies
Outcomes Cues Frequency
aadress aadress S SG N 1
aadresse aadress S PL P 1
aadressil aadress S SG AD 4
aadressile aadress S SG ALL 1
aasisid aasima V SID 1
aasta aasta S SG G 2
aasta aasta S SG N 1
aastane aastane A SG N 48
Introduction Understanding humans Results Application
Naive Discrimination Learning
Input data for incremental learning: single events
Outcomes Cues Frequency
aadress aadress S SG N 1
aadresse aadress S PL P 1
aadressil aadress S SG AD 1
aadressil aadress S SG AD 1
aadressil aadress S SG AD 1
aadressil aadress S SG AD 1
aadressile aadress S SG ALL 1
aasisid aasima V SID 1
aasta aasta S SG G 1
aasta aasta S SG G 1
aasta aasta S SG N 1
aastane aastane A SG N 1
aastane aastane A SG N 1
aastane aastane A SG N 1
...
Introduction Understanding humans Results Application
Naive Discrimination Learning
Output: weight matrix, cues x outcomes
Cues Outcomes Application
letter ngrams words reading
character features words reading
words lexomes POS tagging
lexomes letter ngrams morphological synthesis
contexts words distributional semantics
audio signal words speech recognition
words audio signal speech synthesis
Introduction Understanding humans Results Application
Naive Discrimination Learning
About the weight matrix
What we can look at:
Similarity of outcome vectors
Similarity of cue vectors
MAD (median absolute deviation) of outcome vector
Competing cues
Introduction Understanding humans Results Application
Naive Discrimination Learning
About the weight matrix
Other properties:
No dimensionality reduction (played with 200k x 100k)
Danks equations subject to R’s 232 limit (matrix
pseudoinverse)
Slow (weeks on ca 16 cores, 200G ram)
Performance less than word2vec etc, but comparable
Introduction Understanding humans Results Application
Some NLP tools
How to get started quickly with NLP
Python NLTK
EstNLTK
Gensim (incl word2vec)
DISSECT
Java GATE (also web)
Stanford NLP
Deeplearning4j (incl word2vec)
C word2vec
R NDL
Introduction Understanding humans Results Application
Language understanding
What’s missing from full language understanding
Training material
Interannotator agreement is less than perfect
Corpus is heterogenous
This is not a methodological flaw
Communicative intent and self-awareness
If cues are lexomes (=what the speaker wanted to say), the
system must want something.
Introduction Understanding humans Results Application
Thanks for listening
Contacts and recommended reading
Contact
arvi@qlaara.com
Easy reading
blog.qlaara.com
Recommended reading
Harald Baayen
www.sfs.uni-tuebingen.de/hbaayen/
Michael Ramscar
https://michaelramscar.wordpress.com/

Cognitive plausibility in learning algorithms

  • 1.
    Cognitive plausibility inlearning algorithms With application to natural language processing Arvi Tavast, PhD Qlaara Labs, UT, TLU Tallinn, 10 May 2016
  • 2.
    Introduction Understanding humansResults Application Motivation Why cognitive plausibility? Objective: best product vs best research Model the brain End-to-end learning from raw unlabelled data Grounded cognition Cognitive computing, neuromorphic computing Feedback loop: using the model to better understand the object to be modelled
  • 3.
    Introduction Understanding humansResults Application Outline Heretical view on language - established learning model - application to NLP 1 Introduction 2 Understanding humans Understanding human communication Understanding human learning Rescorla-Wagner learning model 3 Results 4 Application Naive Discrimination Learning
  • 4.
    Introduction Understanding humansResults Application My background mainly in linguistics 1993 TUT computer systems 1989-2004 IT translation 2000-2006 Microsoft MILS 2002 UT MA linguistics 2008 UT PhD linguistics 2015 Uni T ¨Ubingen postdoc quantitative linguistics
  • 5.
    Introduction Understanding humansResults Application Understanding human communication How do we explain the observation that verbal communication sometimes works The channel metaphor Speaking is like sending things by train, selecting suitable wagons (words) for each thing (thought) Hearing is like decoding the message ⇒ meanings are properties of words Communication as uncertainty reduction Speaking is like sending blueprints for building things, which the receiver will have to follow (subject to their abilities, available materials, etc.) ⇒ meanings are properties of people Hearing is like using hints to reduce our uncertainty about the message
  • 6.
    Introduction Understanding humansResults Application Understanding human communication When can the channel metaphor work? Encoding of a message must contain a set of discriminable states that is greater than or equal to the number of discriminable states in the to-be-encoded message or: Encoding thoughts with words can only work if the number of possible thoughts is smaller than or equal to the number of possible words This is the case only in restricted domains (weather forecasts) Compare: reconstructing a document based on its hash sum
  • 7.
    Introduction Understanding humansResults Application Understanding human learning Compositional vs discriminative Possible ways of conceptualising biological learning Compositional model: we start as an empty page, adding knowledge like articles in an encyclopedia Discriminative model: we start by perceiving a single object (the world) and gradually learn to discriminate between its parts If discriminative: Human language models can not be constant across time or subjects
  • 8.
    Introduction Understanding humansResults Application The Rescorla-Wagner learning model Language acquisition can be described as creating a statistical relationship The Rescorla-Wagner model: how do we learn that Cj means O if we see that Cj ⇒ O, the relationship is strengthened less, if there are other cues if we see that Cj ⇒ ¬O, the relationship is weakened more, if there are other cues (if we see that ¬Cj ⇒ O, the relationship is weakened)
  • 9.
    Introduction Understanding humansResults Application Feature-label-order effect Creating the relationship between word and concept is only possible in one direction Feature-label-order effect If concept ⇒ word, the relationship is strengthened If word ⇒ concept, the relationship is not strengthened Number of objects in the world number of words in language Abstraction inevitably and irreversibly discards information Recovering a meaning from a word is necessarily underspecified Ramscar, M., Yarlett, D., Dye, M., Denny, K., and Thorpe, K. (2010). The effects of feature-label-order and their implications for symbolic learning. Cognitive Science, 34(6), 909–957.
  • 10.
    Introduction Understanding humansResults Application Aging and cognitive decline Why do our verbal abilities seem to fail around the age of 65? Ramscar, M., Hendrix, P., Shaoul, C., Milin, P., and Baayen, H. (2014). The myth of cognitive decline: Non-linear dynamics of lifelong learning. Topics in Cognitive Science, 6(1), 5–42.
  • 11.
    Introduction Understanding humansResults Application Morphology Implicit morphology (without morphemes) 0.1 0.378 0.116 0.576 0.531 0.4190.39 0.377 0.516 0.475 0.47 0.587 0.124 0.225 0.216 0.1630.138 0.5 0.5 #mA ki# #tA tA# #mt mtA tAk Aki itA #mi mit At# mAt #m@ @tA m@t #m::t m::tA ###
  • 12.
    Introduction Understanding humansResults Application Naive Discrimination Learning The R package: installation and basic usage ndl: https://cran.r-project.org/web/packages/ndl/index.html ndl2 (+ incremental learning): contact the authors wm = estimateWeights(events) # Danks equilibria wm = learnWeights(events) # incremental, ndl2 only
  • 13.
    Introduction Understanding humansResults Application Naive Discrimination Learning Input data for Danks estimation: frequencies Outcomes Cues Frequency aadress aadress S SG N 1 aadresse aadress S PL P 1 aadressil aadress S SG AD 4 aadressile aadress S SG ALL 1 aasisid aasima V SID 1 aasta aasta S SG G 2 aasta aasta S SG N 1 aastane aastane A SG N 48
  • 14.
    Introduction Understanding humansResults Application Naive Discrimination Learning Input data for incremental learning: single events Outcomes Cues Frequency aadress aadress S SG N 1 aadresse aadress S PL P 1 aadressil aadress S SG AD 1 aadressil aadress S SG AD 1 aadressil aadress S SG AD 1 aadressil aadress S SG AD 1 aadressile aadress S SG ALL 1 aasisid aasima V SID 1 aasta aasta S SG G 1 aasta aasta S SG G 1 aasta aasta S SG N 1 aastane aastane A SG N 1 aastane aastane A SG N 1 aastane aastane A SG N 1 ...
  • 15.
    Introduction Understanding humansResults Application Naive Discrimination Learning Output: weight matrix, cues x outcomes Cues Outcomes Application letter ngrams words reading character features words reading words lexomes POS tagging lexomes letter ngrams morphological synthesis contexts words distributional semantics audio signal words speech recognition words audio signal speech synthesis
  • 16.
    Introduction Understanding humansResults Application Naive Discrimination Learning About the weight matrix What we can look at: Similarity of outcome vectors Similarity of cue vectors MAD (median absolute deviation) of outcome vector Competing cues
  • 17.
    Introduction Understanding humansResults Application Naive Discrimination Learning About the weight matrix Other properties: No dimensionality reduction (played with 200k x 100k) Danks equations subject to R’s 232 limit (matrix pseudoinverse) Slow (weeks on ca 16 cores, 200G ram) Performance less than word2vec etc, but comparable
  • 18.
    Introduction Understanding humansResults Application Some NLP tools How to get started quickly with NLP Python NLTK EstNLTK Gensim (incl word2vec) DISSECT Java GATE (also web) Stanford NLP Deeplearning4j (incl word2vec) C word2vec R NDL
  • 19.
    Introduction Understanding humansResults Application Language understanding What’s missing from full language understanding Training material Interannotator agreement is less than perfect Corpus is heterogenous This is not a methodological flaw Communicative intent and self-awareness If cues are lexomes (=what the speaker wanted to say), the system must want something.
  • 20.
    Introduction Understanding humansResults Application Thanks for listening Contacts and recommended reading Contact arvi@qlaara.com Easy reading blog.qlaara.com Recommended reading Harald Baayen www.sfs.uni-tuebingen.de/hbaayen/ Michael Ramscar https://michaelramscar.wordpress.com/