José Hernández-Orallo, Full Professor, Department of Information Systems and Computation at the Universitat Politecnica de València, presentation “Evaluating Cognitive Systems: Task-oriented or Ability-oriented?” as part of the Cognitive Systems Institute Speaker Series.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Cognitive systems institute talk 8 june 2017 - v.1.0
1. José Hernández-Orallo
Dep. de Sistemes Informàtics i Computació,
Universitat Politècnica de València
jorallo@dsic.upv.es
Talk for the Cognitive Systems Institute Speaker series
8 June 2017* Based on parts of the book:
“The Measure of All Minds”:
http://allminds.org
2. E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
2
“Greatest accuracy, at the frontiers of science,
requires greatest effort, and probably the most
expensive or complicated of measurement
instruments and procedures”
(David Hand, 2004).
3. COGNITIVE SYSTEMS: MUCH MORE THAN AI
Computers:
AI or AGI systems, robots, bots, …
Cognitively-enhanced organisms, cognitive prosthetics
Cyborgs, technology-enhanced humans
Biologically-enhanced computers:
Human computation and their data
(Hybrid) collectives
Virtual social networks, crowdsourcing
Minimal or rare cognition
Artificial life (more like bacteria, plants, etc.)
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
3
Societal impact on
work, leisure, health, etc.,
difficult to assess as we
do not know the cognitive
capabilities of all these
new systems.
4. THE EVALUATION DISCORDANCE: AI EVALUATION
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
4
Edited image, originally from wikicommons
"[AI is] the science of making
machines do things that would require
intelligence if done by [humans]."
Marvin Minsky (1968).
They can do the “things” (tasks) without
featuring intelligence.
Once the task is solved (“superhuman”),
it is no longer an AI problem (“AI effect”)
AI would have progressed very significantly
(see, e.g., Nilsson, 2009, chap. 32, or Bostrom, 2014, Table 1, pp. 12–13).
But AI is now full of idiots savants.
5. THE EVALUATION DISCORDANCE: AI EVALUATION
Specific (task-oriented) AI systems
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
5
Machine translation, information retrieval,
summarisation
Warning!
Intelligence
NOT included.
PR: computer vision,
speech recognition, etc.
Robotic
navigation
Driverless
vehicles
Prediction and
estimation
Planning and
scheduling
Automated
deduction
Knowledge-
based assistants
Game
playing
Warning!
Intelligence
NOT included. Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
All images from wikicommons
6. THE EVALUATION DISCORDANCE: AI EVALUATION
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
6
Specific domain evaluation settings:
CADE ATP System Competition PROBLEM BENCHMARKS
Termination Competition PROBLEM BENCHMARKS
The reinforcement learning competition PROBLEM BENCHMARKS
Program synthesis (Syntax-guided synthesis) PROBLEM BENCHMARKS
Loebner Prize HUMAN DISCRIMINATION
Robocup and FIRA (robot football/soccer) PEER CONFRONTATION
International Aerial Robotics Competition (pilotless aircraft) PROBLEM BENCHMARKS
DARPA driverless cars, Cyber Grand Challenge, Rescue Robotics PROBLEM BENCHMARKS
The planning competition PROBLEM BENCHMARKS
General game playing AAAI competition PEER CONFRONTATION
BotPrize (videogame player) contest HUMAN DISCRIMINATION
World Computer Chess Championship PEER CONFRONTATION
Computer Olympiad PEER CONFRONTATION
Annual Computer Poker Competition PEER CONFRONTATION
Trading agent competition PEER CONFRONTATION
Robo Chat Challenge HUMAN DISCRIMINATION
UCI repository, PRTools, or KEEL dataset repository. PROBLEM BENCHMARKS
KDD-cup challenges and ML kaggle competitions PROBLEM BENCHMARKS
Machine translation corpora: Europarl, SE times corpus, the euromatrix, Tenjinno competitions… PROBLEM BENCHMARKS
NLP corpora: linguistic data consortium, … PROBLEM BENCHMARKS
Warlight AI Challenge PEER CONFRONTATION
The Arcade Learning Environment PROBLEM BENCHMARKS
Pathfinding benchmarks (gridworld domains) PROBLEM BENCHMARKS
Genetic programming benchmarks PROBLEM BENCHMARKS
CAPTCHAs HUMAN DISCRIMINATION
Graphics Turing Test HUMAN DISCRIMINATION
FIRA HuroCup humanoid robot competitions PROBLEM BENCHMARKS
…
7. THE EVALUATION DISCORDANCE: AI EVALUATION
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
7
Cognitive robots
Intelligent assistants
Pets, animats and other
artificial companions
Smart environments
Agents, avatars, chatbots
Web-bots, Smartbots, Security bots…
How to evaluate general-purpose systems and cognitive components?
Warning!
Some intelligence
MAY BE included.
Warning!
Some intelligence
MAY BE included.
Warning!
Some intelligence
MAY BE included.
Warning!
Some intelligence
MAY BE included.
Warning!
Some intelligence
MAY BE included.
Warning!
Some intelligence
MAY BE included.
8. THE EVALUATION DISCORDANCE: AI EVALUATION
“Mythical Turing Test” (Sloman, 2014)
and its myriad variants…
Mythical human-level machine intelligence
A red herring for general-purpose AI!
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
8
9. THE EVALUATION DISCORDANCE: AI EVALUATION
What benchmarks? More comprehensive?
ARISTO (Allen Institute for AI) : College science exams
Winograd Schema Challenge : Questions targeting understanding.
Weston et al. “AI-Complete Question Answering” (bAbI)
CLEVR : Relations over visual objects
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
9
BEWARE: AI-Completeness claimed before
Calculation, Chess, Go, Turing test, …
Now AI is superhuman on most of them!
(e.g., https://arxiv.org/pdf/1706.01427.pdf)
10. THE EVALUATION DISCORDANCE: TEST MISMATCH
What about psychometric tests or animal tests in AI?
In 2003, Sanghi & Dowe :
simple program passing many IQ tests.
About 960 lines of code in Perl!
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
10
This made the point
unequivocally:
programs passing IQ
tests are not
necessarily intelligent
11. THE EVALUATION DISCORDANCE: TEST MISMATCH
This has not been a deterrent!
Psychometric AI (Bringsjord and Schmimanski 2003):
An “agent is intelligent if and only if it excels at all
established, validated tests of intelligence”.
Detterman, editor of the Intelligence Journal, posed “A
challenge to Watson” (Detterman 2011)
2nd level to “be truly intelligent”: tests not seen
beforehand.
“IQ tests are not for machines, yet” (Dowe & Hernandez-Orallo
2012)
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
11
12. THE EVALUATION DISCORDANCE: TEST MISMATCH
What about developmental tests (or tests for children)?
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
12
Developmental robotics:
Battery of tests (Sinapov, Stoytchev, Schenk 2010-13)
Cognitive architectures:
Newell “test” (Anderson and Lebiere 2003)
“Cognitive Decathlon” (Mueller 2007).
AGI: high-level competency areas (Adams
et al. 2012), task breadth (Goertzel et al 2009,
Rohrer 2010), robot preschool (Goertzel and
Bugaj 2009).
a taxonomy for
cognitive architectures
a psychometric
taxonomy (CHC)
13. THE EVALUATION DISCORDANCE: TEST MISMATCH
Adapting tests between disciplines (AI, psychometrics, comparative
psychology) is problematic:
Test from one group only valid and reliable for the original group.
Not necessary and/or not sufficient for the ability.
Machines and hybrids represent a new population.
Nowadays, many benchmarks are assuming that AI will use deep
learning or millions of examples.
But machines and hybrids are also an opportunity to understand how
to evaluate cognition. Still,
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
13
We need a different foundation
14. THE ALGORITHMIC CONFLUENCE: WHAT IQ TESTS MEASURE
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
14
“Beyond the Turing Test”…
“Intelligence” definition and test (C-test) based on algorithmic
information theory (Hernandez-Orallo 1998-2000).
Letter series common in cognitive tests (Thurstone).
Here generated from a TM with properties (projectibility, stability, …).
Their difficulty is calculated by Kt
Linked with Levin’s universal search, Solomonoff’s inductive
inference, Kolmogorov complexity.
15. THE ALGORITHMIC CONFLUENCE: WHAT IQ TESTS MEASURE
Metric derived by slicing by difficulty h (Kt) and :
This is IQ-test re-engineering!
Intelligence no longer “what intelligence tests measure” (Boring, 1923).
Clues about what IQ tests really measure? Inductive inference.
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
15
Human performance
correlated with the difficulty
(h) of each exercise.
But remember Sanghi and Dowe 2003!
16. THE ALGORITHMIC CONFLUENCE: SITUATED TESTS
Passive to interactive view:
Intelligence as performance in a range of worlds.
The set of worlds M is described by Turing machines.
Intelligence is measured as an aggregate:
R aggregates ri and p assigns probabilities to environments. How?
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
16
π μ
ri
oi
ai
17. THE ALGORITHMIC CONFLUENCE: SOLUTIONAL APPROACH
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
17
Three approaches:
Range of difficulties Diversity of solutions
[universal, e.g. Legg and Hutter]
[uniform] [universal]
[universal][uniform][uniform]
[With the choices in brackets, they are NOT equivalent]
18. THE ALGORITHMIC CONFLUENCE: SOLUTIONAL APPROACH
A different view of “general intelligence”:
Policy-general intelligence: aggregate by difficulty (e.g., bounded
uniform distribution) and for each difficulty look for diversity.
Connected to the task-independence of the g factor.
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
18
Raises a fascinating question: Is there a universal g factor?
Ability to find, integrate and emulate a
diverse range of successful policies.
19. FROM TASKS TO ABILITIES: CLUSTERING BY SIMILARITY
Focus first on intermediate levels between tasks and abilities:
Do we have an intrinsic notion of similarity between tasks?
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
19
Task breadth? Arrange abilities?
Hierarchically (e.g., Catell-Horn-Carroll)
Spatially (e.g., Guttman’s model)
20. FROM TASKS TO ABILITIES: CLUSTERING BY SIMILARITY
Example (ECA rules as tasks).
Task description is not used. No population is used either.
The best solutions are used instead and compared.
Using similarity as difficulty increases (18 rules of difficulty 8):
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
20
Dendrogram using complete linkageMetric multidimensional scaling
21. NEW AI EVALUATION PLATFORMS: A COSMOS
Here they are:
Facebook’s bAbi
Arcade Learning Env. (Atari)
Video Game Definition Language
OpenAI Gym and Universe
Microsoft’s Project Malmo
DeepMind Lab
Facebook’s TorchCraft
Facebook’s CommAI
AI Magazine report: “A New AI Evaluation Cosmos”
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
21
Most (except CommAI) oriented towards the evaluation of
very embodied AI, but what about more abstract cognition?
22. IS THIS SUFFICIENT? OPEN QUESTIONS
What do these platforms / test measure?
Depends on the tasks we define!
Many things to be done
Task analysis, their similarities, difficulties, their requirements (data)
Abilities: be conceptualised and identified.
Ability-oriented (or feature-oriented) evaluation
Incremental, gradual, curriculum, …: task similarity → dependency
Recent (EGPAI@ECAI2016, MAIN@NIPS2016) and upcoming workshops
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
22
EGPAI@IJCAI2017
MAIN@NIPS2017 ?
23. IS THIS SUFFICIENT? OPEN QUESTIONS
We want cognitive components that could be easily integrated into
standalone cognitive systems.
What to measure:
“specific entities”, “networks” or “services” (Spohrer and Banavar 2015)
We need a different kind of 'specification' of
What the components are able to do.
What the integrated systems will be able to do,
Depending on their integration (tight, loose, teams, etc.).
Understanding the inclusion or emergence of general abilities.
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
23
24. CONCLUSIONS
Increasing need for the evaluation of cognitive systems:
Plethora of new systems: AI, hybrids, collectives, etc.
Crucial to assess their cognitive profiles unlike and beyond humans’.
Critical for recognising what professions can be automated first.
Compensating for several cognitive impairments (e.g., aging).
From a task-oriented to an ability-oriented evaluation:
Evaluating cognitive abilities requires a change of paradigm:
From a populational to a universal perspective,
From agglomerative (task diversity) to solutional (policy diversity) approaches,
Hierarchical view, clustering bottom-up.
E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
24
25. E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
25
THANK YOU!
More info:
BOOK
“The Measure of All Minds: Evaluating Natural
and Artificial Intelligence”, Cambridge
University Press, 2017. http://www.allminds.org
An AI Evaluation Survey
"Evaluation in artificial intelligence: From task-
oriented to ability-oriented measurement",
Artificial Intelligence Review, 2016