Cognitive systems institute talk 8 june 2017 - v.1.0

José Hernández-Orallo
Dep. de Sistemes Informàtics i Computació,
Universitat Politècnica de València
jorallo@dsic.upv.es
Talk for the Cognitive Systems Institute Speaker series
8 June 2017* Based on parts of the book:
“The Measure of All Minds”:
http://allminds.org

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R
A B I L I T Y - O R I E N T E D ?
2
“Greatest accuracy, at the frontiers of science,
requires greatest effort, and probably the most
expensive or complicated of measurement
instruments and procedures”
(David Hand, 2004).

COGNITIVE SYSTEMS: MUCH MORE THAN AI
 Computers:
 AI or AGI systems, robots, bots, …
 Cognitively-enhanced organisms, cognitive prosthetics
 Cyborgs, technology-enhanced humans
 Biologically-enhanced computers:
 Human computation and their data
 (Hybrid) collectives
 Virtual social networks, crowdsourcing
 Minimal or rare cognition
 Artificial life (more like bacteria, plants, etc.)
3
Societal impact on
work, leisure, health, etc.,
difficult to assess as we
do not know the cognitive
capabilities of all these
new systems.

THE EVALUATION DISCORDANCE: AI EVALUATION
4
Edited image, originally from wikicommons
"[AI is] the science of making
machines do things that would require
intelligence if done by [humans]."
Marvin Minsky (1968).
 They can do the “things” (tasks) without
featuring intelligence.
 Once the task is solved (“superhuman”),
it is no longer an AI problem (“AI effect”)
 AI would have progressed very significantly
(see, e.g., Nilsson, 2009, chap. 32, or Bostrom, 2014, Table 1, pp. 12–13).
 But AI is now full of idiots savants.

 Specific (task-oriented) AI systems
5
Machine translation, information retrieval,
summarisation
Warning!
Intelligence
NOT included.
PR: computer vision,
speech recognition, etc.
Robotic
navigation
Driverless
vehicles
Prediction and
estimation
Planning and
scheduling
Automated
deduction
Knowledge-
based assistants
Game
playing
Warning!
Intelligence
NOT included. Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
Warning!
Intelligence
NOT included.
All images from wikicommons

6
 Specific domain evaluation settings:
 CADE ATP System Competition  PROBLEM BENCHMARKS
 Termination Competition  PROBLEM BENCHMARKS
 The reinforcement learning competition  PROBLEM BENCHMARKS
 Program synthesis (Syntax-guided synthesis)  PROBLEM BENCHMARKS
 Loebner Prize  HUMAN DISCRIMINATION
 Robocup and FIRA (robot football/soccer)  PEER CONFRONTATION
 International Aerial Robotics Competition (pilotless aircraft)  PROBLEM BENCHMARKS
 DARPA driverless cars, Cyber Grand Challenge, Rescue Robotics  PROBLEM BENCHMARKS
 The planning competition  PROBLEM BENCHMARKS
 General game playing AAAI competition  PEER CONFRONTATION
 BotPrize (videogame player) contest  HUMAN DISCRIMINATION
 World Computer Chess Championship  PEER CONFRONTATION
 Computer Olympiad  PEER CONFRONTATION
 Annual Computer Poker Competition  PEER CONFRONTATION
 Trading agent competition  PEER CONFRONTATION
 Robo Chat Challenge  HUMAN DISCRIMINATION
 UCI repository, PRTools, or KEEL dataset repository.  PROBLEM BENCHMARKS
 KDD-cup challenges and ML kaggle competitions  PROBLEM BENCHMARKS
 Machine translation corpora: Europarl, SE times corpus, the euromatrix, Tenjinno competitions…  PROBLEM BENCHMARKS
 NLP corpora: linguistic data consortium, …  PROBLEM BENCHMARKS
 Warlight AI Challenge  PEER CONFRONTATION
 The Arcade Learning Environment  PROBLEM BENCHMARKS
 Pathfinding benchmarks (gridworld domains)  PROBLEM BENCHMARKS
 Genetic programming benchmarks  PROBLEM BENCHMARKS
 CAPTCHAs  HUMAN DISCRIMINATION
 Graphics Turing Test  HUMAN DISCRIMINATION
 FIRA HuroCup humanoid robot competitions  PROBLEM BENCHMARKS
 …

7
Cognitive robots
Intelligent assistants
Pets, animats and other
artificial companions
Smart environments
Agents, avatars, chatbots
Web-bots, Smartbots, Security bots…
 How to evaluate general-purpose systems and cognitive components?
Warning!
Some intelligence
MAY BE included.
Warning!
Some intelligence
MAY BE included.
Warning!
Some intelligence
MAY BE included.
Warning!
Some intelligence
MAY BE included.
Warning!
Some intelligence
MAY BE included.
Warning!
Some intelligence
MAY BE included.

 “Mythical Turing Test” (Sloman, 2014)
and its myriad variants…
 Mythical human-level machine intelligence
 A red herring for general-purpose AI!
8

 What benchmarks? More comprehensive?
 ARISTO (Allen Institute for AI) : College science exams
 Winograd Schema Challenge : Questions targeting understanding.
 Weston et al. “AI-Complete Question Answering” (bAbI)
 CLEVR : Relations over visual objects
9
BEWARE: AI-Completeness claimed before
Calculation, Chess, Go, Turing test, …
Now AI is superhuman on most of them!
(e.g., https://arxiv.org/pdf/1706.01427.pdf)

THE EVALUATION DISCORDANCE: TEST MISMATCH
 What about psychometric tests or animal tests in AI?
 In 2003, Sanghi & Dowe :
 simple program passing many IQ tests.
 About 960 lines of code in Perl!
10
This made the point
unequivocally:
programs passing IQ
tests are not
necessarily intelligent

 This has not been a deterrent!
 Psychometric AI (Bringsjord and Schmimanski 2003):
 An “agent is intelligent if and only if it excels at all
established, validated tests of intelligence”.
 Detterman, editor of the Intelligence Journal, posed “A
challenge to Watson” (Detterman 2011)
 2nd level to “be truly intelligent”: tests not seen
beforehand.
 “IQ tests are not for machines, yet” (Dowe & Hernandez-Orallo
2012)
11

 What about developmental tests (or tests for children)?
12
 Developmental robotics:
 Battery of tests (Sinapov, Stoytchev, Schenk 2010-13)
 Cognitive architectures:
 Newell “test” (Anderson and Lebiere 2003)
 “Cognitive Decathlon” (Mueller 2007).
 AGI: high-level competency areas (Adams
et al. 2012), task breadth (Goertzel et al 2009,
Rohrer 2010), robot preschool (Goertzel and
Bugaj 2009).
a taxonomy for
cognitive architectures
a psychometric
taxonomy (CHC)

 Adapting tests between disciplines (AI, psychometrics, comparative
psychology) is problematic:
 Test from one group only valid and reliable for the original group.
 Not necessary and/or not sufficient for the ability.
 Machines and hybrids represent a new population.
 Nowadays, many benchmarks are assuming that AI will use deep
learning or millions of examples.
 But machines and hybrids are also an opportunity to understand how
to evaluate cognition. Still,
13
We need a different foundation

THE ALGORITHMIC CONFLUENCE: WHAT IQ TESTS MEASURE
14
 “Beyond the Turing Test”…
 “Intelligence” definition and test (C-test) based on algorithmic
information theory (Hernandez-Orallo 1998-2000).
 Letter series common in cognitive tests (Thurstone).
 Here generated from a TM with properties (projectibility, stability, …).
 Their difficulty is calculated by Kt
 Linked with Levin’s universal search, Solomonoff’s inductive
inference, Kolmogorov complexity.

THE ALGORITHMIC CONFLUENCE: WHAT IQ TESTS MEASURE
 Metric derived by slicing by difficulty h (Kt) and :
 This is IQ-test re-engineering!
 Intelligence no longer “what intelligence tests measure” (Boring, 1923).
 Clues about what IQ tests really measure? Inductive inference.
15
Human performance
correlated with the difficulty
(h) of each exercise.
But remember Sanghi and Dowe 2003!

THE ALGORITHMIC CONFLUENCE: SITUATED TESTS
 Passive to interactive view:
 Intelligence as performance in a range of worlds.
 The set of worlds M is described by Turing machines.
 Intelligence is measured as an aggregate:
 R aggregates ri and p assigns probabilities to environments. How?
16
π μ
ri
oi
ai

THE ALGORITHMIC CONFLUENCE: SOLUTIONAL APPROACH
17
 Three approaches:
Range of difficulties Diversity of solutions
[universal, e.g. Legg and Hutter]
[uniform] [universal]
[universal][uniform][uniform]
[With the choices in brackets, they are NOT equivalent]

THE ALGORITHMIC CONFLUENCE: SOLUTIONAL APPROACH
 A different view of “general intelligence”:
 Policy-general intelligence: aggregate by difficulty (e.g., bounded
uniform distribution) and for each difficulty look for diversity.
 Connected to the task-independence of the g factor.
18
Raises a fascinating question: Is there a universal g factor?
Ability to find, integrate and emulate a
diverse range of successful policies.

FROM TASKS TO ABILITIES: CLUSTERING BY SIMILARITY
 Focus first on intermediate levels between tasks and abilities:
 Do we have an intrinsic notion of similarity between tasks?
19
 Task breadth? Arrange abilities?
 Hierarchically (e.g., Catell-Horn-Carroll)
 Spatially (e.g., Guttman’s model)

FROM TASKS TO ABILITIES: CLUSTERING BY SIMILARITY
 Example (ECA rules as tasks).
 Task description is not used. No population is used either.
 The best solutions are used instead and compared.
 Using similarity as difficulty increases (18 rules of difficulty 8):
20
Dendrogram using complete linkageMetric multidimensional scaling

NEW AI EVALUATION PLATFORMS: A COSMOS
 Here they are:
 Facebook’s bAbi
 Arcade Learning Env. (Atari)
 Video Game Definition Language
 OpenAI Gym and Universe
 Microsoft’s Project Malmo
 DeepMind Lab
 Facebook’s TorchCraft
 Facebook’s CommAI
 AI Magazine report: “A New AI Evaluation Cosmos”
21
Most (except CommAI) oriented towards the evaluation of
very embodied AI, but what about more abstract cognition?

IS THIS SUFFICIENT? OPEN QUESTIONS
 What do these platforms / test measure?
 Depends on the tasks we define!
 Many things to be done
 Task analysis, their similarities, difficulties, their requirements (data)
 Abilities: be conceptualised and identified.
 Ability-oriented (or feature-oriented) evaluation
 Incremental, gradual, curriculum, …: task similarity → dependency
 Recent (EGPAI@ECAI2016, MAIN@NIPS2016) and upcoming workshops
22
EGPAI@IJCAI2017
MAIN@NIPS2017 ?

IS THIS SUFFICIENT? OPEN QUESTIONS
 We want cognitive components that could be easily integrated into
standalone cognitive systems.
 What to measure:
 “specific entities”, “networks” or “services” (Spohrer and Banavar 2015)
 We need a different kind of 'specification' of
 What the components are able to do.
 What the integrated systems will be able to do,
 Depending on their integration (tight, loose, teams, etc.).
 Understanding the inclusion or emergence of general abilities.
23

CONCLUSIONS
 Increasing need for the evaluation of cognitive systems:
 Plethora of new systems: AI, hybrids, collectives, etc.
 Crucial to assess their cognitive profiles unlike and beyond humans’.
 Critical for recognising what professions can be automated first.
 Compensating for several cognitive impairments (e.g., aging).
 From a task-oriented to an ability-oriented evaluation:
 Evaluating cognitive abilities requires a change of paradigm:
 From a populational to a universal perspective,
 From agglomerative (task diversity) to solutional (policy diversity) approaches,
 Hierarchical view, clustering bottom-up.
24

25
THANK YOU!
 More info:
 BOOK
 “The Measure of All Minds: Evaluating Natural
and Artificial Intelligence”, Cambridge
University Press, 2017. http://www.allminds.org
 An AI Evaluation Survey
 "Evaluation in artificial intelligence: From task-
oriented to ability-oriented measurement",
Artificial Intelligence Review, 2016

Cognitive systems institute talk 8 june 2017 - v.1.0

Recommended

Recommended

More Related Content

Similar to Cognitive systems institute talk 8 june 2017 - v.1.0

Similar to Cognitive systems institute talk 8 june 2017 - v.1.0 (20)

More from diannepatricia

More from diannepatricia (20)

Recently uploaded

Recently uploaded (20)

Cognitive systems institute talk 8 june 2017 - v.1.0