Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cognitive systems institute talk 8 june 2017 - v.1.0

207 views

Published on

José Hernández-Orallo, Full Professor, Department of Information Systems and Computation at the Universitat Politecnica de València, presentation “Evaluating Cognitive Systems: Task-oriented or Ability-oriented?” as part of the Cognitive Systems Institute Speaker Series.

Published in: Technology
  • Be the first to comment

Cognitive systems institute talk 8 june 2017 - v.1.0

  1. 1. José Hernández-Orallo Dep. de Sistemes Informàtics i Computació, Universitat Politècnica de València jorallo@dsic.upv.es Talk for the Cognitive Systems Institute Speaker series 8 June 2017* Based on parts of the book: “The Measure of All Minds”: http://allminds.org
  2. 2. E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 2 “Greatest accuracy, at the frontiers of science, requires greatest effort, and probably the most expensive or complicated of measurement instruments and procedures” (David Hand, 2004).
  3. 3. COGNITIVE SYSTEMS: MUCH MORE THAN AI  Computers:  AI or AGI systems, robots, bots, …  Cognitively-enhanced organisms, cognitive prosthetics  Cyborgs, technology-enhanced humans  Biologically-enhanced computers:  Human computation and their data  (Hybrid) collectives  Virtual social networks, crowdsourcing  Minimal or rare cognition  Artificial life (more like bacteria, plants, etc.) E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 3 Societal impact on work, leisure, health, etc., difficult to assess as we do not know the cognitive capabilities of all these new systems.
  4. 4. THE EVALUATION DISCORDANCE: AI EVALUATION E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 4 Edited image, originally from wikicommons "[AI is] the science of making machines do things that would require intelligence if done by [humans]." Marvin Minsky (1968).  They can do the “things” (tasks) without featuring intelligence.  Once the task is solved (“superhuman”), it is no longer an AI problem (“AI effect”)  AI would have progressed very significantly (see, e.g., Nilsson, 2009, chap. 32, or Bostrom, 2014, Table 1, pp. 12–13).  But AI is now full of idiots savants.
  5. 5. THE EVALUATION DISCORDANCE: AI EVALUATION  Specific (task-oriented) AI systems E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 5 Machine translation, information retrieval, summarisation Warning! Intelligence NOT included. PR: computer vision, speech recognition, etc. Robotic navigation Driverless vehicles Prediction and estimation Planning and scheduling Automated deduction Knowledge- based assistants Game playing Warning! Intelligence NOT included. Warning! Intelligence NOT included. Warning! Intelligence NOT included. Warning! Intelligence NOT included. Warning! Intelligence NOT included. Warning! Intelligence NOT included. Warning! Intelligence NOT included. Warning! Intelligence NOT included. All images from wikicommons
  6. 6. THE EVALUATION DISCORDANCE: AI EVALUATION E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 6  Specific domain evaluation settings:  CADE ATP System Competition  PROBLEM BENCHMARKS  Termination Competition  PROBLEM BENCHMARKS  The reinforcement learning competition  PROBLEM BENCHMARKS  Program synthesis (Syntax-guided synthesis)  PROBLEM BENCHMARKS  Loebner Prize  HUMAN DISCRIMINATION  Robocup and FIRA (robot football/soccer)  PEER CONFRONTATION  International Aerial Robotics Competition (pilotless aircraft)  PROBLEM BENCHMARKS  DARPA driverless cars, Cyber Grand Challenge, Rescue Robotics  PROBLEM BENCHMARKS  The planning competition  PROBLEM BENCHMARKS  General game playing AAAI competition  PEER CONFRONTATION  BotPrize (videogame player) contest  HUMAN DISCRIMINATION  World Computer Chess Championship  PEER CONFRONTATION  Computer Olympiad  PEER CONFRONTATION  Annual Computer Poker Competition  PEER CONFRONTATION  Trading agent competition  PEER CONFRONTATION  Robo Chat Challenge  HUMAN DISCRIMINATION  UCI repository, PRTools, or KEEL dataset repository.  PROBLEM BENCHMARKS  KDD-cup challenges and ML kaggle competitions  PROBLEM BENCHMARKS  Machine translation corpora: Europarl, SE times corpus, the euromatrix, Tenjinno competitions…  PROBLEM BENCHMARKS  NLP corpora: linguistic data consortium, …  PROBLEM BENCHMARKS  Warlight AI Challenge  PEER CONFRONTATION  The Arcade Learning Environment  PROBLEM BENCHMARKS  Pathfinding benchmarks (gridworld domains)  PROBLEM BENCHMARKS  Genetic programming benchmarks  PROBLEM BENCHMARKS  CAPTCHAs  HUMAN DISCRIMINATION  Graphics Turing Test  HUMAN DISCRIMINATION  FIRA HuroCup humanoid robot competitions  PROBLEM BENCHMARKS  …
  7. 7. THE EVALUATION DISCORDANCE: AI EVALUATION E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 7 Cognitive robots Intelligent assistants Pets, animats and other artificial companions Smart environments Agents, avatars, chatbots Web-bots, Smartbots, Security bots…  How to evaluate general-purpose systems and cognitive components? Warning! Some intelligence MAY BE included. Warning! Some intelligence MAY BE included. Warning! Some intelligence MAY BE included. Warning! Some intelligence MAY BE included. Warning! Some intelligence MAY BE included. Warning! Some intelligence MAY BE included.
  8. 8. THE EVALUATION DISCORDANCE: AI EVALUATION  “Mythical Turing Test” (Sloman, 2014) and its myriad variants…  Mythical human-level machine intelligence  A red herring for general-purpose AI! E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 8
  9. 9. THE EVALUATION DISCORDANCE: AI EVALUATION  What benchmarks? More comprehensive?  ARISTO (Allen Institute for AI) : College science exams  Winograd Schema Challenge : Questions targeting understanding.  Weston et al. “AI-Complete Question Answering” (bAbI)  CLEVR : Relations over visual objects E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 9 BEWARE: AI-Completeness claimed before Calculation, Chess, Go, Turing test, … Now AI is superhuman on most of them! (e.g., https://arxiv.org/pdf/1706.01427.pdf)
  10. 10. THE EVALUATION DISCORDANCE: TEST MISMATCH  What about psychometric tests or animal tests in AI?  In 2003, Sanghi & Dowe :  simple program passing many IQ tests.  About 960 lines of code in Perl! E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 10 This made the point unequivocally: programs passing IQ tests are not necessarily intelligent
  11. 11. THE EVALUATION DISCORDANCE: TEST MISMATCH  This has not been a deterrent!  Psychometric AI (Bringsjord and Schmimanski 2003):  An “agent is intelligent if and only if it excels at all established, validated tests of intelligence”.  Detterman, editor of the Intelligence Journal, posed “A challenge to Watson” (Detterman 2011)  2nd level to “be truly intelligent”: tests not seen beforehand.  “IQ tests are not for machines, yet” (Dowe & Hernandez-Orallo 2012) E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 11
  12. 12. THE EVALUATION DISCORDANCE: TEST MISMATCH  What about developmental tests (or tests for children)? E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 12  Developmental robotics:  Battery of tests (Sinapov, Stoytchev, Schenk 2010-13)  Cognitive architectures:  Newell “test” (Anderson and Lebiere 2003)  “Cognitive Decathlon” (Mueller 2007).  AGI: high-level competency areas (Adams et al. 2012), task breadth (Goertzel et al 2009, Rohrer 2010), robot preschool (Goertzel and Bugaj 2009). a taxonomy for cognitive architectures a psychometric taxonomy (CHC)
  13. 13. THE EVALUATION DISCORDANCE: TEST MISMATCH  Adapting tests between disciplines (AI, psychometrics, comparative psychology) is problematic:  Test from one group only valid and reliable for the original group.  Not necessary and/or not sufficient for the ability.  Machines and hybrids represent a new population.  Nowadays, many benchmarks are assuming that AI will use deep learning or millions of examples.  But machines and hybrids are also an opportunity to understand how to evaluate cognition. Still, E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 13 We need a different foundation
  14. 14. THE ALGORITHMIC CONFLUENCE: WHAT IQ TESTS MEASURE E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 14  “Beyond the Turing Test”…  “Intelligence” definition and test (C-test) based on algorithmic information theory (Hernandez-Orallo 1998-2000).  Letter series common in cognitive tests (Thurstone).  Here generated from a TM with properties (projectibility, stability, …).  Their difficulty is calculated by Kt  Linked with Levin’s universal search, Solomonoff’s inductive inference, Kolmogorov complexity.
  15. 15. THE ALGORITHMIC CONFLUENCE: WHAT IQ TESTS MEASURE  Metric derived by slicing by difficulty h (Kt) and :  This is IQ-test re-engineering!  Intelligence no longer “what intelligence tests measure” (Boring, 1923).  Clues about what IQ tests really measure? Inductive inference. E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 15 Human performance correlated with the difficulty (h) of each exercise. But remember Sanghi and Dowe 2003!
  16. 16. THE ALGORITHMIC CONFLUENCE: SITUATED TESTS  Passive to interactive view:  Intelligence as performance in a range of worlds.  The set of worlds M is described by Turing machines.  Intelligence is measured as an aggregate:  R aggregates ri and p assigns probabilities to environments. How? E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 16 π μ ri oi ai
  17. 17. THE ALGORITHMIC CONFLUENCE: SOLUTIONAL APPROACH E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 17  Three approaches: Range of difficulties Diversity of solutions [universal, e.g. Legg and Hutter] [uniform] [universal] [universal][uniform][uniform] [With the choices in brackets, they are NOT equivalent]
  18. 18. THE ALGORITHMIC CONFLUENCE: SOLUTIONAL APPROACH  A different view of “general intelligence”:  Policy-general intelligence: aggregate by difficulty (e.g., bounded uniform distribution) and for each difficulty look for diversity.  Connected to the task-independence of the g factor. E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 18 Raises a fascinating question: Is there a universal g factor? Ability to find, integrate and emulate a diverse range of successful policies.
  19. 19. FROM TASKS TO ABILITIES: CLUSTERING BY SIMILARITY  Focus first on intermediate levels between tasks and abilities:  Do we have an intrinsic notion of similarity between tasks? E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 19  Task breadth? Arrange abilities?  Hierarchically (e.g., Catell-Horn-Carroll)  Spatially (e.g., Guttman’s model)
  20. 20. FROM TASKS TO ABILITIES: CLUSTERING BY SIMILARITY  Example (ECA rules as tasks).  Task description is not used. No population is used either.  The best solutions are used instead and compared.  Using similarity as difficulty increases (18 rules of difficulty 8): E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 20 Dendrogram using complete linkageMetric multidimensional scaling
  21. 21. NEW AI EVALUATION PLATFORMS: A COSMOS  Here they are:  Facebook’s bAbi  Arcade Learning Env. (Atari)  Video Game Definition Language  OpenAI Gym and Universe  Microsoft’s Project Malmo  DeepMind Lab  Facebook’s TorchCraft  Facebook’s CommAI  AI Magazine report: “A New AI Evaluation Cosmos” E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 21 Most (except CommAI) oriented towards the evaluation of very embodied AI, but what about more abstract cognition?
  22. 22. IS THIS SUFFICIENT? OPEN QUESTIONS  What do these platforms / test measure?  Depends on the tasks we define!  Many things to be done  Task analysis, their similarities, difficulties, their requirements (data)  Abilities: be conceptualised and identified.  Ability-oriented (or feature-oriented) evaluation  Incremental, gradual, curriculum, …: task similarity → dependency  Recent (EGPAI@ECAI2016, MAIN@NIPS2016) and upcoming workshops E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 22 EGPAI@IJCAI2017 MAIN@NIPS2017 ?
  23. 23. IS THIS SUFFICIENT? OPEN QUESTIONS  We want cognitive components that could be easily integrated into standalone cognitive systems.  What to measure:  “specific entities”, “networks” or “services” (Spohrer and Banavar 2015)  We need a different kind of 'specification' of  What the components are able to do.  What the integrated systems will be able to do,  Depending on their integration (tight, loose, teams, etc.).  Understanding the inclusion or emergence of general abilities. E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 23
  24. 24. CONCLUSIONS  Increasing need for the evaluation of cognitive systems:  Plethora of new systems: AI, hybrids, collectives, etc.  Crucial to assess their cognitive profiles unlike and beyond humans’.  Critical for recognising what professions can be automated first.  Compensating for several cognitive impairments (e.g., aging).  From a task-oriented to an ability-oriented evaluation:  Evaluating cognitive abilities requires a change of paradigm:  From a populational to a universal perspective,  From agglomerative (task diversity) to solutional (policy diversity) approaches,  Hierarchical view, clustering bottom-up. E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 24
  25. 25. E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R A B I L I T Y - O R I E N T E D ? 25 THANK YOU!  More info:  BOOK  “The Measure of All Minds: Evaluating Natural and Artificial Intelligence”, Cambridge University Press, 2017. http://www.allminds.org  An AI Evaluation Survey  "Evaluation in artificial intelligence: From task- oriented to ability-oriented measurement", Artificial Intelligence Review, 2016

×