The Role of Openness in Creating a Mind for Life
Open Source, AI, & BiologyAn AI breakthrough can come from an application in biologyIt is imperative that this be open sourceSome steps toward (and questions about) creating an open source AI for understanding life
The first artificial mind will think about molecular biology“You can’t think about thinking without thinking about thinking about something.”Seymour Papert, 1974“A thorough study of Human Physiology is, in itself, an education broader and more comprehensive than much that passes under that name. There is no side of the intellect which it does not call into play, no region of human knowledge into which either its roots, or its branches, do not extend.”Thomas Huxley,1893
Why AI hasn’t succeeded (yet)People know a lot about the world implicitly Conversing with a partnerwho doesn’t know these basic things is very frustrating50 years of failing to capture this “common sense” information computationally suggests:Lack of explicit enumeration makes capture very expensive (encyclopedias don’t have it!) Still no idea of the extent of this knowledge
People don’t have implicit knowledge of molecular biologyEverything anyone knows about MolBio comes from some combination of:TextbooksScientific publicationsDatabases (e.g. NCBI)Experiments done in one’s own labThere is no elicitation barrier to capturing everything known about molecular biologyWhy would biologists care?They have to understand genome-scale data, in the context of all that is already known.Magic bullets and biomarkers are not enoughThe idea of finding a single marker of disease state, and addressing it with a specifically targeted drug is not panningout as well as hoped.
XJ.J. Hornberg et al. / BioSystems 83 (2006) 81–90Homeostatic networks foil single markers and drugsoutcometarget
Networks change through timeMjolsness, Sharp, Reinitz,  A Connectionist Model of Development J. Theoretical Bio 1991
Understanding the data“We are close to having a $1,000 genome sequence, but this may be accompanied by a $1,000,000 interpretation.”	- Bruce Korf, president American College of Medical GeneticsNot only is the cost of sequencing essentially free, but big computers and big storage are cheap, too.  What will keep us busy for the next 50 years is understanding the data”	- Russ Altman, chair of Biomedical Engineering at Stanford
The Hard ProblemGiven a set of genomic regions, variants, gene products, and/or concentrations empirically involved in a defined phenotype…Produce:An explanation of the reasons that those genomic regions / variants / products / concentrations are (or are not) relevant to the phenotypeEvidence to support the explanation(s)Alternative explanationsReasons to prefer one explanation over another
Answering Why? questionsFundamental to human cognitive developmentAmazing human facilityEven to confabulationCausal explanation is central to scienceThe only question “big data”doesn’t seem to be enoughto answer (cfRamachandran & Hovy, 2002)
Abductive inference“However man may have acquired his faculty of divining the ways of Nature, it has certainly not been by a self-controlled and critical logic.  Even now he cannot give any exact reason for his best guesses…. For though it goes wrong oftener than right, yet the relative frequency with which it is right is on the whole the most wonderful thing in our constitution.”The Essential Peirce: Selected Philosophical Writings v. 2 p. 217
“Two paradoxes are better than one; they may even suggest a solution” –Edward TellerMolecular Systems Biology+ Artificial Intelligence
Explanation is hardNot just about the connection between an explanation and the thing explained, but must also be “consonant” with other explanations.Knowledge is keyHave to know many other explanations.Need “judgment” to compare the qualities of alternative explanations.Racunas & Shah’s HyBrow system, but required extensive manually represented knowledgeA “complete enough” knowledge-base?
Knowledge-based Computational BiologyWidespread use, e.g.Simulation systems (e.g. BioCyc)Question answering systems (e.g. AskHermes or Watson Medicine)High-throughput result analysis (e.g. GOEAST, Ontologizer)Hypothesis generation / testing (e.g. HyQue)Anything that uses an ontologyAnnotations (e.g. GOA)Cross-species comparisons NCBO
KB for explanationKnowledge base qualityCorrectness, timeliness (tracking changes)CompletenessA constantly receding goal, that obviously cannot be achieved, but is important anywayNeed to cover the material inTextbooksJournal articlesDatabases
Explanatory inferenceEven if all the relevant knowledge were available in computationally tractable form…We need inferential methods toIdentify possible explanations of complex biological phenomena (symbolic?)Compare alternative explanations in the light of existing evidence (numeric?)History of explanatory inference in AI is suggestive, but key open problems remain
Why does openness matter?Productivity: Attacking hard problems efficientlyRapid assimilation of effective methodsBuilding on (not ignoring) each other’s resultsEquity: Access to scientists with low budgetsDistribution to the widest possible communityEthics: Transparency for AI is a moral value
Transparency is a moral valueAI matters – lots of social concerns about loss of control, etc.  2001, RobopocolypseAI is cheap to replicate, and will diverge (if you can build one mind, building millions more is easy).  Too important to be privateTechnological development in the face of such broad social concern requires earning the trust of the society
Getting thereBuild on track records of opennessOBO &Community-curated OntologiesSemantic Web / OWL / SPARQL / SWRLOpen Access PublishingLinked Life DataBreaking down barriersInfrastructureIncentives
Opening a BazzarTo get the productivity advantage, infrastructure mattersTechnical infrastructure to share, compare and integrate code Social infrastructure to work together to solve hard problemsMotivationCompetitionCooperation
Confronting the temptations of being proprietaryThe temptations:Potential future payoffAvoid effort to conform to the infrastructureFear of not being able to improve in the futureCompetition errorsWrong task / evaluation / supplied dataPoor process (timing, execution, infrastructure)Doesn’t evolve toward worthy end
GoalsParticipation from many, previously disparate communitiesBio focused: BioCreative, BioNLP,Comp Ling: ACL Shared Tasks, CONLLNIST: TREC, TACA living, open source collection of useful, modular, repurposable, state of the art software for understanding biomedical textsMajor advances in AI
Facilitating an OS communityProviding ResourcesSoftware (UIMA, U-COMPARE)Compute powerTraining data (CRAFT, Analysis of analysts)Signal EventsSeries of competitions based on CRAFTIncentivesPrizes for significant achievements
http://bionlp-corpora.sourceforge.net/CRAFT/http://bionlp.sourceforge.net
Remaining challengesPubmed Central and open accessCorporate ownership (Ontotext & LLD)Semantic compatibility of various sourcesUMLS breadth vs. BFO logicSharing inference methods & rulesRule syntax (SWRL) is not enough.  DL inference is not enoughUIMA equivalent?
How to participateHelp design CRAFT competitionsConfront publishers about PMC bulk downloadsHelp define inferential benchmarks
A01-Openness in knowledge-based systems

A01-Openness in knowledge-based systems

  • 1.
    The Role ofOpenness in Creating a Mind for Life
  • 2.
    Open Source, AI,& BiologyAn AI breakthrough can come from an application in biologyIt is imperative that this be open sourceSome steps toward (and questions about) creating an open source AI for understanding life
  • 3.
    The first artificialmind will think about molecular biology“You can’t think about thinking without thinking about thinking about something.”Seymour Papert, 1974“A thorough study of Human Physiology is, in itself, an education broader and more comprehensive than much that passes under that name. There is no side of the intellect which it does not call into play, no region of human knowledge into which either its roots, or its branches, do not extend.”Thomas Huxley,1893
  • 4.
    Why AI hasn’tsucceeded (yet)People know a lot about the world implicitly Conversing with a partnerwho doesn’t know these basic things is very frustrating50 years of failing to capture this “common sense” information computationally suggests:Lack of explicit enumeration makes capture very expensive (encyclopedias don’t have it!) Still no idea of the extent of this knowledge
  • 5.
    People don’t haveimplicit knowledge of molecular biologyEverything anyone knows about MolBio comes from some combination of:TextbooksScientific publicationsDatabases (e.g. NCBI)Experiments done in one’s own labThere is no elicitation barrier to capturing everything known about molecular biologyWhy would biologists care?They have to understand genome-scale data, in the context of all that is already known.Magic bullets and biomarkers are not enoughThe idea of finding a single marker of disease state, and addressing it with a specifically targeted drug is not panningout as well as hoped.
  • 6.
    XJ.J. Hornberg etal. / BioSystems 83 (2006) 81–90Homeostatic networks foil single markers and drugsoutcometarget
  • 7.
    Networks change throughtimeMjolsness, Sharp, Reinitz, A Connectionist Model of Development J. Theoretical Bio 1991
  • 8.
    Understanding the data“Weare close to having a $1,000 genome sequence, but this may be accompanied by a $1,000,000 interpretation.” - Bruce Korf, president American College of Medical GeneticsNot only is the cost of sequencing essentially free, but big computers and big storage are cheap, too. What will keep us busy for the next 50 years is understanding the data” - Russ Altman, chair of Biomedical Engineering at Stanford
  • 9.
    The Hard ProblemGivena set of genomic regions, variants, gene products, and/or concentrations empirically involved in a defined phenotype…Produce:An explanation of the reasons that those genomic regions / variants / products / concentrations are (or are not) relevant to the phenotypeEvidence to support the explanation(s)Alternative explanationsReasons to prefer one explanation over another
  • 10.
    Answering Why? questionsFundamentalto human cognitive developmentAmazing human facilityEven to confabulationCausal explanation is central to scienceThe only question “big data”doesn’t seem to be enoughto answer (cfRamachandran & Hovy, 2002)
  • 11.
    Abductive inference“However manmay have acquired his faculty of divining the ways of Nature, it has certainly not been by a self-controlled and critical logic. Even now he cannot give any exact reason for his best guesses…. For though it goes wrong oftener than right, yet the relative frequency with which it is right is on the whole the most wonderful thing in our constitution.”The Essential Peirce: Selected Philosophical Writings v. 2 p. 217
  • 12.
    “Two paradoxes arebetter than one; they may even suggest a solution” –Edward TellerMolecular Systems Biology+ Artificial Intelligence
  • 13.
    Explanation is hardNotjust about the connection between an explanation and the thing explained, but must also be “consonant” with other explanations.Knowledge is keyHave to know many other explanations.Need “judgment” to compare the qualities of alternative explanations.Racunas & Shah’s HyBrow system, but required extensive manually represented knowledgeA “complete enough” knowledge-base?
  • 14.
    Knowledge-based Computational BiologyWidespreaduse, e.g.Simulation systems (e.g. BioCyc)Question answering systems (e.g. AskHermes or Watson Medicine)High-throughput result analysis (e.g. GOEAST, Ontologizer)Hypothesis generation / testing (e.g. HyQue)Anything that uses an ontologyAnnotations (e.g. GOA)Cross-species comparisons NCBO
  • 15.
    KB for explanationKnowledgebase qualityCorrectness, timeliness (tracking changes)CompletenessA constantly receding goal, that obviously cannot be achieved, but is important anywayNeed to cover the material inTextbooksJournal articlesDatabases
  • 16.
    Explanatory inferenceEven ifall the relevant knowledge were available in computationally tractable form…We need inferential methods toIdentify possible explanations of complex biological phenomena (symbolic?)Compare alternative explanations in the light of existing evidence (numeric?)History of explanatory inference in AI is suggestive, but key open problems remain
  • 17.
    Why does opennessmatter?Productivity: Attacking hard problems efficientlyRapid assimilation of effective methodsBuilding on (not ignoring) each other’s resultsEquity: Access to scientists with low budgetsDistribution to the widest possible communityEthics: Transparency for AI is a moral value
  • 18.
    Transparency is amoral valueAI matters – lots of social concerns about loss of control, etc. 2001, RobopocolypseAI is cheap to replicate, and will diverge (if you can build one mind, building millions more is easy). Too important to be privateTechnological development in the face of such broad social concern requires earning the trust of the society
  • 19.
    Getting thereBuild ontrack records of opennessOBO &Community-curated OntologiesSemantic Web / OWL / SPARQL / SWRLOpen Access PublishingLinked Life DataBreaking down barriersInfrastructureIncentives
  • 20.
    Opening a BazzarToget the productivity advantage, infrastructure mattersTechnical infrastructure to share, compare and integrate code Social infrastructure to work together to solve hard problemsMotivationCompetitionCooperation
  • 21.
    Confronting the temptationsof being proprietaryThe temptations:Potential future payoffAvoid effort to conform to the infrastructureFear of not being able to improve in the futureCompetition errorsWrong task / evaluation / supplied dataPoor process (timing, execution, infrastructure)Doesn’t evolve toward worthy end
  • 22.
    GoalsParticipation from many,previously disparate communitiesBio focused: BioCreative, BioNLP,Comp Ling: ACL Shared Tasks, CONLLNIST: TREC, TACA living, open source collection of useful, modular, repurposable, state of the art software for understanding biomedical textsMajor advances in AI
  • 23.
    Facilitating an OScommunityProviding ResourcesSoftware (UIMA, U-COMPARE)Compute powerTraining data (CRAFT, Analysis of analysts)Signal EventsSeries of competitions based on CRAFTIncentivesPrizes for significant achievements
  • 24.
  • 25.
    Remaining challengesPubmed Centraland open accessCorporate ownership (Ontotext & LLD)Semantic compatibility of various sourcesUMLS breadth vs. BFO logicSharing inference methods & rulesRule syntax (SWRL) is not enough. DL inference is not enoughUIMA equivalent?
  • 26.
    How to participateHelpdesign CRAFT competitionsConfront publishers about PMC bulk downloadsHelp define inferential benchmarks

Editor's Notes

  • #5 I will have more to say about the importance of broad knowledge later
  • #12 Humans are very facile at generating explanations. Confabulation is what happens when the process is disconnected from relevant sources of information. Split brain patients explaining why the other hemisphere did something, or Capgrassymdome
  • #14 I believe that both of these goals are within reach in the next generation.