Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Natural Language Processing & Semantic Models in an Imperfect World


Published on

Published in: Technology
  • Be the first to comment

Natural Language Processing & Semantic Models in an Imperfect World

  1. 1. <ul><li>Presenter: </li></ul><ul><li>Marc Hadfield </li></ul><ul><li>[email_address] </li></ul><ul><li> </li></ul>Confidential Natural Language Processing & Semantic Models in an Imperfect World Copyright Alitora Systems, Inc. 2009
  2. 2. Marc Hadfield <ul><li>CTO of Alitora Systems </li></ul><ul><li>Computer Science </li></ul><ul><li>Research in Bioinformatics </li></ul><ul><ul><li>NLP </li></ul></ul><ul><ul><li>Big (Fuzzy) Networks </li></ul></ul><ul><li>Generalized Semantic Data Platform </li></ul>
  3. 3. Alitora Systems <ul><li>System Approach </li></ul>… Talk about Systems & Apps more than Modules.
  4. 4. Discussion Today <ul><li>Storing Data – Semantic Repository </li></ul><ul><li>Generating Data – NLP </li></ul><ul><li>Modeling Data – Semantic Models </li></ul><ul><li>Analyze Data – Methodology </li></ul><ul><li>Using Data – Application </li></ul>
  5. 5. Alitora Systems Architecture
  6. 6. Alitora Systems API (ASAPI) <ul><li>User Interfaces </li></ul><ul><li>ASAPI Collaboration </li></ul><ul><li>kHarmony™ Semantic DB </li></ul><ul><li>Alitora Foundry </li></ul><ul><ul><li>Text-Mining </li></ul></ul><ul><li>UMIS Secure Distributed URIs </li></ul><ul><ul><li>URI to Named Graphs </li></ul></ul>
  7. 7. ASAPI Cloud Multi-Billion Triples
  8. 8. kHarmony™ Semantic DB <ul><li>Semantic / Graph DB </li></ul><ul><li>Cloud Deployable </li></ul><ul><ul><li>Distribute Data over Servers </li></ul></ul><ul><ul><li>Layers of Cache </li></ul></ul><ul><li>Data Analytics / Clustering </li></ul><ul><ul><li>Determine High-Value Knowledge </li></ul></ul><ul><ul><li>Knowledge Relevancy </li></ul></ul><ul><li>Embedded Scripting </li></ul><ul><li>Data Entitlements </li></ul><ul><ul><li>Users, Teams, Organizations, Colleagues </li></ul></ul><ul><li>Base Ontology </li></ul>
  9. 9. Alitora Foundry <ul><li>Manages NLP processes </li></ul><ul><ul><li>Annotators which add metadata to text </li></ul></ul><ul><ul><ul><li>Includes external services like OpenCalais as annotators </li></ul></ul></ul><ul><ul><li>Workflows to link annotators together </li></ul></ul><ul><ul><li>Common data representation across components </li></ul></ul><ul><ul><li>RDF in, RDF out </li></ul></ul><ul><ul><li>Ontology includes representation of certainty, error </li></ul></ul>
  10. 10. Foundry Workflow <ul><li>Independent Workflows based on type of text </li></ul><ul><li>Combine ML & Rule-based systems </li></ul>
  11. 11. Foundry Data Model <ul><li>Two dimensional representation of tokens </li></ul><ul><ul><li>Labels/Spans to tag token ranges (features in machine learning) </li></ul></ul><ul><li>Allows multiple interpretations of tokens </li></ul><ul><ul><li>Chemical names tokenized differently than personal names </li></ul></ul><ul><li>Sequence Recognition and Categorization (with scoring/likelyhood) </li></ul><ul><ul><li>Entities, Entity Types, Normalized (Disambiguated) Entities (ER vs. ER) </li></ul></ul><ul><li>Shared across workflow steps </li></ul><ul><li>Direct RDF representation </li></ul>“Span”
  12. 12. NLP In Action Copyright Alitora Systems, Inc. 2009 Confidential
  13. 13. <ul><li>Sentence </li></ul><ul><li>“ Suppression of endogenous Bim greatly inhibits Gadd45a induction of apoptosis.” </li></ul><ul><li>Parse </li></ul><ul><li>[ action, inhibit, </li></ul><ul><li>[action, suppress, </li></ul><ul><li>[unknown], </li></ul><ul><li>[gp, endogenous Bim] </li></ul><ul><li>], </li></ul><ul><li>[action, induce, </li></ul><ul><li>[gp, Gadd45a], </li></ul><ul><li>[process, apoptosis] </li></ul><ul><li>], </li></ul><ul><li>] </li></ul>Foundry Relationship Extraction Confidential Copyright Alitora Systems, Inc. 2009
  14. 14. Alitora Knowledge Ontology Data Representation: Each Object is Named Graph. Unique URI. “ chunks” of RDF OWL2 “ Core” Model
  15. 15. Alitora Knowledge Ontology <ul><li>Named Graphs: </li></ul><ul><li>URI </li></ul><ul><li>“ Reified” </li></ul><ul><li>Provenance </li></ul><ul><ul><li>Hash/Signature </li></ul></ul><ul><ul><li>Creation, Modification, Expiration Dates </li></ul></ul><ul><li>Certainty/Error </li></ul>
  16. 16. Alitora Knowledge Ontology Lesson: “ Reification” at the model level. Expose the topology of the knowledge.
  17. 17. Semantic Knowledge Statements Domain Ontology + Instance Statements Alitora Knowledge Ontology
  18. 18. Semantic Collaborative Statements Alitora Knowledge Ontology
  19. 19. Alitora Knowledge Ontology <ul><li>Fact Representation </li></ul><ul><ul><li>This example has 9 Named Graphs </li></ul></ul><ul><ul><li>The “Relation” is the head </li></ul></ul><ul><ul><li>Any number of Relation-Parts </li></ul></ul><ul><ul><li>Relation-Parts are chained </li></ul></ul>“ Company Merger”
  20. 20. <ul><li>OWL </li></ul><ul><li>“ Reified” </li></ul><ul><li>Knowledge Representation </li></ul><ul><ul><li>Certainty, Error, Provenance, … </li></ul></ul><ul><li>Graph + Semantic </li></ul><ul><ul><li>Topology Interpretation </li></ul></ul><ul><ul><li>Logical Interpretation </li></ul></ul>Alitora Knowledge Ontology
  21. 21. MemomicsBio Ontology (Domain) <ul><li>Extends Alitora Knowledge Ontology </li></ul><ul><ul><li>Inherits knowledge representation structures </li></ul></ul><ul><li>OWL </li></ul><ul><li>Domain Specific </li></ul><ul><li>Defines types of “facts” specific to biomedical domain </li></ul><ul><li>A general AKO fact can be mapped/asserted into a Memomics BioOntology fact </li></ul>
  22. 22. Where are we? <ul><li>Store Data </li></ul><ul><li>Generate data with NLP </li></ul><ul><li>Represent data in a general knowledge model </li></ul><ul><li>Have a domain specific ontology </li></ul><ul><ul><li>Where the “action” happens </li></ul></ul><ul><li>Need some analysis to push facts into the domain ontology </li></ul><ul><li>Query, Inference using the domain ontology </li></ul>
  23. 23. Relevancy <ul><li>The shape or “topology” of the graph helps to identify relevant knowledge. </li></ul><ul><li>The “paths” connecting a User to knowledge, based on search usage, factor into Relevancy </li></ul><ul><li>“ Knowledge Rank” </li></ul><ul><ul><li>“ Best” facts </li></ul></ul>Relevancy based on Graph Topology
  24. 24. Scripting, Analysis, Inference <ul><li>Submitted Scripts applied over Graph Walk </li></ul><ul><ul><li>Groovy Scripts (Java Interface) </li></ul></ul><ul><ul><li>Can calculate “scores” </li></ul></ul><ul><li>Offline Clustering and Analysis Algorithms </li></ul><ul><ul><li>Grid/Cloud based </li></ul></ul><ul><li>Inference process utilizes knowledge </li></ul><ul><ul><li>Asserting statements (Relation  Statement) </li></ul></ul><ul><ul><li>Prolog, HiLog, F-Logic </li></ul></ul><ul><ul><li>Use all features in inferencing (such as certainty) </li></ul></ul>
  25. 25. Certainty <ul><li>How accurate (F-score) are your NLP extractions? </li></ul><ul><li>How accurate is the source material? </li></ul><ul><li>How dynamic is your domain? </li></ul><ul><li>Can facts be independently verified </li></ul><ul><ul><li>Do multiple sources reinforce a “fact”? </li></ul></ul><ul><li>Can your community of users curate or validate information? </li></ul><ul><li>How sensitive are you to error? </li></ul><ul><ul><li>Will users tolerate error (such as in search) or are you trying to inference over absolute “truth”? </li></ul></ul>
  26. 26. Certainty <ul><li>Choose to assert facts ( or not ) based on certainty assessments </li></ul>
  27. 27. Confidential Guided Inference Inference is guided by ranked knowledge Analysis can be performed offline
  28. 28. Guided Inference <ul><li>Dynamic Inference / Rules </li></ul><ul><li>A question/query is posed to initiate the inference </li></ul><ul><li>Knowledge-based is queried to collect relevant data </li></ul><ul><ul><li>Certainty Thresholds can be used </li></ul></ul><ul><ul><li>Relevancy Thresholds can be used </li></ul></ul><ul><li>AKO Relations are asserted as “facts” to extend the inference </li></ul><ul><li>Process is repeated to add assertions </li></ul>
  29. 29. Demonstrations <ul><li>Alitora Newstracker </li></ul><ul><li>Sage Commons, Biomedical Domain </li></ul><ul><li>Match Engine, Consumer Application </li></ul>
  30. 30. Alitora News Tracker <ul><li>Track highly relevant news in domain niche </li></ul><ul><li>Use NLP to extract entities and relations of interest </li></ul><ul><li>Use certainty assessments as thresholds to consider entities/relations </li></ul><ul><li>Use a score (an embedded script) to assign a relevancy to news articles </li></ul><ul><ul><li>Heuristic including entities types in articles, relationship types, et cetera </li></ul></ul>
  31. 31. Application: News Tracker
  32. 32. Application: Sage Commons <ul><li>Share networks of biomedical data across the community of researchers </li></ul><ul><ul><li>million node networks, billions of triples </li></ul></ul><ul><li>Extended AKO with Sage Ontology </li></ul><ul><ul><li>Use for structured data and unstructured data </li></ul></ul><ul><li>Allow combination of structured data with NLP derived data </li></ul><ul><li>Use certainty thresholds to cut down on noise </li></ul><ul><li>Use relevancy for efficient queries </li></ul><ul><li>Expose data for guided inferencing </li></ul>
  33. 36. Application: Match Engine <ul><li>Match Engine </li></ul><ul><li>Extended AKO with Match Ontology </li></ul><ul><li>Foundry for extracting music event entities </li></ul><ul><ul><li>Performer, Venue, Price, Genre </li></ul></ul><ul><li>Certainty for reducing noise </li></ul><ul><li>Match Engine uses inference with multiple source of “evidence” to match users with events </li></ul><ul><li>Demo Application: Bandalay Facebook App </li></ul>
  34. 40. NLP and (Un)Certainty <ul><li>Capture Error / Uncertainty in Model from NLP </li></ul><ul><li>“ Reify” relationships so metadata will “fit” </li></ul><ul><li>Use multiple types of analysis </li></ul><ul><ul><li>Rules, Machine Learning, Topology, Curation, User Feedback </li></ul></ul><ul><li>Separate general model and domain model </li></ul><ul><ul><li>Allows asserting a fact in the domain model or not (don’t “decide” everything at once) </li></ul></ul><ul><li>Use semantics to make decisions about data </li></ul><ul><li>Inference can use thresholds to decide to assert facts (or not) </li></ul><ul><li>Guided Inference can make informed choice about facts to add/remove from model </li></ul>
  35. 41. Contact Information <ul><li>750 Menlo Ave, Suite 340 155 Water Street </li></ul><ul><li>Menlo Park, CA 94025 Brooklyn, NY 11201 </li></ul><ul><li>(415) 310-4406 (917) 463-4776 </li></ul><ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul>Confidential Copyright Alitora Systems, Inc. 2009