Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The Cogito Studio

  1. 1. Hybrid semantic document enrichment using machine learning and linguistics Stefan Geißler, SEMANTICS, Leipzig Sept 14 2016 Expert System
  2. 2. • Title What is this? A graph showing the distribution of large cities in the world Size of the city (population) The city‘s rank
  3. 3. • Title What is this? A graph showing the richest people of the world Wealth of the person The person‘s rank
  4. 4. • Title What is this? A graph showing the most frequent words from a large text corpus Frequency of the word The word‘s rank
  5. 5. Empirical evidence: Many types of data from physics, social sciences etc follow such a distribution „Zipf‘s law“: The number of data points (cities, rich people, words) with a value higher than S (on the y axis) is proportional to 1/S.
  6. 6. • Title Distribution of categories in many categorized/tagged corpora Frequency of the category The category‘s rank
  7. 7. Problem #1: How does that fit the requirement at the start of many categorization projects that a category will need a decent amount of data (>100 documents) to be trained? Larger categories can be trained (learned automatically) smaller ones often can‘t.
  8. 8. Problem #2: Even for the frequent enough categories: Is a training corpus really representative? Is „Greece“ always about „debt crisis“? Is „Ansbach“ always about „terror“? Learning method may learn unwanted associations
  9. 9. • Title Solution? More data? No because, - The graph here is scale-free - More data is often not available or very costly Frequency of the category The category‘s rank
  10. 10. Solution: Let the human expert refine the automatically created model Human document categorization: If („Etna“ or „Vesuv“ or „Pinantubo“) AND („lava“ or „eruption“) Then „Volcanism“ Machine document categorization:
  11. 11. This is seldomly a subject in scientific work on document categorization. Different classification methods most often compared only on the basis of their (automatic) performance on a evaluation corpus
  12. 12. … but this is often a requirement in real-world document categorization projects. • Training corpora alone are often not enough to attained expected levels of quality. • Additional data hard to find (manual preparation or curation very costly) • Existing corpora may not always be representative.
  13. 13. Our suggestion • Use available training data to train a model • Make the model available in a human readable formal language • Allow user to inspect and refine model where needed in a dedicated developement&testing environment
  14. 14. • A rich formal language (strings, lemmas, regexps, semantic concepts, operators …) allows to express learnt associations for bag of words models • … as well as detailed syntactic/semantic constraints • … and visualize and evaluated the result in the same application
  15. 15. • For the reasons explained above, the statistical learning approach may erroneously learn a rule that the words „Athens“ or „Greece“ allone justify assigning the document to „Banking Crisis“ • The user can refine the learnt rule, adding the further constraint that features like „Debt“, „Schäuble“ or „Troika“ are required before the category is assigned.
  16. 16. … Sample projects • <US Media company> • Large category schema for news articles • Task: set up solution that allows combining automatically created rule sets with manual refinement • <Insurance company> • Categorize medical reports using ICD category scheme • Go beyond quality that can be attained by using only the manually coded training set
  17. 17. Conclusion • Requirements in categorization projects in the industry are sometimes not identical to the scenarios in academic categorization benchmarks • Available training data sometimes limited even in the age of big data • Allow the seamless (one language, one development environment) application of both learnt as well as manually crafted rules
  18. 18. Expert System Who we are
  19. 19. Expert System: Largest European provider of pure semantic technologies • 7 Geographies • 250+ team members • Listed on the AIM exchange • Recommended by Gartner, Forrester, IDC ... • Experiences from hundreds of projects • Award winning technology: Taxonomy / Ontology Management, NLP, Information extraction, Question Answering, Cognitive Computing
  20. 20. Global Positioning – Selected Clients 21 ENERGY, OIL & GAS GOVERNMENT FEDERAL AGENCIES MEDIA & PUBLISHING Life Sciences FINANCE