Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep 2016

298 views

Published on

At the HrTAL2016 conference I presented the talk on "Language as a Social Sensor to operate with Knowledge". The talk included a section on language as an interface between physical nature and the world of human mind and human society. The role of language as a 'sensor'has several consequences in uncertainties and inexactness of the language evolution, as we know it. The talk was accompanies with several live demonstrations of the systems on semantic annotation (wikifier.org) and media monitoring (eventregistry.org).

Published in: Data & Analytics
  • Be the first to comment

Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep 2016

  1. 1. Language as a Social Sensor to operate with Knowledge Marko Grobelnik Jozef Stefan Institute, Slovenia Marko.Grobelnik@ijs.si Dubrovnik, Sep 30th 2016
  2. 2. Reflection on what should be the goal of NLP • The (mostly) forgotten long term aim of NLP is to understand the text • …and not so much ‘processing’ itself (as NLP suggests) • The curse of shallow solutions working well enough for too many problems, made people (and researchers) happy for too long • …as much as information retrieval and text mining are useful, they delayed development of “text understanding”
  3. 3. Language vs. World • …if we agree with the above statement, then at this point in time, we have ‘language’, but the ‘world’ is more or less missing • So – so what a ‘world’ or ‘world model’ could be?
  4. 4. Language is really a social sensor… • Nature’s physical reality is very complex… • …but manifests itself in a simple and structured way • Humans need a mechanism to capture the complexity they need to survive, evolve and communicate • …that’s why the language appeared as a necessity • Consequently, human language is a reflection of the world in which we live and our perception of it: • Some of the key properties: Uncertainty, dynamics, compressed information
  5. 5. Nature Human Human Human Perception PerceptionPerception Language Language Common Understanding Nature is complex – but whenever Nature gets optimized it gets towards a simple and clear structure (crystallization as an obvious process of getting structure) Human perception is just a simplified reflection of how Nature shows itself Language is a means how to communicate the perception – kind of a sensor for the structures beneath (since it is optimized, it has a form of a crystal) Common understanding of the Nature we call Knowledge – it still emits clear structures (clear Knowledge has nice crystal structure) Crystallization of the Nature, Perception, Language and Knowledge
  6. 6. Positioning language towards knowledge • Language has a difficult task to encode the Nature’s complexity in an efficient way for humans… • …to describe the Nature • …to express uncertainty, not fully understanding the Nature’s complexity • …to be efficient when communicate • …to reflect dynamics of the changing environment • …to abstract physical reality in an abstract forms, what we call Knowledge
  7. 7. Why we need representing knowledge in a formal way? • The key element to operate with knowledge is “Reasoning” • Since we cannot express all the facts in a formalized way, we need a mechanism to combine knowledge fragments to derive new knowledge • …this is called reasoning
  8. 8. Popular ways to encode and reasoning with knowledge? • In the current science we have several ways to express the knowledge, with an aim to encode the complexity of the world: • …simple forms of knowledge expressed as a collection of points in high dimensional spaces • Efficient, due to linear and other algebras and corresponding tools • Most popular nowadays – machine learning, statistics, text-mining, statistical NLP are using mostly these forms • Reasoning is often straightforward • …probabilistic structures such as Bayesian networks • Expressive, but more expensive to encode and still manageable to be used for reasoning • …various kinds of logic to formulate ontological knowledge • Very expressive, not always easy to be used for reasoning
  9. 9. CYC KNOWLEDGE BASE Thing Universe isa isa Celestial Body isa located in Planet subclass Earth isa Animal isa Human subclas s Physics Money Mathematics Chemistry Time Learning FoodVehicles Event Education School Language LoveEmotions Going for a walk Death Cat Euro Working Words Driving RainStabbing someone Nature Tree Hatred Fear Physics Time Learning Vehicles Event Education School Emotions Going for a walk Death Cat EuroWords Driving Rain Stabbing someone Nature Tree Hatred Fear Planet Earth isa Human Physics Money Mathematics Chemistry Time Learning FoodVehicles Event Education Languag e LoveEmotions Going for a walk Cat Euro Working Words Driving Rain Tree Hatred Fear Learning Vehicles Event Education School Emotions Euro Driving Stabbing someone Hatred Fear Structure of a Common Sense Knowledge (CycKB at http://opencyc.org/)
  10. 10. Model of the world… • …beyond surface knowledge • …to interconnect contextualized fragments Why? • To make reasoning capable of connecting isolated fragments of knowledge • To derive new knowledge beyond materialized factual knowledge World model Top-down KA Bottom-up KA Multimodal data Why we need a World model?
  11. 11. Simple forms of knowledge extraction and reasoning
  12. 12. What can be extracted from a document? • Lexical level • Tokenization – extracting tokens from a document (words, separators, …) • Sentence splitting – set of sentences to be further processed • Linguistic level • Part-of-Speech – assigning word types (nouns, verbs, adjectives, …) • Deep Parsing – constructing parse trees from sentences • Triple extraction – subject-predicate-object triple extraction • Name entity extraction – identifying names of people, places, organizations • Semantic level • Co-reference resolution – replacing pronouns with corresponding names; merging different surface forms of names into single entity • Semantic labeling – assigning semantic identifiers to names (e.g. LOD/DBpedia/Freebase) including disambiguation • Topic classification – assigning topic categories to a document (e.g. DMoz) • Summarization – assigning importance to parts of a document • Fact extraction – extracting relevant facts from a document
  13. 13. Wikipedia as a World model (http://wikifier.org) [Demo] Annotation, Disambiguation of general texts into Wikipedia Concepts with a changing vocabulary in 100 language
  14. 14. Global Media as a playground to understand social dynamics through shallow knowledge extraction (http://eventregistry.org/) [Demo] Imported articles: 150M Identified events: 5M (2014-2016) News sources: 154,969 Unique concepts: 2,698,213 Categories: 5,015
  15. 15. Event description through entities and Semantic keywords
  16. 16. Collection of events described through Entity relatedness
  17. 17. Collection of events described through trending concepts
  18. 18. Collection of events described through three level categorization
  19. 19. Events identified across languages
  20. 20. Collection of events described through a story-line of related events
  21. 21. Linguistic processing on Semantically augmented texts • The goal is to use traditional corpus linguistic tools on the top of semantically enriched texts • Exmaple: “UN” string -> “United Nations” concept -> “Organization” higher level concept -> … • The purpose is to reuse existing tools for many languages to accurately extract micro- context within the text • Using SketchEngine (https://www.sketchengine.co.uk/) to preprocess the NewsFeed.ijs.si documents (100M+ docs) • Covering the following languages: Arabic, Catalan, Czech, German, English, film, French, Croatian, Hungarian, Italian, Korean, Dutch, Polish, Russian, Spanish, Serbian and Swedish • Login: https://ondra.sketchengine.co.uk/ / username: test / password: preview
  22. 22. Infobox extraction for events: (structured event representation) • Structured event representation describes an event by its “Event Type” and corresponding information slots to be filled • Event Types should be taken from “Event Taxonomy” • …at this stage of development this level of representation still requires human intervention to achieve high accuracy (Precision/Recall) extraction • Example on the right – Wikipedia event infobox: • 2011 Tōhoku earthquake and tsunami
  23. 23. Deeper means to model and reason with knowledge
  24. 24. One of the challenges for the future: Micro-reading • It is “easier” to understand millions of documents than a single document • …reading and understanding a single document is micro-reading • The following experiment is on how much knowledge we can extract from individual documents • …extraction is in a form of first order inferentially productive Cyc logic • …allowing us full reasoning to identify new facts • …minimizing human involvement, optimizing precision and recall Document Assertions Reasoning Dialogue
  25. 25. Disambiguation with a world model (CycKB) World model used as a set of common-sense semantic constraints to disambiguate text
  26. 26. Cyc Knowledge Base and Reasoning
  27. 27. Cycorp © 2006 The Cyc Ontology Thing Intangible Thing Individual Temporal Thing Spatial Thing Partially Tangible Thing Paths Sets Relations Logic Math Human Artifacts Social Relations, Culture Human Anatomy & Physiology Emotion Perception Belief Human Behavior & Actions Products Devices Conceptual Works Vehicles Buildings Weapons Mechanical & Electrical Devices Software Literature Works of Art Language Agent Organizations Organizational Actions Organizational Plans Types of Organizations Human Organizations Nations Governments Geo-Politics Business, Military Organizations Law Business & Commerce Politics Warfare Professions Occupations Purchasing Shopping Travel Communication Transportation & Logistics Social Activities Everyday Living Sports Recreation Entertainment Artifacts Movement State Change Dynamics Materials Parts Statics Physical Agents Borders Geometry Events Scripts Spatial Paths Actors Actions Plans Goals Time Agents Space Physical Objects Human Beings Organ- ization Human Activities Living Things Social Behavior Life Forms Animals Plants Ecology Natural Geography Earth & Solar System Political Geography Weather General Knowledge about Various Domains Specific data, facts, and observations
  28. 28. Cycorp © 2006 Cyc Reasoning Modules Interface to External Data Sources CycAPI Knowledge EntryTools User Interface (with Natural Language Dialog) Data Bases Web Pages Text Sources Other KBs Cyc Ontology & Knowledge Base Cyc High-level Architecture
  29. 29. Cycorp © 2006 Thing Intangible Thing Individual Temporal Thing Spatial Thing Partially Tangible Thing Paths Sets Relations Logic Math Human Artifacts Social Relations, Culture Human Anatomy & Physiology Emotion Perception Belief Human Behavior & Actions Products Devices Conceptual Works Vehicles Buildings Weapons Mechanical & Electrical Devices Software Literature Works of Art Language Agent Organizations Organizational Actions Organizational Plans Types of Organizations Human Organizations Nations Governments Geo-Politics Business, Military Organizations Law Business & Commerce Politics Warfare Professions Occupations Purchasing Shopping Travel Communication Transportation & Logistics Social Activities Everyday Living Sports Recreation Entertainment Artifacts Movement State Change Dynamics Materials Parts Statics Physical Agents Borders Geometry Events Scripts Spatial Paths Actors Actions Plans Goals Time Agents Space Physical Objects Human Beings Organ- ization Human Activities Living Things Social Behavior Life Forms Animals Plants Ecology Natural Geography Earth & Solar System Political Geography Weather General Knowledge about Terrorism Specific data, facts, and observations about terrorist groups and activities General Knowledge about Terrorism: Terrorist groups are capable of directing assassinations: (implies (isa ?GROUP TerroristGroup) (behaviorCapable ?GROUP AssassinatingSomeone directingAgent)) … If a terrorist group considers an agent an enemy, that agent is vulnerable to an attack by that group: (implies (and (isa ?GROUP TerroristGroup) (considersAsEnemy ?GROUP ?TARGET)) (vulnerableTo ?GROUP ?TARGET TerroristAttack)) Cyc KB Extended w/Domain Knowledge
  30. 30. Cycorp © 2006 Thing Intangible Thing Individual Temporal Thing Spatial Thing Partially Tangible Thing Paths Sets Relations Logic Math Human Artifacts Social Relations, Culture Human Anatomy & Physiology Emotion Perception Belief Human Behavior & Actions Products Devices Conceptual Works Vehicles Buildings Weapons Mechanical & Electrical Devices Software Literature Works of Art Language Agent Organizations Organizational Actions Organizational Plans Types of Organizations Human Organizations Nations Governments Geo-Politics Business, Military Organizations Law Business & Commerce Politics Warfare Professions Occupations Purchasing Shopping Travel Communication Transportation & Logistics Social Activities Everyday Living Sports Recreation Entertainment Artifacts Movement State Change Dynamics Materials Parts Statics Physical Agents Borders Geometry Events Scripts Spatial Paths Actors Actions Plans Goals Time Agents Space Physical Objects Human Beings Organ- ization Human Activities Living Things Social Behavior Life Forms Animals Plants Ecology Natural Geography Earth & Solar System Political Geography Weather General Knowledge about Terrorism Specific data, facts, and observations about terrorist groups and activities Specific Facts about Al Qaida: (basedInRegion AlQaida Afghanistan) Al-Qaida is based in Afghanistan. (hasBeliefSystems AlQaida IslamicFundamentalistBeliefs) Al-Qaida has Islamic fundamentalist beliefs. (hasLeaders AlQaida OsamaBinLaden) Al-Qaida is led by Osama bin Laden. … (affiliatedWith AlQaida AlQudsMosqueOrganization) Al-Qaida is affiliated with the Al Quds Mosque. (affiliatedWith AlQaida SudaneseIntelligenceService) Al-Qaida is affiliated with the Sudanese Intell Service … (sponsors AlQaida HarakatUlAnsar) Al-Qaida sponsors Harakat ul-Ansar. (sponsors AlQaida LaskarJihad) Al-Qaida sponsors Laskar Jihad. … (performedBy EmbassyBombingInNairobi AlQaida) Al-Qaida bombed the Embassy in Nairobi. (performedBy EmbassyBombingInTanzania AlQaida) Al-Qaida bombed the Embassy in Tanzania. Cyc KB Extended w/Domain Knowledge
  31. 31. Example of automatic translating text into Cyc Logic and back to text Source: “Galileo Galilei was an Italian physicist and astronomer.” Learn Logic:(#$and (#$isa #$GalileoGalilei #$ItalianPerson) (#$isa #$GalileoGalilei #$Physicist) (#$isa #$GalileoGalilei #$Astronomer)) Fact: Galileo was an Italian, a physicist, and an astronomer. Source: “Galileo was born in Pisa on Feburary 15, 1564.” Learn Logic:(#$and (#$birthDate #$GalileoGalilei (#$DayFn 15 (#$MonthFn #$February (#$YearFn 1564)))) (#$birthPlace #$GalileoGalilei #$CityOfPisaItaly)) Fact: Galileo was born on February 15, 1564 and he was born in Pisa. Source: “Albert Einstein was born in 1879 in Ulm, Germany.” Learn Logic: (#$birthDate #$AlbertEinstein (#$YearFn 1879)) Fact: Albert Einstein was born in 1879.
  32. 32. Example of text and extracted Cyc assertions (1/2) Automatically Extracted Assertions: • (isa ?V1 ProsecutingEvent) • (agent ?V1 RudyGiuliani) • (genls Entity Agent) • (isa RudyGiuliani Agent) • (isa RudyGiuliani Entity) • (isa ?V3 OrganizingEvent) • (patient ?V3 (IntersectionFn OrganizedCrime WallStreet)) • (isa (IntersectionFn OrganizedCrime WallStreet) Patient) • (genls Entity Patient) • (isa OrganizedCrime Patient) • (isa OrganizedCrime Entity) • (isa WallStreet Patient) • (isa WallStreet Entity) Sentence: He prosecuted a number of high-profile cases, including ones against organized crime and Wall_Street financiers.
  33. 33. Example of text and extracted Cyc assertions (2/2) Automatically Extracted Assertions: • (isa ?V1 SubstitutingEvent) • (temporal ?V1 Lincoln) • (genls Entity Agent) • (isa Lincoln Agent) • (genls Person Entity) • (isa Lincoln Entity) • (isa Lincoln Person) • (isa ?V3 SucceedingEvent) • (temporal ?V3 Grant) • (isa Grant Agent) • (isa Grant Entity) • (isa Grant Person) Sentence: Each time a general failed, Lincoln substituted another until finally Grant succeeded in 1865.
  34. 34. Reasoning on extracted assertions (Cyc) Query: (and (isa ?Per Person) (birthDate ?Per ?BD) (occursBefore ?BD WorldWarII) (thereExistsAtLeast 2 ?Role (lifeRole ?Per ?Role) (roleInIndustry ?Role FilmIndustry) ) ) Answers: Sir Derek_George_Jacobi Sir Alexander_Korda Victor Lonzo_Fleming John_Francis_Junkin Cornel_Wilde George_Stevens Bertrand_Blier NL Query: People born before World War II who had at least two roles in the film industry KB?
  35. 35. Text query Query (semi) automatically translated in the First Order Logic Answers to the query Cyc’s front-end: “Cyc Analytic Environment” – querying (1/2) Who has a motive for the assassination of Rafik Hariri?
  36. 36. Query & Answer Justification Sources for Reasoning and Justification Cyc’s front-end: “Cyc Analytic Environment” – justification (2/2)
  37. 37. Some of the challenges for the future • Background knowledge in a form of a World Model • …to have knowledge contextualized • Representing and scalable reasoning knowledge with operational soft logic • …to decrease brittleness of logic and increase scale • Economically viable structured knowledge acquisition with high precision and recall • …to increase the reach of what we can acquire • Emphasizing understanding vs. applying black box models

×