Knowledge acquisition using automated techniques


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The definition of knowledge is a matter of on-going debate among philosophersbut for our talk I have taken this definition from wikipedia
  • Predicting market: to predict whether people likes Lux soap or advertisements. Ex: Advertising Bengalis’ community in Hyderabad for a concert in Bengali.
  • Scarcity is not the issue but abundance is!Easy for humans to understand the meaning lying in different documents.Becomes difficult for a user to find a document of his interest.
  • Too much of labour, time consuming, biasedness, For huge data, an intelligent way is to formulate an algo which can perform repetitive computation. with systems instead of manual labour. Less time consuming, Which I will talk about in my ppt.I Consider it to be more appropriate. Combines the advantages of both systems and humans. Systems: scalability and accuracy and intelligence with humans. In my thesis, I have particularly opted for this approach. Today I am not talking about this approach. I will cover this topic in some later ppt.
  • Systems that are built over some algorithms: the use of methods for controlling industrial processes automatically, esp by electronically controlled systems, often reducing manpower
  • Broad overview of how system worksAccording to me these are five main components
  • Broad overview of how system worksAccording to me these are five main components
  • Type of extraction method depends on the applicationHighly sophisticated system can achieve max. of 70% accuracy. Accuracy of automated techniques can not surpass human intelligence.
  • Knowledge acquisition using automated techniques

    1. 1. Methods ofKnowledge ExtractionDeepti AggarwalSIEL|SERL, IIIT-Hyderabad, India
    2. 2. Agenda Introduction to Web as a knowledge repository Automated extraction techniques (Input sources, extracted structures, input pre- processing, extraction methods, output generation) Issues with automated extraction
    3. 3. What is knowledge? A familiarity with someone or something with experience Includes facts, information, descriptions, skills
    4. 4. Types of KnowledgeExplicit Knowledge Implicit Knowledge Always present  Not present explicitly explicitly in records for analysis Objective facts having  Cultural beliefs with a definite answer subjective judgments E.g., Hyderabad is the capital of A.P.  E.g., Hyderabad is the best city to live in India.
    5. 5. How knowledge isrepresented over a periodof time? From Public library to global library
    6. 6. How knowledge isrepresented over the web? Millions of documents, blogs, forums, social networks scattered on web Diverse topic, different formats, from diverse people in diverse language, different point of views
    7. 7. Benefits of knowledgeextraction over the Web Question Answering systems Search engines Explicit Validating knowledge knowledge Tracking a particular information Predicting market, polls etc. Implicit Community advertisements knowledge
    8. 8. Problems with knowledgeacquisition over web Abundance of data Relevance of information Personalized retrieval
    9. 9. Possible approaches Manual filtering Automated techniques Combination of both
    10. 10. Automated Extraction
    11. 11. Working of automated extraction systems Defining Input output pre- Extraction Output structures processing methods processing Inputsources Database of all facts, Extraction system relations
    12. 12. Input sources Types
    13. 13. Input sources web documents news articles blogs social networks activities (user profiles, posts, comments)Sentence level parsing required.
    14. 14. Defining thestructures of outputNamed Entities and their relations
    15. 15. Output structures Named Entities Named entities relations
    16. 16. 1. Named Entity: Definition It is an atomic element in a body of text. Types: person, organization, location etc. Different named entities when linked together, form a relation.
    17. 17. 1. Named Entity: Anexample Sachin Tendulkarwas born in Bombay. NE of type „Person‟ NE of type „Location‟
    18. 18. 2. Named EntityRelationship: Structure Subject – Relation - Object NE of any type NE of any type Verb, Adjective, Adverb
    19. 19. 2. Named EntityRelationship: An ExampleSachin Tendulkar was born inBombay Subject Relation Object
    20. 20. Co-referencingSachin was born in Bombay. He is a ... Sachin Tendulkar…. Mr. Tendulkar … Master Blaster...
    21. 21. Inputpre-processing Libraries
    22. 22. NLP libraries: Splitting each sentence into tokens, words, digits using Sentence Tokenizer Recognizing language constructs, nouns, verbs, pronouns using Part-of-speech Tagger Example: Sachin/NNPTendulkar/NNP was/VBD born/VBN in/IN Bombay/NNP
    23. 23. NLP libraries (contd.): Linking individual constituents of a sentence with Parser to form parse tree Identify types of named entity using Named Entity Recognizer Example: Sachin Tendulkar/PERSON was born inBombay/LOCATION
    24. 24. NLP libraries (contd.): Identify all co-references and replace with actual entity using Co - reference Resolution tool Identify specific meaning of a word Word Sense Disambiguation  External vocabularies: MindNet, DBpedia, WordNet  E.g., contextual meaning of „crane‟: noun-bird, verb-lift/move
    25. 25. Extraction methods
    26. 26. Extracting relationshipsamong NEs: Standardprocess named entities within a1. Identify sentence. verbor adjective that2. Find the connects the identified named entities.3. Connect them together to form relation.
    27. 27. Extracting relationshipsamong NEs: Requiredprocess1. Identifypart-of-speech constructs: noun, verb, adjective etc. Co-references,2. Determine Acronyms and abbreviations.3. Connect them together to form a relationship.
    28. 28. Extraction Methods Natural Language Processing: rule based.  Based on sentence structure  E.g., for English language, a rule can be “noun-verb-noun” Machine Learning: supervised and unsupervised learning.  Features are detected from the training data  E.g., to extract instances of some medical diseases, system is trained over all the symptoms of each given disease.
    29. 29. Extraction Methods (contd.) Other methods:Vocabulary based systems, context based clustering.  Maintaining a mapping file of all countries and their nationalities helps to determine nationality of a person when his birth place is known. Hybrid:  NLP based libraries to pre-process the input data, applying machine learning approach to extract the relations by using some external vocabulary as WordNet.
    30. 30. Outputgeneration
    31. 31. Types of output systems1. Identifies all mentionsof named entities and their relations. E.g., from a given corpus, extract all named entity relations.2. Identify missing relations of a database E.g., Given a database, extract the missing attributes of given entities from the corpus.3. Linking various entities within a database. E.g., Given a database, link two entities together with some relation extracted from the corpus.
    32. 32. Working of automated extraction systems Defining Input output pre- Extraction Output structures processing methods processing Inputsources Database of all facts, Extraction system relations
    33. 33. Issues with automated extractionAccuracy, running time, dependency
    34. 34. Issue 1: Challenges oflanguage structure Co-reference resolution Ambiguous, complex sentences Abbreviations Acronyms
    35. 35. See an example… “Tomcalled his father last night. They talked for an hour. Hesaid hewould be home the next day." What is „Hereferring to? Tomorhis father?
    36. 36. “You see sir, I can talk English, I can walk English, Ican laugh English, I can run English, becauseEnglish is such a funny language.”Amitabh in NamakHalal
    37. 37. Issue 2: Accuracy  Named entity detection: 90%, relationship 50-70%.  Introduction of noise at each step.  E.g., disambiguation of acronym „crane‟ with WordNet, introduces contextual errors, which then decreases accuracy of rule based relationship extraction
    38. 38. Issue 3: Efficiency  Feature detection steps are expensive.  Require days for computation
    39. 39. Issue 4: Dependency on external vocabulary sources, like Wikipedia, WordNet, MindNetetc. Maintenance &updationof vocabulary sources is manual: costly and require expertise. Limited size produce context based noise  Domain-dependent: medical domain  Corpus-dependent: Wikipedia, news corpus  Relation specific: Dateand Place-of- event
    40. 40. Issue 5: Problem with Implicitknowledge extraction Community Knowledge is learned and shared No one can be an expert. cultural competence and perception of workers are fed into a system as variables.Cultural Consensus Theory provides models to include such variables into the system.
    41. 41. Can we do better?Can we seek human intelligence to improvethe accuracy of automated techniques?
    42. 42. References[1] I. Tuomi. Data is more than knowledge: implications of the reversed knowledge hierarchy for knowledge management and organizational memory. J. Manage. Inf. Syst. , 16(3):103–117, Dec. 1999.[2] S. Sekine. Named Entity: History and Future. 2004.[3] S. Sarawagi. Information extraction. Found. Trends databases , 1(3):261–377, Mar. 2008.[4] S. C. Weller. Cultural consensus theory: Applications and frequently asked questions. Field Methods,19(4):339–368, 2007.
    43. 43. References (contd.)[5] Z. Syed, E. Viegas, and S. Parastatidis. Automatic discovery of semantic relations using mindnet. LREC,2010.[6] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Wordnet: An on-line lexical database. International Journal of Lexicography , 3:235–244, 1990[7] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull. , pages 40–48, 2006.[8] E. Greengrass. Information retrieval: A survey, 2000.
    44. 44. Thank you Questions?