Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Domain-Specific Term Extraction for Concept Identification in Ontology Construction

263 views

Published on

Domain-Specific Term Extraction for Concept Identification in Ontology Construction

Published in: Science
  • Be the first to comment

Domain-Specific Term Extraction for Concept Identification in Ontology Construction

  1. 1. Domain-Specific Term Extraction for Concept Identification in Ontology Construction 1 Kiruparan Balachandran and Surangika Ranathunga Department of Computer Science and Engineering University of Moratuwa Sri Lanka 2016 IEEE/WIC/ACM International Conference on Web Intelligence
  2. 2. Introduction - Ontology An ontology is a formal and explicit specification of a shared conceptualization. • Ontology consists of : • Classes • Properties (Taxonomic and Non-taxonomic) • Individuals • Values • Axioms- used to verify the consistency of ontology. • E.g. “sorting algorithm can be considered as an algorithm if and only if it solves a certain Computer Science problem” 2 Problem ComplexityAlgorithm Sorting Algorithm Has Is a Solve
  3. 3. Issues in Manual Construction • Time Consuming • Noise • Experts have different • Viewpoints • Assumptions • Needs regarding the same domain 3
  4. 4. Ontology Learning 4 • Ontology learning (OL) is a solution to overcome issues related to the manual construction of ontology. • Can be an automatic or a semi-automatic process • Building ontology from Scratch • Enriching or adapting an existing ontology
  5. 5. Ontology Learning Layer-Cake Approach 5 Terms Relations Concept Hierarchies Concepts Synonyms {Randomized Algorithm, Sorting Algorithm, System Software, Application Software} {Randomized Algorithm, Sorting Algorithm}, {System Software, Application Software} Algorithm (I, E, L) isA(Sorting Algorithm, Algorithm) - Known as Taxonomy Relationship solve (Algorithm, Problem) - Known as Non- Taxonomy Relationship RulesisA(Sorting Algorithm, Algorithm) -> solve (Sorting Algorithm, Problem)
  6. 6. Unresolved Issues in Term Extraction 6 • Assume that the domain expert feeds domain-specific terms • Corpus selection based on word count • Considering single contrastive corpus Target Domain Corpus Computer Science Contrastive Corpus Bio Medical Cricket Other Domains Terms Relations Concept Hierarchies Concepts Synonyms Rules
  7. 7. Unresolved Issues in Term Extraction 7 • Issues in Term Extraction • Inverse document frequency: inadequate to identify the cross-domain distribution • Domain relevance: fails if a term is used at a higher frequency in a few documents in a domain, but not equally across domains • Domain consensus: only considers the term distribution within a domain but not the term distribution across domains • DR with DC for single contrastive corpus: when combining a large number of corpora, there is a significant count for each term from individual corpora and this count misleads the calculation of statistical distribution • Complex Term Extraction • rules do not consider all possible POS tags • does not limit the size of complex terms Terms Relations Concept Hierarchies Concepts Synonyms Rules
  8. 8. Improving Domain-Specific Term Extraction Process 8 Objective Terms Relations Concept Hierarchies Concepts Synonyms Rules Terms
  9. 9. Domain-Specific Term Extraction Process 9 Extracting domain- specific terms Selecting and Organizing corpora Target Domain Corpus with contrastive corpora Corpus Annotation POS Tagged Corpus
  10. 10. Selecting Corpora Select corpora that are good in lexical richness 10 Frequency of Words Occurrence (normalized by length)% Without Stop Words MK NUS G FAO C R 1 1.31 2.23 1.69 2.46 0.08 20.82 2 0.39 0.55 0.81 0.63 0 6.49 3 0.19 0.25 0.39 0.27 2.26 3.09 4 0.12 0.15 0.26 0.17 0.12 1.82 5 0.08 0.10 0.19 0.11 0 1.26 Total 2.11 3.31 3.63 3.66 2.48 33.50
  11. 11. Organizing Corpuses Mikalai Krapivin Computer Science GENIA Bio Medical Cricinfo RSS 11 Target domain is iteratively selected Contrastive Domain Contrastive Domain Target Domain GENIA Bio Medical Mikalai Krapivin Computer Science Cricinfo RSS Cricinfo RSS GENIA Bio Medical Mikalai Krapivin Computer Science
  12. 12. Extracting Domain-Specific Terms 12 • Linguistic rules to extract simple terms and complex terms • Statistical distribution calculation to support multiple contrastive corpora. Extracting domain- specific terms Selecting and Organizing corpora Target Domain Corpus with contrastive corpora Corpus Annotation POS Tagged Corpus Domain-Specific Terms “rca algorithm”, “time complexity”, “computational complexity”, “processor”
  13. 13. Mikalai Krapivin GENIA Cricinfo Tokenize, annotate with POS Find simple and complex terms Select candidate terms Linguistic rules Extracting Simple and Complex Terms Domain Weight Calculation for each term t(e.g. “processor” “computational complexity” etc.…) {algorithm/NN}, {computational/JJ complexity/NN}, {time/NN complexity/NN}, {RCA/NNP algorithm/NN}, {Minkowski/NNP sum/NN}, {convex/NN subpolygons/NNS} {algorithm/NN}, {computational/JJ complexity/NN}, {time/NN complexity/NN}, {RCA/NNP algorithm/NN}
  14. 14. Weigh Domain-Specific Terms 14 GENIA Cricinfo (p”computational complexity”) (p”computational”) (p”complexity”) MAX((pGENIA ”computational complexity”), (pCricinfo ”computational complexity”)) (pMAX ”computational complexity”) (pMAX ”computational”) (pMAX ”complexity”) Mikalai Krapivin Domain Weight Calculation for each term t(e.g. “processor” “computational complexity” etc.…) 𝑃 𝑡𝑒𝑟𝑚 𝑖𝑛 𝑡𝑟𝑎𝑔𝑒𝑡 𝑑𝑜𝑚𝑎𝑖𝑛 𝑃 ai 𝑖𝑛 𝑡𝑎𝑟𝑔𝑒𝑡 𝑑𝑜𝑚𝑎𝑖𝑛𝑛 𝑖 arg max 𝑚 𝑃(term 𝑖𝑛 𝑑 𝑚) 𝑃 ai 𝑖𝑛 𝑑 𝑚 𝑛 𝑖
  15. 15. Evaluation – Domain-Specific Term Extraction Top 700 Our approach ComSci Precisionfor complex terms 52.5% ComSci Precision for simple terms 55% Existing approaches DC Precision for simple terms 47% C-value/NC-value Precision for simple and complex terms 28% 15 Top 300 Our approach Bio Medical Precision for complex terms 62% Bio Medical Precision for simple terms 80% Existing approaches DC Precision for simple terms 55% C-value/NC-value Precision for simple and complex terms 32% • Evaluation of the best 700 simple terms and best 700 complex terms for the Computer Science domain • Evaluation of the best 300 simple terms and best 300 complex terms for the Bio Medical domain Based on Comparing existing studies with same data sets.
  16. 16. Conclusion • Our contribution in term extraction for ontology learning • Implemented a mechanism to select corpora and discussed an approach to organize corpora • Implemented a calculation of statistical distribution that can extract simple and complex terms from multiple domains by using multiple contrastive corpora • Improvement • Consider synonyms of each domain-specific term, which helps to identify more terms specific to a domain 16
  17. 17. 17 Questions ? Thank You

×