Your SlideShare is downloading. ×
  • Like
The impact of standardized terminologies and domain-ontologies in multilingual information processing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

The impact of standardized terminologies and domain-ontologies in multilingual information processing

  • 1,113 views
Published

 

Published in Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,113
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The impact of standardized-terminologies and domain-ontologies in multilingual information processing Maruf Hasan, D.Eng. Senior Researcher Thai Computational Linguistics Laboratory, Thailand National Institute of Information and Communication Technology, Japan
  • 2. Outline
    • Natural Language Processing (NLP) Research
      • Cross Language Information Retrieval
      • Named Entity Extraction
    • Integrated Knowledge Management Scenario
    • Terminology and Ontology Initiatives
    • The Future: Bootstrapping
    • NiCT resources and technologies
    • Conclusions
  • 3. NLP Research
    • Corpus-based Statistical NLP became a popular research theme in recent years
      • many smart applications exist (e.g., Google search engine, MS Word’s Grammar Checking, etc.)
      • semantics and knowledge still remain obscured behind words (symbols)
      • meaning, concepts are difficult to extract/build with statistics alone  Bootstrapping helps
  • 4. New Research Trends
    • While relying heavily on sophisticated NLP techniques, researchers are paying increasing attention to take advantage of semi-automatically built Lexical and Knowledge resources
    • Outcomes
      • Increasing number of monolingual lexical resources
      • Increasing number of multilingual dictionaries, thesauri, and generalized ontologies
      • Increasing number of specialized ontologies
      • Increasing number of bootstrapping approaches to get the best from both ends
        • augmenting statistically extracted knowledge with the manually encoded one, and vice versa.
  • 5. Two perspectives of Information/Knowledge
    • Content Management Perspective
      • Metadata (e.g., Dublin Core Metadata, 13 fields )
      • Taxonomy/thesauri (augmenting the Keyword field)
        • Analogy: HTML (Fixed set of tags)
    • Content Harnessing Perspective
      • Machine understandable content
      • Conceptual and associative hierarchy based on content
      • Ontology (Modeling a domain with concepts and their relationships from domain-expert’s perspective)
        • Analogy: XML (Tags are not fixed)
  • 6. Interoperability
    • XML technology revolutionized the computing industry in terms of data interoperability and exchange
    • Ontology has started bringing new dimensions in modeling information and knowledge in the same way
      • Traditional dictionaries and thesauri suffered badly from interoperability problems
      • Ontology offers a flexible framework for Knowledge Modeling (similar to that of XML in Data manipulation)
  • 7. Bootstrapping: How It Helps
    • Two major pitfalls with ontology
      • Developing ontologies (expensive! requires Knowledge Engineers)
      • Populating ontologies (labor intensive! Semi-automatic means exist)
    • Bootstrapping: a simple example
      • X is identified as a Person in the ontology but Y is not
      • Analyzing a piece of text with NLP tools, we found the evidence that X and Y are conducting research in an organization for some projects , for example.
      • It is easy to infer that Y is a person (and, also her affiliation, research interests, etc. through similar analysis)
        • NLP techniques helps in semi-automatically populating an ontology
      • NLP tools and algorithms can be further augmented with the help of the ontology-driven knowledge
        • What if we do not find any such evidence that X is also a family-friend of Y ? How can we possibly deal with such cases?
        •  I will show an example later
  • 8. Human factors: Why MT fails but IR wins
    • So far, Information Retrieval (IR) applications including Search Engines, such as Google, have been largely successful but Machine Translation (MT) systems are not so successful.
      • Reasons include
        • Failures in modeling linguistic and extra-linguistic phenomena, context and concepts, etc.
        • Human tolerance in finding information and in translation quality varies
          • Human tolerance: [ (low) Written  Audio  Video (high)]
    • Case-Study: Telstra Voice-operated Directory Service – a failure from user’s perspective but a successful investment from Telstra’s point of view
        • Many queries (70%) are repeating and the system can handle them quickly (success from Telstra’s perspective). But when a user enquires about rare entities, the system fails (failure from user’s perspective).
  • 9. Cross-Language Information Retrieval
    • Cross-language Information Retrieval is crucial
      • Why : Querying with native language is comfortable, but every now and then, the most valuable information related to our search is probably available in another language
      • How : Translating the queries or the document-collection (using a simplified MT model) to find information in other languages
      • Economic Factor : Finding relevant information at a low cost ( using noisy translation ) is possible. And, after receiving a list of documents ( and selecting the relevant ones - as we often do with Google ), we can take the ( costly ) decision of whether or not to translate the information.
        • That is, even if someone’s foreign language level is not so competitive, we can still make sense of information from other cues (tables, graphs, etc.) and take the right decision.
  • 10. Cross-Language Information Retrieval (2)
    • Multilingual dictionaries or simplistic MT models are typically used
    • Although noisy to some extent, language pair, such as Chinese and Japanese can take advantage of Hanzi- (Kanji-) semantics
      • also applicable for alphabetic languages if we map words with their root forms
    • Further enhancement, for example, Latent Semantic Indexing (or other conceptual retrieval techniques help in mapping symbolic words to abstract concepts
    • Statistically built dictionaries (based on statistical correlation) also proved effective in CLIR
    • CLIR Demo
      • In CLIR, the best effect can be achieved, if a user is guided through a correlation dictionary (statistically created) and an ontology (manually crafted).
        • Associative relationships are better captured by statistical correlation
        • Hierarchical relationships are better captured in ontologies or KBs
  • 11. Searching Idiosyncrasies (pseudo CLIR)
    • Experiment with Kanji Semantics
      • Searching “ 大学 ” on Google
        • 大学 site : cn
        • 大学 site : jp
          • The word, 大学 has the same meaning in both Japanese and Chinese
    • Experiment with different server
      • Searching “ DNA ” on different Google local sites
        • www.google.co.jp
        • www.google.co.th
        • www.google.com
          • The retrieved results are quite different
    • When it comes to information, we prefer to harness it in an integrated fashion .
        • Communication and connectivity are no longer barriers but languages are!
  • 12. Dilemma in Named-Entity Extraction
    • Named Entities play an important role in harnessing information
    • Significant research efforts have been channeled to automatic Named Entity Extraction - using simple heuristics as well as sophisticated machine learning algorithms.
    • For some reasons, the task remained restricted
      • Organization, Person, Location, Date, Time, Money, Percent
    • In specific domains such as Bio- or Agro- informatics, the notion of named-entities is broader (and different from the above, of course)
      • Domain specific entities are important. With carefully designed tools (using NLP techniques), it is possible to identify domain-specific entities
      • Event extraction is more difficult but crucial in harnessing information
  • 13. Integrated Knowledge Management
    • In an optimal scenario, we need to elicit knowledge from 3 different sources and manage it in an integrated fashion
      • Knowledge extracted from symbolic systems (written text, utterance, etc.) – relatively explicit but not so precise!
      • More precise knowledge encoded in ontologies and KBs (semi-automatic) – converted from implicit towards explicit forms!
      • Expert’s tacit knowledge – possible to capture in a system if the experts cooperate.
    • Ontology-based knowledge representation is the most appropriate representation so far – because it is understood by both human and machine equally
    • Ontologies, if not maintained regularly can be outdated soon.
    • There are certain other pitfalls which can be circumvented through sophisticated NLP techniques, bootstrapping and indexing scheme.  see examples in the following slides
  • 14. An Integrated KM Scenario
    • An “ academic ontology ” about people, project, organisations, project-reports, etc. within an organization (precise knowledge: ontologies are populated semi-automatically, sometimes from databases)
    • A set of sophisticated “ NLP Tools ” for Tokenizing, Parsing, Text Classifications, etc. (non-precise knowledge: Extracted from text automatically)
    • A group of users/experts who are inspired to make things better (Tacit Knowledge) by giving feedback.
    • A Spreading Activation based indexing scheme is used to capture and propagate changes in a bootstrapped fashion
      • c.f., Hasan, M.M. (2004). Spreading Activation Framework for Ontology-enhanced Effective Information Access within Organisations, In van Elst, L. et al. eds.: "Agent-Mediated Knowledge Management". Springer’s Lecture Notes in Computer Science, Vol. 2926. pp. 288-296. Also published in the proceedings of AAAI Spring Symposium, AMKM-2003, USA.
  • 15. Heterogeneous Sources of Knowledge
  • 16. But, Integrated Manipulation
    • Underneath, there is a spreading activation based indexing structure which changes over time
    • Expert’s feedback is also captured and propagated into the network
    • Commercial systems are developed using similar technique (e.g., TeSSI ® from L&C Global in pharmaceutical domain using a multilingual pharmaceutical ontology (developed under EU initiative)
  • 17. Lexical and Ontological Resources
    • China : HowNet (similar to WordNet with broader conceptual coverage
    • Japan : EDR Dictionary - A set of dictionaries including bilingual E-J dictionaries, Dictionary of Technical Terms and concept; NTT Goi Taikei , etc
    • Korea : KORTERM initiative
    • Thai : TCL’s Computational Lexicon
  • 18. Lexical and Ontological Resources (2)
    • GENIA Annotated Corpus and GENIA Ontology from University of Tokyo for Bioinformatics research based on Medline Abstract
      • Multilingual specialized ontologies are still rare but likely to appear
    • Similar resources in Agricultural domain including AGROVOC thesaurus, and related ontologies and resources (corpora)
      • FAO’s Bio-Safety Ontology:
        • Frequent verbs (Free Text Corpus)  Arguments (NPs)  KAON concepts  Domain Experts
      • a bootstrapping approach of creating ontology
  • 19. NiCT Language Resources
    • EDR Lexicons
      • NiCT acquired all copyright of the EDR electronic dictionary in 2002 and able to distribute them for a nominal handling fee.
      • Word Dictionaries
        • Japanese Word Dictionary (260,000)
        • English Word Dictionary (190,000)
      • Bilingual Dictionaries
        • Jpn.-Eng. Bilingual Dictionary (240,000)
        • Eng.-Jpn. Bilingual Dictionary (160,000)
      • Concept Dictionary (410,000)
      • Co-occurrence Dictionary
        • Japanese Co-occurrence Dictionary (930,000)
          • 20,000 Japanese example sentences
        • English Co-occurrence Dictionary (460,000)
          • 12,000 English example sentences
      • Technical Terminology Dictionary
        • (110,000 Japanese & 70,000 English entries)
  • 20. NiCT Language Resources (2)
    • Multilingual Annotated Corpus
      • 40,000 Japanese sentences from Mainichi Newspaper (i.e., Kyoto University Corpus)
      • Morphologically and syntactically annotated
      • English translation (manually translated); Phrase alignment done
        • Syntactic annotation (based on Penn Treebank)
        • 10,000 sentences will be translated and aligned in the phrase level in April 2004 (tentative)
      • Chinese translation (manually translated)
        • 10,000 sentences are already translated
    • And, many other tools and linguistic resources
      • Project Gutenberg Corpus (English-Japanese Bilingual Sentence Aligned corpus)
      • SST Learner Corpora (with error annotation)
  • 21. Conclusions
    • In this new era of ubiquitous connectivity, Integrated processing of information is a necessity .
      • Language ( not physical communication/bandwidth) remains to be the strongest barrier.
    • Multilingual resources (dictionaries, thesauri, corpora) are either rare or incomplete
      • AGROVOC still doesn’t cover many languages (including Japanese)
    • Effective processing of multilingual information needs concerted effort in resource building and standardization
      • Specially in terminology and interoperable ontology standards
    • Multilingual resources along with effective bootstrapping strategy will help us overcoming the difficulties in NLP and multilingual information processing
      • With the resources and technologies we have at NiCT, it could be worthy to try extending AGROVOC and related ontology to cover Japanese
        • AGROTERM from AFFRC-Japan contains 57,000 agricultural terms extracted from a corpus using NLP tools.
        • Aligning AGROTERM or other similar resources with AGROVOC semi-automatically is a useful challenge.