• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cross-lingual ontology lexicalisation, translation and information extraction  Net2 workshop, University of South Africa (UNISA)
 

Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop, University of South Africa (UNISA)

on

  • 1,573 views

 

Statistics

Views

Total Views
1,573
Views on SlideShare
1,573
Embed Views
0

Actions

Likes
0
Downloads
18
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Frame: VerbNet, …LinguisticOntology: GOLD, LexInfo2Form: SKOSLexicalSense-Ontology: SKOS-XLNode/Edge: ParseStructures rare formats such as NEGRA Corpus / TIGER TAG SET by IMS Stuttgart or StanfordParser proprietary
  • Also phrasal lexicon
  • Lemon distinguishes among different types of lexical forms
  • LexicalSenseunderspecified sense THAT points to a language-external referenceunique ontological semantic object (depending on conditions and context) can have subsense andsenseRelation with other lexicalSensesemene relation between lexicalSense and ontologicalSemantic Object can be either: pref / alt / hiddenSem
  • Syntactic agreement: NP( NP_COMPANY VP( VB_sell NP_ASSETS ) )Semantic agreement: COMPANY, ASSETS
  • Syntactic agreement: NP( NP_COMPANY VP( VB_sell NP_ASSETS ) )Semantic agreement: COMPANY, ASSETS
  • Syntactic agreement: NP( NP_COMPANY VP( VB_sell NP_ASSETS ) )Semantic agreement: COMPANY, ASSETS
  • asset-backed-debt“debts are backed by assets”Corresponds to a noun phrase BUT is analyzed by the lemon generator as a sentence: ‘asset backed debt’

Cross-lingual ontology lexicalisation, translation and information extraction  Net2 workshop, University of South Africa (UNISA) Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop, University of South Africa (UNISA) Presentation Transcript

  • ifrs:Revenue
    us-gaap: GainLossOnSaleOfOilAndGasProperty
    de-gaap:BilanzsummeSummeAktiva
    Cross-lingual ontology lexicalisation, translation and information extraction
    Net2 workshop, University of South Africa
    Tobias Wunner
    DERI, National University of Ireland, Galway
    be-gaaap:MinderwaardenBijDeRealisatieVanVasteActiva
    • Copyright 2010 Digital Enterprise Research Institute. All rights reserved, Paul Buitelaar
  • Outline
    1. Research challenge and motivation
    2. Ontology Translation
    3. Lexicalization (lemon)
    4. CLOBIE (CL Ontology-based Inf. Extraction)
  • Context and Motivation
    Monnet use case in financial domain
    query financial information
    Cross-vocabulary
    Cross-lingual
    Get result in your own language
    Research challenges
    localization & translation of vocabularies
    cross-lingual ontology-based information extraction
  • Finance Terminology is complex!
    “minimum finance lease payments receivable,
    at present value,
    end of period not later than one year”
    representative term of financial domain
    • 16 words
    • complex structure (conceptually & linguistically)
  • Some insight in finance terminology
    Domain Terminology
    (SAPTerm)
    Dictionary
    (WordNet)
    XBRL
    (IFRS)
    Domain
    Related
    Domain
    Independent
    Domain
    Specific
    Domain
    Independent
    Domain
    Related
  • Break down complexity
    3-faceted lexical enrichment
    Semantic
    Linguistic
    Terminological
    asset
    [asset]
    [available-for-sale]
    [financial]
    [financial asset]
    [non-financial asset]
    [available-for-sale financial asset]
    Noun_Sing: asset
    Noun_Plural: assets
    ?P: available-for-sale
    Adjective: financial
    NP: available-for-sale fin. asset
    VP: to sell financial assets
    is-a
    is-a
    financial
    asset
    non-financial
    asset
    is-a
    Term
    decomposition
    available-for-sale
    financial
    asset
  • XBRL – Semantic Analysis
  • XBRL – Semantic Analysis
  • XBRL – Semantic Analysis
    “Enhance semantics to
    facilitate translation and
    information extraction.”
  • XBRL – Terminological Analysis
    ifrs:MinimumFinanceLeasePaymentsReceivableAtPresentValue
    ifrs:MinimumFinanceLeasePaymentsReceivable
    Minimum finance lease payments receivable, at present value
    sapTerm:payments
    googleDefine:leasePayments
    sapTerm:financeLease
    googleDefine:Finance_lease
    Domain
    Independent
    Domain
    Related
    Domain
    Specific
    Domain
    Related
    Domain
    Independent
    Domain
    Independent
    Domain
    Specific
  • XBRL – Linguistic Analysis
    Financial text
    “… received minimum finance lease payments …”
    verb
    “… lease payment …”
    complex
    singular
    simple
    minimum finance lease payments receivable
    XBRL term
    adverb
    … lease payments …
    plural
  • Outline
    1. Research challenge and motivation
    2. Ontology Translation
    3. Lexicalization (lemon)
    4. CLOBIE (CL Ontology-based Inf. Extraction)
  • Translation using STL
    Models developed in Monnet
    English / German / Spanish / Dutch
    …Net2
    Afrikaans?
    Zulu?
    Xhosa?

    ifrs:MinimumFinanceLeasePaymentsPayable
    ifrs:ProfitLossBeforeTax
    ifrs:Revenue
  • Application in Machine Translation
    in Dutch
    available-for-sale financial assets
    IFRS, SAPTerm, GoogleDefine
    1. term analysis using:
    domain TM (IFRS), Linked Open Data (DBPedia),
    Translation services (GoogleTranslate)
    [available-for-sale] [financial] [assets]
    2. translate subterms using:
    [voorverkoopbeschikbare] [financiële] [activa]
    3. term synthesis using:
    grammars (rules, statistical models)
    voor verkoop beschikbare financiële activa
  • Application in Machine Translation
    in Afrikaans
    available-for-sale financial assets
    IFRS, SAPTerm, GoogleDefine
    1. term analysis using:
    domain TM (IFRS), Linked Open Data (DBPedia),
    Translation services (GoogleTranslate)
    [available-for-sale] [financial] [assets]
    2. translate subterms using:
    [beskikbaarvirverkoop] [finansiële] [bates]
    3. term synthesis using:
    grammars (rules, statistical models)
    finansiële bates beskikbaar vir verkoop
  • Application in Machine Translation
    in Spanish
    available-for-sale financial assets
    IFRS, SAPTerm, GoogleDefine
    1. term analysis using:
    domain TM (IFRS), Linked Open Data (DBPedia),
    Translation services (GoogleTranslate)
    [available-for-sale] [financial] [assets]
    2. translate subterms using:
    [disponiblespara la venta] [financia] [activos]
    3. term synthesis using:
    grammars (rules, statistical models)
    activos financieros disponibles para la venta
  • Outline
    1. Research challenge and motivation
    2. Ontology Translation
    3. Lexicalization (lemon)
    4. CLOBIE (CL Ontology-based Inf. Extraction)
  • Why do we need a lexicon?
    http://en.wikipedia.org/wiki/Finance_lease
    “loads of unlinked domain-specific
    terminology on the web !”
    An interoperable web for … ?
    re-use
    enable multilinguality
    cross-lingual search
    cross-lingual fact extraction
    http://www.investopedia.com/terms/l/lease-payments.asp
  • Lexicon standards overview
    ISO (XML)
    TEI (Text Encoding Initiative)
    LMF (Lexical Markup Framework)
    W3C & Semantic Web (RDF / OWL)
    build-in rdfs:label
    lightweight linguistic representations (SKOS, SKOS-XL)
    rich linguistic representations (GOLD, LexInfo)
  • SKOS – Multilingual Information
    SKOS concepts with…
    germ relations
    multilingual labels
    resource references
    skos:related
    ifrs:Minimum
    FinanceLease
    Payments
    dbpedia:
    Finance_lease
    dbpedia:Lease
    _payments
    skos:narrower
    skos:broader
    skos:related
  • SKOS – Multilingual Information
    Not much uptake yet? from http://data.nytimes.com/
  • Ontology-Text Mismatch
    ‘Edificio-historico’ vs. ‘…edificio, declarado Monumento Histórico…’
    >> goes beyond SKOS (monolingual & multilingual term variants)
    >> requires representation of lexical information to compute linguistic variants, e.g.
    ‘edificio historico[apposVP[NP[Adj]]]’
  • A Lexicon Model for Ontologies
    Requirements for ‘ontology-lexicon’ model
    Represent linguistic information relative to ontology
    Avoid unnecessary ambiguities by representing only lexical features relevant to semantics of underlying application
    Keep semantics separate from linguistic info
    Separate clearly ‘world’ (properties of objects referred to by words) from ‘word’ (properties of words) knowledge
    Modular, minimal design
    Provide simple core model that can be easily extended upon need
  • Was there a solution already? - SKOS
    Simple Knowledge Organization System – SKOS
    General model for formalizing thesauri, terminologies and related semantic and knowledge resources
    Formalization of terminology in focus - terminology, classification, Semantic Web communities
    Does not address linguistic aspects of terminology, or therefore, the lexicon-ontology interface
    http://www.w3.org/2004/02/skos/
  • Was there a solution already? - GOLD
    General Ontology for Linguistic Description – GOLD
    Community-based ontology of linguistics
    Linguistic study in focus - linguistics community
    Formal model of linguistics as an ontology, but not about connecting lexical features to ontological semantics
    Other issues: very big, modularity?
    http://linguistics-ontology.org/gold/2010
  • Was there a solution already? - OWN
    OntoWordNet – OWN
    Formal specification of WordNet through extension and axiomatization of its conceptual relations
    Formal knowledge representation in focus - logic, knowledge representation, Semantic Web communities
    Turns WordNet into an ontology but not about connecting lexical features to ontological semantics
    http://wiki.loa-cnr.it/index.php/LoaWiki:OWN
  • Was there a solution already? - LMF
    Lexical Markup Framework – LMF
    General model for formalizing and sharing of machine-readable dictionaries
    Lexical knowledge representation in focus - lexicography, NLP communities
    Very close to ontology-lexicon requirements, but no view on how lexical features link to ontological semantics – semantics is limited to a notion of sense based on synsets
    Other issues: incomplete formal model, focus on classes, less on properties/relations
    http://www.lexicalmarkupframework.org/
  • lemon
    lexicon model for ontologies: ‘lemon’
    General model for formalizing lexical features relative to independently defined ontological semantics
    http://www.monnet-project.eu/lemon
    Two-level modelling
    Abstract level (meta-model): lemon
    Instantiation level (lexicon model): e.g. ‘LexInfo2’
    http://lexinfo.net/
  • Many solutions…
    …with an a priori amount of linguistics or semantics!
  • lemon: Overview
  • lemon: Lexicon
    Lexicon: wild animals
    entry
    entry
    entry
    LE: Kudu
    LE: shaped like a Kudu
    LexicalEntry can be a Word, Phrase, or Part - such as an Affix
  • lemon: Form
    wild animals
    otherForm
    abstractForm
    canonicalForm
    LE
    F
    LE
    F
    LE
    F
    “kudu”
    “greater”
    “great”
  • lemon: Structure
    ?
    LE: shaped like a Kudu
    LE: shaped
    LE: like
    LE: a
    LE: Kudu
    LexicalEntry can be decomposed into one or more Components and compositional structure can be represented
  • lemon: Structure - Example
    :Component
    :Component
    :Component
    :Component
    lexeme
    edge
    edge
    decomposition
    :LexicalEntry
    :node
    :LexicalEntry
    :node
    :node
    :LexicalEntry
    :node
    :node
    :LexicalEntry
    :node
    :LexicalEntry
    :node
    shaped like a kudu
    constituent:PP
    shaped, lemma=“shape”
    constituent:VP
    constituent:VBN
    like, lemma=“like”
    constituent:NP
    constituent:IN
    a
    constituent:DT
    Kudu
    constituent:NNP
    element
    leaf
    edge
    edge
    element
    leaf
    edge
    element
    leaf
    edge
    element
    leaf
  • lemon: Meaning & Reference
    LE: kudu
    lexeme
    sense
    LS
    sememe
    reference
  • lemon: Meaning & Reference
    LE: kudu
    sense
    sense
    LE: greater
    kudu
    narrower
    LS
    LS
    reference
    reference
    preSem
  • lemon: Meaning & Reference
    LE:greater
    kudu
    LE:lesser
    kudu
    sense
    sense
    lexical
    incompatibility
    LS
    LS
    incompatible
    reference
    reference
    dbpedia:Kudu
  • lemon: Meaning & Reference
    LE: kudu
    LE: goat
    sense
    sense
    ontological
    incompatibility
    LS
    LS
    reference
    reference
    owl:disjointWith
  • lemon: Lexical Projection
    LexicalEntry can introduce a syntactic frame with arguments that are mapped to LexicalSense and indirectly to ontological semantic objects/properties
  • Lexical projection (Verb Frame)
    syntactic frame
    S (
    NP VP(
    VB NP
    )
    )
    …with semantic
    sugar!
    SAP AGsold long-term fixed rate conventional mortgage loans
  • …more frames with LexInfo2
    http://lexinfo.net/ontology/2.0/lexinfo#DitransitiveFrame
    ditransitive : Frame
    subject : Argument
    direct object : Argument
    indirect object : Argument
    verb: Frame
    extends
    synarg
    synarg
    synarg
    SAP AGsoldCompany Xmortgage loans
  • …more frames with LexInfo2
    http://lexinfo.net/ontology/2.0/lexinfo#DitransitiveFrame_To
    ditransitive_to : Frame
    subject : Argument
    direct object : Argument
    indirect object : Argument
    ditransitive: Frame
    extends
    synarg
    synarg
    synarg
    SAP AGsold mortgage loansto Company X
  • Or Zulu morphology…
    LE:angoma
    sense
    sense
    class = lemon:MorphologicalPattern
    LE:tolo
    :zuluNC7_8 a lemon:MorphPattern ;   lemon:transform [       lemon:rule "isi(?=[^aeiou])~" ;      lemon:rule "is(?=[aeiou])~" ;      lemon:generates [ lexinfo:numberlexinfo:singular ]   ] , [      lemon:rule "izi(?=[^aeiou])~" ;      lemon:rule "iz(?=[aeiou])~" ;      lemon:generates [ lexinfo:numberlexinfo:plural ]  ] .
    pattern
    pattern
    isitolo (shop)
    izangoma (witch doctors)
  • Lemon Editor and Generator
    http://monnetproject.deri.ie/Lemon-Editor
    “asset-backed-debts”
    Finance Ontology
    lemon lexicon
    @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    @prefix lemon: <http://www.monnet-project.eu/lemon#> .
    @prefix financeV4: <http://fadyart.com/financeV4#> .
    @prefix lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#> .
    @prefix pennbank: <http://www.monnet-project.eu/pennbank#> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

    <file:test#assetbackeddebt> lemon:phraseRoot [ lemon:edge [ lemon:edge [ lemon:edge [ lemon:leaf _:n6 ] ;
    lemon:constituentpennbank:NNP ] ;
    lemon:constituentpennbank:NP ] ,
    [ lemon:edge [ lemon:edge [ lemon:leaf _:n88 ] ;
    lemon:constituentpennbank:VBD ] ,
    [ lemon:edge [ lemon:edge [ lemon:leaf _:n69 ] ;
    lemon:constituentpennbank:NN ] ;
    lemon:constituentpennbank:NP ] ;
    lemon:constituentpennbank:VP ] ;
    lemon:constituentpennbank:S ] ;
    lemon:decomposition ( _:n6
    _:n88
    _:n69
    ) ;
    lemon:sense [ lemon:reference financeV4:AssetBackedDebt ] ;
    lemon:canonicalForm [ lemon:writtenRep "Asset backed debt"@en ] .

    lemon Lexical Entries
    <file:test#back> lexinfo:partOfSpeechlexinfo:verb ;
    lemon:canonicalForm [ lexinfo:tenselexinfo:past ;
    lexinfo:verbFormMoodlexinfo:indicative ;
    lemon:writtenRep "backed"@en ;
    lexinfo:aspectlexinfo:perfective ] .
    _:n88 rdf:typelemon:Component ;
    lexinfo:tenselexinfo:past ;
    lemon:element <file:test#back> ;
    lexinfo:verbFormMoodlexinfo:indicative ;
    lexinfo:aspectlexinfo:perfective .
  • Outline
    1. Research challenge and motivation
    2. Ontology Translation & Inform. Extraction
    3. Lexicalization (lemon)
    4. CLOBIE (Cross-lingual Ontology-based Information Extraction)
  • What is CLOBIE
    Information Extraction
    Monolingual
    No semantics
    Cross-lingual Information Extraction
    Multilingual
    Ontology-based Information Extraction
    Semantics in the background
  • What is CLOBIE
    Information extraction(monolingual)
    Information extraction (multilingual)
    Information extraction with semantics
    “SAP sold risk securities at a value of 12b EUR.”
    PATTERN: .*SAP.*[sells|sold|issues].*[risk securities].*[0-9]+b [EUR|USD].*
    PATTERN_DE: .*SAP.*verkaufte*.*[RisikoWertpapiere].*[0-9]+b [EUR|USD].*
    .*[COMPANY] sell [ASSETS] .*
    PATTERN: .*$COMPANY .*[sells|sold|issues].*$ASSETS.*$MONETARY_VALUE.}
    financial assets
    non-financial assets
    risk securities
    Property, Plant & Equipment
  • Application in Information Extraction (IE)
    :MinimumFinanceLeasePaymentsReceivable
    rdfs:subClassOf xbrli:monetaryItemType ;
    rdfs:label “Minimum finance lease payments receivable”@en .
    semantically lifted
    Minimum finance lease payments receivable
    term analysis
    receivables
    payments received
    linguistic analysis
    Tesco’s Annual Report 2009
    Tesco’s Annual Report 2009
    Tesco’s Annual Report 2009
    Tesco’s Annual Report 2009
    SAP Annual Report 2008
    SAP Annual Report 2008
    SAP Annual Report 2008
    SAP Annual Report 2008
    …The fair value of the Group’s
    finance leasereceivablesat
    23 February 2008 was £5m…
    ..As at December 31, 2008,
    the future minimumlease
    payments expected to be
    received was €16million…
    …The fair value of the Group’s
    finance leasereceivablesat
    23 February 2008 was £5m…
    ..As at December 31, 2008,
    the future minimumlease
    paymentsexpected to be
    received was €16million…
    …The fair value of the Group’s
    finance lease receivables at
    23 February 2008 was £5m…
    ..As at December 31, 2008,
    the future minimum lease
    payments expected to be
    received was €16million…
    …The fair value of the Group’s
    finance lease receivables at
    23 February 2008 was £5m…
    ..As at December 31, 2008,
    the future minimum lease
    payments expected to be
    received was €16million…
  • CLOBIE Interdisciplinary
    Statistical MT
    Rule-based MT
    Localization
    Term extraction
    Relation extraction
    Extract. grammars
    Machine
    Translation
    Information
    Extraction
    NLP
    Corpus query
    Term analysis
    POS tagging
    Morph analysis
    Information
    Retrieval
    CLOBIE
    Semantic
    Web
    TF-IDF
    Web query
    ranking algorithms
    CLIR (ESA, MT-based)
    Ontologies
    SKOS, lemon
    SPARQL queries
  • Why CLOBIE?
    Many unstructured resources (News, FinReps)
    Knowledge in SW is often:
    Not dynamic (no regular, only manual updates)
    Knowledge across languages/countries not integrated
  • CLOBIE blackboard architecture
    CLOBIE Search
    read
    token_id /
    POS
    token_id /
    token_id
    sent_id/
    term
    sent_id/
    concept
    Blackboard

    read /
    contribute
    read /
    contribute
    read /
    contribute
    read /
    contribute
    Annotators
    Basic NLP
    • Splitter
    • Tok. / POS
    Linguistic Analyzer
    • Morphology
    • Dependency Parser
    Term Analyzer
    Semantic Analyzer
    • Terminology DB
    Semantic / Terminological / Linguistic Enrichment Process
  • CLOBIE Data set (Wind Energy)
    10 companies in Wind Energy domain
    Financial reports in
    German / Spanish / English / Dutch
    IFRS / DE-GAAP
    Semantics defined by
    IFRS vocabulary
    xEBR vocabulary
  • Next steps…
    Benchmark development and evaluation on the basis of a data set in finance domain
    financial reports and news from different companies in wind energy domain
    multilingual (German, Dutch, Spanish, English)
    multi-vocabulary (IFRS, European local GAAPs, DBPedia)
    Cross-lingual ontology-based information retrieval system
    Generate ontology-based information extraction grammars from lemon ontology-lexicons