Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop, University of South Africa (UNISA)

ifrs:Revenue us-gaap: GainLossOnSaleOfOilAndGasProperty de-gaap:BilanzsummeSummeAktiva Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop, University of South Africa Tobias Wunner DERI, National University of Ireland, Galway be-gaaap:MinderwaardenBijDeRealisatieVanVasteActiva ,[object Object],[object Object]

Context and Motivation Monnet use case in financial domain query financial information Cross-vocabulary Cross-lingual Get result in your own language Research challenges localization & translation of vocabularies cross-lingual ontology-based information extraction

Finance Terminology is complex! “minimum finance lease payments receivable, at present value, end of period not later than one year” representative term of financial domain ,[object Object]

complex structure (conceptually & linguistically),[object Object]

Break down complexity 3-faceted lexical enrichment Semantic Linguistic Terminological asset [asset] [available-for-sale] [financial] [financial asset] [non-financial asset] [available-for-sale financial asset] Noun_Sing: asset Noun_Plural: assets ?P: available-for-sale Adjective: financial NP: available-for-sale fin. asset VP: to sell financial assets is-a is-a financial asset non-financial asset is-a Term decomposition available-for-sale financial asset

XBRL – Semantic Analysis “Enhance semantics to facilitate translation and information extraction.”

XBRL – Terminological Analysis ifrs:MinimumFinanceLeasePaymentsReceivableAtPresentValue ifrs:MinimumFinanceLeasePaymentsReceivable Minimum finance lease payments receivable, at present value sapTerm:payments googleDefine:leasePayments sapTerm:financeLease googleDefine:Finance_lease Domain Independent Domain Related Domain Specific Domain Related Domain Independent Domain Independent Domain Specific

XBRL – Linguistic Analysis Financial text “… received minimum finance lease payments …” verb “… lease payment …” complex singular simple minimum finance lease payments receivable XBRL term adverb … lease payments … plural

Outline 1. Research challenge and motivation 2. Ontology Translation 3. Lexicalization (lemon) 4. CLOBIE (CL Ontology-based Inf. Extraction)

Translation using STL Models developed in Monnet English / German / Spanish / Dutch …Net2 Afrikaans? Zulu? Xhosa? … ifrs:MinimumFinanceLeasePaymentsPayable ifrs:ProfitLossBeforeTax ifrs:Revenue

Application in Machine Translation in Dutch available-for-sale financial assets IFRS, SAPTerm, GoogleDefine 1. term analysis using: domain TM (IFRS), Linked Open Data (DBPedia), Translation services (GoogleTranslate) [available-for-sale] [financial] [assets] 2. translate subterms using: [voorverkoopbeschikbare] [financiële] [activa] 3. term synthesis using: grammars (rules, statistical models) voor verkoop beschikbare financiële activa

Application in Machine Translation in Afrikaans available-for-sale financial assets IFRS, SAPTerm, GoogleDefine 1. term analysis using: domain TM (IFRS), Linked Open Data (DBPedia), Translation services (GoogleTranslate) [available-for-sale] [financial] [assets] 2. translate subterms using: [beskikbaarvirverkoop] [finansiële] [bates] 3. term synthesis using: grammars (rules, statistical models) finansiële bates beskikbaar vir verkoop

Application in Machine Translation in Spanish available-for-sale financial assets IFRS, SAPTerm, GoogleDefine 1. term analysis using: domain TM (IFRS), Linked Open Data (DBPedia), Translation services (GoogleTranslate) [available-for-sale] [financial] [assets] 2. translate subterms using: [disponiblespara la venta] [financia] [activos] 3. term synthesis using: grammars (rules, statistical models) activos financieros disponibles para la venta

Why do we need a lexicon? http://en.wikipedia.org/wiki/Finance_lease “loads of unlinked domain-specific terminology on the web !” An interoperable web for … ? re-use enable multilinguality cross-lingual search cross-lingual fact extraction http://www.investopedia.com/terms/l/lease-payments.asp

Lexicon standards overview ISO (XML) TEI (Text Encoding Initiative) LMF (Lexical Markup Framework) W3C & Semantic Web (RDF / OWL) build-in rdfs:label lightweight linguistic representations (SKOS, SKOS-XL) rich linguistic representations (GOLD, LexInfo)

SKOS – Multilingual Information SKOS concepts with… germ relations multilingual labels resource references skos:related ifrs:Minimum FinanceLease Payments dbpedia: Finance_lease dbpedia:Lease _payments skos:narrower skos:broader skos:related

SKOS – Multilingual Information Not much uptake yet? from http://data.nytimes.com/

Ontology-Text Mismatch ‘Edificio-historico’ vs. ‘…edificio, declarado Monumento Histórico…’ >> goes beyond SKOS (monolingual & multilingual term variants) >> requires representation of lexical information to compute linguistic variants, e.g. ‘edificio historico[apposVP[NP[Adj]]]’

A Lexicon Model for Ontologies Requirements for ‘ontology-lexicon’ model Represent linguistic information relative to ontology Avoid unnecessary ambiguities by representing only lexical features relevant to semantics of underlying application Keep semantics separate from linguistic info Separate clearly ‘world’ (properties of objects referred to by words) from ‘word’ (properties of words) knowledge Modular, minimal design Provide simple core model that can be easily extended upon need

Was there a solution already? - SKOS Simple Knowledge Organization System – SKOS General model for formalizing thesauri, terminologies and related semantic and knowledge resources Formalization of terminology in focus - terminology, classification, Semantic Web communities Does not address linguistic aspects of terminology, or therefore, the lexicon-ontology interface http://www.w3.org/2004/02/skos/

Was there a solution already? - GOLD General Ontology for Linguistic Description – GOLD Community-based ontology of linguistics Linguistic study in focus - linguistics community Formal model of linguistics as an ontology, but not about connecting lexical features to ontological semantics Other issues: very big, modularity? http://linguistics-ontology.org/gold/2010

Was there a solution already? - OWN OntoWordNet – OWN Formal specification of WordNet through extension and axiomatization of its conceptual relations Formal knowledge representation in focus - logic, knowledge representation, Semantic Web communities Turns WordNet into an ontology but not about connecting lexical features to ontological semantics http://wiki.loa-cnr.it/index.php/LoaWiki:OWN

Was there a solution already? - LMF Lexical Markup Framework – LMF General model for formalizing and sharing of machine-readable dictionaries Lexical knowledge representation in focus - lexicography, NLP communities Very close to ontology-lexicon requirements, but no view on how lexical features link to ontological semantics – semantics is limited to a notion of sense based on synsets Other issues: incomplete formal model, focus on classes, less on properties/relations http://www.lexicalmarkupframework.org/

lemon lexicon model for ontologies: ‘lemon’ General model for formalizing lexical features relative to independently defined ontological semantics http://www.monnet-project.eu/lemon Two-level modelling Abstract level (meta-model): lemon Instantiation level (lexicon model): e.g. ‘LexInfo2’ http://lexinfo.net/

Many solutions… …with an a priori amount of linguistics or semantics!

lemon: Lexicon Lexicon: wild animals entry entry entry LE: Kudu LE: shaped like a Kudu LexicalEntry can be a Word, Phrase, or Part - such as an Affix

lemon: Form wild animals otherForm abstractForm canonicalForm LE F LE F LE F “kudu” “greater” “great”

lemon: Structure ? LE: shaped like a Kudu LE: shaped LE: like LE: a LE: Kudu LexicalEntry can be decomposed into one or more Components and compositional structure can be represented

lemon: Structure - Example :Component :Component :Component :Component lexeme edge edge decomposition :LexicalEntry :node :LexicalEntry :node :node :LexicalEntry :node :node :LexicalEntry :node :LexicalEntry :node shaped like a kudu constituent:PP shaped, lemma=“shape” constituent:VP constituent:VBN like, lemma=“like” constituent:NP constituent:IN a constituent:DT Kudu constituent:NNP element leaf edge edge element leaf edge element leaf edge element leaf

lemon: Meaning & Reference LE: kudu lexeme sense LS sememe reference

lemon: Meaning & Reference LE: kudu sense sense LE: greater kudu narrower LS LS reference reference preSem

lemon: Meaning & Reference LE:greater kudu LE:lesser kudu sense sense lexical incompatibility LS LS incompatible reference reference dbpedia:Kudu

lemon: Meaning & Reference LE: kudu LE: goat sense sense ontological incompatibility LS LS reference reference owl:disjointWith

lemon: Lexical Projection LexicalEntry can introduce a syntactic frame with arguments that are mapped to LexicalSense and indirectly to ontological semantic objects/properties

Lexical projection (Verb Frame) syntactic frame S ( NP VP( VB NP ) ) …with semantic sugar! SAP AGsold long-term fixed rate conventional mortgage loans

…more frames with LexInfo2 http://lexinfo.net/ontology/2.0/lexinfo#DitransitiveFrame ditransitive : Frame subject : Argument direct object : Argument indirect object : Argument verb: Frame extends synarg synarg synarg SAP AGsoldCompany Xmortgage loans

…more frames with LexInfo2 http://lexinfo.net/ontology/2.0/lexinfo#DitransitiveFrame_To ditransitive_to : Frame subject : Argument direct object : Argument indirect object : Argument ditransitive: Frame extends synarg synarg synarg SAP AGsold mortgage loansto Company X

Or Zulu morphology… LE:angoma sense sense class = lemon:MorphologicalPattern LE:tolo :zuluNC7_8 a lemon:MorphPattern ; lemon:transform [ lemon:rule "isi(?=[^aeiou])~" ; lemon:rule "is(?=[aeiou])~" ; lemon:generates [ lexinfo:numberlexinfo:singular ] ] , [ lemon:rule "izi(?=[^aeiou])~" ; lemon:rule "iz(?=[aeiou])~" ; lemon:generates [ lexinfo:numberlexinfo:plural ] ] . pattern pattern isitolo (shop) izangoma (witch doctors)

Lemon Editor and Generator http://monnetproject.deri.ie/Lemon-Editor “asset-backed-debts” Finance Ontology lemon lexicon @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix lemon: <http://www.monnet-project.eu/lemon#> . @prefix financeV4: <http://fadyart.com/financeV4#> . @prefix lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#> . @prefix pennbank: <http://www.monnet-project.eu/pennbank#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . … <file:test#assetbackeddebt> lemon:phraseRoot [ lemon:edge [ lemon:edge [ lemon:edge [ lemon:leaf _:n6 ] ; lemon:constituentpennbank:NNP ] ; lemon:constituentpennbank:NP ] , [ lemon:edge [ lemon:edge [ lemon:leaf _:n88 ] ; lemon:constituentpennbank:VBD ] , [ lemon:edge [ lemon:edge [ lemon:leaf _:n69 ] ; lemon:constituentpennbank:NN ] ; lemon:constituentpennbank:NP ] ; lemon:constituentpennbank:VP ] ; lemon:constituentpennbank:S ] ; lemon:decomposition ( _:n6 _:n88 _:n69 ) ; lemon:sense [ lemon:reference financeV4:AssetBackedDebt ] ; lemon:canonicalForm [ lemon:writtenRep "Asset backed debt"@en ] . … lemon Lexical Entries <file:test#back> lexinfo:partOfSpeechlexinfo:verb ; lemon:canonicalForm [ lexinfo:tenselexinfo:past ; lexinfo:verbFormMoodlexinfo:indicative ; lemon:writtenRep "backed"@en ; lexinfo:aspectlexinfo:perfective ] . _:n88 rdf:typelemon:Component ; lexinfo:tenselexinfo:past ; lemon:element <file:test#back> ; lexinfo:verbFormMoodlexinfo:indicative ; lexinfo:aspectlexinfo:perfective .

Outline 1. Research challenge and motivation 2. Ontology Translation & Inform. Extraction 3. Lexicalization (lemon) 4. CLOBIE (Cross-lingual Ontology-based Information Extraction)

What is CLOBIE Information Extraction Monolingual No semantics Cross-lingual Information Extraction Multilingual Ontology-based Information Extraction Semantics in the background

What is CLOBIE Information extraction(monolingual) Information extraction (multilingual) Information extraction with semantics “SAP sold risk securities at a value of 12b EUR.” PATTERN: .*SAP.*[sells|sold|issues].*[risk securities].*[0-9]+b [EUR|USD].* PATTERN_DE: .*SAP.*verkaufte*.*[RisikoWertpapiere].*[0-9]+b [EUR|USD].* .*[COMPANY] sell [ASSETS] .* PATTERN: .*$COMPANY .*[sells|sold|issues].*$ASSETS.*$MONETARY_VALUE.} financial assets non-financial assets risk securities Property, Plant & Equipment

Application in Information Extraction (IE) :MinimumFinanceLeasePaymentsReceivable rdfs:subClassOf xbrli:monetaryItemType ; rdfs:label “Minimum finance lease payments receivable”@en . semantically lifted Minimum finance lease payments receivable term analysis receivables payments received linguistic analysis Tesco’s Annual Report 2009 Tesco’s Annual Report 2009 Tesco’s Annual Report 2009 Tesco’s Annual Report 2009 SAP Annual Report 2008 SAP Annual Report 2008 SAP Annual Report 2008 SAP Annual Report 2008 …The fair value of the Group’s finance leasereceivablesat 23 February 2008 was £5m… ..As at December 31, 2008, the future minimumlease payments expected to be received was €16million… …The fair value of the Group’s finance leasereceivablesat 23 February 2008 was £5m… ..As at December 31, 2008, the future minimumlease paymentsexpected to be received was €16million… …The fair value of the Group’s finance lease receivables at 23 February 2008 was £5m… ..As at December 31, 2008, the future minimum lease payments expected to be received was €16million… …The fair value of the Group’s finance lease receivables at 23 February 2008 was £5m… ..As at December 31, 2008, the future minimum lease payments expected to be received was €16million…

CLOBIE Interdisciplinary Statistical MT Rule-based MT Localization Term extraction Relation extraction Extract. grammars Machine Translation Information Extraction NLP Corpus query Term analysis POS tagging Morph analysis Information Retrieval CLOBIE Semantic Web TF-IDF Web query ranking algorithms CLIR (ESA, MT-based) Ontologies SKOS, lemon SPARQL queries

Why CLOBIE? Many unstructured resources (News, FinReps) Knowledge in SW is often: Not dynamic (no regular, only manual updates) Knowledge across languages/countries not integrated

CLOBIE blackboard architecture CLOBIE Search read token_id / POS token_id / token_id sent_id/ term sent_id/ concept Blackboard … read / contribute read / contribute read / contribute read / contribute Annotators Basic NLP ,[object Object]

Tok. / POSLinguistic Analyzer ,[object Object]

Dependency ParserTerm Analyzer Semantic Analyzer ,[object Object],Semantic / Terminological / Linguistic Enrichment Process

CLOBIE Data set (Wind Energy) 10 companies in Wind Energy domain Financial reports in German / Spanish / English / Dutch IFRS / DE-GAAP Semantics defined by IFRS vocabulary xEBR vocabulary

Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop, University of South Africa (UNISA)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop, University of South Africa (UNISA)

Similar to Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop, University of South Africa (UNISA) (20)

Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop, University of South Africa (UNISA)

Editor's Notes