STL: A similarity measure based on semantic and linguistic information


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

STL: A similarity measure based on semantic and linguistic information

  1. 1. STL : A Similarity Measure Based on Semantic, Terminological and Linguistic Information<br />Nitish Aggarwal<br />joint work with Tobias Wunner, MihaelArcan<br />DERI, NUI Galway<br /><br />Friday,19th Aug, 2011<br />DERI, Friday Meeting<br />
  2. 2. Overview<br />Motivation & Applications<br />Why STL? <br />Semantic<br />Terminology<br />Linguistic<br />Evaluation<br />Conclusion and future work<br />2<br />
  3. 3. Motivation & Applications<br />SemanticAnnotation<br />Similarity between corpus data and ontology concepts<br />SAP AG held €1615 million in short-term liquid assets (2009)<br />“dbpedia:SAP_AG” “xEBR:LiquidAssets” at “dbpedia:year:2009”<br />3<br />
  4. 4. SemanticSearch<br />Similarity between Query and index object<br />Motivation & Applications<br />SAP liquid asset in 2010<br />Current asset of SAP last year<br />“dbpedia:SAP_AG” “xEBR:liquid asset” at “dbpedia:year:2010”<br />Net cash of SAP in 2010<br />SAP total amount received in 2010<br />4<br />
  5. 5. Motivation & Applications<br />OntologyMatching & Alignment<br />Similarity between ontology concepts<br />ifrs:StatementOfFinancialPosition<br />xebr:KeyBalanceSheet<br />Assets<br />Ifrs:Assets<br />ifrs:BiologicalAssets<br />xebr:SubscribedCapitalUnpaid<br />Ifrs:CurrentAssets<br />Ifrs:NonCurrentAssets<br />xebr:FixedAssets<br />xebr:CurrentAssets<br />ifrs:PropertyPlantAndEquipment<br />xebr:TangibleFixedAssets<br />xebr:IntangibleFixedAssets<br />xebr:Amount Receivable<br />xebr:Liquid<br />Assets<br />Similarity = ?<br />Similarity = ?<br />ifrs:CashAndCashEquivalents<br />Ifrs:TradeAndOtherCurrentReceivables<br />Ifrs:Inventories<br />5<br />
  6. 6. Classical Approaches<br />String Similarity<br />Levenshteindistance, Dice Coefficient<br />Corpus-based<br />LSA, ESA, Google distance,Vector-Space Model<br />Ontology-based<br />Path distance, Information content<br />Syntax Similarity<br />Word-order, Part of Speech<br />6<br />
  7. 7. Why STL?<br />Semantic<br />Semanticstructure and relations<br />Terminology<br />complex terms expressing the same concept<br />Linguistic <br />Phrase and dependency structure<br />7<br />
  8. 8. STL<br />Definition<br />Linear combination of semantic, terminological and linguistic<br />obtained by using a linear regression<br />Formula used<br />STL = w1*S + w2*T + w3*L + Constant<br />w1, w2, w3 represent the contribution of each<br />8<br />
  9. 9. Semantic<br />WuPalmer<br />2*depth(MSCA) / depth(c1) + depth(c2)<br />Resnik’s Information Content<br />IC(c) = -log p(c)<br />Intrinsic Information Content (Pirro09)<br />Overcome the analysis of large corpora<br />9<br />
  10. 10. Cont.<br />Intrinsic information content(iIC)<br />.<br />where sub(c) is number of sub-concept of given concept c.<br />Pirro_Similarity<br />10<br />
  11. 11. Cont.<br />MSCA<br />subconcepts = 48<br />IC (TFA) = 0.32<br />Assets<br />Subscribed Capital Unpaid<br />Fixed Assets<br />Current Assets<br />Pirro_Sim = 0.33<br />Pirro_Sim =?<br />Stocks<br />Tangible Fixed Assets<br />Amount Receivable<br />subconcepts = 6<br />IC (AR) = 0.69<br />subconcepts = 9<br />IC (TFA) = 0.60<br />Amount Receivable [total]<br />Amount Receivable with in one year<br />Amount Receivable after more than one year<br />Other Tangible Fixed Assets<br />Property, Plant <br />and Equipment<br />Payments on account and asset in construction<br />Furniture Fixture and Equipment<br />Trade Debtors<br />Other Fixture<br />Land and Building<br />Other Debtors<br />Plant and Machinery<br />Other Property, Plant <br />and Equipment<br />Property, Plant <br />and Equipment [Total]<br />11<br />
  12. 12. Limitation<br />Does semantic structure reflect a good similarity?<br />not necessarily<br />e.g. In xEBR, parent-child relation for describing the layout of concepts<br />“Work in progress” is not a type of asset, although both are linked via the parent-child relationship <br />12<br />
  13. 13. Terminology<br />Definition<br />Common naming convention<br />Ngram Vs subterms<br />In financial domain, bigram ”Intangible Fixed” is a subtring of ”Other Intangible Fixed Assets” but not a subterm.<br />Terminological similarity<br />maximal subterm overlap<br />13<br />
  14. 14. Cont.<br />Trade Debts Payable After More Than One Year <br />[[Trade][Debts]][Payable][After More Than One Year]<br />[SAP:Payable]<br />[Ifrs:After More Than One Year]<br />[Investoword:Debt]<br />[FinanceDict:Trade Debts]<br />[Investopedia:Trade]<br />Financial[Debts][Payable][After More Than One Year]<br />Financial Debts Payable After More Than One Year <br />14<br />
  15. 15. Multilingual Subterms<br />Translatedsubterms<br />Available in otherlanguages<br />Advantage<br />Reflect terminological similarities that may be available in one language but not in others.<br />”Property Plant and Equipment”@en<br />”Sachanlagen”@de<br />”Tangible Fixed Asset” @en<br />15<br />
  16. 16. Linguistic <br />Syntactic Information<br />Beyond simple word order<br />phrase structure<br />Dependency structure<br />Phrase structure<br />Intangible fixed : adj adj > ??<br />Intangible fixed assets : adj adj n > NP<br />Dependency structure<br />Amounts receivable : N Adv : receive:mod, amounts:head<br />Received amounts : V N : receive:mod, amounts:head<br />16<br />
  17. 17. Evaluation<br />Data Set<br />xEBR finance vocabulary<br />269 terms (concept labels)<br />72,361(269*269) termpairs<br />Benchmarks<br />SimSem59: sample of 59 term pairs<br />SimSem200 : sample of 200 term pairs (under construction)<br />17<br />
  18. 18. Experiment<br />An overview of similarity measures<br />18<br />
  19. 19. Experiment Results (Simsem59)<br />STL formula used<br />STL = 0.1531 * S + 0.5218 * T + 0.1041 * L + 0.1791<br />Correlation between similarity scores & simsem59<br />Semantic <br />Contribution<br />Terminology<br />Contribution<br />Linguistic <br />Contribution<br />19<br />
  20. 20. Conclusion<br />STL outperforms more traditional similarity measures<br />Largest contribution by T (Terminological Analysis)<br />Multilingual subterms performs better than monolingual<br />20<br />
  21. 21. Future work<br />Evaluation on larger data set and vocabularies (IFRS)<br />3000+ terms <br />9M term pairs<br />richer set of linguistic operations<br />“recognise” => “recognition” <br /> by derivation rule verb_lemma+"ion”<br />Similarity between subterms<br />“Staff Costs” and "Wages And Salaries"<br />21<br />