• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
STL: A similarity measure based on semantic and linguistic information
 

STL: A similarity measure based on semantic and linguistic information

on

  • 324 views

 

Statistics

Views

Total Views
324
Views on SlideShare
324
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    STL: A similarity measure based on semantic and linguistic information STL: A similarity measure based on semantic and linguistic information Presentation Transcript

    • STL : A Similarity Measure Based on Semantic, Terminological and Linguistic Information
      Nitish Aggarwal
      joint work with Tobias Wunner, MihaelArcan
      DERI, NUI Galway
      firstname.lastname@deri.org
      Friday,19th Aug, 2011
      DERI, Friday Meeting
    • Overview
      Motivation & Applications
      Why STL?
      Semantic
      Terminology
      Linguistic
      Evaluation
      Conclusion and future work
      2
    • Motivation & Applications
      SemanticAnnotation
      Similarity between corpus data and ontology concepts
      SAP AG held €1615 million in short-term liquid assets (2009)
      “dbpedia:SAP_AG” “xEBR:LiquidAssets” at “dbpedia:year:2009”
      3
    • SemanticSearch
      Similarity between Query and index object
      Motivation & Applications
      SAP liquid asset in 2010
      Current asset of SAP last year
      “dbpedia:SAP_AG” “xEBR:liquid asset” at “dbpedia:year:2010”
      Net cash of SAP in 2010
      SAP total amount received in 2010
      4
    • Motivation & Applications
      OntologyMatching & Alignment
      Similarity between ontology concepts
      ifrs:StatementOfFinancialPosition
      xebr:KeyBalanceSheet
      Assets
      Ifrs:Assets
      ifrs:BiologicalAssets
      xebr:SubscribedCapitalUnpaid
      Ifrs:CurrentAssets
      Ifrs:NonCurrentAssets
      xebr:FixedAssets
      xebr:CurrentAssets
      ifrs:PropertyPlantAndEquipment
      xebr:TangibleFixedAssets
      xebr:IntangibleFixedAssets
      xebr:Amount Receivable
      xebr:Liquid
      Assets
      Similarity = ?
      Similarity = ?
      ifrs:CashAndCashEquivalents
      Ifrs:TradeAndOtherCurrentReceivables
      Ifrs:Inventories
      5
    • Classical Approaches
      String Similarity
      Levenshteindistance, Dice Coefficient
      Corpus-based
      LSA, ESA, Google distance,Vector-Space Model
      Ontology-based
      Path distance, Information content
      Syntax Similarity
      Word-order, Part of Speech
      6
    • Why STL?
      Semantic
      Semanticstructure and relations
      Terminology
      complex terms expressing the same concept
      Linguistic
      Phrase and dependency structure
      7
    • STL
      Definition
      Linear combination of semantic, terminological and linguistic
      obtained by using a linear regression
      Formula used
      STL = w1*S + w2*T + w3*L + Constant
      w1, w2, w3 represent the contribution of each
      8
    • Semantic
      WuPalmer
      2*depth(MSCA) / depth(c1) + depth(c2)
      Resnik’s Information Content
      IC(c) = -log p(c)
      Intrinsic Information Content (Pirro09)
      Overcome the analysis of large corpora
      9
    • Cont.
      Intrinsic information content(iIC)
      .
      where sub(c) is number of sub-concept of given concept c.
      Pirro_Similarity
      10
    • Cont.
      MSCA
      subconcepts = 48
      IC (TFA) = 0.32
      Assets
      Subscribed Capital Unpaid
      Fixed Assets
      Current Assets
      Pirro_Sim = 0.33
      Pirro_Sim =?
      Stocks
      Tangible Fixed Assets
      Amount Receivable
      subconcepts = 6
      IC (AR) = 0.69
      subconcepts = 9
      IC (TFA) = 0.60
      Amount Receivable [total]
      Amount Receivable with in one year
      Amount Receivable after more than one year
      Other Tangible Fixed Assets
      Property, Plant
      and Equipment
      Payments on account and asset in construction
      Furniture Fixture and Equipment
      Trade Debtors
      Other Fixture
      Land and Building
      Other Debtors
      Plant and Machinery
      Other Property, Plant
      and Equipment
      Property, Plant
      and Equipment [Total]
      11
    • Limitation
      Does semantic structure reflect a good similarity?
      not necessarily
      e.g. In xEBR, parent-child relation for describing the layout of concepts
      “Work in progress” is not a type of asset, although both are linked via the parent-child relationship
      12
    • Terminology
      Definition
      Common naming convention
      Ngram Vs subterms
      In financial domain, bigram ”Intangible Fixed” is a subtring of ”Other Intangible Fixed Assets” but not a subterm.
      Terminological similarity
      maximal subterm overlap
      13
    • Cont.
      Trade Debts Payable After More Than One Year
      [[Trade][Debts]][Payable][After More Than One Year]
      [SAP:Payable]
      [Ifrs:After More Than One Year]
      [Investoword:Debt]
      [FinanceDict:Trade Debts]
      [Investopedia:Trade]
      Financial[Debts][Payable][After More Than One Year]
      Financial Debts Payable After More Than One Year
      14
    • Multilingual Subterms
      Translatedsubterms
      Available in otherlanguages
      Advantage
      Reflect terminological similarities that may be available in one language but not in others.
      ”Property Plant and Equipment”@en
      ”Sachanlagen”@de
      ”Tangible Fixed Asset” @en
      15
    • Linguistic
      Syntactic Information
      Beyond simple word order
      phrase structure
      Dependency structure
      Phrase structure
      Intangible fixed : adj adj > ??
      Intangible fixed assets : adj adj n > NP
      Dependency structure
      Amounts receivable : N Adv : receive:mod, amounts:head
      Received amounts : V N : receive:mod, amounts:head
      16
    • Evaluation
      Data Set
      xEBR finance vocabulary
      269 terms (concept labels)
      72,361(269*269) termpairs
      Benchmarks
      SimSem59: sample of 59 term pairs
      SimSem200 : sample of 200 term pairs (under construction)
      17
    • Experiment
      An overview of similarity measures
      18
    • Experiment Results (Simsem59)
      STL formula used
      STL = 0.1531 * S + 0.5218 * T + 0.1041 * L + 0.1791
      Correlation between similarity scores & simsem59
      Semantic
      Contribution
      Terminology
      Contribution
      Linguistic
      Contribution
      19
    • Conclusion
      STL outperforms more traditional similarity measures
      Largest contribution by T (Terminological Analysis)
      Multilingual subterms performs better than monolingual
      20
    • Future work
      Evaluation on larger data set and vocabularies (IFRS)
      3000+ terms
      9M term pairs
      richer set of linguistic operations
      “recognise” => “recognition”
      by derivation rule verb_lemma+"ion”
      Similarity between subterms
      “Staff Costs” and "Wages And Salaries"
      21