STL: A similarity measure based on semantic and linguistic information
Upcoming SlideShare
Loading in...5
×
 

STL: A similarity measure based on semantic and linguistic information

on

  • 329 views

 

Statistics

Views

Total Views
329
Views on SlideShare
329
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

STL: A similarity measure based on semantic and linguistic information STL: A similarity measure based on semantic and linguistic information Presentation Transcript

  • STL : A Similarity Measure Based on Semantic, Terminological and Linguistic Information
    Nitish Aggarwal
    joint work with Tobias Wunner, MihaelArcan
    DERI, NUI Galway
    firstname.lastname@deri.org
    Friday,19th Aug, 2011
    DERI, Friday Meeting
  • Overview
    Motivation & Applications
    Why STL?
    Semantic
    Terminology
    Linguistic
    Evaluation
    Conclusion and future work
    2
  • Motivation & Applications
    SemanticAnnotation
    Similarity between corpus data and ontology concepts
    SAP AG held €1615 million in short-term liquid assets (2009)
    “dbpedia:SAP_AG” “xEBR:LiquidAssets” at “dbpedia:year:2009”
    3
  • SemanticSearch
    Similarity between Query and index object
    Motivation & Applications
    SAP liquid asset in 2010
    Current asset of SAP last year
    “dbpedia:SAP_AG” “xEBR:liquid asset” at “dbpedia:year:2010”
    Net cash of SAP in 2010
    SAP total amount received in 2010
    4
  • Motivation & Applications
    OntologyMatching & Alignment
    Similarity between ontology concepts
    ifrs:StatementOfFinancialPosition
    xebr:KeyBalanceSheet
    Assets
    Ifrs:Assets
    ifrs:BiologicalAssets
    xebr:SubscribedCapitalUnpaid
    Ifrs:CurrentAssets
    Ifrs:NonCurrentAssets
    xebr:FixedAssets
    xebr:CurrentAssets
    ifrs:PropertyPlantAndEquipment
    xebr:TangibleFixedAssets
    xebr:IntangibleFixedAssets
    xebr:Amount Receivable
    xebr:Liquid
    Assets
    Similarity = ?
    Similarity = ?
    ifrs:CashAndCashEquivalents
    Ifrs:TradeAndOtherCurrentReceivables
    Ifrs:Inventories
    5
  • Classical Approaches
    String Similarity
    Levenshteindistance, Dice Coefficient
    Corpus-based
    LSA, ESA, Google distance,Vector-Space Model
    Ontology-based
    Path distance, Information content
    Syntax Similarity
    Word-order, Part of Speech
    6
  • Why STL?
    Semantic
    Semanticstructure and relations
    Terminology
    complex terms expressing the same concept
    Linguistic
    Phrase and dependency structure
    7
  • STL
    Definition
    Linear combination of semantic, terminological and linguistic
    obtained by using a linear regression
    Formula used
    STL = w1*S + w2*T + w3*L + Constant
    w1, w2, w3 represent the contribution of each
    8
  • Semantic
    WuPalmer
    2*depth(MSCA) / depth(c1) + depth(c2)
    Resnik’s Information Content
    IC(c) = -log p(c)
    Intrinsic Information Content (Pirro09)
    Overcome the analysis of large corpora
    9
  • Cont.
    Intrinsic information content(iIC)
    .
    where sub(c) is number of sub-concept of given concept c.
    Pirro_Similarity
    10
  • Cont.
    MSCA
    subconcepts = 48
    IC (TFA) = 0.32
    Assets
    Subscribed Capital Unpaid
    Fixed Assets
    Current Assets
    Pirro_Sim = 0.33
    Pirro_Sim =?
    Stocks
    Tangible Fixed Assets
    Amount Receivable
    subconcepts = 6
    IC (AR) = 0.69
    subconcepts = 9
    IC (TFA) = 0.60
    Amount Receivable [total]
    Amount Receivable with in one year
    Amount Receivable after more than one year
    Other Tangible Fixed Assets
    Property, Plant
    and Equipment
    Payments on account and asset in construction
    Furniture Fixture and Equipment
    Trade Debtors
    Other Fixture
    Land and Building
    Other Debtors
    Plant and Machinery
    Other Property, Plant
    and Equipment
    Property, Plant
    and Equipment [Total]
    11
  • Limitation
    Does semantic structure reflect a good similarity?
    not necessarily
    e.g. In xEBR, parent-child relation for describing the layout of concepts
    “Work in progress” is not a type of asset, although both are linked via the parent-child relationship
    12
  • Terminology
    Definition
    Common naming convention
    Ngram Vs subterms
    In financial domain, bigram ”Intangible Fixed” is a subtring of ”Other Intangible Fixed Assets” but not a subterm.
    Terminological similarity
    maximal subterm overlap
    13
  • Cont.
    Trade Debts Payable After More Than One Year
    [[Trade][Debts]][Payable][After More Than One Year]
    [SAP:Payable]
    [Ifrs:After More Than One Year]
    [Investoword:Debt]
    [FinanceDict:Trade Debts]
    [Investopedia:Trade]
    Financial[Debts][Payable][After More Than One Year]
    Financial Debts Payable After More Than One Year
    14
  • Multilingual Subterms
    Translatedsubterms
    Available in otherlanguages
    Advantage
    Reflect terminological similarities that may be available in one language but not in others.
    ”Property Plant and Equipment”@en
    ”Sachanlagen”@de
    ”Tangible Fixed Asset” @en
    15
  • Linguistic
    Syntactic Information
    Beyond simple word order
    phrase structure
    Dependency structure
    Phrase structure
    Intangible fixed : adj adj > ??
    Intangible fixed assets : adj adj n > NP
    Dependency structure
    Amounts receivable : N Adv : receive:mod, amounts:head
    Received amounts : V N : receive:mod, amounts:head
    16
  • Evaluation
    Data Set
    xEBR finance vocabulary
    269 terms (concept labels)
    72,361(269*269) termpairs
    Benchmarks
    SimSem59: sample of 59 term pairs
    SimSem200 : sample of 200 term pairs (under construction)
    17
  • Experiment
    An overview of similarity measures
    18
  • Experiment Results (Simsem59)
    STL formula used
    STL = 0.1531 * S + 0.5218 * T + 0.1041 * L + 0.1791
    Correlation between similarity scores & simsem59
    Semantic
    Contribution
    Terminology
    Contribution
    Linguistic
    Contribution
    19
  • Conclusion
    STL outperforms more traditional similarity measures
    Largest contribution by T (Terminological Analysis)
    Multilingual subterms performs better than monolingual
    20
  • Future work
    Evaluation on larger data set and vocabularies (IFRS)
    3000+ terms
    9M term pairs
    richer set of linguistic operations
    “recognise” => “recognition”
    by derivation rule verb_lemma+"ion”
    Similarity between subterms
    “Staff Costs” and "Wages And Salaries"
    21