Your SlideShare is downloading. ×
Algoritmo di text-similarity per l'annotazione semantica di Web Service
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Algoritmo di text-similarity per l'annotazione semantica di Web Service

955
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
955
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide









  • LCS = Least Common Subsumer (Ultimo sussuntore comune)















  • Transcript

    • 1. Algoritmo di text-similarity per l’annotazione semantica di WS SWAP research group - 27 luglio 2010 Michele Filannino, @bronko85
    • 2. Outline Il problema Scenario di riferimento Similarità SAWA Word-to-word similarity Text-to-text similarity Risultati sperimentali Qualità dei risultati Tempo di esecuzione 2 Sviluppi futuri Sessione dimostrativa
    • 3. Il problema Come misurare la similarità tra due testi?
    • 4. 4 Scenario di riferimento Natural language To approve/reject descriptions suggested annotations WSDL file CODEArchitects CODEArchitects SAWSDL file Annotation Tool Annotation Tool
    • 5. 5 Similarità semantica Assegnare una metrica di somiglianza, basata sul significato, ad un insieme di termini e/o documenti; Similarità ≠ Correlatività; “Banca” e “denaro” sono correlati sebbene non siano affatto simili; Similarità Correlatività; “Ragazza” e “fanciulla” sono simili quindi anche correlati.
    • 6. 6 Similarità semantica in SWOP Concetti del WS Concetti ontologici - RequestOrder Order - - Order OrderNumber - - BillingInformation OrderID - - ... BillID - BillReference - BusinessFirm - Product - Catalog - ... -
    • 7. 7 Peso computazionale Esempio: Ontologia con 1200 concetti WSDL con 15 annotazioni 18.000 esecuzioni di SAWA :( 1.200 x 15 =
    • 8. SAWA Similarity Algorithm Wikipedia-bAsed
    • 9. 9 Word-to-word similarity Date due parole stabilire quanto esse sono simili; Tipi di algoritmi per il calcolo della similarità tra parole: Corpus-based: pointwise mutual information, latent semantic analysis; Hierarchy-based: Leacock & Chodorow, Lesk, Wu & Palmer, Resnik, Lin, Jiang & Conrath; Input: due parole; Output: score compreso tra 0 e 1.
    • 10. 10 Algoritmo di Lin (1998)
    • 11. 11 Tool di word-to-word similarity Libreria utilizzata: LinguaTools DISCO; Utilizza Wikipedia come gerarchia di concetti 202.578 concetti; Aggiornato al 1° gennaio 2008 Utilizza l’algoritmo di Lin per il calcolo della similarità.
    • 12. 12 Esempi Tiger, lion = 90% Doctor, nurse = 70% Stock, market = 47% Love, sex = 46% FBI, investigation = 35% Professor, cucumber = 0,006%
    • 13. Qualità dell’algoritmo Corpus per la misurazione della qualità: WordSim353; Coefficienti di correlazione (Pearson): Wikipedia: 0,574; BNC: 0,415; PubMed: 0,105; 90.000 67.500 45.000 22.500 0
    • 14. 14 Text-to-text similarity Dati due testi stabilire quanto essi sono simili; Estensione opportuna degli algoritmi di word-to-word similarity; Rimozione delle parole (stopword) basso potere discriminatorio; alta frequenza di occorrenza; Input: due testi; Output: score compreso tra 0 e 1.
    • 15. 15 Stopword “Returns the first and last name of each customer who is categorized as an individual consumer” STOPWORD “name customer categorized individual consumer”
    • 16. Algoritmo di Corley & Mihalcea 16 (2005)
    • 17. Ottimizzazioni (v1.2) Caching delle frequenze di ogni termine; Caching delle similarità tra termini; Apprendimento incrementale; Riduzione degli accessi a DISCO; Performance ridotte di 10 volte;
    • 18. Risultati sperimentali Qualità e tempo di esecuzione
    • 19. DESCRIZIONE DEL DOCUMENTO WSDL SCELTA: "returns the first and last name of each customer who is categorized as an individual consumer" RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score): *---------------------------------------------------------------------------------------------------------------* | Descrizione | Score | *---------------------------------------------------------------------------------------------------------------* | name: name of customer | 62,85% | | customer: Current customer individual information | 56,91% | | customeraddress: Customer address | 42,36% | | customercredicard: Customer credit card information | 35,08% | | salesreason: Reasons why a customer may purchase a particular product. | 30,35% | | customerstore:Stores of our Company (customer and resellers). | 17,31% | | salesorderdetail: Product details associated with a specific sales order. | 2,99% | | productinventory: Product inventory information. | 2,59% | | salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,39% | | productlocation: Product manufacturing locations | 2,36% | | salestaxrate: Sales Tax rate. | 2,36% | | salesterritory: Sales territory. | 2,22% | | employeeaddress: Employee information such as salary, department, and title. | 2,18% | | product: Products sold or used in the manfacturing of sold products. | 2,12% | | enterpricedepartment: Departments of Enterprise | 2,00% | | salesspecialoffer: Sales Special Offer (discounts). | 1,99% | | productlistpricehistory: Changes in the list price of a product over time. | 1,80% | | shipmethod: Shipping methods. | 1,79% | | salesorder: General sales order information (header). | 1,76% | | productdocument: Product Document | 1,73% | | productcosthistory: Changes in the cost of a product over time. | 1,68% | | productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,61% | | productmodel: Product model classification. | 1,48% | | currencyrate: Currency exchange rates. | 1,40% | | salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,29% | | productcategory: High-level product categorization. | 1,27% | | addresstype: Types of addresses | 0,95% | | unitmeasure: Unit of measure. | 0,80% | | currency: Standard ISO currencies. | 0,51% | 19 | countryregion: ISO standard codes for countries and regions. | 0,51% | | stateprovince: States and provinces | 0,12% | *---------------------------------------------------------------------------------------------------------------* Time elapsed: 9.4 seconds.
    • 20. DESCRIZIONE DEL DOCUMENTO WSDL SCELTA: "lists the names and addresses of all individual customers" RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score): *---------------------------------------------------------------------------------------------------------------* | Descrizione | Score | *---------------------------------------------------------------------------------------------------------------* | addresstype: Types of addresses | 51,77% | | customer: Current customer individual information | 24,03% | | customeraddress: Customer address | 10,83% | | name: name of customer | 6,32% | | productlistpricehistory: Changes in the list price of a product over time. | 4,91% | | customercredicard: Customer credit card information | 4,47% | | salesreason: Reasons why a customer may purchase a particular product. | 4,20% | | customerstore:Stores of our Company (customer and resellers). | 3,21% | | salesorder: General sales order information (header). | 2,72% | | salesspecialoffer: Sales Special Offer (discounts). | 2,53% | | salesorderdetail: Product details associated with a specific sales order. | 2,49% | | salesterritory: Sales territory. | 2,14% | | salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,08% | | employeeaddress: Employee information such as salary, department, and title. | 1,81% | | salestaxrate: Sales Tax rate. | 1,79% | | productlocation: Product manufacturing locations | 1,78% | | countryregion: ISO standard codes for countries and regions. | 1,64% | | product: Products sold or used in the manfacturing of sold products. | 1,62% | | productinventory: Product inventory information. | 1,60% | | currencyrate: Currency exchange rates. | 1,46% | | enterpricedepartment: Departments of Enterprise | 1,45% | | productmodel: Product model classification. | 1,38% | | shipmethod: Shipping methods. | 1,37% | | salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,36% | | productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,32% | | productdocument: Product Document | 1,27% | | productcosthistory: Changes in the cost of a product over time. | 1,26% | | productcategory: High-level product categorization. | 1,01% | | currency: Standard ISO currencies. | 0,85% | 20 | stateprovince: States and provinces | 0,73% | | unitmeasure: Unit of measure. | 0,71% | *---------------------------------------------------------------------------------------------------------------* Time elapsed: 4.177 seconds.
    • 21. DESCRIZIONE DEL DOCUMENTO WSDL SCELTA: "returns the name of each customer that is categorized as a store" RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score): *---------------------------------------------------------------------------------------------------------------* | Descrizione | Score | *---------------------------------------------------------------------------------------------------------------* | name: name of customer | 64,29% | | customeraddress: Customer address | 43,83% | | customer: Current customer individual information | 40,05% | | customercredicard: Customer credit card information | 36,52% | | salesreason: Reasons why a customer may purchase a particular product. | 31,74% | | customerstore:Stores of our Company (customer and resellers). | 21,07% | | employeeaddress: Employee information such as salary, department, and title. | 2,75% | | salesorderdetail: Product details associated with a specific sales order. | 2,67% | | productinventory: Product inventory information. | 2,52% | | salestaxrate: Sales Tax rate. | 2,22% | | salesterritory: Sales territory. | 2,19% | | salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,09% | | productlocation: Product manufacturing locations | 1,91% | | enterpricedepartment: Departments of Enterprise | 1,87% | | salesorder: General sales order information (header). | 1,84% | | product: Products sold or used in the manfacturing of sold products. | 1,79% | | salesspecialoffer: Sales Special Offer (discounts). | 1,72% | | productlistpricehistory: Changes in the list price of a product over time. | 1,68% | | productdocument: Product Document | 1,63% | | salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,61% | | shipmethod: Shipping methods. | 1,52% | | productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,47% | | productcosthistory: Changes in the cost of a product over time. | 1,43% | | productmodel: Product model classification. | 1,42% | | currencyrate: Currency exchange rates. | 1,30% | | productcategory: High-level product categorization. | 1,15% | | addresstype: Types of addresses | 1,02% | | unitmeasure: Unit of measure. | 0,93% | | countryregion: ISO standard codes for countries and regions. | 0,45% | 21 | currency: Standard ISO currencies. | 0,44% | | stateprovince: States and provinces | 0,12% | *---------------------------------------------------------------------------------------------------------------* Time elapsed: 1.245 seconds.
    • 22. 22 Tempo di esecuzione Ottimizzato Non ottimizzato 3 1.0 s 9.4 s 6 1.7 s 9.8 s 5 2.7 s 18.1 s 7 3.6 s 21.8 s 2 3.9 s 15.5 s 8 5.6 s 23.1 s 1 6.2 s 14.3 s 4 9.4 s 39.4 s 0 12.5 25 37.5 50
    • 23. Sviluppi futuri Imminenti e futuri
    • 24. Sviluppi futuri Imminenti: Realizzazione dell’interfaccia Web Service Realizzazione dell’interfaccia Web (gratuita) Realizzazione dell’interfaccia di rete Disseminazione scientifica Altri: Introduzione di soglie per migliorare le performance Rilascio con licenza open-source del codice sorgente
    • 25. Sessione dimostrativa