Information Extraction and Linked Data Cloud

2,190 views

Published on

In the media industry there is a great emphasis on providing descriptive metadata as part of the media assets to the consumers. Information extraction (IE) is considered an important tool for metadata generation process and its performance largely depend on the knowledge base it utilizes. The advances in the “Linked Data Cloud” research provide a great opportunity for generating such knowledge base that benefit from the participation of wider community. In this talk, I will discuss our experiences of utilizing Linked Data Cloud in conjunction with a GATE-based IE system.

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,190
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
68
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • In terms of data quality, we have found following limitation of the DBpedia knowledge base: DBpedia is less formally structured and governed by number of ontologies where retrieving a particular class of entity will require joining a number of ontologies. For example, a comprehensive list of footballers can only be retrieved by combining Yago, DBpedia and SKOS ontology. The data quality is inferior (to our expectations) as there are considerable inconsistencies within DBpedia. For example, some of the object properties do not link to other entities and instead link to temporal templates. Another example is the incorrect classification of entities. For example, some of the bands are incorrectly classified as persons. In addition to the above shortcomings, we have our own view of the world and define them differently in PA Images ontology. As suggested by DBpedia authors [9], an approach to combine the advantages of both worlds is to interlink DBpedia with hand-crafted ontologies, which enables applications to use the formal knowledge from these ontologies together with the instance data from DBpedia.
  • The accuracy required needs to be close to 100%. As mentioned earlier, the coverage of data under DBpedia is richer when using multiple ontologies which require mapping one ontology to many and doing so that the coverage benefits and redundancy is countered. There is no known automatic ontology mapping approach to us that fulfils the aforementioned criteria. We have successfully used SPARQL CONSTRUCT [17] queries to achieve ontology mapping between PA Images and DBpedia ontologies and to extract the entities from DBpedia KB and generate a clean, contextualised PA KB.
  • Information Extraction and Linked Data Cloud

    1. 1. Information Extraction & Linked Data Cloud Dr. Dhaval Thakker KTP Research Associate Press Association Images & Nottingham Trent University 12/10/10 © Dhaval Thakker, Press Association , Nottingham Trent University
    2. 2. Outline <ul><li>Press Association & its operations </li></ul><ul><li>Introduction to the Semantic Technology Project at PA Images </li></ul><ul><ul><li>IE and Knowledge base systems </li></ul></ul><ul><ul><li>Semantic Web browsing </li></ul></ul><ul><li>Problem of generating Knowledge bases </li></ul><ul><ul><li>Introduction to Linked Data Cloud (LDC) </li></ul></ul><ul><ul><li>How do we use LDC </li></ul></ul><ul><li>Current and Future Work </li></ul><ul><li>Conclusions </li></ul>
    3. 3. Press Association (pressassociation.com) <ul><li>Press Association & its operations </li></ul><ul><ul><li>UK’s leading multimedia news & information provider </li></ul></ul><ul><ul><li>Core News Agency operation </li></ul></ul><ul><ul><li>Editorial services: Sports data, entertainment guides, weather forecasting, photo syndication </li></ul></ul>Background Semantic Web project Knowledge base Conclusions
    4. 4. Free-text versus Semantic Approach <ul><li>Free-Text </li></ul><ul><li>Lack of structure </li></ul><ul><li>Have to rely on the annotator to provide all possible keywords </li></ul><ul><li>Repetitive annotation effort </li></ul><ul><li>Low accuracy </li></ul><ul><li>Semantic </li></ul><ul><li>Adds structure, Concepts-Relationship </li></ul><ul><li>Provides Inference ( Implicit reasoning ) capacity </li></ul><ul><li>Accurate results </li></ul><ul><li>“ Related”, “Similarity” based browsing </li></ul>Background Semantic Web project Knowledge base Conclusions
    5. 5. … the Semantic Web <ul><li>Web was “invented” by Tim Berners-Lee (amongst others), a physicist working at CERN </li></ul>“ The next generation WWW is a Web in which machines can converse in a meaningful way, rather than a web limited to humans requesting HTML pages.“ Tim Berners-Lee … need to Add “Semantics” <ul><li>Use Ontologies (dictionary of terms) to help computers understand the meaning (semantics) of domain concepts </li></ul>Background Semantic Web project Knowledge base Conclusions
    6. 6. PA Images Workflow Agency/Photographers Metadata Company Captioners Website Provides minimum metadata in IPTC Images with metadata passed to Captioners for batch processing Modifies existing and adds new metadata Information Extraction Storage & Browsing Semantic structure Background Semantic Web project Knowledge base Conclusions
    7. 7. Utilisation of Semantic Technologies for Intelligent Indexing and Retrieval of PA Images photo Collection <ul><li>Development of a comprehensive semantic-based taxonomy for PA Images domains of News, Entertainment and Sports. </li></ul><ul><li>Design and implementation of a web-based and semantics-transparent annotation tool. </li></ul><ul><ul><li>Design and develop software programmes to semi-automate the annotation of legacy data. </li></ul></ul><ul><li>Development of semantically-enabled search technology, specifically tailored for the PA Photos Image Retrieval engine. </li></ul>Background Semantic Web project Knowledge base Conclusions
    8. 8. Text Mining System Overview Images with captions GATE-based IE System Background Semantic Web project Knowledge base Conclusions Gazetteer (known entities) JAPE Grammar (context rules) Disambiguation/Summarisation Entities of interest Annotated Image captions PA KB Linked Data Cloud What to store What to extract Confirmation Captions Learned Facts Schema PA Images view PA Images ontology
    9. 9. PA Images Ontology (OWL) Background Semantic Web project Knowledge base Conclusions
    10. 10. Knowledge base (KB) <ul><li>Ontology (schema) </li></ul><ul><li>Royalty (Royal Family) </li></ul><ul><li>name </li></ul><ul><li>relationship </li></ul><ul><li>Type 1 </li></ul><ul><ul><li>Spouse </li></ul></ul><ul><ul><li>From </li></ul></ul><ul><ul><li>To </li></ul></ul><ul><ul><li>Type 2 </li></ul></ul><ul><ul><li>Partner </li></ul></ul><ul><ul><li>From </li></ul></ul><ul><ul><li>To </li></ul></ul><ul><li>predecessor </li></ul><ul><li>successor </li></ul><ul><li>father </li></ul><ul><li>mother </li></ul><ul><li>Title </li></ul><ul><li>Data </li></ul><ul><li>Royalty (Henry VIII ) </li></ul><ul><li>name (Tudor, Henry/Henry VIII of England ) </li></ul><ul><li>relationship </li></ul><ul><li>Spouse (Anne Boleyn) </li></ul><ul><li>Spouse (Catherine Parr) </li></ul><ul><li>Spouse (Jane Seymour) </li></ul><ul><li>Spouse (Anne of Cleves) </li></ul><ul><li>Spouse (Catherine Howard) Spouse (Catherine of Aragon) </li></ul><ul><li>Predecessor (Henry VII ) </li></ul><ul><li>Successor (Edward VI) </li></ul><ul><li>Father (Henry VII of England) </li></ul><ul><li>Mother (Elizabeth of York ) </li></ul><ul><li>Title (king of England and Ireland) </li></ul>Background Semantic Web project Knowledge base Conclusions
    11. 11. Scale of Things for KB <ul><li>Emphasis on : People, Places, Organisations, Events </li></ul><ul><li>About 50 types of sports </li></ul><ul><ul><li>Their Events </li></ul></ul><ul><ul><li>Type of people in these sports (Referee, Players etc) </li></ul></ul><ul><ul><li>Type of Locations for these sports </li></ul></ul><ul><ul><li>Variety of Teams for these sports </li></ul></ul><ul><ul><li>And relationships between all of them!! </li></ul></ul><ul><li>Similarly for Entertainment and News </li></ul>Background Semantic Web project Knowledge base Conclusions
    12. 12. Outsourcing KB – Linked Data Cloud (LDC) <ul><li>Where do we get all these knowledge from? </li></ul><ul><li>We don’t want it in free-text form but in a semantic structure </li></ul><ul><li>It has to be comprehensive and accurate </li></ul><ul><li>Free, open, extractable, evolving </li></ul><ul><li>Uniform Resource Identifiers (URIs) and Resource Description Framework (RDF) language are the heart of the LoD </li></ul>Background Semantic Web project Knowledge base Conclusions
    13. 13. Linked Data <ul><li>“ The term Linked Data is used to describe a method of exposing, sharing, and connecting data via dereferenceable URIs on the Web” </li></ul><ul><li>“ The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.  With linked data, when you have some of it, you can find other, related, data.” </li></ul>Background Semantic Web project Knowledge base Conclusions
    14. 14. Linked Data cloud 31/03/2008 Background Semantic Web project Knowledge base Conclusions
    15. 15. DBPedia <ul><li>Epicentre of the Linked Data Cloud </li></ul><ul><li>Generated primarily from the Wikipedia info-boxes and improved with linkage to other sources in the cloud. </li></ul><ul><li>The DBpedia knowledge base currently describes more than 2.6 million things , including at least 213,000 persons , 328,000 places , 57,000 music albums , 36,000 films, 20,000 companies . </li></ul><ul><li>Many organisations, researchers using it. </li></ul>Background Semantic Web project Knowledge base Conclusions
    16. 16. Linking Open Data Community <ul><li>Community effort to </li></ul><ul><li>Publish existing open license datasets as Linked Data on the Web </li></ul><ul><li>Interlink things between different data sources </li></ul><ul><li>Develop clients that consume Linked Data from the Web </li></ul>Background Semantic Web project Knowledge base Conclusions
    17. 17. Organizations participating in the LOD community <ul><li>Companies </li></ul><ul><li>Press Association (UK) </li></ul><ul><li>New York Times (USA) </li></ul><ul><li>Thompson Reuters (USA)- Opencalais </li></ul><ul><li>BBC (UK) – Music Beta website , BBC Eath </li></ul><ul><li> MusicBrainz </li></ul><ul><li>Yahoo Microsearch </li></ul><ul><li>OpenLink (UK) </li></ul><ul><li>Talis (UK) </li></ul><ul><li>Zitgist (USA) </li></ul><ul><li>Garlik (UK) </li></ul><ul><li>Mondeca (FR) </li></ul><ul><li>Renault (FR) </li></ul><ul><li>Boab Interactive (AUS) </li></ul><ul><li>… ..others who are indirect consumers.. </li></ul><ul><li>Universities and Research Institutes </li></ul><ul><li>Massachusetts Institute of Technology (USA) </li></ul><ul><li>University of Southampton (UK) </li></ul><ul><li>DERI (IRE) </li></ul><ul><li>KMi, Open University (UK) </li></ul><ul><li>University of London (UK) </li></ul><ul><li>Universität Hannover (DE) </li></ul><ul><li>University of Pennsylvania (USA) </li></ul><ul><li>Universität Leipzig (DE) </li></ul><ul><li>Universität Karlsruhe (DE) </li></ul><ul><li>Joanneum (AT) </li></ul><ul><li>Freie Universität Berlin (DE) </li></ul><ul><li>Cyc Foundation (USA) </li></ul><ul><li>SouthEast University (CN) </li></ul>Background Semantic Web project Knowledge base Conclusions
    18. 18. Background Semantic Web project Knowledge base Conclusions
    19. 19. Interested in Linking up? <ul><li>1. Use URIs as names for things </li></ul><ul><li>2. Use HTTP URIs so that people can look up those names </li></ul><ul><li>3. When someone looks up a URI, provide useful RDF information </li></ul><ul><li>4. Include RDF statements that link to other URIs so that they can discover related things </li></ul>Tim Berners-Lee 2007 http://www.w3.org/DesignIssues/LinkedData.html Background Semantic Web project Knowledge base Conclusions
    20. 20. Our approach for LDC utilisation <ul><li>Why not DBPedia as it is? </li></ul><ul><ul><li>Great deal of noisy data -If we store them as it is, storage will be huge </li></ul></ul><ul><ul><li>DBpedia is less formally structured. </li></ul></ul><ul><ul><li>The data quality is lower for production scale and there are some inconsistencies within DBpedia. </li></ul></ul><ul><ul><li>and we have our own domains and own view of them </li></ul></ul><ul><li>Our approach is to combine the advantages of both worlds is to interlink DBpedia with hand-crafted ontologies such as PA Images ontology, which enables applications to use the formal knowledge from these ontologies together with the data from DBpedia.” </li></ul>Background Semantic Web project Knowledge base Conclusions
    21. 21. Ontology Mapping - Map the ontology and the data will follow.. Linked Data Cloud PA Images Ontology DBPedia YAGO Geonames ...... sameAs sameAs sameAs Knowledgebase/data for our ontology Similar Entities & Their Features Background Semantic Web project Knowledge base Conclusions
    22. 22. SPARQL CONSTRUCT <ul><li>PREFIX dbpedia-ont: <http://dbpedia.org/ontology/> </li></ul><ul><li>PREFIX db: <http://dbpedia.org/> </li></ul><ul><li>PREFIX pa: <http://localhost/pa/images/media/entities.owl#> </li></ul><ul><li>PREFIX owl: <http://www.w3.org/2002/07/owl#> </li></ul><ul><li>PREFIX foaf: <http://xmlns.com/foaf/0.1/> </li></ul><ul><li>CONSTRUCT </li></ul><ul><li>{ </li></ul><ul><li>?newLoc a pa:City . </li></ul><ul><li>?newLoc pa:locationName ?name . </li></ul><ul><li>?newLoc pa:latitutedegrees ?lat </li></ul><ul><li>} </li></ul><ul><li>WHERE </li></ul><ul><li>{ </li></ul><ul><li>?newLoc a dbpedia-ont:City . </li></ul><ul><li>?newLoc foaf:name ?name . </li></ul><ul><li>?newLoc dbpedia-ont:latitutedegrees ?lat </li></ul><ul><li>} </li></ul>DBPedia PA Images ontology Background Semantic Web project Knowledge base Conclusions
    23. 23. Has City -> City Of Country <ul><li>PREFIX dbpedia-ont: <http://dbpedia.org/ontology/> </li></ul><ul><li>PREFIX pa: <http://localhost/pa/images/media/entities.owl#> </li></ul><ul><li>PREFIX owl: <http://www.w3.org/2002/07/owl#> </li></ul><ul><li>PREFIX foaf: <http://xmlns.com/foaf/0.1/> </li></ul><ul><li>PREFIX db-prop: <http://dbpedia.org/property/> </li></ul><ul><li>CONSTRUCT </li></ul><ul><li>{ </li></ul><ul><li>?newLoc a pa:City. </li></ul><ul><li>?newLoc pa:cityOfCountry ?country . </li></ul><ul><li>?newLoc pa:locationName ?name . </li></ul><ul><li>?country pa:hasCity ?newLoc </li></ul><ul><li>} </li></ul><ul><li>WHERE </li></ul><ul><li>{ </li></ul><ul><li>?newLoc a dbpedia-ont:City . </li></ul><ul><li>?newLoc db-prop:subdivisionName ?country . </li></ul><ul><li>?country a <http://dbpedia.org/ontology/Country> . </li></ul><ul><li>?newLoc foaf:name ?name </li></ul><ul><li>} </li></ul>Background Semantic Web project Knowledge base Conclusions
    24. 24. People - Total > 200000 <ul><li>Footballers -> 24k </li></ul><ul><li>Cricketers -> 4k </li></ul><ul><li>American Footballers -> 8k </li></ul><ul><li>Actors -> 12k </li></ul><ul><li>Music Artists -> 22k </li></ul><ul><li>Baseball players -> 1200 </li></ul><ul><li>Basketball players -> 1200 </li></ul><ul><li>British Royalty -> 800 </li></ul><ul><li>Cyclists -> 2300 </li></ul><ul><li>Politicians -> 15k </li></ul><ul><li>F1 Racing Drivers ->1100………………. </li></ul>Background Semantic Web project Knowledge base Conclusions
    25. 25. Groups Total > 50k <ul><li>National Football Teams -> 400 </li></ul><ul><li>Band -> 16000 </li></ul><ul><li>Companies -> 24k </li></ul><ul><li>Clubs -> 800 </li></ul>Background Semantic Web project Knowledge base Conclusions
    26. 26. Work > 200000 <ul><li>Album – 80k </li></ul><ul><li>Films – 80k </li></ul><ul><li>Single -> 27k </li></ul><ul><li>Books -> 17k …. </li></ul><ul><li>And.. </li></ul><ul><li>Events -> 2000 </li></ul><ul><li>Locations -> 200000 </li></ul>Background Semantic Web project Knowledge base Conclusions
    27. 27. Conclusions <ul><li>Linked data very exciting </li></ul><ul><li>The intention is that we move from a web of documents to a web of data </li></ul><ul><ul><li>– The Web as database </li></ul></ul><ul><li>PA Knowledge base generation using linked data cloud </li></ul><ul><li>A complete product that utilises semantic technologies to lower the cost of annotation and improved search experience </li></ul>Background Semantic Web project Knowledge base Conclusions
    28. 28. Acknowledgement <ul><li>KTP Project, Press Association & Nottingham Trent University </li></ul>

    ×