Lodie overview

2,926 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,926
On SlideShare
0
From Embeds
0
Number of Embeds
1,921
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Lodie overview

  1. 1. Background Methodology Evaluation Conclusions References LODIE Linked Open Data Information Extraction Fabio Ciravegna Anna Lisa Gentile Ziqi Zhang OAK Group, Department of Computer Science The University of Sheffield, UK 9th October 2012 Semantic Web and Information Extraction SWAIE @ EKAW 2012Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 1 / 19
  2. 2. Background Methodology Evaluation Conclusions ReferencesLODIE: Overview Linked Open Data for Web-scale Information Extraction Web-scale IE number of documents, domains, facts efficient and effective methods required Linked Open Data to seed learning “[. . . ] a recommended best practice for exposing, sharing, and connecting data [. . . ] using URIs and RDF”(linkeddata.org). a large KB of typed instances, relations, annotations (e.g., RDFa) Adapting to specific user information needs users define specific IE tasks by specifying the types of instances and relations to be learnt Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 2 / 19
  3. 3. Background Methodology Evaluation Conclusions ReferencesChallenges: user Information needs SoA defines a generic IE task - KnowItAll, StatSnowball, PROSPERA, NELL, ExtremeExtraction [3, 4, 5, 11] extracts “people, oragnisation, location" etc and their generic relations RQ how to let users define Web-IE tasks tailored to their own needs - “drugs that treat hayfever" Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 3 / 19
  4. 4. Background Methodology Evaluation Conclusions ReferencesChallenges: training data SoA requires certain amount of training/learning resources to be manually specified RQ how to automatically obtain these (and filter noise) from the LOD Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 4 / 19
  5. 5. Background Methodology Evaluation Conclusions ReferencesChallenges: learning strategies SoA Typically semi-supervised bootstrapping based learning from unstructured texts, prone to propagation of errors RQ how to combine multi-strategy learning (e.g., from both structured and unstructured contents) to avoid drifting away from the learning task Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 5 / 19
  6. 6. Background Methodology Evaluation Conclusions ReferencesChallenges: publication of triples SoA No integration with existing KB RQ how to integrate with LOD Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 6 / 19
  7. 7. Background Methodology Evaluation Conclusions ReferencesLODIE Architecture Overview Figure: Architecture diagram. Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 7 / 19
  8. 8. Background Methodology Evaluation Conclusions ReferencesUser needs formalisation Goal: Support users in formalising their information needs in a machine understandable format Hypothesis Users define information needs in terms of ontologies Users use different vocabularies in ontology creation Methods Baseline: manually identify relevant ontologies on the LOD and define a view on them using tools like neon-toolkit.org Ontology Design Pattern: reuse existing Content ODP building block and apply re-engineering patterns to bridge the “vocabulary gap” Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 8 / 19
  9. 9. Background Methodology Evaluation Conclusions ReferencesLearning Seed Identification and Filtering - I Goal 1: Automatically identify training data in the forms of triples and annotations to seed learning Hypothesis: LOD can already contain answers to user needs in the forms of triples and annotations The Web contains additional linguistic realisations of triples Method From LOD - SPQRL queries to fetch seed triples (and annotations) matching the user needs From the Web - Search for linguistic realisations of triples (identified above): co-occurrence of related instances in textual contexts e.g. sentences structural elements e.g., tables Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 9 / 19
  10. 10. Background Methodology Evaluation Conclusions ReferencesLearning Seed Identification and Filtering - II Goal 2: Filter noisy training data and select the most informative examples for learning Hypothesis: Identified learning seeds can contain noise (causing “drifting-away”) ... and can be redundant (causing unnecessary overheads) good learning examples are consistent w.r.t. the learning task and diverse. Method Consistency measure - cluster seed instances of different classes N times (varying parameters), does i always appear in the cluster representing the same class? Variability measure - cluster seed instances of the same class, how many clusters are generated and how dense are they Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 10 / 19
  11. 11. Background Methodology Evaluation Conclusions ReferencesMulti-Strategy Learning Goal: Learning from different types (e.g., structured, unstructured) with different strategies to improve both recall and precision Hypothesis: The same pieces of knowledge can be repeated in different forms, e.g., in tables v.s. sentences (re-enforcing evidenece, Precision) Some knowledge may be found only in one form or another (Recall) Method: multi-strategy learning Learning from structures such as tables and lists [10, 8] Inducing wrappers for regular pages [7] Lexical-syntactic pattern learning from free texts Combine outputs from different processes Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 11 / 19
  12. 12. Background Methodology Evaluation Conclusions ReferencesIntegration with the LOD Goal: Assign unique identifier to the extracted knowledge Hypothesis: Knowledge that already exists in the LOD can be re-extracted and must be integrated Method: simple, scalable disambiguation methods, e.g., by feature overlapping [2] and string distance metrics [9] Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 12 / 19
  13. 13. Background Methodology Evaluation Conclusions ReferencesUser Feedback Goal: Integrate user’s feedback on learning Hypothesis: Automatic extraction is imperfect and user’s feedback can help improve learning Method: Expose extracted knowledge via an interface and collect user feedback - errors can be used as negative example for re-training Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 13 / 19
  14. 14. Background Methodology Evaluation Conclusions ReferencesEvaluation suitability to formalise the user needs suitability of the approach to IE Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 14 / 19
  15. 15. Background Methodology Evaluation Conclusions ReferencesEvaluation: suitability to formalise the user needs Task described in natural language –> equivalent IE task feasibility and efficiency (reasonable time, limited overhead) effectiveness is result representative of the user needs? users judge for description of task users judge for resulting instances is result suitable to seeding IE? usefulness of triples in terms to learning using the proposed quality measures Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 15 / 19
  16. 16. Background Methodology Evaluation Conclusions ReferencesEvaluation: suitability of the approach to IE definition of new task: population of sections of the schema.org ontology precision partial evaluation of recall fraction of the available annotated instances checking recall with respect to already available annotated instances, not provided for training comparative large scale IE evaluations TAC Knowledge Base Population [6] TREC Entity Extraction task [1] Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 16 / 19
  17. 17. Background Methodology Evaluation Conclusions ReferencesImpact LODIE timeliness LOD: first very large-scale information resource available for IE covering for a growing number of domains LODIE output Web-scale IE task corpora, linked resources, etc. developed code available as open source under the MIT licence all the data generated will be made available using a licence such as Open Data Commons (opendatacommons.org) Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 17 / 19
  18. 18. Background Methodology Evaluation Conclusions References Krisztian Balog and Pavel Serdyukov. Overview of the TREC 2010 Entity Track. In Proceedings of the Nineteenth Text REtrieval Conference (TREC 2010). NIST, 2011. S Banerjee and T Pedersen. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In CICLing ’02: Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing, pages 136–145, London, UK, 2002. Springer-Verlag. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell. Toward an Architecture for Never-Ending Language Learning. pages 1306–1313. Oren Etzioni, Ana-maria Popescu, Daniel S Weld, Doug Downey, and Alexander Yates. Web-Scale Information Extraction in KnowItAll ( Preliminary Results ). pages 100–110, 2004. Marjorie Freedman, Lance Ramshaw, Elizabeth Boschee, Ryan Gabbard, Gary Kratkiewicz, Nicolas Ward, and Ralph Weischedel. Extreme Extraction – Machine Reading in a Week. (1):1437–1446, 2011. Heng Ji and Ralph Grishman. Knowledge Base Population : Successful Approaches and Challenges. N Kushmerick. Wrapper Induction for information Extraction. In IJCAI97, 1997. Girija Limaye.Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 18 / 19
  19. 19. Background Methodology Evaluation Conclusions References Annotating and Searching Web Tables Using Entities , Types and Relationships. Vanessa Lopez, Miriam Fernández, Nico Stieler, Enrico Motta, Walton Hall, Milton Keynes Mkaa, and United Kingdom. PowerAqua : supporting users in querying and exploring the Semantic Web content. David Milne and Ian H Witten. Learning to Link with Wikipedia. pages 509–518, 2007. Ndapandula Nakashole and Martin Theobald. Scalable knowledge harvesting with high precision and high recall. on Web search and data mining, (1955):227–236, 2011.Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 19 / 19

×