Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis


Published on

Presentation at the KCAP 2011 conference of the paper: http://data.open.ac.uk/applications/kcap2011.pdf

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis

  1. 1. Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis<br />Mathieu d’Aquin and Enrico Motta<br />Knowledge Media Institute<br />The Open University, Milton Keynes, UK<br />
  2. 2. Hey, Data! <br />I Love Data!<br />
  3. 3. Me?<br />My name is “one great dataset” and my namespace http://datasets.com/greatone/<br />Let’s see… You there, what are you about?<br />One great dataset<br />
  4. 4. 1,254,245 triples. <br />I also have a SPARQL endpoint!<br />OK, but what’s there?<br />One great dataset<br />
  5. 5. Euh.. I have a Void description… with links and all…<br />Can you be more explicit?<br />One great dataset<br />
  6. 6. You mean you want to see… my ontology?<br />Hmm… I mean, what are these triples saying?<br />One great dataset<br />
  7. 7. That would help… but can you tell me what I can ask you?<br />Like example SPARQL queries?<br />One great dataset<br />
  8. 8. Yeah… but I don’t know SPARQL, and how do you chose your examples anyway?<br />…<br />One great dataset<br />
  9. 9. Well… figure it out by yourself them!<br />One great dataset<br />
  10. 10. Summarizing an RDF dataset with questions <br />We would like to be able to give an entry point to a dataset by showing questions it is good at answering<br />In a way that can be navigated<br />Example:<br />Who are the people Tom knows?<br />Tom Heath’s FOAF profile<br />
  11. 11. A question<br />A list of characteristics of objects (clauses) based on the relationships between objects<br />Things that are people, i.e. instances of <Person><br />Related to <tom> through the relation <knows><br />For which the answer is a set of objects <br />All the objects that satisfy the clauses of the question<br />
  12. 12. Formal concept analysis<br />Lattice of concepts: set of objects (extension) with common properties (intension)<br />Formal context: objects with binary attributes<br />Example from: http://en.wikipedia.org/wiki/Formal_concept_analysis<br />
  13. 13. RDF instances as individuals in a formal context<br />Present relations of objects as binary attributes:<br />RDF: tom a Person. tom knows enrico. jeff knows tom.<br />FCA: tom: {Class:-Person, knows:-Enrico, jeff-:knows}<br />Include implicit information based on the ontology<br />tom: {Class:-Person, Class:-Agent, Class:-Thing, knows:-Enrico, knows:Person, knows:-Agent, knows:-Thing,jeff-:knows, Person:-knows, Agent-:knows, Thing:-knows}<br />
  14. 14. Example lattice: Tom’s FOAF Profile<br />
  15. 15. Eliminating redundancies<br />Who are the people Tom knows?<br />
  16. 16. A concept in the lattice is a question<br />Intension = clauses of the question <br />Extension = answers <br />All the objects of the extension satisfy the clauses of the question<br />Different areas of the lattice focus on different topics<br />Questions are <br />organized in a hierarchy<br />{Class:-Person, tom-:knows}<br />What are the (Person) that (tom knows)?<br />What are tom’s current projects?<br />What are the people?<br />What are the people that tom knows?<br />
  17. 17. But…<br />The RDFFormal Context process can generate a lot of attributes and so a lot of questions<br />Ranging from things uninterestingly general<br /> What are the Things?<br />To the ones that might be interesting only in very specific cases<br /> What are the indian restaurants located in San Diego that have been rated OK and are called “Chez Bob”?<br />Need to extract a list of questions as an entry point<br />
  18. 18. How to measure the interestingness of a question - metrics<br />Inspired by ontology summarization:<br />Coverage: if providing a list of questions, the questions should cover the entire lattice (i.e., at least one question per branch)<br />Level: Too general or too specific questions are not useful<br />Density: The number of clauses can have an impact (avoid too complex questions as well as too simple ones)<br />Inspired from FCA:<br />Support: the cardinality of the extent – i.e. the number of answers<br />Intentional Stability: How much a concept depends on particular elements of the extension<br />Extensional Stability: How much a concept depends on particular elements of the intension<br />
  19. 19. Experiment: finding the relevant metrics<br />4 datasets in different domains<br />12 evaluators providing questions of interest for these datasets<br />Obtained 44 questions, out of which 27 are valid (no overlap)<br />Some are too complicated for our model (include disjunction, negation, aggregation functions)<br />“What is the highest point in Florida?”<br />A large part do not comply with the initial instructions: should be self-contained and answered by a list of objects<br />“How high is mountain x?”<br />“What are the restaurant in a given city?”<br />
  20. 20. Results<br />Level: Questions between levels 3 and 7. 4.46 is the average.<br /><ul><li>Interesting questions located around the center of the lattice</li></ul>Density: Questions have between 1 and 3 clauses<br /><ul><li>Simple questions are preferred</li></ul>Support: Very large variations amongst the obtained questions<br />Intentional Stability: Very large variations amongst the obtained questions<br />Extensional Stability: High values (between 0.75 and 1.0), especially compared to the average (0.4)<br />Conclusion:<br />In order to establish a list of questions most likely to be of interest, a combination of level, density and extensional stability, together with coverage should be used<br />
  21. 21. Evaluation<br />Algorithm to generate a set of questions from the lattice of an RDF dataset that<br />Cover the entire lattice<br />Are believed to be interesting according to a given measure<br />Datasets from data.open.ac.uk<br />614 course descriptions <br />1706 Video podcasts<br />Using the metrics: random, closeness to middle level, density close to 2, support, extensional stability, and <br />Aggregated = 1/3 level + 1/3 density + 1/3 stability<br />6 users to score the resulting sets of questions (6 metrics in 2 datasets: 12 sets in total) depending on interestingness <br />
  22. 22. Results<br />
  23. 23. Implementation: the whatoaskinterface<br />Dataset with SPARQL endpoint<br />SPARQL2RCF<br />Formal Context<br />CORON<br />Offline<br />Lattice<br />Online<br />Lattice Parser<br />Interface Generation (using metrics)<br />Interface with navigation in Browser<br />User<br />
  24. 24. Example: Open educational material(OpenLearn)<br />
  25. 25. Example: Database of reading experiences (Arts History project)<br />
  26. 26. Example: Open University Buildings<br />
  27. 27. Conclusion<br />The technique presented provides both a summary and an exploration mechanism over RDF data, using the underlying ontology and formal concept analysis<br />It provides an interface for documenting the dataset by examples rather than by specification<br />It favors serendipity in the exploration of the dataset, without the need for prior, specialized knowledge<br />The current interface in beta is available in an online demo<br />Need to improve the question generation and navigation mechanisms<br />Ongoing experiment including information gathered through the links to external dataset, to generate un-anticipated questions<br />Use-cases in research projects in Arts and Humanities<br />
  28. 28. Thank you!<br />More info<br />Demo: http://lucero-project.info/lb/2011/06/what-to-ask-linked-data/<br />Data.open.ac.uk (for some of the datasets used)<br />@mdaquin – m.daquin@open.ac.uk<br />