Extracting Relevant Questions to an RDF Dataset Using Formal Concept AnalysisMathieu d’Aquin and Enrico MottaKnowledge Media InstituteThe Open University, Milton Keynes, UK
Hey, Data! I Love Data!
Me?My name is “one great dataset” and my namespace http://datasets.com/greatone/Let’s see… You there, what are you about?One great dataset
1,254,245 triples. I also have a SPARQL endpoint!OK, but what’s there?One great dataset
Euh.. I have a Void description… with links and all…Can you be more explicit?One great dataset
You mean you want to see… my ontology?Hmm… I mean, what are these triples saying?One great dataset
That would help… but can you tell me what I can ask you?Like example SPARQL queries?One great dataset
Yeah… but I don’t know SPARQL, and how do you chose your examples anyway?…One great dataset
Well… figure it out by yourself them!One great dataset
Summarizing an RDF dataset with questions We would like to be able to give an entry point to a dataset by showing questions it is good at answeringIn a way that can be navigatedExample:Who are the people Tom knows?Tom Heath’s FOAF profile
A questionA list of characteristics of objects (clauses) based on the relationships between objectsThings that are people, i.e. instances of <Person>Related to <tom> through the relation <knows>For which the answer is a set of objects All the objects that satisfy the clauses of the question
Formal concept analysisLattice of concepts: set of objects (extension) with common properties (intension)Formal context: objects with binary attributesExample from: http://en.wikipedia.org/wiki/Formal_concept_analysis
RDF instances as individuals in a formal contextPresent relations of objects as binary attributes:RDF: tom a Person. tom knows enrico. jeff knows tom.FCA: tom: {Class:-Person, knows:-Enrico, jeff-:knows}Include implicit information based on the ontologytom: {Class:-Person, Class:-Agent, Class:-Thing, knows:-Enrico, knows:Person, knows:-Agent, knows:-Thing,jeff-:knows, Person:-knows, Agent-:knows, Thing:-knows}
Example lattice: Tom’s FOAF Profile
Eliminating redundanciesWho are the people Tom knows?
A concept in the lattice is a questionIntension = clauses of the question Extension = answers All the objects of the extension satisfy the clauses of the questionDifferent areas of the lattice focus on different topicsQuestions are organized in a hierarchy{Class:-Person, tom-:knows}What are the (Person) that (tom knows)?What are tom’s current projects?What are the people?What are the people that tom knows?
But…The RDFFormal Context process can generate a lot of attributes and so a lot of questionsRanging from things uninterestingly general         What are the Things?To the ones that might be interesting only in very specific cases        What are the indian restaurants located in San Diego that have been rated OK and are called “Chez Bob”?Need to extract a list of questions as an entry point
How to measure the interestingness of a question - metricsInspired by ontology summarization:Coverage: if providing a list of questions, the questions should cover the entire lattice (i.e., at least one question per branch)Level: Too general or too specific questions are not usefulDensity: The number of clauses can have an impact (avoid too complex questions as well as too simple ones)Inspired from FCA:Support: the cardinality of the extent – i.e. the number of answersIntentional Stability: How much a concept depends on particular elements of the extensionExtensional Stability: How much a concept depends on particular elements of the intension
Experiment: finding the relevant metrics4 datasets in different domains12 evaluators providing questions of interest for these datasetsObtained 44 questions, out of which 27 are valid (no overlap)Some are too complicated for our model (include disjunction, negation, aggregation functions)“What is the highest point in Florida?”A large part do not comply with the initial instructions: should be self-contained and answered by a list of objects“How high is mountain x?”“What are the restaurant in a given city?”
ResultsLevel: Questions between levels 3 and 7. 4.46 is the average.Interesting questions located around the center of the latticeDensity: Questions have between 1 and 3 clausesSimple questions are preferredSupport: Very large variations amongst the obtained questionsIntentional Stability: Very large variations amongst the obtained questionsExtensional Stability: High values (between 0.75 and 1.0), especially compared to the average (0.4)Conclusion:In order to establish a list of questions most likely to be of interest, a combination of level, density and extensional stability, together with coverage should be used
EvaluationAlgorithm to generate a set of questions from the lattice of an RDF dataset thatCover the entire latticeAre believed to be interesting according to a given measureDatasets from data.open.ac.uk614 course descriptions 1706 Video podcastsUsing the metrics: random, closeness to middle level, density close to 2, support, extensional stability, and Aggregated = 1/3 level + 1/3 density + 1/3 stability6 users to score the resulting sets of questions (6 metrics in 2 datasets: 12 sets in total) depending on interestingness
Results
Implementation: the whatoaskinterfaceDataset with SPARQL endpointSPARQL2RCFFormal ContextCORONOfflineLatticeOnlineLattice ParserInterface Generation (using metrics)Interface with navigation in BrowserUser
Example: Open educational material(OpenLearn)
Example: Database of reading experiences (Arts History project)
Example: Open University Buildings
ConclusionThe technique presented provides both a summary and an exploration mechanism over RDF data, using the  underlying ontology and formal concept analysisIt provides an interface for documenting the dataset by examples rather than by specificationIt favors serendipity in the exploration of the dataset, without the need for prior, specialized knowledgeThe current interface in beta is available in an online demoNeed to improve the question generation and navigation mechanismsOngoing experiment including information gathered through the links to external dataset, to generate un-anticipated questionsUse-cases in research projects in Arts and Humanities
Thank you!More infoDemo: http://lucero-project.info/lb/2011/06/what-to-ask-linked-data/Data.open.ac.uk (for some of the datasets used)@mdaquin – m.daquin@open.ac.uk

Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis

  • 1.
    Extracting Relevant Questionsto an RDF Dataset Using Formal Concept AnalysisMathieu d’Aquin and Enrico MottaKnowledge Media InstituteThe Open University, Milton Keynes, UK
  • 2.
    Hey, Data! ILove Data!
  • 3.
    Me?My name is“one great dataset” and my namespace http://datasets.com/greatone/Let’s see… You there, what are you about?One great dataset
  • 4.
    1,254,245 triples. Ialso have a SPARQL endpoint!OK, but what’s there?One great dataset
  • 5.
    Euh.. I havea Void description… with links and all…Can you be more explicit?One great dataset
  • 6.
    You mean youwant to see… my ontology?Hmm… I mean, what are these triples saying?One great dataset
  • 7.
    That would help…but can you tell me what I can ask you?Like example SPARQL queries?One great dataset
  • 8.
    Yeah… but Idon’t know SPARQL, and how do you chose your examples anyway?…One great dataset
  • 9.
    Well… figure itout by yourself them!One great dataset
  • 10.
    Summarizing an RDFdataset with questions We would like to be able to give an entry point to a dataset by showing questions it is good at answeringIn a way that can be navigatedExample:Who are the people Tom knows?Tom Heath’s FOAF profile
  • 11.
    A questionA listof characteristics of objects (clauses) based on the relationships between objectsThings that are people, i.e. instances of <Person>Related to <tom> through the relation <knows>For which the answer is a set of objects All the objects that satisfy the clauses of the question
  • 12.
    Formal concept analysisLatticeof concepts: set of objects (extension) with common properties (intension)Formal context: objects with binary attributesExample from: http://en.wikipedia.org/wiki/Formal_concept_analysis
  • 13.
    RDF instances asindividuals in a formal contextPresent relations of objects as binary attributes:RDF: tom a Person. tom knows enrico. jeff knows tom.FCA: tom: {Class:-Person, knows:-Enrico, jeff-:knows}Include implicit information based on the ontologytom: {Class:-Person, Class:-Agent, Class:-Thing, knows:-Enrico, knows:Person, knows:-Agent, knows:-Thing,jeff-:knows, Person:-knows, Agent-:knows, Thing:-knows}
  • 14.
  • 15.
    Eliminating redundanciesWho arethe people Tom knows?
  • 16.
    A concept inthe lattice is a questionIntension = clauses of the question Extension = answers All the objects of the extension satisfy the clauses of the questionDifferent areas of the lattice focus on different topicsQuestions are organized in a hierarchy{Class:-Person, tom-:knows}What are the (Person) that (tom knows)?What are tom’s current projects?What are the people?What are the people that tom knows?
  • 17.
    But…The RDFFormal Contextprocess can generate a lot of attributes and so a lot of questionsRanging from things uninterestingly general What are the Things?To the ones that might be interesting only in very specific cases What are the indian restaurants located in San Diego that have been rated OK and are called “Chez Bob”?Need to extract a list of questions as an entry point
  • 18.
    How to measurethe interestingness of a question - metricsInspired by ontology summarization:Coverage: if providing a list of questions, the questions should cover the entire lattice (i.e., at least one question per branch)Level: Too general or too specific questions are not usefulDensity: The number of clauses can have an impact (avoid too complex questions as well as too simple ones)Inspired from FCA:Support: the cardinality of the extent – i.e. the number of answersIntentional Stability: How much a concept depends on particular elements of the extensionExtensional Stability: How much a concept depends on particular elements of the intension
  • 19.
    Experiment: finding therelevant metrics4 datasets in different domains12 evaluators providing questions of interest for these datasetsObtained 44 questions, out of which 27 are valid (no overlap)Some are too complicated for our model (include disjunction, negation, aggregation functions)“What is the highest point in Florida?”A large part do not comply with the initial instructions: should be self-contained and answered by a list of objects“How high is mountain x?”“What are the restaurant in a given city?”
  • 20.
    ResultsLevel: Questions betweenlevels 3 and 7. 4.46 is the average.Interesting questions located around the center of the latticeDensity: Questions have between 1 and 3 clausesSimple questions are preferredSupport: Very large variations amongst the obtained questionsIntentional Stability: Very large variations amongst the obtained questionsExtensional Stability: High values (between 0.75 and 1.0), especially compared to the average (0.4)Conclusion:In order to establish a list of questions most likely to be of interest, a combination of level, density and extensional stability, together with coverage should be used
  • 21.
    EvaluationAlgorithm to generatea set of questions from the lattice of an RDF dataset thatCover the entire latticeAre believed to be interesting according to a given measureDatasets from data.open.ac.uk614 course descriptions 1706 Video podcastsUsing the metrics: random, closeness to middle level, density close to 2, support, extensional stability, and Aggregated = 1/3 level + 1/3 density + 1/3 stability6 users to score the resulting sets of questions (6 metrics in 2 datasets: 12 sets in total) depending on interestingness
  • 22.
  • 23.
    Implementation: the whatoaskinterfaceDatasetwith SPARQL endpointSPARQL2RCFFormal ContextCORONOfflineLatticeOnlineLattice ParserInterface Generation (using metrics)Interface with navigation in BrowserUser
  • 24.
    Example: Open educationalmaterial(OpenLearn)
  • 25.
    Example: Database ofreading experiences (Arts History project)
  • 26.
  • 27.
    ConclusionThe technique presentedprovides both a summary and an exploration mechanism over RDF data, using the underlying ontology and formal concept analysisIt provides an interface for documenting the dataset by examples rather than by specificationIt favors serendipity in the exploration of the dataset, without the need for prior, specialized knowledgeThe current interface in beta is available in an online demoNeed to improve the question generation and navigation mechanismsOngoing experiment including information gathered through the links to external dataset, to generate un-anticipated questionsUse-cases in research projects in Arts and Humanities
  • 28.
    Thank you!More infoDemo:http://lucero-project.info/lb/2011/06/what-to-ask-linked-data/Data.open.ac.uk (for some of the datasets used)@mdaquin – m.daquin@open.ac.uk