Towards Semantic Search with Topic Maps Lars Marius Garshol <larsga@bouvet.no> TMRA 2009, November 12, Leipzig
What this talk is about <ul><li>Basically, moving from full-text search to a more semantic form of search </li></ul><ul><u...
Two kinds of search <ul><li>Web-wide search and site-wide search </li></ul><ul><ul><li>these two are not the same kind of ...
Two (other) kinds of search <ul><li>Natural language search </li></ul><ul><ul><li>where users put questions to the machine...
Algorithm <ul><li>(1) Parse query into a list of tokens </li></ul><ul><ul><li>categorize tokens as “instance”, “topic type...
Tokens <ul><li>The types of tokens are: </li></ul><ul><ul><li>T topic type (e.g., “person”) </li></ul></ul><ul><ul><li>I i...
Example: a photo topic map <ul><li>I use a topic map to organize my digital photos </li></ul><ul><ul><li>it now holds  ~13...
 
 
Hierarchies <ul><li>In many cases, the generic “I” interpretation is too simplistic </li></ul><ul><ul><li>none of the Sam ...
 
 
Hotel Europa is in Montreal. Ste Brigitte des Saults is on the road between Montreal and Quebec City.
 
Verifying the interpretation <ul><li>Not all interpretations can actually produce results </li></ul><ul><ul><li>for exampl...
How to use this with your topic map <ul><li>Install the component, then search </li></ul><ul><li>No configuration is neces...
Current implementation <ul><li>Just a Jython script using Ontopia </li></ul><ul><ul><li>541 lines </li></ul></ul><ul><ul><...
Weaknesses <ul><li>No relevance ranking </li></ul><ul><ul><li>given “beer Oslo”, all found photos are equally closely tied...
Do users actually query this way? <ul><li>Literature studies and log mining indicate that: </li></ul><ul><ul><li>nearly al...
Taking this further <ul><li>Limitations </li></ul><ul><ul><li>so far all queries use a single variable </li></ul></ul><ul>...
Conclusion <ul><li>The system really does have a kind of semantic understanding </li></ul><ul><ul><li>you type “beer Oslo”...
Upcoming SlideShare
Loading in...5
×

Semantic Search with Topic Maps

1,508
-1

Published on

A description of a possible approach to a true semantic search based on Ontopia and Topic Maps, presented at TMRA 2009.

Published in: Technology

Semantic Search with Topic Maps

  1. 1. Towards Semantic Search with Topic Maps Lars Marius Garshol <larsga@bouvet.no> TMRA 2009, November 12, Leipzig
  2. 2. What this talk is about <ul><li>Basically, moving from full-text search to a more semantic form of search </li></ul><ul><ul><li>if the user types “hotels Leipzig” can we do something more than look for documents containing these two words? </li></ul></ul><ul><ul><li>for example, can we turn this into “find hotels located in Leipzig” ? </li></ul></ul><ul><li>It describes some personal experiments with new approaches </li></ul><ul><ul><li>what is described here needs more work </li></ul></ul>
  3. 3. Two kinds of search <ul><li>Web-wide search and site-wide search </li></ul><ul><ul><li>these two are not the same kind of search </li></ul></ul><ul><ul><li>the former means searching everything </li></ul></ul><ul><ul><li>the second means searching in a limited domain </li></ul></ul><ul><li>This proposal only deals with site-wide search </li></ul><ul><ul><li>to make it work for web-wide search is hard </li></ul></ul><ul><ul><li>so we don’t do that </li></ul></ul>
  4. 4. Two (other) kinds of search <ul><li>Natural language search </li></ul><ul><ul><li>where users put questions to the machine, typically using something approaching complete sentences </li></ul></ul><ul><ul><li>users are assumed to be at least somewhat familiar with the domain </li></ul></ul><ul><li>Web-site search </li></ul><ul><ul><li>users behave unpredictably </li></ul></ul><ul><ul><li>users do not necessarily know the domain </li></ul></ul><ul><ul><li>users are unaware of what search technology is used </li></ul></ul><ul><ul><li>users cannot be trained </li></ul></ul>
  5. 5. Algorithm <ul><li>(1) Parse query into a list of tokens </li></ul><ul><ul><li>categorize tokens as “instance”, “topic type”, “unknown”, ... </li></ul></ul><ul><li>(2) Build an interpretation from the token list </li></ul><ul><ul><li>the interpretation is a tolog query </li></ul></ul><ul><ul><li>if none found, fall back to full-text </li></ul></ul><ul><li>(3) Verify interpretation against schema </li></ul><ul><ul><li>if one is present, that is </li></ul></ul><ul><li>(4) Run chosen interpretation, present results </li></ul><ul><ul><li>also present interpretation, so the user knows what is happening </li></ul></ul><ul><ul><li>allow the user to override and fall back to normal full-text search </li></ul></ul>
  6. 6. Tokens <ul><li>The types of tokens are: </li></ul><ul><ul><li>T topic type (e.g., “person”) </li></ul></ul><ul><ul><li>I instance topic (e.g., “Lars Marius Garshol”) </li></ul></ul><ul><ul><li>A association type (e.g., “employed by”) </li></ul></ul><ul><ul><li>? unrecognized word (e.g., “TMRA”) </li></ul></ul><ul><li>For example, the search “hotels Leipzig ” would typically be parsed into to the following list of tokens </li></ul><ul><ul><li>T hotel, topic type </li></ul></ul><ul><ul><li>I Leipzig , instance of city </li></ul></ul>
  7. 7. Example: a photo topic map <ul><li>I use a topic map to organize my digital photos </li></ul><ul><ul><li>it now holds ~13,000 photos </li></ul></ul><ul><ul><li>online at http://www.garshol.priv.no/tmphoto/ </li></ul></ul><ul><li>A web application is used for search and navigation </li></ul><ul><ul><li>I’ve added the semantic search to this application for demonstration purposes </li></ul></ul>Photo Person Event Category Location
  8. 10. Hierarchies <ul><li>In many cases, the generic “I” interpretation is too simplistic </li></ul><ul><ul><li>none of the Sam Oh photos are marked as being taken in Canada, they are all marked as being taken in places that are contained in Canada </li></ul></ul><ul><ul><li>this is a very common case </li></ul></ul><ul><li>Solved by using ontology annotation </li></ul><ul><ul><li>Kal Ahmed has published a set of PSIs for indicating hierarchical association types </li></ul></ul><ul><ul><li>these are used by the Ontopia tools, at least </li></ul></ul><ul><ul><li>these can be used to pick up hierarchical association types and extending the interpretation of “I” terms to handle them </li></ul></ul>
  9. 13. Hotel Europa is in Montreal. Ste Brigitte des Saults is on the road between Montreal and Quebec City.
  10. 15. Verifying the interpretation <ul><li>Not all interpretations can actually produce results </li></ul><ul><ul><li>for example, “puccini tenor” does not work, because no topics are related to both </li></ul></ul><ul><li>We can actually work this out, based on the schema, because </li></ul><ul><ul><li>there is no topic type to which both composers and voice types can be related </li></ul></ul><ul><ul><li>studying the schema will tell us this </li></ul></ul><ul><li>Studying the schema also helps us explain the interpretation to the user </li></ul>Sam Oh Montr éal person photo person location location
  11. 16. How to use this with your topic map <ul><li>Install the component, then search </li></ul><ul><li>No configuration is necessary! </li></ul><ul><li>However, for better results you may want to </li></ul><ul><ul><li>add more names for some topics </li></ul></ul><ul><ul><li>mark hierarchical association types as such (should be done already) </li></ul></ul><ul><ul><li>mark topic types with large instance sets as such </li></ul></ul>
  12. 17. Current implementation <ul><li>Just a Jython script using Ontopia </li></ul><ul><ul><li>541 lines </li></ul></ul><ul><ul><li>builds a set of token objects, then a set of constraint objects </li></ul></ul><ul><ul><li>then introspects the schema to remove hopeless constraints </li></ul></ul><ul><li>Stemming is still missing! </li></ul><ul><ul><li>need to modify Ontopia full-text search to do this </li></ul></ul><ul><li>Run from a JSP file by means of the Jython API </li></ul><ul><ul><li>just 10-15 lines of glue code </li></ul></ul><ul><li>Longer-term this may turn into a proper Ontopia component </li></ul><ul><ul><li>time horizon not at all clear </li></ul></ul>
  13. 18. Weaknesses <ul><li>No relevance ranking </li></ul><ul><ul><li>given “beer Oslo”, all found photos are equally closely tied to “beer” and “Oslo” </li></ul></ul><ul><ul><li>there is nothing to rank their relevance by </li></ul></ul><ul><ul><li>on the other hand, all hits are definitely relevant to the query as given </li></ul></ul><ul><li>Homonym support too simplistic </li></ul><ul><ul><li>it’s not clear that it will actually handle all cases in practice </li></ul></ul><ul><ul><li>a better approach would be to construct multiple interpretations and then choose between them </li></ul></ul><ul><ul><li>ideally the user should be allowed to override the choice </li></ul></ul><ul><li>Very closely tied to topic map structure </li></ul><ul><ul><li>if the user uses the wrong terms, the approach does not work </li></ul></ul><ul><ul><li>only allows structured searches along the dimensions actually in the topic map </li></ul></ul><ul><ul><li>how much of an issue this is is likely to depend on the application </li></ul></ul>
  14. 19. Do users actually query this way? <ul><li>Literature studies and log mining indicate that: </li></ul><ul><ul><li>nearly all queries are just 1 or 2 words </li></ul></ul><ul><ul><li>2-word queries tend to be either </li></ul></ul><ul><ul><ul><li>the name of a entity (New York), or </li></ul></ul></ul><ul><ul><ul><li>qualified searches (Montr éal city ) </li></ul></ul></ul><ul><li>Conclusion </li></ul><ul><ul><li>this feature has to be used with caution </li></ul></ul><ul><ul><li>it may work best when users can be told about it </li></ul></ul><ul><ul><li>site feedback may encourage users to use it more </li></ul></ul><ul><li>More work is needed on this </li></ul>
  15. 20. Taking this further <ul><li>Limitations </li></ul><ul><ul><li>so far all queries use a single variable </li></ul></ul><ul><ul><li>no understanding of association types </li></ul></ul><ul><ul><li>no understanding of occurence types </li></ul></ul><ul><ul><li>no notion of ordering (first, last, biggest, smallest, ...) </li></ul></ul><ul><li>This can be implemented </li></ul><ul><ul><li>an earlier prototype could interpret queries such as “operas based on works written by Shakespeare” </li></ul></ul><ul><ul><li>other elements also implementable </li></ul></ul><ul><li>However, this takes the system further away from normal user searches </li></ul><ul><ul><li>more thinking needed on how to handle this </li></ul></ul><ul><ul><li>make it a semi-formal language? </li></ul></ul><ul><ul><li>turn it into a full natural language search component? </li></ul></ul>
  16. 21. Conclusion <ul><li>The system really does have a kind of semantic understanding </li></ul><ul><ul><li>you type “beer Oslo”, and it says “I think you want photos of beer taken in Oslo” </li></ul></ul><ul><li>Easy to implement </li></ul><ul><ul><li>no configuration necessary </li></ul></ul><ul><ul><li>component can be plugged into any web application based on Ontopia </li></ul></ul><ul><ul><li>(also easy to implement on top of other Topic Maps engines) </li></ul></ul><ul><li>Does not match current user behaviour </li></ul><ul><ul><li>more work necessary on this </li></ul></ul><ul><li>Not as advanced as it could be </li></ul><ul><ul><li>single-variable queries only </li></ul></ul><ul><ul><li>no understanding of association types </li></ul></ul><ul><ul><li>more work to be done on this, too </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×