Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining the Web for Points of Interest

2,739 views

Published on

These are the slides for talk at SIGIR 2012 (Portland, OR, USA) on extracting mentions of places in web text and using user generated data to then localise them.

Published in: Technology, Business
  • Be the first to comment

Mining the Web for Points of Interest

  1. 1. Adam RaeVanessa Murdock, Adrian Popescu, Hugues Bouchard SIGIR 2012, Portland, Oregon, Entities Session
  2. 2. ! I’m at Adam’s Bar…? Mining the Web for Points of Interest Using social media to increase our knowledge of the world
  3. 3. Contents§ Motivation§ Point Of Interest (POI) extraction using user generated data§ POI localisation using social media§ Conclusions
  4. 4. Motivation§ Geographic Points of Interest are valuable representations of important places in the world around us.§ Browsing and search of POIs increasingly important ›  Web search ›  Mobile ›  Navigation
  5. 5. Where do POIs come from?§ Editing listings coming from NMAs, commercial directories etc. ›  Costly process ›  Expensive to maintain freshness ›  Coverage§ Do they reflect the kind of places that people are interested in looking for?
  6. 6. Can we get them from the web?§ Un/semi-structured mentions of POIs throughout text on web ›  Lots of context§ Structured mentions of POIs in micro blogging systems and Wikipedia articles ›  Easy to extract
  7. 7. When is a POI not a POI?1  The White House is at 1600 Pennsylvania Avenue, Washington DC.2  The White House released a statement today suggesting the moon is made of cheese.3  The people living in the white house at the end of the street turned out to be Martians.
  8. 8. Europe According to Foursquare
  9. 9. The World According to Foursquare
  10. 10. The World According to Gowalla
  11. 11. The World According to Wikipedia
  12. 12. Can we bootstrap using social media?§ Train Conditional Random Fields (CRF) using web snippets bootstrapped from structured mentions in micro-blog entries ›  Extract POI, use as query to search engine ›  Resultant snippets filtered to those that contain POI ›  Sanitise§ Also from geocoded Wikipedia articles (according to Yago2)
  13. 13. Ground Truth Data§ Created by manual assessors given explicit instructions ›  1,337 examples of POIs in (some) context ›  1,066 unique POIs ›  Inter-assessor agreement: Ground Truth Precision Recall F-Measure Assessor 1 0.749 0.792 0.770 2 0.814 0.716 0.762
  14. 14. Sequential Tagging Model 1 $ p(Y | X, λ ) = exp& ∑ λ j F j (Y, X)) & ) Z(X) % j ( + 1 - % (/- argmaxΛ, exp ∑ λ j F j (Y, X)* 0 *- - Z(X) . & j )1
  15. 15. Features§ Lexical ›  Word identity, shape, position, etc.§ Grammatical ›  Part of Speech, Apache OpenNLP§ Statistical ›  Normalised Point-wise Mutual Information of mobile search query logs§ Geographic ›  Gazetteer attributes from Yahoo! Placemaker ›  http://developer.yahoo.com/geo/placemaker/
  16. 16. Process Overview ExtractGeocoded Wikipedia Wikipedia Bootstrapped Wikipedia based Article Articles Raw Web Snippets POI Tagger Search Engine (Bing) CRF Model Training Snippet Processing Titles Foursquare Foursquare Check-Ins Bootstrapped Raw Web based POI (Foursquare) Extract Snippets Tagger POI Mentions Check-Ins Gowalla Bootstrapped Gowalla based (Gowalla) Raw Web Snippets POI Tagger … was only after he had left the Marriott Hotel that he remembered…
  17. 17. ResultsTraining Data Testing Data Precision RecallY! Placemaker Manual Data 0.237 0.228Wikipedia Manual Data 0.514 0.337Foursquare Manual Data 0.276 0.655Gowalla Manual Data 0.360 0.414Wikipedia 10-fold CV 0.879 0.955Foursquare 10-fold CV 0.689 0.468Gowalla 10-fold CV 0.857 0.868
  18. 18. Language Modelling§ Partition the world into 1km cells§ For each, create model from Flickr photos taken in that area c user (t,L) P(t | θ L ) = L = ∑c user (t i ,L) L t i ∈L§ Treat problem as IR, match a POI (query) against the cells (document) ›  Return centroid of of best matching cell €
  19. 19. Performance Placemaker Cascade Geo Scope # ExamplesPlacemaker 0.29 0.29 0.29 134POIsPlacemaker 4.19 2.90 2.12 131Other LocsAll Known 1.17 0.82 0.79 265LocsNew - 439.0 5.88 130LocationsAll Data - 1.20 0.96 395
  20. 20. Conclusions and Implications§  POIs are valuable, but useful ones difficult to define§  Generating evaluation data is hard§  Can use web snippets bootstrapped with check-ins, and articles on Wikipedia to train POI tagger ›  Up to 88% precision on unlabelled data ›  Reflect the POIs users visit ›  Easily updated ›  Can be located accurately using hybrid gazetteer + Flickr language model technique
  21. 21. Benefits of this approach§ Discover POIs: ›  that we already know about (replace/extend existing sources) ›  we didn’t already know about (novel POIs) ›  of more diverse types (increasing coverage) ›  that are fresher§ Increase relevance of local and hyperlocal search using wisdom of the crowds
  22. 22. Research Areas-  Automatic POI detection in UGC-  Learning how users refer to places-  Localising media-  Generating evaluation data -  (This is hard)-  Multi-source combination-  Quality & Credibility
  23. 23. Adam Rae adamrae@yahoo-inc.comThank you Vanessa Murdock Adrian Popescu Hugues Bouchard

×