WWW2013: Web Usage Mining with Semantic Analysis

513 views

Published on

Laura Hollink, Peter Mika and Roi Blanco. Web Usage Mining with Semantic Analysis. In proceedings of the International World Wide Web Conference, Rio de Janeiro, Brazil, May 2013.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
513
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
39
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

WWW2013: Web Usage Mining with Semantic Analysis

  1. 1. Web Usage Mining with Semantic Analysis Laura Hollink, VU University Amsterdam Peter Mika, Yahoo! Labs Barcelona Roi Blanco, Yahoo! Labs Barcelona
  2. 2. Analysis of web user behavior What are typical use cases? Are these carried out in a particular order? Which use cases are not satisfied? And to which other sites do users go?
  3. 3. Analysis of web user behavior What are typical use cases? Are these carried out in a particular order? Which use cases are not satisfied? And to which other sites do users go? oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org! captain'america'''movies.yahoo.com moneyball'trailer'''movies.yahoo.com' money'''moneyball'movies.yahoo.com' moneyball'''movies.yahoo.com''movies.yahoo.com en.wikipedia.org'''movies.yahoo.com''peter'brand'''peter nymag.com'''moneyball'the'movie'''www.imdb.com' moneyball'trailer'movies.yahoo.com''moneyball'trailer'' brad'pi-''brad'pi-'moneyball''brad'pi-'moneyball'movie'brad'pi-'moneyball''brad'pi-'moneyball'oscar'''www.imdb.co relay'for'life'calvert'ocunty www.relayforlife.org'trailer'for'moneyball'''movies.yahoo.com 'moneyball.movie moneyball'en.wikipedia.org 'movies.yahoo.com map'of'africa''www.africaguide.com' money'ball'movie'''www.imdb.com money'ball'movie'trailer''moneyball.movie-trailer.com'' brad'pi-'new''www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com'brad'pi-'news' news.search.yahoo.com moneyball'trailer''moneyball'trailer'www.imdb.com''www.imdb.com! Transaction logs: sessions of queries and clicks
  4. 4. Analysis of web user behavior oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org! captain'america'''movies.yahoo.com moneyball'trailer'''movies.yahoo.com' money'''moneyball'movies.yahoo.com' moneyball'''movies.yahoo.com''movies.yahoo.com en.wikipedia.org'''movies.yahoo.com''peter'brand'''peter nymag.com'''moneyball'the'movie'''www.imdb.com' moneyball'trailer'movies.yahoo.com''moneyball'trailer'' brad'pi-''brad'pi-'moneyball''brad'pi-'moneyball'movie'brad'pi-'moneyball''brad'pi-'moneyball'oscar'''www.imdb.co relay'for'life'calvert'ocunty www.relayforlife.org'trailer'for'moneyball'''movies.yahoo.com 'moneyball.movie moneyball'en.wikipedia.org 'movies.yahoo.com map'of'africa''www.africaguide.com' money'ball'movie'''www.imdb.com money'ball'movie'trailer''moneyball.movie-trailer.com'' brad'pi-'new''www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com'brad'pi-'news' news.search.yahoo.com moneyball'trailer''moneyball'trailer'www.imdb.com''www.imdb.com! Transaction logs: sessions of queries and clicks Are these use cases typical for all movies? Recent movies? Only for Moneyball?
  5. 5. Why are these questions difficult to answer? Sparsity of the event space ‣ 64% percent of queries are unique within a year ‣ even the most frequent patterns have extremely low support To illustrate: top 12 most frequent sessions observed in our data:
  6. 6. Tasks Question 1: what are typical use cases? ‣Task 1: find sequences of events in the data that are more frequent (have a higher support) than a threshold. Question 2: what use cases are not satisfied? ‣Task 2: learn to predict website abandonment from queries and clicks.
  7. 7. Approach 'oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org! Applied to the movie domain Connect queries to entities in the linked open data cloud and use properties of these entities to generalize and categorize queries.
  8. 8. Data processing and linking steps 1.link queries to entities 2.select types of entities (classes) 3.detect modifier words (download, trailer, cast, date, etc.) 4.identify navigational queries 5.identify ‘loosing’ queries. 'oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!
  9. 9. 1. Linking queries to entities in the LOD cloud • We link one entity to each query. • The intent of about 40% of unique Web queries is to find a particular entity [Pound, WWW2008]. • We link to Freebase (has a lot of movie related info) and DBpedia (Wikipedia is widely used)
  10. 10. 2. Select one type per entity • We use the Freebase API to get the semantic “types” of each query URI • Freebase ‘Notable types API’ is not official and not documented. • For repeatability and transparency, we have created our own heuristics to select one type for each entity: 1. no internal or administrative types, 2.prefer established domains (‘Commons’) over user defined schemas (’Bases’) 3.aggregate specific types into more general types a)subtypes of location -> location b)subtypes of award winners and nominees -> award_winner_nonimee c)prefer movie related types over other types: film, actor, artist, tv_program, tv_actor and location (order of decreasing preference). entity TypeType Type Type Type Type
  11. 11. 3. Detect modifier words in queries Top 100 most frequent words that appear in the query log before or after entity names [Mika ISWC2009, Pantel WWW2012]. movie, movies, theater, cast, quotes, free, theaters, watch, 2011, new, tv, show, dvd, online, sex, video, cinema, trailer, list, theatre . . .
  12. 12. 4. Identifying navigational queries • A navigational query is a query entered with the intention of navigating to a particular website. • A common heuristic is to consider navigational queries where the query matches the domain name of a clicked result. • “official homepage” is value of dbpedia:homepage, dbpedia:url, and foaf:homepage. netflix login www.netflix.com banana www.bananas.org European Parliament europarl.europa.eu
  13. 13. 5 Identify ‘loosing’ queries • A ‘loosing’ query is the query that leads a user to abandon a service in favor of another service. • Common definition: A user repeats the same query and clicks on another result in the list. • Our broader, semantic definition:
  14. 14. Evaluation 1.Linking to entities and types 2.Detection of frequent usage patterns 3.Prediction of website abandonment Applied to the movie domain • sample of server logs of Yahoo! Search in the US from June, 2011, split into sessions. • Only sessions that contain at least one visit to any of 16 popular movie sites4. • 1.7 million sessions, containing over 5.8 million queries and over 6.8 million clicks.
  15. 15. Evaluation of links to entities and types • Compare manually created <query, entity> and <entity, type> pairs to automatically created links. • 2 samples: the 50 most frequent queries and 50 random queries. Examples: • Ambiguous query: “Green Lantern” - the movie or the fictional character? • Wrong type: Oil peak is a serious game subject?
  16. 16. Evaluation of links to entities and types Queries Entities Types Frequencyofoccurrence Frequencyofoccurrence Frequencyofoccurrence
  17. 17. Frequent usage patterns I • Freebase:release_date property of entities. Recent movies Older movies
  18. 18. Frequent usage patterns II • Sequences of consecutive query types.
  19. 19. Frequent usage patterns III • A comparison of websites. • most frequent query types that lead to a click on a website. /film /film/actor /tv_program /people/person /book/book ional_universe/fictional_character /music/artist /tv/tv_actor /location /film/film_series Website 1 proportionofqueriesthatleadtoaclickonthewebsite 0.0 0.1 0.2 0.3 0.4 0.5 0.6 /film /location /book/book /film/actor /business/employer /fictional_universe/work_of_fiction ional_universe/fictional_character /tv_program /architecture/building_function /film/film_series Website 2 proportionofqueriesthatleadtoaclickonthewebsite 0.0 0.1 0.2 0.3 0.4 0.5 0.6 /location /business/employer /film /film/actor /organization/organization /architecture/building_function /people/person /tv_program /tv/tv_network /internet/website_category Website 3 proportionofqueriesthatleadtoaclickonthewebsite 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Proportionofqueries Proportionofqueries Website BWebsite A
  20. 20. Predicting website abandonment • 3 Classification Tasks: Given a (part of a) session in which a user is lost/gained, predict... 1...whether a user will be gained for a given website. 2...given that the session includes a given website, whether this website is in the loosing or gaining position. 3...given that the session includes two given websites, which one is in the gaining position. •Gradient Boosted Decision Trees.
  21. 21. Discussion and future work • Mining patterns of entire queries gives problems with sparsity of data • We interpret the structure and semantics of the queries, using openly available, up-to-date information on the Web. • give a “semantic” definition of navigational and ‘loosing’ queries • find patterns of user behavior • predict website abandonment • This is the beginning: • Use more properties of entities, more features. • Detect more complex patterns. • Explore other linked open datasets.
  22. 22. Thank you! Questions?

×