Gaining New Insights into Usage Log Data

685 views

Published on

Presented at SAS Business Analytics 2011 event in Singapore.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
685
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Gaining New Insights into Usage Log Data

  1. 1. Introduction Web Usage Log Case Study Conclusion Gaining New Insights into Usage Log Data via Explorative Visualisation Markus Kirchberg, Ryan K L Ko, and Bu Sung Lee Hewlett-Packard Labs (HP Labs) Singapore Contact: Markus.Kirchberg@hp.com Business Analytics 2011 – A SAS Forum Event – May 25th , 2011 – university-logoM. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 1 / 27
  2. 2. Introduction Web Usage Log Case Study ConclusionOutline. 1 Introduction Usage Log Analysis Explorative Visualisation 2 Web Usage Log Case Study Basics Relevance 3 Conclusion university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 2 / 27
  3. 3. Introduction Usage Log Analysis Web Usage Log Case Study Explorative Visualisation Conclusion Introduction university-logoM. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 3 / 27
  4. 4. Introduction Usage Log Analysis Web Usage Log Case Study Explorative Visualisation ConclusionBackground and Motivation Cloud computing, MPP/map-reduce, data explosion, semantic technologies, ... increased interest in data analytics. Logged data Generated by almost all systems/services in-use. Capabilities to extract value from logs ∼ Key distinguishing factor. = Current approaches (e.g., link & usage log analysis) need revision. Typically time is considered as an orthogonal factor. Limitation of the potential impact of the measured importance. Real-world events, topics or keywords are not consistently interpreted over time. Focus: Extract meaningful information (e.g., usage patterns or relevance indicators) and relate to users / real-world events. university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 4 / 27
  5. 5. Introduction Usage Log Analysis Web Usage Log Case Study Explorative Visualisation ConclusionSample Events university-logo
  6. 6. Introduction Usage Log Analysis Web Usage Log Case Study Explorative Visualisation ConclusionSample Events university-logo
  7. 7. Introduction Usage Log Analysis Web Usage Log Case Study Explorative Visualisation ConclusionSample Events university-logo
  8. 8. Introduction Usage Log Analysis Web Usage Log Case Study Explorative Visualisation ConclusionSample Events university-logo
  9. 9. Introduction Usage Log Analysis Web Usage Log Case Study Explorative Visualisation ConclusionUsage Log Analysis – Basics Usage Log Types (It’s more than just Web server logs!): Network / Firewall Logs (bandwidth per msg type, inbound vs outbound, Intranet vs Internet, ...) Medical Device Usage Logs (proper usage, treatment improvement, ...) Vehicle Usage Logs (ERP, road monitoring, accident prevention / investigation, ...) Database Usage Logs (auditing, consistency, recovery, performance optimisation, ...) Web, ftp, mail, ... server usage logs (usage statistics, relevancy, advertising, ...) Call Center Usage Logs, Social Networking Usage Logs, ... Purposes: Data enrichment, identification of redundant data, data cleaning, detection of hidden patterns, statistical verification, usage context / relevancy, marketing / advertisement placement, ... university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 6 / 27
  10. 10. Introduction Usage Log Analysis Web Usage Log Case Study Explorative Visualisation ConclusionUsage Log Analysis – Basics Raw HTTP usage log sample: 140.203.154.206 - - [14/Dec/2010:13:16:51 +0000] "GET /sparql?query=DESCRIBE+%3C http%3A%2F%2Fdata.semanticweb.org%2Fconference%2Feswc%2F2006%2Fpaper%2Fpazienza- stellato%3E HTTP/1.0" 200 7112 "-" "-" 66.249.72.196 - - [14/Dec/2010:13:17:11 +0000] "GET /person/venkatram-yadav-jaltar HTTP/1.1" 303 10133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Anonymised HTTP usage log sample: 0.0.0.0 - - [14/Dec/2010:13:16:51 +0000] "GET /sparql?query=DESCRIBE+%3C http%3A%2F%2Fdata.semanticweb.org%2Fconference%2Feswc%2F2006%2Fpaper%2Fpazienza- stellato%3E HTTP/1.0" 200 7112 "-" "-" "IE" "d9de2b0c659e7bc7b199e0f0953cd15e1ef8fc0c" 0.0.0.0 - - [14/Dec/2010:13:17:11 +0000] "GET /person/venkatram-yadav-jaltar HTTP/1.1" 303 10133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "US" "869b12b0ac5630349570f69ad6062b7793fb73a8" Usage log visualisation samples: university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 7 / 27
  11. 11. Introduction Usage Log Analysis Web Usage Log Case Study Explorative Visualisation ConclusionExplorative Visualisation ‘Data science is the future and there cannot be data science without data visualization and vice versa.’ DavidMcCandless@TED,July 2010 ∼ Graphics that give important clues and observations of patterns = and consistent trends. Useful to prove the existence or understanding of a certain phenomenon; Assist with modelling findings as mathematics, algorithms or other formalisms that can reproduce such trends. Proven to be of great value in analysing and exploring big data. university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 8 / 27
  12. 12. Introduction Basics Web Usage Log Case Study Relevance Conclusion Web Usage Log Case Study Basics M. Kirchberg, R. K L Ko, B. S. Lee. From Linked Data to RelevantData – Time is the Essence. In Proceedings of the 1st InternationalWorkshop on Usage Analysis and the Web of Data (USEWOD) held in conjunction with the 20th International World Wide Web Conference (WWW), 2011. (Best Paper Award) university-logoM. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 9 / 27
  13. 13. Introduction Basics Web Usage Log Case Study Relevance ConclusionHow Do Obtain MEANINGFUL Web Usage Data? Usage Log Analysis Non-invasive; implicitly collected; potential source of privacy concerns! Challenges: up to 90% of data is rubbish; lack of relevancy notion. Social Tagging / Annotations Required explicit user inputs; limited to social networking sites. Proven useful to define better folksonomies; but lack of use cases. Explicit User Feedback (Like/Unlike, Rate Up/Down) in the GUI Required new GUIs and explicit user inputs. Proven useful for location-dependent search; long-tail queries. university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 10 / 27
  14. 14. Introduction Basics Web Usage Log Case Study Relevance ConclusionCase Study: (Linked) Data Sets & their Usage Logs Semantic Web Dog Food (SWDF): Web/Semantic Web publications, people and organisations. Usage logs cover 2 years from Nov 01, 2008 to Dec 14, 2010[1] . Log # Resources # Accessed Days Hits # Success- Size Resources ful Hits 2GB > 100, 000 40, 322 720 8.1m 7.1m DBpedia: twin of Wikipedia; focal points of the Web of data. Usage logs covering Jul 01, 2009 & Feb 01, 2010[1] (avg of 1m hits/day; 6m accessed resources). SWDF serves a specific purpose; DBpedia is general-purpose. university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 11 / 27
  15. 15. Introduction Basics Web Usage Log Case Study Relevance ConclusionCase Study Evaluation Framework: Log-to-Database 1 Eval log entries & removed hits with 4/5xx HTTP status codes. SWDF: Very clean and conform to the CLF format. DBpedia: > 1, 000 non-UTF8 / non-CLF-conform entries. 2 Map log entry fields to specifically designed PostgreSQL DB. 3 Post-process DB entries: URIs and matching HTML/RDF representations; Bots, spiders, crawlers, ... (user agent field, access to robots.txt, high frequency accesses); and Access types – Plain/HTML vs. Semantic vs. Search vs. SPARQL. 4 Basic analysis of usage log data. 5 Relevance-driven usage log analysis. university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 12 / 27
  16. 16. Introduction Basics Web Usage Log Case Study Relevance ConclusionCase Study: Basic Statistics & Findings Top hits excluding bots & spiders are 10% of those overall. Adequante filtering is vital to obtain a better insights. However, it is not enough to already derive at a useful notion of relevance. Möller et.al.[2] on a possible metric to determine relevance: ‘[...] In the case of the Dog Food dataset, the hypothesis is that requests for data from specific conferences would be noticeably higher around the time when the event took place. [...] Contrary to our expectations, there areuniversity-logo no significantly higher access rates around the time of the event. [...]’. M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 13 / 27
  17. 17. Introduction Basics Web Usage Log Case Study Relevance ConclusionCase Study: Basic Statistics & Findings university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 13 / 27
  18. 18. Introduction Basics Web Usage Log Case Study Relevance Conclusion Web Usage Log Case Study Relevance Web-site: http://usewod2011.thekirchbergs.info/ university-logoM. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 14 / 27
  19. 19. Introduction Basics Web Usage Log Case Study Relevance ConclusionRelevance – Basics SWDF/DBpedia data sets provide clues pointing to concepts of relevance of Web resources with time and events in reality. Consider two spaces in which semantic data are communicated: Real Space: where r/w events take place at unique time windows. A same semantic of an event (e.g., National Day) can take place frequently with the same objectives and content; BUT different time windows understand temporal and situational context/meaning. Web Space: Desc of Real Space events in the form of linked data. Without time window more difficult to give ‘meaning’ to a set of keywords/topics/Web data describing a Real Space event. Study representations of events in Real Space recorded as linked data in Web Space. Time windows + exploratory graphics Meaningful change. university-logo ∼ Time window, traffic & linked resources. Relevance = M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 15 / 27
  20. 20. Introduction Basics Web Usage Log Case Study Relevance ConclusionCase Study: Key Contributions Present evidence that Web usage logs can lead to relevance notion. Essential: Consider not only interlinking of weighted resources: Whether users make use of links (use versus mere existence), How users utilise links (browsing depth, browsing patterns, ...), and How the usage changes over time. Conclude that time is indeed a key factor to be considered. Propose new approach by combining link and usage analysis for events based on time-windowed views over usage logs. Event ∼ A situation that creates a need in a user to search or = browse for related information which, in turn, triggers a visit to a Web resource that is associated with topics and keywords via the Web 3.0. university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 16 / 27
  21. 21. Introduction Basics Web Usage Log Case Study Relevance ConclusionCase Study: Measuring Relevance Web Travel Footprint (WTF) of an IP Address: ∼ Road network on a map with footprint being the user’s trail. = Characteristics from linking ‘referrer’ to ‘resource requested’: 1 Fan – Linkages between a data resource and other data resources. Spread of influence of a resource; eliminates unused resources. 2 Depth – how ‘deep’ a user surfs into the Web-site. Measure about ‘curiosity’ w.r.t. a certain set of resources. Characteristics from counting a link’s hits within a time window: 1 Weight – Number of times a path was accessed. university-logo Relevancy based on all three characteristics – not in isolation. M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 17 / 27
  22. 22. Introduction Basics Web Usage Log Case Study Relevance Conclusion Case Study: Measuring Relevanceint (WTF) of an IP Address university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 17 / 27
  23. 23. Introduction Basics Web Usage Log Case Study Relevance ConclusionCase Study: Kandinsky Graphs (KGs) ∼ Sum of all WTFs of visitors’ access paths & linkage of the = resources within the site at a particular time window. Exploratory graph sums of (1) how deep users have travelled into/within a site; (2) how resources are linked to each other; and (3) which resources are highly relevant – at a given time window. Technically : GraphViz dot files as circo-layouts. university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 18 / 27
  24. 24. Introduction Basics Web Usage Log Case Studyy : GraphViz dot files as circo-layouts. Conclusion Relevance Case Study: Kandinsky Graphs (KGs) university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 18 / 27
  25. 25. Introduction Basics Web Usage Log Case Study Relevance ConclusionCase Study: Kandinsky Graphs for WWW 2010 Recurring Top Relevant Resources in the Paper Before During After SWDF Web-site Due Conf Conf Conf http://data.semanticweb.org/conference/www/2009 2 2 1 3 http://data.semanticweb.org/conference/iswc/2009 1 1 2 2 http://data.semanticweb.org/papers 3 3 3 4 http://data.semanticweb.org/index.html 1 university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 19 / 27
  26. 26. Introduction Basics Web Usage Log Case Study Relevance ConclusionCase Study: DIFF-Kandinsky Graphs for WWW 2010 KGs capture relevance for each time window. DIFF-KGs capture changes between time windows: Relevance(TimeWindow2 ) − Relevance(TimeWindow1 ) whereby weights are calculated using division. Emphasise on new hits; remove/penalise edges with similar hits. university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 20 / 27
  27. 27. Introduction Web Usage Log Case Study Conclusion Conclusion university-logoM. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 21 / 27
  28. 28. Introduction Web Usage Log Case Study ConclusionReal Space Web/Cyber Space university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 22 / 27
  29. 29. Introduction Web Usage Log Case Study ConclusionWeb/Cyber Space Real Space university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 23 / 27
  30. 30. Introduction Web Usage Log Case Study ConclusionReal Space Web/Cyber Space Real Space university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 24 / 27
  31. 31. Introduction Web Usage Log Case Study ConclusionReal Space Web/Cyber Space Real Space Did you notice something? university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 24 / 27
  32. 32. Introduction Web Usage Log Case Study ConclusionReal Space Web/Cyber Space Real Space Did you notice something? No annotations! university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 24 / 27
  33. 33. Introduction Web Usage Log Case Study ConclusionReal Space Web/Cyber Space Real Space Did you notice something? No annotations! Results/observations of relevance in active and purposeful Web-sites could only be achieved because of the fundamental linkage of time windows to the study of semantics in linked data. Small but crucial step towards identification of data relevant to real-life events from previously deemed contextless data. university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 24 / 27
  34. 34. Introduction Web Usage Log Case Study ConclusionSummary & Future Work Argue: Sum of WTFs & linkage of a site’s resources (time-windowed) gives insights at what constitutes relevance. Important properties include: Fan, depth of traversals & weight. Lessons: Clean your data thoroughly! Visualisation helps to gain new perspectives. Visualisation is great for semi- & unstructured big data. Future Work: Extend notion of relevance to multiple data nodes. Determine relevance value programmatically . Extend to other types of Usage Logs. university-logo M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 25 / 27
  35. 35. Introduction Web Usage Log Case Study Conclusion Thank You! Questions and/or Comments? Contact: Markus.Kirchberg@hp.com university-logoM. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 26 / 27
  36. 36. Introduction Web Usage Log Case Study Conclusion B ERENDT, B., H OLLINK , L., H OLLINK , V., L UCZAK -R ÖSCH , M., M ÖLLER , K. H., AND VALLET, D. Usewod2011 – 1st international workshop on usage analysis and the web of data. In 20th International World Wide Web Conference (WWW) (Hyderabad, India, 2011). M ÖLLER , K., H AUSENBLAS , M., C YGANIAK , R., H ANDSCHUH , S., AND G RIMNES , G. A. Learning from linked open data usage: Patterns & metrics. In Proceedings of the Web Science Conference (WebSci) (2010). university-logoM. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 27 / 27

×