Your SlideShare is downloading. ×
Web Log, Text, and Other Data Mining
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Web Log, Text, and Other Data Mining

582

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
582
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
27
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • The WebQuilt Visualization - Nodes are visited web pages, and arrows are the traffic between the pages. Entry pages are colored green, and exit pages cyan. Thicker arrows represent heavier traffic. Arrow color is used to indicate time spent on a page before transitioning, where the closer the arrow to red, the longer spent in transition. The designer’s path is highlighted in blue. The zoom slider interface along the left hand side is used to change the zoom level. The checkboxes along the bottom indicate which participant paths are currently displayed, and can be used to add and remove paths from the display
  • Here, half of the users have been filtered out of the display
  • Zooming in to the upper left hand corner, we can see that nodes are now thumbnails, and more detailed path transitions are visible
  • Zooming in further to the start node, the page design is visible, arrows become translucent and originate from the links clicked, the url of the page can be read Browser back actions (back arrow) and (computer with ?) navigational actions that the system cannot yet identify a location for, such as form buttons, dynamically generated links are represented by little icons on top. Navigation is done by selecting a node or group of nodes and zooming in/out or panning with gestures
  • If we look at the view again…
  • Specifically lets look at these nodes on the left. we find that they are users that chose to backtrack after finding the safety information in order to find the dealer locations. Designers can use behavior information such as this to recognize places where better navigation may shorten this path, for example by providing more obvious links to remove unnecessary transitions. One of these users exited the task on a page un-related to the task (vehicle insurance and loan calculator).
  • These two users found the correct information, but for the wrong car model year. lack of enough contextual cues for the user to realize they are not on the correct path.
  • These two pages, where a lot of time was spent before transitioning, are long (see size of scrollbar) that users spent a long time on before transitioning to the next page. Because these two pages were in the “designer’s optimal path” it may be important to consider how much time should be spent. In particular, users spent a long time reading before they decided to click on the specs button and a few users made a few wrong clicks before returning and continuing on with the “specs” link.
  • Ping pongers – another simple task asked users to find a piece of information on the casa di fruita website most users found the information (top links) but one users went off into ping-pong land. *note* this is an image from an older rev of webquilt, and has a logfile format we no longer support, hence the crappy rendering and arrows not connecting to links
  • 5 lab users
  • 5 remote users
  • 18 total unique issues identified (some issues appear in more than one category.) Those in bold are found w/wq remote usability testing All issues found with lab, only 7 unique found with webquilt HOWEVER, 7 of the 9 site design issues were found, including 3 of the 4 higher severity issues. Webquilt methodology also revealed some problems with the test design and device via questionnaires.
  • Transcript

    • 1. Web Log, Text, and Other Data Mining Wayne Kao
    • 2. What is Data Mining?
      • “ Automated extraction of hidden predictive information from large databases” -Kurt Thearling
      • “ Quickly and thoroughly explore mountains of data, isolating the valuable, usable information -- the business intelligence” -SPSS site
    • 3. Possible Questions (Chi)
      • Usage
        • How has info been accessed? How frequently? What’s popular?
        • How do people enter the site? Where do people spend time? How long do they spend there?
        • How do people travel within a site? What are the [un]popular paths?
        • Who are the people accessing the site? From what geographical location? From what domains?
    • 4. Possible Questions (cont)
      • Structural
        • What information has been added? Modified? Remained the same but moved?
      • Usage + Structural
        • How is new info accessed? When does it become popular?
        • How does introducing new information change navigation patterns? Can people still navigate there to the desired info?
        • Do people look for deleted information?
    • 5. Usability Testing
      • Common usability testing techniques:
      • Interviews
      • Ethnographic and/or lab-style observations
      • Surveys
      • Focus groups
      • Good qualitative data
      • Problems with these techniques:
      • Time and effort are costly
      • Small sample sizes – quantitative results? (Spool)
      • How can we get usability testing more involved in the design cycles, so we can find problems and potential problems earlier?
      Design Evaluate Prototype
    • 6. Remote Usability (Waterson)
      • Analyze clickstreams in the context of the task and user intentions
      • Human observers not present
      • Want methods that are
        • Easy to deploy on any website
        • Compatible with range of OS and browsers
      • Mobile computing adds further usability challenges
        • Small screen sizes
        • Limited and/or new interaction techniques
        • Devices are used in environments beyond the desktop
    • 7. Apache Web Log
      • 205.188.209.10 - - [29/Mar/2002:03:58:06 -0800] "GET /~sophal/whole5.gif HTTP/1.0" 200 9609 "http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0 (compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)"
      • 216.35.116.26 - - [29/Mar/2002:03:59:40 -0800] "GET /~alexlam/resume.html HTTP/1.0" 200 2674 "-" "Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; http://www.inktomi.com/slurp.html)“
      • 202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/indextop.html HTTP/1.1" 200 3510 "http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“
      • 202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/animate.js HTTP/1.1" 200 14261 "http://www.csua.berkeley.edu/~tahir/indextop.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“
    • 8. Analog - One traditional tool
      • Reports number of requests, info about client machines, entry/exit points, charts (Chi et al.)
      • Generated on a daily basis
      • Typical stats
      • Prettier stats
    • 9. Readings
      • “ Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998
      • “ Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999
      • “ VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999
      • “ Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001
      • “ What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002
    • 10. Readings
      • “ Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998
      • “ Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999
      • “ VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999
      • “ Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001
      • “ What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002
    • 11. Evolution of Web Ecologies
      • Rather than hits, focus intermediate representation on (C)ontent, (U)sage, and (T)opology, sorted by URL.
        • URL1:
          • {day1: <link> <link> …}
          • {day2: <link> <link> …}
        • URL2:
          • {day1: <link> <link> …}
      • Visualize an entire web site in a small amount of space
      • Show temporal changes
    • 12. Disk Tree Visualization
      • Breadth first traversal
      • Each ring represents a tree level
      • All leaf nodes guaranteed some angular space (360 / # leaves)
      color: new , continued , deleted Lifecycle stage line size/brightness Page access frequency line mark in X and Y Tree links
    • 13. Disk Tree Visualization (cont)
      • Pros
        • No occlusion problems since it’s 2D plane
        • Can use the 3rd dimension for other info (e.g. time)
        • Aesthetically pleasing to the eye (?)
      • Cons
        • Difficult to see any page-level detail
        • Confusing color choices
    • 14. Time Tube Visualization
      • Put Disk Trees along spatial axis
      • Rotated so that each slice gets equal screen area
      • Focus+context
      • Animation: Can fly through tube, mapping time onto time
    • 15.  
    • 16. Interaction Model
      • Can rotate slices with a button click
      • Can focus a slice by clicking on it
      • Flicking gestures move slices around
      • Right-clicking zooms to an area
      • Mouseovers display more information about a node in a side window
      • Can bring up pages in the browser
      • Animation of slices
    • 17. Real-world Analyzes
      • Deadwood: Shows pages becoming [un]popular
      • Shows effects of a redesign
    • 18. Real-world Analyzes (cont)
      • Added items are being used
      • Deleted items aren’t negatively impacting the rest of the site
    • 19. Comments
      • Gives only a broad view of the data with no real way to get at the specifics
      • Interaction seems very advanced
      • Not sure how intuitive the whole idea of a circular tree is – seems kind of gratuitous
    • 20. Readings
      • “ Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998
      • “ Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999
      • “ VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999
      • “ Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001
      • “ What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002
    • 21. Association Rule?
      • Quantitative rule that describes associations between sets of items
        • Not qualitative because no domain knowledge necessary for text mining
      • Implication X  Y where
        • X: set of antecedent items
        • Y: consequent item
      • Example: 80% of people who buy diapers and baby powder also buy baby oil.
    • 22. Association Rule? (cont)
      • Support/predictability/conditional probability
        • Percentage of items in the total set that satisfies the union of items in the antecedent and in the consequent item
      • Confidence/prevalence/joint probability
        • Percentage of articles that satisfy both the antecendent and the consequent item
    • 23. Association Rule Visualization
      • Must visualize
        • Antecedent items & consequent items
        • Associations between antecedent and consequent
        • Rules' support
        • Confidence
      • Traditional ways of visualizing it
        • 2D matrix
        • Directed graph
    • 24. 2D Matrix (figure 1)
      • Antecedent and consequent items on axes
      • Metadata icons in the cells that connect the antecedent to consequent contain support and confidence values
      Association rule: B  C
    • 25. 2D Matrix (cont)
      • Pros: one-to-one binary relationships
      • Cons:
        • Hard to see association rules in many-to-one relationships (A+B  C or A  C and B  C)
        • Grouping antecedents adds complexity
        • Object occulusion
    • 26. Directed graph
      • nodes = items
      • edges = associations
      • Cons:
        • Dozen or more items  tangled display
        • Selecting edges to display multiple rules requires significant human interaction
    • 27. Confusing?
    • 28. “Novel” Technique
      • Matrix: rule-to-item
        • rows = topics
        • columns = item associations
        • blue / red = antecedent and consequent
      • Bar graph = confidence/support
      • Can use queries to filter
      • Mouse zooming to support context/focus
    • 29.  
    • 30. “ Novel” Technique Advantages
      • Handles hundreds of multiple antecedent association rules
      • View topics and associations simultaneously
      • Individual items clearly shown
      • No antecedent groups
      • Few occulusions because metadata is plotted at the far end and bar graph is scaled
      • No screen swapping, animation, or serious interaction required
    • 31. “ Novel” Technique Demo
      • Demo shows scalability
      • ~9 MB news article corpus of 100,000+ documents
      • Use word and concept-based text engines
      • Words evaluated on whether they’re interesting depending on their position in documents
      • Suffices removed and common prepositions, pronouns, adj’s, gerunds ignored
      • Build a table of antecedents, consequents, confidences, and supports -> feed into viz
    • 32.  
    • 33.  
    • 34. Conclusions
      • Rule-to-item association
      • Very clear visualization if limited to a few dozen rules
      • Most web log visualizations jump to using a graph; this paper forces you to think twice.
    • 35. Readings
      • “ Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998
      • “ Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999
      • “ VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999
      • “ Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001
      • “ What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002
    • 36. VISVIP
      • Captures individual movement between pages rather than aggregates
      • Shows paths - sequence of URLs
    • 37.  
    • 38. Topology
      • Directed graph
      • Force-directed algorithm
        • Spring-like force
        • Nodes repel each other with force inversely proportional to the distance between them (i.e. closer nodes means closer pages)
        • Final force pulls nodes toward center
    • 39. Content
      • URLs abbreviated
        • http://sims.berkeley.edu/~bob/pics/large/abd.gif  ge/abd
      • Color-coded by content type
      • Mouseover reveals all the abbreviated information
    • 40. Simplification
      • Common problems
        • Noise nodes not significant to paths - image and mailto nodes
        • Over-connectivity - link back to home page or company logo
      • Solutions
        • Delete all edges connected to a node
        • Make one node the graph root
        • Focus on a subset of the graph
    • 41. Path Sequence
      • Showing subject paths as straight lines didn't work
        • Hard to follow single jagged path
        • Multiple paths overlapped
      • Spline representation
        • Each path is a smooth curve overlaid on the graph
        • Colors for groups of subjects (e.g. novices)
    • 42.  
    • 43. Path Sequence (cont)
      • User path-oriented layouts
        • Simpler structure than when path is laid over a graph of the entire site
    • 44. Path Timing
      • Vertical bar with base on node, its height proportional to time spent on page
      • Animation runs through pages at 10-30 times real-time
      • Select a node to get detailed stats
    • 45. Comments
      • Capturing individual movements pretty innovative
      • Curved user paths and reorienting the layout based on user paths
      • Overall graph viz not too clear
      • Good tips for creating a web log mining viz
    • 46. Readings
      • “ Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998
      • “ Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999
      • “ VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999
      • “ Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001
      • “ What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002
    • 47. Clickstream Visualizer
      • Aggregate nodes using an icon (e.g. all the checkout pages)
      • Edges represent transitions
        • Wider means more transitions
    • 48. Customer Segments
      • Collect
        • Clickstream
        • Purchase history
        • Demographic data
      • Associates customer data with their clickstream (scary...)
      • Different c o l o r for each customer segment
    • 49. Filtering
      • Using the mouse or table control, can filter by
        • Edge weight
        • Node selection
      • Example: select checkout nodes and see if users are exiting from nodes
    • 50. Layout
      • Using third party Tom Sawyer package
      • Hierarchical from higher-out degree to higher-in degree
        • Mirrors actual flow of site users
        • The default
      • Circular
        • Puts related nodes into circles
        • Shows relationships between groups of pages
    • 51. Layout (cont)
      • Aggregation based on file system path (good idea?)
    • 52. Initial Findings
      • Gender shopping differences (intriguing...)
    • 53. Initial Findings (cont)
      • Checkout process analysis
      • Newsletter hurting sales
    • 54. Comments
      • Visualizing clickstreams with demographic data
      • Grouping pages by type
      • Best use of color
      • Icons an interesting way of reducing complexity
    • 55. Readings
      • “ Visualizing the Evolution of Web Ecologies” Chi et al., Xerox PARC, 1998
      • “ Visualizing Association Rules for Text Mining” Wong, Whitney, & Thomas, Pacific Northwest, 1999
      • “ VISVIP: 3D Visualization of Paths through Web Sites” Cugini & Scholtz, National Institute of Standards and Technology, 1999
      • “ Case Study: E-Commerce Clickstream Visualization Brainerd & Becker, Blue Martini Software, 2001
      • “ What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System” Waterson et al., UC Berkeley, 2002
    • 56. System Design
      • Log data with proxy
      • Infer actions
      • Aggregate data
      • Layout graph
      • Display interactive visualization
    • 57. Capturing Interaction
      • Typical HTTP request…
      Client Browser Web Server
    • 58. Capturing Interaction (cont)
      • WebQuilt captures interaction with a proxy
        • Proxies have typically been used for caching and firewalls
      WebQuilt Log Proxy Client Browser Web Server
    • 59. Capturing Interaction (cont)
      • If a page says: <A HREF=“coolpage.html&quot;>
      • Change it to: <A HREF=&quot;http://webquiltproxy.cs.berkeley.edu/webquilt?replace=http://www.spiffypages.com/coolpage.html&tid=1&linkid=13&quot;>
    • 60. Capturing Interaction (cont)
      • Pros:
        • Don’t need access to servers
        • Can analyze sites without permission from the server
        • Can gather clickstreams from a variety of devices including PDAs, phones, desktop computers
      • Cons:
        • No access direct to the client
    • 61. Visualization
      • Interactive, zoomable directed graph
      • Nodes = web pages
      • Edges = aggregate traffic between pages
      • Java-based SATIN toolkit for gesturing & zooming interaction
      • Image rendering of web pages:
      • JacoZoom Java callable wrappers around an ActiveX component
      • MSIE window
    • 62. Directed graph
      • Nodes: visited pages
        • Color marks entry and exit nodes
      • Arrows: traversed links
        • Thicker: more heavily traversed
        • Color
          • Red / yellow : Time spend before clicking
          • Blue : optimal path chosen by designer
    • 63. Controls
      • Slider: Zoom in and out
      • Checkboxes: Filter paths to display
    • 64. Pages
      • Zooming in shows page thumbnails
      • Arrows
        • Originate from actual links or the Back button
        • Translucent & don’t cover details
    • 65.  
    • 66.  
    • 67.  
    • 68.  
    • 69. Layout
      • Layout system flexible…
      • Edge-weighted depth-first traversal
        • Most visited path along top
        • Recursively place less followed paths below
      • Grid positioning
        • Organizes distance between nodes
        • Avoid overlapping nodes
    • 70. Interaction
      • Selecting nodes
      • Zooming in and out
      • Navigational gestures
    • 71. Inferring & Aggregating
      • Take log files and infer actions, such as when the back button is pressed
        • Can infer back button pressed, but not combinations of back and forward
        • Extensible framework to add other inferred actions
      • Aggregate information, preserving individual paths
    • 72. Running a WebQuilt Remote Usability Test
      • Recruit users
      • Design and distribute tasks (via email)
      • Auto-collect! Watch and wait as users perform tasks and proxy logs data
      • Visualize, analyze
      • Use the results to change design
    • 73. Pilot Usability Study
      • Edmunds.com PDA web site
      • Visor Handspring equipped with a OmniSky wireless modem
      • 10 users asked to find…
        • Anti-lock brake information on the latest Nissan Sentra model
        • The Nissan dealer closest to them.
    • 74.  
    • 75.  
    • 76.  
    • 77.  
    • 78.  
    • 79. In the Lab vs. Out in the Wild
      • Comparing in-lab usability testing with WebQuilt remote usability testing
      • 5 users were tested in the lab
      • 5 were given the device and asked to perform the task at their convenience
      • All task directions, demographic data, and follow up questionnaire data was presented and collected in web forms as part of the WebQuilt testing framework.
    • 80.  
    • 81.  
    • 82. Classifying Usability Issues
      • Lab: Tester observations, participant comments and questionnaire data
      • Remote: WebQuilt visualization and questionnaire data
      • Four categories of issues
        • Browser
        • Device
        • Test design
        • Site design
      • Six severity levels
        • 0 indicates comment
        • 1-5 where 1 is a very minor issue and 5 is a critical issue
    • 83. Findings
    • 84. Findings
      • WebQuilt methodology is promising for uncovering site design related issues.
      • 1/3 of the issues were device or browser related.
        • Browser and device issues can not be captured automatically with WebQuilt unless they cause an interaction with the server
        • can be revealed via the questionnaire data.
    • 85. Testing Concerns
      • What to do when problems with running the test occur?
      • Understanding user motivation is still ambiguous: Curiosity vs. confusion?
      • Gathering qualitative feedback on mobile devices is difficult
        • PDA input difficult
        • Phones have potential for audio
    • 86. Comments
      • Zooming/filtering great for showing overview and page-level details
        • Can put screenshots directly into the viz
      • Layout in relation to intended path
      • Study compares remote usability tests to traditional tests - promising
      • Proxy logging very cool
    • 87. Future Work
      • Expanded mobile device interaction capture, specifically net-enabled cell phones
      • Improve filtering capabilities, integrating questionnaire and demographic data
      • Clever algorithms to simplify graph layout
      • Improved quantitative reporting
      • Improved controls/interaction
      • More rigorous evaluation with designers and usability experts
    • 88. Concluding Comments
      • Many incremental improvements in web log/data mining viz (using a graph, using demographic data, etc.)
      • Would be really good to see a study of usability engineers and web developers comparing the tools themselves

    ×