Many things not observable in weblogs (search information, shopping cart events, registration forms, time to return results). Log more at app-server
External events: marketing promotions, advertisements, site changes
Choose to collect as much data as you realistically can because you do not know what might be relevant for a future question. (Subject to privacy issues, but aggregated/anonymous data is usually OK.)
Collection example – Form Errors Here is a good example of data collection that we introduced without knowing apriori whether it will help: form errors If a web form was filled and a field did not pass validation, we logged the field and value filled This was the Bluefly home page when they went live Looking at form errors, we saw thousands of errors every day on this page Any guesses?
Many explanations we give to “success” are backwards looking. Hindsight is 20/20
Sales of sunglasses per-capita in Seattle vs. LA example
Our intuition at assessing new ideas is usually very poor
We are especially bad at assessing ideas that are not incremental, i.e., radical changes
We commonly confuse ourselves with the target audience
Discoveries that contradict our prior thinking are usually the most interesting
Next set of slides are a series of examples where you can test your intuition, or your “prior probabilities.”
Do you believe in intuition? No, but I have a feeling I might someday
We tend to interpret the picture to the left as a serious problem How Priors Fail us Warning: graphic image may be disturbing to some people. However, it’s just your priors.
We are not Used to Seeing Pacifiers with Teeth
Checkout Page Example from Bryan Eisenberg’s article on clickz.com The conversion rate is the percentage of visits to the website that include a purchase Which version has a higher conversion rate? Why? A B
When reading help (from product or web), you have an option to give feedback
Office Online Feedback A B Feedback A puts everything together, whereas feedback B is two-stage: question follows rating. Feedback A just has 5 stars, whereas B annotates the stars with “Not helpful” to “Very helpful” and makes them lighter Which one has a higher response rate? By how much?
Only 34% of women were accepted, while 44% of men were accepted
Segmenting by departments to isolate the bias, they found that all departments accept a higher percentage of women applicants than men applicants. (If anything, there is a slight bias in favor of women!)
There is no conflict in the above statements. It’s possible and it happened
Bickel, P. J., Hammel, E. A., and O'Connell, J. W. (1975). Sex bias in graduate admissions: Data from Berkeley. Science , 187, 1975, 398-404.
(For those not familiar with baseball, batting average is percent of hits.)
One player can hit for a higher batting average than another player during the first half of the year
Do so again during the second half
But to have a lower batting average for the entire year
Key to the “paradox” is that the segmenting variable (e.g., half year) interacts with “success” and with the counts. E.g., “A” was sick and rarely played in the 1 st half, then “B” was sick in the 2 nd half, but the 1 st half was “easier” overall.
Make sure time-series data exists for the whole period. It is very easy to conclude that this week was bad relative to last week because some data is missing (e.g., collection bug)
Synchronize clocks from all data collection points. In one example, some servers were set to GMT and others to EST, leading to strange anomalies. Even being a few minutes off can cause add-to-carts to appear “prior” to the search
Picking the right visualization is key to seeing patterns
On the left is traffic by day – note the weekends (but hard to see patterns)
On the right is a heatmap, showing traffic colored from green to yellow to red utilizing the cyclical nature of the week (going up in columns) It’s easy to see the weekend, Labor day on Sept 3, and the effect of Sept 11
Analyzing and measuring long-term impact of changes
Control/Treatment experiments give us short-term value. How do we address long-term impact of changes?
For non-commerce sites, how do we measure user satisfaction? Example: users hit F1 for help in Microsoft Office and execute a series of queries, browsing through documents. How do we measure satisfaction other than through surveys?