More Related Content


Measuring the impact of Google Analytics

  1. Measuring the impact: Stephen Merity / @smerity
  2. Smerity @ Common Crawl Continuing the crawl Documenting best practices Guides for newcomers to Common Crawl + big data Reference for seasoned veterans Spending many hours blessing and/or cursing Hadoop Before: University of Sydney '11, Harvard '14 Google Sydney,, Grok Learning
  3. I was hoping on creating a tool that will automatically extract some of the most common memes ("But does it run Linux?" and "In Soviet Russia..." style jokes etc) and I needed a corpus - . I do intensely apologise. I wrote a primitive (threaded :S) web crawler and started it before I considered robots.txt -- Past Smerity (16/12/2007)
  4. Where did all the HTTP referrers go?
  5. Referrers: leaking browsing history If you click from to protected-bikeways-planned-for-the-embarcadero/ then SFBike knows you came from Reddit
  6. 1) How many websites is Google Analytics (GA) on? 2) How much of a user's browsing history does GA capture?
  7. Top 10k domains: 65.7% Top 100k domains: 64.2% Top million domains: 50.8% It keeps dropping off, but by how much..?
  8. Estimate of captured browsing history... ?
  9. Referrers allow easy web tracking when done at Google's scale! No information !GA → !GA Full information !GA → GA GA → !GA → GA GA → !GA → GA → !GA → GA → !GA → GA → !GA → GA
  10. Key insight: leaked browsing history Google only needs one in every two links to have GA in order to have your full browsing path* *possibly less if link graph + click timing + machine learning used
  11. Estimating leaked browser history for each :link = {page A} → {page B} total_links += 1 if {page A} or {page B} has GA: total_leaked += 1 Estimate of leaked browser history is simply: total_leaked / total_links
  12. Joint project with Chad Hornbaker* at Harvard IACS *Best full name ever: Captain Charles Lafforest Hornbaker II
  13. The task Google Analytics count: " " Generate link graph Merge link graph & GA count NoGA 1 GA 6 GA 1 GA 244 -> <total times> -> 24
  14. Exciting age of open data Open data + Open tools + Cloud computing
  15. WARC raw web data WAT metadata (links, title, ...) for each page WET extracted text
  16. WARC = GA usage raw web data WAT = hyperlink graph metadata (links, title, ...) for each page
  17. Estimating the task's size Page level ( ): 3.5 billion nodes, 128 billion edges, 331GB compressed Subdomain level ( ): 101 million nodes, 2 billion edges, 9.2GB compressed Decided on using subdomains instead of page level http:// /
  18. Engineering for scale ✓ Use the framework that matches best ✓ Debug locally ✓ Standard Hadoop optimizations (combiner, compression, re-use JVMs...) ✓ Many small jobs ≫ one big job ✓ Ganglia for metrics & monitoring
  19. Hadoop :'(
  20. Hadoop :'(
  21. Monitoring & metrics with Ganglia
  22. Engineering for cost ✓ Avoid Hadoop if it's simple enough ✓ Use spot instances everywhere* ✖ Use EMR if highly cost sensitive (Elastic MapReduce = hosted Hadoop) *Everywhere but the master node!
  23. Juggling spot instances c1.xlarge goes from $0.58 p/h to $0.064 p/h
  24. EMR: The good, the bad, the ugly significantly easier, one click setup price is insane when using spot instances (spot = $0.075 with EMR = $0.12) Guess how many log files for a 100 node cluster?
  25. 584,764+ log files. Ouch.
  26. Cost projection Best optimized small Hadoop job: 1/177th the dataset in 23 minutes (12 c1.xlarge machines + Hadoop master) Estimated full dataset job: ~210TB for web data + ~90TB for link data ~$60 in EC2 costs (177 hours of spot instances) ~$100 in EMR costs (avoid EMR for cost!)
  27. Final results 29.96% of 48 million domains have GA (top million domains was 50.8%) That means that one in every two hyperlinks will leak information to Google
  28. The wider impact
  29. Want Big Open Data? Web Data Covers everything at scale! Languages... Topics... Demographics...
  30. Processing the web is feasible Downloading it is a pain! Common Crawl does that for you Processing it is scary! Big data frameworks exist and are (relatively) painless These experiments are too expensive! Cloud computing means experiments can be just a few dollars
  31. Get started now..! Want raw web data? Want hyperlink graph / web tables / RDFa? Want example code to get you started?
  32. Measuring the impact: Full write-up: Stephen Merity / @smerity