SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Have you ever been curious as to how widely Google Analytics is used across the web? Stop pondering, start coding! In this presentation, Stephen discusses how he used the Common Crawl dataset to perform wide scale analysis over billions of web pages and what this means for privacy on the web at large.
Have you ever been curious as to how widely Google Analytics is used across the web? Stop pondering, start coding! In this presentation, Stephen discusses how he used the Common Crawl dataset to perform wide scale analysis over billions of web pages and what this means for privacy on the web at large.
1.
Measuring the impact:
Stephen Merity
/ smerity.com @smerity
2.
Smerity @ Common Crawl
Continuing the crawl
Documenting best practices
Guides for newcomers to Common Crawl + big data
Reference for seasoned veterans
Spending many hours blessing and/or cursing Hadoop
Before:
University of Sydney '11, Harvard '14
Google Sydney, Freelancer.com, Grok Learning
3.
banned@slashdot.org
I was hoping on creating a tool that will automatically extract
some of the most common memes ("But does it run Linux?" and
"In Soviet Russia..." style jokes etc) and I needed a corpus -
. I do intensely apologise.
I wrote
a primitive (threaded :S) web crawler and started it before I
considered robots.txt
-- Past Smerity (16/12/2007)
5.
Referrers: leaking browsing history
If you click from
to
http://www.reddit.com/r/sanfrancisco
http://www.sfbike.org/news/
protected-bikeways-planned-for-the-embarcadero/
then SFBike knows you came from Reddit
6.
1) How many websites is Google Analytics (GA) on?
2) How much of a user's browsing history does GA capture?
7.
Top 10k domains:
65.7%
Top 100k domains:
64.2%
Top million domains:
50.8%
It keeps dropping off, but by how much..?
9.
Referrers allow easy web tracking
when done at Google's scale!
No information
!GA → !GA
Full information
!GA → GA
GA → !GA → GA
GA → !GA → GA → !GA → GA → !GA → GA → !GA → GA
10.
Key insight: leaked browsing history
Google only needs one in every two links to have GA in
order to have your full browsing path*
*possibly less if link graph + click timing + machine learning used
11.
Estimating leaked browser history
for each :link = {page A} → {page B}
total_links += 1
if {page A} or {page B} has GA:
total_leaked += 1
Estimate of leaked browser history is simply:
total_leaked / total_links
12.
Joint project with Chad Hornbaker* at Harvard IACS
*Best full name ever: Captain Charles Lafforest Hornbaker II
13.
The task
Google Analytics count: " "
Generate link graph
Merge link graph & GA count
.google-analytics.com/ga.js
www.winradio.net.au NoGA 1
www.winrar.com.cn GA 6
www.winratzart.com GA 1
www.winrenner.ch GA 244
domainA.com -> domainB.com <total times>
cnet-cnec-driver.softutopia.com -> www.softutopia.com 24
14.
Exciting age of open data
Open data
+
Open tools
+
Cloud computing
15.
WARC
raw web data
WAT
metadata (links, title, ...) for each page
WET
extracted text
16.
WARC = GA usage
raw web data
WAT = hyperlink graph
metadata (links, title, ...) for each page
17.
Estimating the task's size
Page level ( ):http://en.wikipedia.org/
3.5 billion nodes, 128 billion edges, 331GB compressed
Subdomain level ( ):
101 million nodes, 2 billion edges, 9.2GB compressed
Decided on using subdomains instead of page level
http:// /
18.
Engineering for scale
✓ Use the framework that matches best
✓ Debug locally
✓ Standard Hadoop optimizations
(combiner, compression, re-use JVMs...)
✓ Many small jobs ≫ one big job
✓ Ganglia for metrics & monitoring
22.
Engineering for cost
✓ Avoid Hadoop if it's simple enough
✓ Use spot instances everywhere*
✖ Use EMR if highly cost sensitive
(Elastic MapReduce = hosted Hadoop)
*Everywhere but the master node!
23.
Juggling spot instances
c1.xlarge goes from $0.58 p/h to $0.064 p/h
24.
EMR: The good, the bad, the ugly
significantly easier, one click setup
price is insane when using spot instances
(spot = $0.075 with EMR = $0.12)
Guess how many log files for a 100 node cluster?
26.
Cost projection
Best optimized small Hadoop job:
1/177th the dataset in 23 minutes
(12 c1.xlarge machines + Hadoop master)
Estimated full dataset job:
~210TB for web data + ~90TB for link data
~$60 in EC2 costs (177 hours of spot instances)
~$100 in EMR costs (avoid EMR for cost!)
27.
Final results
29.96% of 48 million domains have GA
(top million domains was 50.8%)
That means that
one in every two hyperlinks will leak information to Google
29.
Want Big Open Data?
Web Data
Covers everything at scale!
Languages...
Topics...
Demographics...
30.
Processing the web is feasible
Downloading it is a pain!
Common Crawl does that for you
Processing it is scary!
Big data frameworks exist and are (relatively) painless
These experiments are too expensive!
Cloud computing means experiments can be just a few dollars
31.
Get started now..!
Want raw web data?
CommonCrawl.org
Want hyperlink graph / web tables / RDFa?
WebDataCommons.org
Want example code to get you started?
https://github.com/Smerity/cc-warc-examples
32.
Measuring the impact:
Full write-up: http://smerity.com/cs205_ga/
Stephen Merity
/ smerity.com @smerity