Hw09   Counting And Clustering And Other Data Tricks
 

Hw09 Counting And Clustering And Other Data Tricks

on

  • 2,326 views

 

Statistics

Views

Total Views
2,326
Views on SlideShare
2,323
Embed Views
3

Actions

Likes
1
Downloads
64
Comments
0

1 Embed 3

http://www.slideshare.net 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The ability to compute across large data is a key to success. Accumulating data is easy. Computing is hard Personify, Omniture, Webtrends - specialized not general computing tools.
  • Fixed reports are great and we can do that but the really interesting part is asking questions after the fact in an adhoc exploratory manner.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • Distribution of page views per users for the July 2009. Most users view 1 Page. Not new but shows we have some mastery over the data.This is based of just user data - 380G of compressed data covering over 700million page views.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • mapping, collbrative filtering, segementation,
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.

Hw09   Counting And Clustering And Other Data Tricks Hw09 Counting And Clustering And Other Data Tricks Presentation Transcript

  • Cheap Parlor Tricks, Counting, and Clustering Derek Gottfrid The New York Times October 2009
  • Evolution of Hadoop @ NYTimes.com
  • Early Days - 2007
    • Solution looking for a problem
  • Solution
    • Wouldn’t it be cool to use lots of EC2 instances
    • (it’s cheap; nobody will notice)
    • Wouldn’t it be cool to use Hadoop
    • (MapReduce Google style is awesome)
  • Found a Problem
    • Freeing up historical archives of NYTimes.com 1851-1922
  • Problem Bits
    • Articles are served as PDFs
    • Really need PDFs from 1851-1981
    • PDFs are dynamically generated
    • Free = more traffic
    • Real deadline
  • Background What goes into making a PDF of a NYTimes.com article?
    • Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.
  • Simple Answer
    • Pre-generate all 11 million PDFs and serve them statically.
  • Solution
    • Copy all the source data to S3
    • Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs
    • Store the output PDFs in S3
    • Serve the PDFs out of S3 w/ a signed query string
  • A Few Details
    • Limited HDFS - everything loaded in and out of S3
    • Reduce = 0 - only used for some stats and error reporting
  • Breakdown
    • 4.3 TB of source data into S3
    • 11M PDFS - 1.5 TB output
    • $240 for EC2 - 24hrs x 100 machines
  • TimesMachine http://timesmachine.nytimes.com
  • Currently - 2009
    • All that darn data - Web Analytics
  • Data
    • Registration / Demographic
    • Articles 1851 - today
    • Usage Data / Web Logs
  • Counting
    • Classic cookie tracking - let’s add it up
    • Total PV
    • Total unique users
    • PV per user
  • A Few Details
    • Using EC2 - 20 Machines
    • Hadoop 0.20.0
    • 12+TB of data
    • Straight MR in Java
  • Usage Data July 2009 ???M Page Views ??M Unique Users
  • Merging Data
    • Usage data combined with demographic data.
  • Twitter Click Backs By Age Group July 2009
  • Merging Data
    • Usage data with article meta data
  • Usage Data combined with Article Data July 2009 40 Articles
  • Usage Data combined with Article Data July 2009 40 Articles
  • Products
    • Coming soon...
  • Clustering
    • Moving beyond simple counting and joining
    • Join usage data, demographic information, and article meta data
    • Apply simple k-means clustering
  • Clustering
  • Clustering
  • Conclusion
    • Large scale computing is transformative for NYTimes.com.
  • Questions? [email_address] @derekg http://open.nytimes.com/