• Like

Hw09 Counting And Clustering And Other Data Tricks

  • 1,309 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,309
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
64
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • The ability to compute across large data is a key to success. Accumulating data is easy. Computing is hard Personify, Omniture, Webtrends - specialized not general computing tools.
  • Fixed reports are great and we can do that but the really interesting part is asking questions after the fact in an adhoc exploratory manner.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • Distribution of page views per users for the July 2009. Most users view 1 Page. Not new but shows we have some mastery over the data.This is based of just user data - 380G of compressed data covering over 700million page views.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • mapping, collbrative filtering, segementation,
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.

Transcript

  • 1. Cheap Parlor Tricks, Counting, and Clustering Derek Gottfrid The New York Times October 2009
  • 2. Evolution of Hadoop @ NYTimes.com
  • 3. Early Days - 2007
    • Solution looking for a problem
  • 4. Solution
    • Wouldn’t it be cool to use lots of EC2 instances
    • (it’s cheap; nobody will notice)
    • Wouldn’t it be cool to use Hadoop
    • (MapReduce Google style is awesome)
  • 5. Found a Problem
    • Freeing up historical archives of NYTimes.com 1851-1922
  • 6. Problem Bits
    • Articles are served as PDFs
    • Really need PDFs from 1851-1981
    • PDFs are dynamically generated
    • Free = more traffic
    • Real deadline
  • 7. Background What goes into making a PDF of a NYTimes.com article?
    • Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.
  • 8. Simple Answer
    • Pre-generate all 11 million PDFs and serve them statically.
  • 9. Solution
    • Copy all the source data to S3
    • Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs
    • Store the output PDFs in S3
    • Serve the PDFs out of S3 w/ a signed query string
  • 10. A Few Details
    • Limited HDFS - everything loaded in and out of S3
    • Reduce = 0 - only used for some stats and error reporting
  • 11. Breakdown
    • 4.3 TB of source data into S3
    • 11M PDFS - 1.5 TB output
    • $240 for EC2 - 24hrs x 100 machines
  • 12. TimesMachine http://timesmachine.nytimes.com
  • 13. Currently - 2009
    • All that darn data - Web Analytics
  • 14. Data
    • Registration / Demographic
    • Articles 1851 - today
    • Usage Data / Web Logs
  • 15. Counting
    • Classic cookie tracking - let’s add it up
    • Total PV
    • Total unique users
    • PV per user
  • 16. A Few Details
    • Using EC2 - 20 Machines
    • Hadoop 0.20.0
    • 12+TB of data
    • Straight MR in Java
  • 17. Usage Data July 2009 ???M Page Views ??M Unique Users
  • 18. Merging Data
    • Usage data combined with demographic data.
  • 19. Twitter Click Backs By Age Group July 2009
  • 20. Merging Data
    • Usage data with article meta data
  • 21. Usage Data combined with Article Data July 2009 40 Articles
  • 22. Usage Data combined with Article Data July 2009 40 Articles
  • 23. Products
    • Coming soon...
  • 24. Clustering
    • Moving beyond simple counting and joining
    • Join usage data, demographic information, and article meta data
    • Apply simple k-means clustering
  • 25. Clustering
  • 26. Clustering
  • 27. Conclusion
    • Large scale computing is transformative for NYTimes.com.
  • 28. Questions? [email_address] @derekg http://open.nytimes.com/