Your SlideShare is downloading. ×
0
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data Tricks
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hw09 Counting And Clustering And Other Data Tricks

1,365

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,365
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
64
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • The ability to compute across large data is a key to success. Accumulating data is easy. Computing is hard Personify, Omniture, Webtrends - specialized not general computing tools.
  • Fixed reports are great and we can do that but the really interesting part is asking questions after the fact in an adhoc exploratory manner.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • Distribution of page views per users for the July 2009. Most users view 1 Page. Not new but shows we have some mastery over the data.This is based of just user data - 380G of compressed data covering over 700million page views.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • mapping, collbrative filtering, segementation,
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • Transcript

    • 1. Cheap Parlor Tricks, Counting, and Clustering Derek Gottfrid The New York Times October 2009
    • 2. Evolution of Hadoop @ NYTimes.com
    • 3. Early Days - 2007
      • Solution looking for a problem
    • 4. Solution
      • Wouldn’t it be cool to use lots of EC2 instances
      • (it’s cheap; nobody will notice)
      • Wouldn’t it be cool to use Hadoop
      • (MapReduce Google style is awesome)
    • 5. Found a Problem
      • Freeing up historical archives of NYTimes.com 1851-1922
    • 6. Problem Bits
      • Articles are served as PDFs
      • Really need PDFs from 1851-1981
      • PDFs are dynamically generated
      • Free = more traffic
      • Real deadline
    • 7. Background What goes into making a PDF of a NYTimes.com article?
      • Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.
    • 8. Simple Answer
      • Pre-generate all 11 million PDFs and serve them statically.
    • 9. Solution
      • Copy all the source data to S3
      • Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs
      • Store the output PDFs in S3
      • Serve the PDFs out of S3 w/ a signed query string
    • 10. A Few Details
      • Limited HDFS - everything loaded in and out of S3
      • Reduce = 0 - only used for some stats and error reporting
    • 11. Breakdown
      • 4.3 TB of source data into S3
      • 11M PDFS - 1.5 TB output
      • $240 for EC2 - 24hrs x 100 machines
    • 12. TimesMachine http://timesmachine.nytimes.com
    • 13. Currently - 2009
      • All that darn data - Web Analytics
    • 14. Data
      • Registration / Demographic
      • Articles 1851 - today
      • Usage Data / Web Logs
    • 15. Counting
      • Classic cookie tracking - let’s add it up
      • Total PV
      • Total unique users
      • PV per user
    • 16. A Few Details
      • Using EC2 - 20 Machines
      • Hadoop 0.20.0
      • 12+TB of data
      • Straight MR in Java
    • 17. Usage Data July 2009 ???M Page Views ??M Unique Users
    • 18. Merging Data
      • Usage data combined with demographic data.
    • 19. Twitter Click Backs By Age Group July 2009
    • 20. Merging Data
      • Usage data with article meta data
    • 21. Usage Data combined with Article Data July 2009 40 Articles
    • 22. Usage Data combined with Article Data July 2009 40 Articles
    • 23. Products
      • Coming soon...
    • 24. Clustering
      • Moving beyond simple counting and joining
      • Join usage data, demographic information, and article meta data
      • Apply simple k-means clustering
    • 25. Clustering
    • 26. Clustering
    • 27. Conclusion
      • Large scale computing is transformative for NYTimes.com.
    • 28. Questions? [email_address] @derekg http://open.nytimes.com/

    ×