Hw09 Counting And Clustering And Other Data Tricks

1,677 views
1,603 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,677
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
65
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • The ability to compute across large data is a key to success. Accumulating data is easy. Computing is hard Personify, Omniture, Webtrends - specialized not general computing tools.
  • Fixed reports are great and we can do that but the really interesting part is asking questions after the fact in an adhoc exploratory manner.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • Distribution of page views per users for the July 2009. Most users view 1 Page. Not new but shows we have some mastery over the data.This is based of just user data - 380G of compressed data covering over 700million page views.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • mapping, collbrative filtering, segementation,
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • Hw09 Counting And Clustering And Other Data Tricks

    1. 1. Cheap Parlor Tricks, Counting, and Clustering Derek Gottfrid The New York Times October 2009
    2. 2. Evolution of Hadoop @ NYTimes.com
    3. 3. Early Days - 2007 <ul><li>Solution looking for a problem </li></ul>
    4. 4. Solution <ul><li>Wouldn’t it be cool to use lots of EC2 instances </li></ul><ul><li>(it’s cheap; nobody will notice) </li></ul><ul><li>Wouldn’t it be cool to use Hadoop </li></ul><ul><li>(MapReduce Google style is awesome) </li></ul>
    5. 5. Found a Problem <ul><li>Freeing up historical archives of NYTimes.com 1851-1922 </li></ul>
    6. 6. Problem Bits <ul><li>Articles are served as PDFs </li></ul><ul><li>Really need PDFs from 1851-1981 </li></ul><ul><li>PDFs are dynamically generated </li></ul><ul><li>Free = more traffic </li></ul><ul><li>Real deadline </li></ul>
    7. 7. Background What goes into making a PDF of a NYTimes.com article? <ul><li>Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos. </li></ul>
    8. 8. Simple Answer <ul><li>Pre-generate all 11 million PDFs and serve them statically. </li></ul>
    9. 9. Solution <ul><li>Copy all the source data to S3 </li></ul><ul><li>Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs </li></ul><ul><li>Store the output PDFs in S3 </li></ul><ul><li>Serve the PDFs out of S3 w/ a signed query string </li></ul>
    10. 10. A Few Details <ul><li>Limited HDFS - everything loaded in and out of S3 </li></ul><ul><li>Reduce = 0 - only used for some stats and error reporting </li></ul>
    11. 11. Breakdown <ul><li>4.3 TB of source data into S3 </li></ul><ul><li>11M PDFS - 1.5 TB output </li></ul><ul><li>$240 for EC2 - 24hrs x 100 machines </li></ul>
    12. 12. TimesMachine http://timesmachine.nytimes.com
    13. 13. Currently - 2009 <ul><li>All that darn data - Web Analytics </li></ul>
    14. 14. Data <ul><li>Registration / Demographic </li></ul><ul><li>Articles 1851 - today </li></ul><ul><li>Usage Data / Web Logs </li></ul>
    15. 15. Counting <ul><li>Classic cookie tracking - let’s add it up </li></ul><ul><li>Total PV </li></ul><ul><li>Total unique users </li></ul><ul><li>PV per user </li></ul>
    16. 16. A Few Details <ul><li>Using EC2 - 20 Machines </li></ul><ul><li>Hadoop 0.20.0 </li></ul><ul><li>12+TB of data </li></ul><ul><li>Straight MR in Java </li></ul>
    17. 17. Usage Data July 2009 ???M Page Views ??M Unique Users
    18. 18. Merging Data <ul><li>Usage data combined with demographic data. </li></ul>
    19. 19. Twitter Click Backs By Age Group July 2009
    20. 20. Merging Data <ul><li>Usage data with article meta data </li></ul>
    21. 21. Usage Data combined with Article Data July 2009 40 Articles
    22. 22. Usage Data combined with Article Data July 2009 40 Articles
    23. 23. Products <ul><li>Coming soon... </li></ul>
    24. 24. Clustering <ul><li>Moving beyond simple counting and joining </li></ul><ul><li>Join usage data, demographic information, and article meta data </li></ul><ul><li>Apply simple k-means clustering </li></ul>
    25. 25. Clustering
    26. 26. Clustering
    27. 27. Conclusion <ul><li>Large scale computing is transformative for NYTimes.com. </li></ul>
    28. 28. Questions? [email_address] @derekg http://open.nytimes.com/

    ×