Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop World - Oct 2009


Published on

Review of the different things that has been up to w/ Hadoop from the simple to the less simple.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Hadoop World - Oct 2009

  1. 1. Cheap Parlor Tricks, Counting, and Clustering Derek Gottfrid The New York Times October 2009
  2. 2. Evolution of Hadoop @
  3. 3. Early Days - 2007 <ul><li>Solution looking for a problem </li></ul>
  4. 4. Solution <ul><li>Wouldn’t it be cool to use lots of EC2 instances </li></ul><ul><li>(it’s cheap; nobody will notice) </li></ul><ul><li>Wouldn’t it be cool to use Hadoop </li></ul><ul><li>(MapReduce Google style is awesome) </li></ul>
  5. 5. Found a Problem <ul><li>Freeing up historical archives of 1851-1922 </li></ul>
  6. 6. Problem Bits <ul><li>Articles are served as PDFs </li></ul><ul><li>Really need PDFs from 1851-1981 </li></ul><ul><li>PDFs are dynamically generated </li></ul><ul><li>Free = more traffic </li></ul><ul><li>Real deadline </li></ul>
  7. 7. Background What goes into making a PDF of a article? <ul><li>Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos. </li></ul>
  8. 8. Simple Answer <ul><li>Pre-generate all 11 million PDFs and serve them statically. </li></ul>
  9. 9. Solution <ul><li>Copy all the source data to S3 </li></ul><ul><li>Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs </li></ul><ul><li>Store the output PDFs in S3 </li></ul><ul><li>Serve the PDFs out of S3 w/ a signed query string </li></ul>
  10. 10. A Few Details <ul><li>Limited HDFS - everything loaded in and out of S3 </li></ul><ul><li>Reduce = 0 - only used for some stats and error reporting </li></ul>
  11. 11. Breakdown <ul><li>4.3 TB of source data into S3 </li></ul><ul><li>11M PDFS - 1.5 TB output </li></ul><ul><li>$240 for EC2 - 24hrs x 100 machines </li></ul>
  12. 12. TimesMachine
  13. 13. Currently - 2009 <ul><li>All that darn data - Web Analytics </li></ul>
  14. 14. Data <ul><li>Registration / Demographic </li></ul><ul><li>Articles 1851 - today </li></ul><ul><li>Usage Data / Web Logs </li></ul>
  15. 15. Counting <ul><li>Classic cookie tracking - let’s add it up </li></ul><ul><li>Total PV </li></ul><ul><li>Total unique users </li></ul><ul><li>PV per user </li></ul>
  16. 16. A Few Details <ul><li>Using EC2 - 20 Machines </li></ul><ul><li>Hadoop 0.20.0 </li></ul><ul><li>12+TB of data </li></ul><ul><li>Straight MR in Java </li></ul>
  17. 17. Usage Data July 2009 ???M Page Views ??M Unique Users
  18. 18. Merging Data <ul><li>Usage data combined with demographic data. </li></ul>
  19. 19. Twitter Click Backs By Age Group July 2009
  20. 20. Merging Data <ul><li>Usage data with article meta data </li></ul>
  21. 21. Usage Data combined with Article Data July 2009 40 Articles
  22. 22. Usage Data combined with Article Data July 2009 40 Articles
  23. 23. Products <ul><li>Coming soon... </li></ul>
  24. 24. Clustering <ul><li>Moving beyond simple counting and joining </li></ul><ul><li>Join usage data, demographic information, and article meta data </li></ul><ul><li>Apply simple k-means clustering </li></ul>
  25. 25. Clustering
  26. 26. Clustering
  27. 27. Conclusion <ul><li>Large scale computing is transformative for </li></ul>
  28. 28. Questions? [email_address] @derekg