AOL - Ian Holsman - Hadoop World 2010

1,660 views

Published on

AOL's Data Layer

Ian Holsman
AOL

Learn more @ http://www.cloudera.com/hadoop/

Published in: Technology
  • Be the first to comment

AOL - Ian Holsman - Hadoop World 2010

  1. 1. ‘because data has needs’ Hadoop World, October 2010 Ian Holsman The Data Layer
  2. 2. 2 Who Am I? • Ian Holsman • CTO of Relegence • Started in open source in 2000 on the Apache Web Server • Joined AOL in 2007 • I work in the ‘content’ side of AOL
  3. 3. 3 AOL has • 3 large (>100 boxes) hadoop clusters • 1 in advertising • 1 in search • 1 in content • I am talking today about the ‘content’ side of the house
  4. 4. 4 Agenda • The Opportunity • How we addressed it • Unexpected benefits and issues we had • What we are doing today
  5. 5. It started with a question can we do better than a ‘top stories’ link? ?
  6. 6. 6 The Opportunity - circa 2008 • Get more information about our customers • Increase recirculation • Increase RPM of our pages
  7. 7. 7 Which we translated into • Build a better ‘related’ page module • initially site-specific • but the plan was to make it site-wide
  8. 8. How we addressed it
  9. 9. 9 How we addressed it • Custom Javascript injected onto the page so we can start measuring things • Custom web server modules to handle cookies over multiple domains • Custom log processing infrastructure to push data onto HDFS every 15m • Map-Reduce jobs to provide reports & create MySQL databases • built a co-visitation algorithm to produce related pages
  10. 10. 10 Privacy • We have tried our best to keep things anonymous from the start • We don’t track IP-level data, we translate to WOEIDs • So we can’t tell you (or governments) what a particular IP did • It’s not perfect • We avoided putting it on ‘sensitive’ sites
  11. 11. 11 Initial architecture (July 2008)
  12. 12. 12 Did it on the cheap • 2-3 person project • Grabbed 50 ‘spare’ machines that were lying around • Installed hadoop • Put our ‘beacon’ on a site (AOL real-estate) • and away we went • a ‘skunkworks’ with the blessing of the CIO • minimized red-tape.
  13. 13. 13 in 2-3 months • we had infrastructure up • we were processing page views & uniques • we installed the beacon on other ‘small’ sites • we had ‘data’ and a proof of concept that was meaningful for business owners
  14. 14. 14 We got people’s attention • Start doing basic reports for bebo • 300m PV’s a day • needed to move from skunkworks to a ‘real’ project.
  15. 15. 15 Major issues • Hadoop • Map-reduce was slow to write and inflexible • Hadoop kept on hanging, both the name server and our custom push jobs would stall • Operations • how to move from 0.18 to 0.19 ? • Jobs failing meant we were getting paged, and restart-ability was never really designed • Felt like we were building our house on quicksand. • we were running off factory-defaults • network wasn’t optimized at ALL • People • zero experience going in • people were learning by doing. • lots of new things made fault detection ‘interesting’ • our group started becoming a bottleneck • Map reduce hard to learn
  16. 16. Operational Issues
  17. 17. 17 Operational issues • Got ‘real’ machines • put onto same switches/racks • built the filesystem to better match how we used hadoop • upgraded to 0.19 at same time • took 48 hours to migrate • Spent some time listening to experts • tuned our cluster a bit better • removed developer access to the ‘hadoop’ user • Still not a 100% “production” system • but close enough for my liking
  18. 18. then Yahoo open sourced ‘PIG’
  19. 19. 19 PIG fixed a lot of ‘people’ issues • easy to use • didn’t require much training • enabled the system to be used by ‘regular’ channel developers
  20. 20. We felt like Alice going through the rabbit hole http://www.flickr.com/photos/spam/3355824586/
  21. 21. 21 The data unlocked innovations • The channel developers knew their data • They used it in ways we never expected
  22. 22. 22 Built off the data Some cool tools
  23. 23. 23 The heatmap tool
  24. 24. 24 The heatmap tool
  25. 25. 25 Aol’s Traffic Exchange
  26. 26. 26 The URL viewer • Get stats about any URL • Page views • Google Searches • Referrers • Exits • Custom parameters • Geographic regions • Have similar tool for anonymous userIDs
  27. 27. 27 using simple aggregation techniques and mahout Some useful applications
  28. 28. 28 Shopping recommendations • Utilizes Mahout • Utilizes custom parameters • Better click through rate than external vendors • Can apply technique to any product-based channel
  29. 29. 29 User recommendations { "algo": "recoByPVPartDay", "UnauthId": "007e3dc60bbe11dfadba39f9fdfe11b5#2", "url-pv-Info": [ { "pv": 54.0, "url": "joystiq.com/2010/07/14/sengoku-basara-samurai-heroes-sticks-six-swords-into-north-amer" }, { "pv": 49.0, "url": "joystiq.com/2010/07/14/maxis-hiring-development-director-for-online-simulation-game" }, { "pv": 35.0, "url": "joystiq.com/2010/07/14/how-to-play-sin-and-punishment-star-successor" }, { "pv": 10.0, "url": "news.bigdownload.com/2010/07/14/bioware-co-founder-were-working-on-smaller-mmo-type-games" }, { "pv": 3.0, "url": "news.bigdownload.com/2010/07/14/natural-selection-2-alpha-test-coming-july-26-for-special-editio" } ] }
  30. 30. 30 User Interests { "tags": [ { "tag": "Video games", "score": 435.0 }, { "tag": "Internet search engines", "score": 96.0 }, { "tag": "Internet", "score": 84.0 } ], "unauthId": "007e3dc60bbe11dfadba39f9fdfe11b5" }
  31. 31. Where we are today
  32. 32. 32 The current deliverables • Get more information about our customers • Increase recirculation • Increase RPM of our pages • Build metrics into our platform • What works on pages • How are we performing • Build intelligence on the page • Collaborative filtering • Product recommendations • Top-K type lists • Make it closer to real time • not the focus of this talk
  33. 33. 33 What data are we processing? • Beacon Web servers • Tracking beacon injected into the HTML page via custom javascript • Tracks • Page views • Page clicks • Custom event that the content developer wants • Tracks standard things like referrers, and user agents, and Location • Developer can add custom parameters to tell us about the page • needed to write a custom module to generate anonymous user ids + 3rd party domain tracking • custom module to map IP#’s to geographic WOEID-based locations • Ad impressions • User viewed a campaign • Integrate it with campaign manage to determine actual revenue • URL context (through relegence) • We can determine who & what a article is about • through relegence, similar to what OpenCalais does
  34. 34. 34 The Data Layer Infrastructure today

×