Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio


Published on

  • Be the first to comment

Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

  1. 1. Feeds processing at Yahoo!One Platform, One Hadoop, Two Systems<br />Yahoo! Inc. <br />Apache Hadoop India Summit<br />16th February 2011<br />
  2. 2. Agenda<br /><ul><li>Pacman
  3. 3. Design
  4. 4. Contributions
  5. 5. The small feeds problem
  6. 6. Pepper
  7. 7. Requirements
  8. 8. Design
  9. 9. Production numbers
  10. 10. Cover the whole spectrum
  11. 11. Examples of processing
  12. 12. Conclusion</li></ul>Yahoo! Inc<br />2<br />
  13. 13. Pacman<br />Started in 2006 in Bangalore<br />Process large feeds, millions of records in few hours<br />Multi-Tenant<br />Reliability, Operability<br />Use Hadoop M/R, one record is unit of processing<br />Workflow semantics over Hadoop<br />Workflow defined by DAG<br />Each node result is stored in HDFS ‘Channels’<br />Feeds processing oriented API, abstracting M/R<br />High Availability, Cross-colo replication HDFS data<br />3<br />Yahoo! Inc<br />
  14. 14. Design<br />Notification<br />Asynchronous processing<br />One Job for each WF node<br />State in DB<br />Feed copied on the Grid<br />Reporting service exposes metrics and logs<br />4<br />Yahoo! Inc<br />
  15. 15. Contributions<br />Multiple Output files for a Job<br />Counters<br />Chaining of Maps<br />Led to open-sourced Oozie<br />5<br />Yahoo! Inc<br />
  16. 16. The small feeds problem<br />More and more small feeds on boarded (NPC, OMG, Green…)<br />Overhead of Pacman is high (Hadoop, DB…)<br />Too many small files on HDFS<br />Solution : Process nodes of Workflow in WebServer Farm<br />Lack of Isolation<br />Between executions<br />Native libraries management<br />Operability issues (provisioning,…)<br />6<br />Yahoo! Inc<br />
  17. 17. Pepper requirements<br />Be able to support all properties :<br />News, Finance, Travel, …<br />Scalable (millions of feeds a day), Elastic<br />Isolation, Multiple Native Libraries versions<br />Low overhead (<5s)<br />Compatible with Pacman API<br />Reuse Pacman code/infrastructure as most as possible<br />7<br />Yahoo! Inc<br />
  18. 18. Pepper<br />Servlet Model<br />Synchronous in-memory execution of the workflow (very fast)<br />No use of HDFS<br />Share Pacman API and infrastructure<br />Hadoop<br />Reporting, Deployment…<br />Cloud like qualities<br />Elastic, Scalable<br />Isolation<br />8<br />Yahoo! Inc<br />
  19. 19. Design<br />Embedded Jetty server runs in Map task, registers with ZooKeeper<br />1 Hadoop job = 1 Map task = 1 Web Server = 1 WebApp = 1 Workflow<br />Proxy Router receives incoming requests, looks up ZooKeeper & redirects to appropriate Web Server<br />9<br />Yahoo! Inc<br />
  20. 20. Production numbers<br />Qualified with simple workflow and 3 Hadoop slaves cluster<br />10<br />Yahoo! Inc<br />
  21. 21. Production numbers<br />Pacman :<br />20+ solutions (Autos, Real Estate, Deals…)<br />150,000 feeds<br />250 requests/h<br />200 millions listings processed/week<br />Pepper :<br />News, Finance, NPC<br />600,000 feeds<br />10,000 requests/h… for now<br />20 Hadoop slave cluster (x2 colos)<br />11<br />Yahoo! Inc<br />
  22. 22. Cover the whole spectrum<br />Clever switch between the 2 systems<br />Choice can be done upfront <br />‘Sticky’ feeds go to Pacman<br />Size > 2MB go to Pacman<br />Failed feeds in Pepper are redirected to Pacman<br />OutOfMemory<br />TimeOut<br />12<br />Yahoo! Inc<br />
  23. 23. Example of processing<br />Validation against schema<br />Filtering (Security), Image resizing<br />Send images to edge serving<br />Reformat to common model<br />Simple (in-line) enrichments<br />Categorization<br />Geocoding<br />Entity Recognition<br />Clustering<br />13<br />Yahoo! Inc<br />
  24. 24. Conclusion<br />One common platform (Deployment, Reporting…)<br />Covers the whole spectrum of feeds <br />Share same Hadoop cluster<br />Very generic concepts<br />Pacman : Workflow engine<br />Pepper : Serving cloud on top of Hadoop<br />14<br />Yahoo! Inc<br />
  25. 25. Pepper future work<br />On-demand allocation of servers<br />Async NIO between Proxy Router & Map Web Engine to increase scalability<br />Improving distribution of requests across web servers<br />Follow Hadoop roadmap<br />15<br />Yahoo! Inc<br />
  26. 26. References<br />Oozie<br /><br /> oozie/<br />Pepper<br /> (new !!)<br /><br /><br />16<br />Yahoo! Inc<br />
  27. 27. Questions ?<br />17<br />Yahoo! Inc<br />