Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

1. Feeds processing at Yahoo!One Platform, One Hadoop, Two Systems Yahoo! Inc. Apache Hadoop India Summit 16th February 2011

3. Design

4. Contributions

5. The small feeds problem

6. Pepper

7. Requirements

8. Design

9. Production numbers

10. Cover the whole spectrum

11. Examples of processing

12. ConclusionYahoo! Inc 2

13. Pacman Started in 2006 in Bangalore Process large feeds, millions of records in few hours Multi-Tenant Reliability, Operability Use Hadoop M/R, one record is unit of processing Workflow semantics over Hadoop Workflow defined by DAG Each node result is stored in HDFS ‘Channels’ Feeds processing oriented API, abstracting M/R High Availability, Cross-colo replication HDFS data 3 Yahoo! Inc

14. Design Notification Asynchronous processing One Job for each WF node State in DB Feed copied on the Grid Reporting service exposes metrics and logs 4 Yahoo! Inc

15. Contributions Multiple Output files for a Job Counters Chaining of Maps Led to open-sourced Oozie 5 Yahoo! Inc

16. The small feeds problem More and more small feeds on boarded (NPC, OMG, Green…) Overhead of Pacman is high (Hadoop, DB…) Too many small files on HDFS Solution : Process nodes of Workflow in WebServer Farm Lack of Isolation Between executions Native libraries management Operability issues (provisioning,…) 6 Yahoo! Inc

17. Pepper requirements Be able to support all properties : News, Finance, Travel, … Scalable (millions of feeds a day), Elastic Isolation, Multiple Native Libraries versions Low overhead (<5s) Compatible with Pacman API Reuse Pacman code/infrastructure as most as possible 7 Yahoo! Inc

18. Pepper Servlet Model Synchronous in-memory execution of the workflow (very fast) No use of HDFS Share Pacman API and infrastructure Hadoop Reporting, Deployment… Cloud like qualities Elastic, Scalable Isolation 8 Yahoo! Inc

19. Design Embedded Jetty server runs in Map task, registers with ZooKeeper 1 Hadoop job = 1 Map task = 1 Web Server = 1 WebApp = 1 Workflow Proxy Router receives incoming requests, looks up ZooKeeper & redirects to appropriate Web Server 9 Yahoo! Inc

20. Production numbers Qualified with simple workflow and 3 Hadoop slaves cluster 10 Yahoo! Inc

21. Production numbers Pacman : 20+ solutions (Autos, Real Estate, Deals…) 150,000 feeds 250 requests/h 200 millions listings processed/week Pepper : News, Finance, NPC 600,000 feeds 10,000 requests/h… for now 20 Hadoop slave cluster (x2 colos) 11 Yahoo! Inc

22. Cover the whole spectrum Clever switch between the 2 systems Choice can be done upfront ‘Sticky’ feeds go to Pacman Size > 2MB go to Pacman Failed feeds in Pepper are redirected to Pacman OutOfMemory TimeOut 12 Yahoo! Inc

23. Example of processing Validation against schema Filtering (Security), Image resizing Send images to edge serving Reformat to common model Simple (in-line) enrichments Categorization Geocoding Entity Recognition Clustering 13 Yahoo! Inc

24. Conclusion One common platform (Deployment, Reporting…) Covers the whole spectrum of feeds Share same Hadoop cluster Very generic concepts Pacman : Workflow engine Pepper : Serving cloud on top of Hadoop 14 Yahoo! Inc

25. Pepper future work On-demand allocation of servers Async NIO between Proxy Router & Map Web Engine to increase scalability Improving distribution of requests across web servers Follow Hadoop roadmap 15 Yahoo! Inc

26. References Oozie http://yahoo.github.com/oozie/ http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2- oozie/ Pepper http://yahoo.github.com/pepper/ (new !!) http://www.computer.org/portal/web/csdl/doi/10.1109/CloudCom.2010.39 http://salsahpc.indiana.edu/CloudCom2010/slides/PDF/Pepper%20An%20Elastic%20Web%20Server%20Farm%20for%20Cloud%20based%20on%20Hadoop.pdf 16 Yahoo! Inc

27. Questions ? 17 Yahoo! Inc

Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

Similar to Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio