Your SlideShare is downloading. ×
Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Apache Hadoop India Summit 2011 talk "Feeds Processing at Yahoo!" by Jean-Christophe Counio

1,314
views

Published on


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,314
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
37
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Feeds processing at Yahoo!One Platform, One Hadoop, Two Systems
    Yahoo! Inc.
    Apache Hadoop India Summit
    16th February 2011
  • 2. Agenda
    Yahoo! Inc
    2
  • 13. Pacman
    Started in 2006 in Bangalore
    Process large feeds, millions of records in few hours
    Multi-Tenant
    Reliability, Operability
    Use Hadoop M/R, one record is unit of processing
    Workflow semantics over Hadoop
    Workflow defined by DAG
    Each node result is stored in HDFS ‘Channels’
    Feeds processing oriented API, abstracting M/R
    High Availability, Cross-colo replication HDFS data
    3
    Yahoo! Inc
  • 14. Design
    Notification
    Asynchronous processing
    One Job for each WF node
    State in DB
    Feed copied on the Grid
    Reporting service exposes metrics and logs
    4
    Yahoo! Inc
  • 15. Contributions
    Multiple Output files for a Job
    Counters
    Chaining of Maps
    Led to open-sourced Oozie
    5
    Yahoo! Inc
  • 16. The small feeds problem
    More and more small feeds on boarded (NPC, OMG, Green…)
    Overhead of Pacman is high (Hadoop, DB…)
    Too many small files on HDFS
    Solution : Process nodes of Workflow in WebServer Farm
    Lack of Isolation
    Between executions
    Native libraries management
    Operability issues (provisioning,…)
    6
    Yahoo! Inc
  • 17. Pepper requirements
    Be able to support all properties :
    News, Finance, Travel, …
    Scalable (millions of feeds a day), Elastic
    Isolation, Multiple Native Libraries versions
    Low overhead (<5s)
    Compatible with Pacman API
    Reuse Pacman code/infrastructure as most as possible
    7
    Yahoo! Inc
  • 18. Pepper
    Servlet Model
    Synchronous in-memory execution of the workflow (very fast)
    No use of HDFS
    Share Pacman API and infrastructure
    Hadoop
    Reporting, Deployment…
    Cloud like qualities
    Elastic, Scalable
    Isolation
    8
    Yahoo! Inc
  • 19. Design
    Embedded Jetty server runs in Map task, registers with ZooKeeper
    1 Hadoop job = 1 Map task = 1 Web Server = 1 WebApp = 1 Workflow
    Proxy Router receives incoming requests, looks up ZooKeeper & redirects to appropriate Web Server
    9
    Yahoo! Inc
  • 20. Production numbers
    Qualified with simple workflow and 3 Hadoop slaves cluster
    10
    Yahoo! Inc
  • 21. Production numbers
    Pacman :
    20+ solutions (Autos, Real Estate, Deals…)
    150,000 feeds
    250 requests/h
    200 millions listings processed/week
    Pepper :
    News, Finance, NPC
    600,000 feeds
    10,000 requests/h… for now
    20 Hadoop slave cluster (x2 colos)
    11
    Yahoo! Inc
  • 22. Cover the whole spectrum
    Clever switch between the 2 systems
    Choice can be done upfront
    ‘Sticky’ feeds go to Pacman
    Size > 2MB go to Pacman
    Failed feeds in Pepper are redirected to Pacman
    OutOfMemory
    TimeOut
    12
    Yahoo! Inc
  • 23. Example of processing
    Validation against schema
    Filtering (Security), Image resizing
    Send images to edge serving
    Reformat to common model
    Simple (in-line) enrichments
    Categorization
    Geocoding
    Entity Recognition
    Clustering
    13
    Yahoo! Inc
  • 24. Conclusion
    One common platform (Deployment, Reporting…)
    Covers the whole spectrum of feeds
    Share same Hadoop cluster
    Very generic concepts
    Pacman : Workflow engine
    Pepper : Serving cloud on top of Hadoop
    14
    Yahoo! Inc
  • 25. Pepper future work
    On-demand allocation of servers
    Async NIO between Proxy Router & Map Web Engine to increase scalability
    Improving distribution of requests across web servers
    Follow Hadoop roadmap
    15
    Yahoo! Inc
  • 26. References
    Oozie
    http://yahoo.github.com/oozie/
    http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2- oozie/
    Pepper
    http://yahoo.github.com/pepper/ (new !!)
    http://www.computer.org/portal/web/csdl/doi/10.1109/CloudCom.2010.39
    http://salsahpc.indiana.edu/CloudCom2010/slides/PDF/Pepper%20An%20Elastic%20Web%20Server%20Farm%20for%20Cloud%20based%20on%20Hadoop.pdf
    16
    Yahoo! Inc
  • 27. Questions ?
    17
    Yahoo! Inc