• Share
  • Email
  • Embed
  • Like
  • Private Content
Hw09   Hadoop Applications At Yahoo!
 

Hw09 Hadoop Applications At Yahoo!

on

  • 6,047 views

 

Statistics

Views

Total Views
6,047
Views on SlideShare
6,007
Embed Views
40

Actions

Likes
6
Downloads
246
Comments
0

2 Embeds 40

http://www.slideshare.net 25
http://cptl.corp.yahoo.co.jp 15

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection
  • Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection
  • Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection

Hw09   Hadoop Applications At Yahoo! Hw09 Hadoop Applications At Yahoo! Presentation Transcript

  • Hadoop at Yahoo! Eric Baldeschwieler VP Hadoop Software Development Yahoo!
  • What we want you to know about Yahoo! and Hadoop
    • Yahoo! is:
      • The largest contributor to Hadoop
      • The largest tester of Hadoop
      • The largest user of Hadoop
        • 4000 node clusters!
      • A great place to do Hadoop development, do internet scale science and change the world!
    • Also:
      • We release “The Yahoo! Distribution of Hadoop”
      • We contribute all of our Hadoop work to the Apache Foundation as open source
      • We continue to aggressively invest in Hadoop
      • We do not sell Hadoop service or support!
        • We use Hadoop services to run Yahoo!
    • The majority of all patches to Hadoop have come from Yahoo!s
      • 72% of core patches
        • Core = HDFS, Map-reduce, Common
      • Metric is an underestimate
        • Some Yahoo!s use apache.org accounts
        • Patch sizes vary
    • Yahoo! is the largest employer of Hadoop Contributors, by far!
      • We contribute ALL of our Hadoop development work back to Apache!
    • We are hiring!
      • Sunnyvale, Bangalore, Beijing
      • See http://hadoop.yahoo.com
    The Largest Hadoop Contributor All Patches Core Patches
  • The Largest Hadoop Tester
    • Every release of The Yahoo! Distribution of Hadoop goes through multiple levels of testing before we declare it stable
    • 4 Tiers of Hadoop clusters
      • Development, Testing and QA (~10% of our hardware)
        • Continuous integration / testing of new code
      • Proof of Concepts and Ad-Hoc work (~10% of our hardware)
        • Runs the latest version of Hadoop – currently 0.20 (RC6)
      • Science and Research (~60% of our hardware)
        • Runs more stable versions – currently 0.20 (RC5)
      • Production (~20% of our hardware)
        • The most stable version of Hadoop – currently 0.18.3
    • We continue to grow our Quality Engineering team
      • We are hiring!!
  • The Largest Hadoop User 2006 now Hardware Internal Hadoop Users building new datacenter PB Disk, >82PB Today Nodes, >25,000 Today
  • Collaborations around the globe
      • The Apache Foundation – http://hadoop.apache.org
      • Hundreds of contributors and thousands users of Hadoop!
      • see http://wiki.apache.org/hadoop/PoweredBy
    • The Yahoo! Distribution of Hadoop – Opening up our testing
      • Cloudera
    • M45 - Yahoo!’s shared research supercomputing cluster
      • Carnegie Mellon University
      • The University of California at Berkeley
      • Cornell University
      • The University of Massachusetts at Amherst
    • Partners in India
      • Computational Research Laboratories (CRL), India's Tata Group
      • Universities – IIT-B, IISc, IIIT-H, PSG
    • Open Cirrus™ - cloud computing research & education
      • The University of Illinois at Urbana-Champaign
      • Infocomm Development Authority (IDA) in Singapore
      • The Karlsruhe Institute of Technology (KIT) in Germany
      • HP, Intel
      • The Russian Academy of Sciences, Electronics & Telecomm.
      • Malaysian Institute of Microelectronic Systems
  • Usage of Hadoop
  • Why Hadoop @ Yahoo! ?
    • Massive scale
      • 500M+ unique users per month
      • Billions of “transactions” per day
      • Many petabytes of data
    • Analysis and data processing key to our business
      • Need to process all of that data in a timely manner
      • Lots of ad hoc investigation to look for patterns, run reports…
    • Need to do this cost effectively
      • Use low cost commodity hardware
      • Share resources among multiple projects
      • Try new things quickly at huge scale
      • Handle the failure of some of that hardware every day
    • The Hadoop infrastructure provides these capabilities
  • Yahoo! front page - Case Study
  • Yahoo! front page - Case Study Ads Optimization Search Index
  • Yahoo! front page - Case Study Ads Optimization Content Optimization Search Index Machine Learned Spam filters RSS Feeds Content Optimization
  • Large Applications 2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output ~73 hours runtime ~490 TB shuffling ~280 TB output +55% Hardware Sort benchmarks (Jim Gray contest)
    • 1 Terabyte sorted
    • 209 seconds
    • 900 nodes
    • 1 Terabyte sorted
    • 62 seconds, 1500 nodes 1 Petabyte sorted
    • 16.25 hours, 3700 nodes
    Largest cluster
    • 2000 nodes
    • 6PB raw disk
    • 16TB of RAM
    • 16K Cores
    • 4000 nodes
    • 16PB raw disk
    • 64TB of RAM
    • 32K Cores
    • (40% faster too!)
    • Makes Developers & Scientists more productive
      • Research questions answered in days, not months
      • Projects move from research to production easily
      • Easy to learn! “Even our rocket scientists use it!”
    • The major factors
      • You don’t need to find new hardware to experiment
      • You can work with all your data!
      • Production and research based on same framework
      • No need for R&D to do I.T., the clusters just work
    Tremendous Impact on Productivity
  • Example: Search Assist TM
    • Database for Search Assist™ is built using Hadoop.
    • 3 years of log-data
    • 20-steps of map-reduce
    Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days
  • Futures
  • Current Yahoo! Development
    • Hadoop
      • Backwards compatibility
        • New APIs in 21, Avro
      • Append, Sync, Flush
      • Improved Scheduling
        • Capacity Scheduler, Hierarchical Queues in 21
      • Security - Kerberos authentication
      • GridMix3 / Mumak Simulator / Better trace & log analysis
    • PIG
      • SQL and Metadata
      • Column oriented storage access layer - Zebra
      • Multi-query, lots of other optimizations
    • Oozie
      • New workflow and scheduling system
    • Quality Engineering – Many more tests
      • Stress tests, Fault injection, Functional test coverage, performance regression…
  • Questions? Eric Baldeschwieler VP Hadoop Software Development Yahoo! For more information: http://hadoop.apache.org/ http://hadoop.yahoo.com/ (including job openings)