Hadoop at Yahoo! Eric Baldeschwieler VP Hadoop Software Development Yahoo!
What we want you to know about Yahoo! and Hadoop Yahoo! is: The largest contributor to Hadoop The largest tester of Hadoop The largest user of Hadoop 4000 node clusters! A great place to do Hadoop development, do internet scale science and change the world! Also: We release “The Yahoo! Distribution of Hadoop” We contribute all of our Hadoop work to the Apache Foundation as open source We continue to aggressively invest in Hadoop We do not sell Hadoop service or support!  We use Hadoop services to run Yahoo!
The majority of all patches to Hadoop  have come from Yahoo!s 72% of core patches Core = HDFS, Map-reduce, Common Metric is an underestimate Some Yahoo!s use apache.org accounts Patch sizes vary Yahoo! is the largest employer of Hadoop Contributors, by far! We contribute ALL of our Hadoop development work back to Apache! We are hiring! Sunnyvale, Bangalore, Beijing See http://hadoop.yahoo.com The Largest Hadoop Contributor All Patches Core Patches
The Largest Hadoop Tester Every release of  The Yahoo! Distribution of Hadoop  goes through multiple levels of testing before we declare it stable 4 Tiers of Hadoop clusters Development, Testing and QA  (~10% of our hardware) Continuous integration / testing of new code Proof of Concepts and Ad-Hoc work  (~10% of our hardware) Runs the latest version  of Hadoop –  currently 0.20 (RC6) Science and Research  (~60% of our hardware) Runs more stable versions –  currently 0.20 (RC5) Production  (~20% of our hardware) The most stable version of Hadoop –  currently 0.18.3 We continue to grow our Quality Engineering team We are hiring!!
The Largest Hadoop User 2006 now Hardware Internal Hadoop Users building  new datacenter PB Disk, >82PB Today Nodes, >25,000 Today
Collaborations around the globe The Apache Foundation  – http://hadoop.apache.org Hundreds of contributors and thousands users of Hadoop! see http://wiki.apache.org/hadoop/PoweredBy The Yahoo! Distribution of Hadoop – Opening up our testing Cloudera M45  - Yahoo!’s shared research supercomputing cluster Carnegie Mellon University The University of California at Berkeley Cornell University The University of Massachusetts at Amherst Partners in India Computational Research Laboratories (CRL), India's Tata Group Universities – IIT-B, IISc, IIIT-H, PSG Open Cirrus™ - cloud computing research & education The University of Illinois at Urbana-Champaign Infocomm Development Authority (IDA) in Singapore The Karlsruhe Institute of Technology (KIT) in Germany HP, Intel  The Russian Academy of Sciences, Electronics & Telecomm.  Malaysian Institute of Microelectronic Systems
Usage of Hadoop
Why Hadoop @ Yahoo! ? Massive scale 500M+ unique users per month Billions of “transactions” per day Many petabytes of data Analysis and data processing key to our business Need to process all of that data in a timely manner Lots of ad hoc investigation to look for patterns, run reports… Need to do this cost effectively Use low cost commodity hardware Share resources among multiple projects Try new things quickly at huge scale Handle the failure of some of that hardware every day The Hadoop infrastructure provides these capabilities
Yahoo! front page - Case Study
Yahoo! front page - Case Study Ads Optimization Search  Index
Yahoo! front page - Case Study Ads Optimization Content Optimization Search  Index Machine Learned  Spam filters RSS Feeds Content Optimization
Large Applications 2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output ~73 hours runtime ~490 TB shuffling ~280 TB output +55% Hardware Sort benchmarks (Jim Gray contest) 1 Terabyte sorted 209 seconds 900 nodes 1 Terabyte sorted 62 seconds, 1500 nodes 1 Petabyte  sorted 16.25 hours, 3700 nodes Largest cluster 2000 nodes 6PB raw disk 16TB of RAM 16K Cores 4000 nodes 16PB raw disk 64TB of RAM 32K Cores (40% faster too!)
Makes Developers & Scientists more productive Research questions answered in days, not months Projects move from research to production easily Easy to learn! “Even our rocket scientists use it!” The major factors You don’t need to find new hardware to experiment You can work with all your data! Production and research based on same framework No need for R&D to do I.T., the clusters just work Tremendous Impact on Productivity
Example: Search Assist TM Database for  Search Assist™  is built using Hadoop.  3 years of log-data 20-steps of map-reduce   Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days
Futures
Current Yahoo! Development Hadoop Backwards compatibility New APIs in 21, Avro Append, Sync, Flush Improved Scheduling Capacity Scheduler, Hierarchical Queues in 21 Security - Kerberos authentication GridMix3 / Mumak Simulator / Better trace & log analysis  PIG SQL and Metadata Column oriented storage access layer - Zebra Multi-query, lots of other optimizations Oozie New workflow and scheduling system Quality Engineering – Many more tests Stress tests, Fault injection, Functional test coverage, performance regression…
Questions? Eric Baldeschwieler VP Hadoop Software Development Yahoo! For more information: http://hadoop.apache.org/  http://hadoop.yahoo.com/  (including job openings)

Hadoop at Yahoo! -- Hadoop World NY 2009

  • 1.
    Hadoop at Yahoo!Eric Baldeschwieler VP Hadoop Software Development Yahoo!
  • 2.
    What we wantyou to know about Yahoo! and Hadoop Yahoo! is: The largest contributor to Hadoop The largest tester of Hadoop The largest user of Hadoop 4000 node clusters! A great place to do Hadoop development, do internet scale science and change the world! Also: We release “The Yahoo! Distribution of Hadoop” We contribute all of our Hadoop work to the Apache Foundation as open source We continue to aggressively invest in Hadoop We do not sell Hadoop service or support! We use Hadoop services to run Yahoo!
  • 3.
    The majority ofall patches to Hadoop have come from Yahoo!s 72% of core patches Core = HDFS, Map-reduce, Common Metric is an underestimate Some Yahoo!s use apache.org accounts Patch sizes vary Yahoo! is the largest employer of Hadoop Contributors, by far! We contribute ALL of our Hadoop development work back to Apache! We are hiring! Sunnyvale, Bangalore, Beijing See http://hadoop.yahoo.com The Largest Hadoop Contributor All Patches Core Patches
  • 4.
    The Largest HadoopTester Every release of The Yahoo! Distribution of Hadoop goes through multiple levels of testing before we declare it stable 4 Tiers of Hadoop clusters Development, Testing and QA (~10% of our hardware) Continuous integration / testing of new code Proof of Concepts and Ad-Hoc work (~10% of our hardware) Runs the latest version of Hadoop – currently 0.20 (RC6) Science and Research (~60% of our hardware) Runs more stable versions – currently 0.20 (RC5) Production (~20% of our hardware) The most stable version of Hadoop – currently 0.18.3 We continue to grow our Quality Engineering team We are hiring!!
  • 5.
    The Largest HadoopUser 2006 now Hardware Internal Hadoop Users building new datacenter PB Disk, >82PB Today Nodes, >25,000 Today
  • 6.
    Collaborations around theglobe The Apache Foundation – http://hadoop.apache.org Hundreds of contributors and thousands users of Hadoop! see http://wiki.apache.org/hadoop/PoweredBy The Yahoo! Distribution of Hadoop – Opening up our testing Cloudera M45 - Yahoo!’s shared research supercomputing cluster Carnegie Mellon University The University of California at Berkeley Cornell University The University of Massachusetts at Amherst Partners in India Computational Research Laboratories (CRL), India's Tata Group Universities – IIT-B, IISc, IIIT-H, PSG Open Cirrus™ - cloud computing research & education The University of Illinois at Urbana-Champaign Infocomm Development Authority (IDA) in Singapore The Karlsruhe Institute of Technology (KIT) in Germany HP, Intel The Russian Academy of Sciences, Electronics & Telecomm. Malaysian Institute of Microelectronic Systems
  • 7.
  • 8.
    Why Hadoop @Yahoo! ? Massive scale 500M+ unique users per month Billions of “transactions” per day Many petabytes of data Analysis and data processing key to our business Need to process all of that data in a timely manner Lots of ad hoc investigation to look for patterns, run reports… Need to do this cost effectively Use low cost commodity hardware Share resources among multiple projects Try new things quickly at huge scale Handle the failure of some of that hardware every day The Hadoop infrastructure provides these capabilities
  • 9.
    Yahoo! front page- Case Study
  • 10.
    Yahoo! front page- Case Study Ads Optimization Search Index
  • 11.
    Yahoo! front page- Case Study Ads Optimization Content Optimization Search Index Machine Learned Spam filters RSS Feeds Content Optimization
  • 12.
    Large Applications 20082009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output ~73 hours runtime ~490 TB shuffling ~280 TB output +55% Hardware Sort benchmarks (Jim Gray contest) 1 Terabyte sorted 209 seconds 900 nodes 1 Terabyte sorted 62 seconds, 1500 nodes 1 Petabyte sorted 16.25 hours, 3700 nodes Largest cluster 2000 nodes 6PB raw disk 16TB of RAM 16K Cores 4000 nodes 16PB raw disk 64TB of RAM 32K Cores (40% faster too!)
  • 13.
    Makes Developers &Scientists more productive Research questions answered in days, not months Projects move from research to production easily Easy to learn! “Even our rocket scientists use it!” The major factors You don’t need to find new hardware to experiment You can work with all your data! Production and research based on same framework No need for R&D to do I.T., the clusters just work Tremendous Impact on Productivity
  • 14.
    Example: Search AssistTM Database for Search Assist™ is built using Hadoop. 3 years of log-data 20-steps of map-reduce Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days
  • 15.
  • 16.
    Current Yahoo! DevelopmentHadoop Backwards compatibility New APIs in 21, Avro Append, Sync, Flush Improved Scheduling Capacity Scheduler, Hierarchical Queues in 21 Security - Kerberos authentication GridMix3 / Mumak Simulator / Better trace & log analysis PIG SQL and Metadata Column oriented storage access layer - Zebra Multi-query, lots of other optimizations Oozie New workflow and scheduling system Quality Engineering – Many more tests Stress tests, Fault injection, Functional test coverage, performance regression…
  • 17.
    Questions? Eric BaldeschwielerVP Hadoop Software Development Yahoo! For more information: http://hadoop.apache.org/ http://hadoop.yahoo.com/ (including job openings)

Editor's Notes

  • #10 Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection
  • #11 Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection
  • #12 Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection