Hadoop at Yahoo! -- Hadoop World NY 2009

Hadoop at Yahoo! Eric Baldeschwieler VP Hadoop Software Development Yahoo!

What we want you to know about Yahoo! and Hadoop Yahoo! is: The largest contributor to Hadoop The largest tester of Hadoop The largest user of Hadoop 4000 node clusters! A great place to do Hadoop development, do internet scale science and change the world! Also: We release “The Yahoo! Distribution of Hadoop” We contribute all of our Hadoop work to the Apache Foundation as open source We continue to aggressively invest in Hadoop We do not sell Hadoop service or support! We use Hadoop services to run Yahoo!

The majority of all patches to Hadoop have come from Yahoo!s 72% of core patches Core = HDFS, Map-reduce, Common Metric is an underestimate Some Yahoo!s use apache.org accounts Patch sizes vary Yahoo! is the largest employer of Hadoop Contributors, by far! We contribute ALL of our Hadoop development work back to Apache! We are hiring! Sunnyvale, Bangalore, Beijing See http://hadoop.yahoo.com The Largest Hadoop Contributor All Patches Core Patches

The Largest Hadoop Tester Every release of The Yahoo! Distribution of Hadoop goes through multiple levels of testing before we declare it stable 4 Tiers of Hadoop clusters Development, Testing and QA (~10% of our hardware) Continuous integration / testing of new code Proof of Concepts and Ad-Hoc work (~10% of our hardware) Runs the latest version of Hadoop – currently 0.20 (RC6) Science and Research (~60% of our hardware) Runs more stable versions – currently 0.20 (RC5) Production (~20% of our hardware) The most stable version of Hadoop – currently 0.18.3 We continue to grow our Quality Engineering team We are hiring!!

The Largest Hadoop User 2006 now Hardware Internal Hadoop Users building new datacenter PB Disk, >82PB Today Nodes, >25,000 Today

Collaborations around the globe The Apache Foundation – http://hadoop.apache.org Hundreds of contributors and thousands users of Hadoop! see http://wiki.apache.org/hadoop/PoweredBy The Yahoo! Distribution of Hadoop – Opening up our testing Cloudera M45 - Yahoo!’s shared research supercomputing cluster Carnegie Mellon University The University of California at Berkeley Cornell University The University of Massachusetts at Amherst Partners in India Computational Research Laboratories (CRL), India's Tata Group Universities – IIT-B, IISc, IIIT-H, PSG Open Cirrus™ - cloud computing research & education The University of Illinois at Urbana-Champaign Infocomm Development Authority (IDA) in Singapore The Karlsruhe Institute of Technology (KIT) in Germany HP, Intel The Russian Academy of Sciences, Electronics & Telecomm. Malaysian Institute of Microelectronic Systems

Why Hadoop @ Yahoo! ? Massive scale 500M+ unique users per month Billions of “transactions” per day Many petabytes of data Analysis and data processing key to our business Need to process all of that data in a timely manner Lots of ad hoc investigation to look for patterns, run reports… Need to do this cost effectively Use low cost commodity hardware Share resources among multiple projects Try new things quickly at huge scale Handle the failure of some of that hardware every day The Hadoop infrastructure provides these capabilities

Yahoo! front page - Case Study

Yahoo! front page - Case Study Ads Optimization Search Index

Yahoo! front page - Case Study Ads Optimization Content Optimization Search Index Machine Learned Spam filters RSS Feeds Content Optimization

Large Applications 2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output ~73 hours runtime ~490 TB shuffling ~280 TB output +55% Hardware Sort benchmarks (Jim Gray contest) 1 Terabyte sorted 209 seconds 900 nodes 1 Terabyte sorted 62 seconds, 1500 nodes 1 Petabyte sorted 16.25 hours, 3700 nodes Largest cluster 2000 nodes 6PB raw disk 16TB of RAM 16K Cores 4000 nodes 16PB raw disk 64TB of RAM 32K Cores (40% faster too!)

Makes Developers & Scientists more productive Research questions answered in days, not months Projects move from research to production easily Easy to learn! “Even our rocket scientists use it!” The major factors You don’t need to find new hardware to experiment You can work with all your data! Production and research based on same framework No need for R&D to do I.T., the clusters just work Tremendous Impact on Productivity

Example: Search Assist TM Database for Search Assist™ is built using Hadoop. 3 years of log-data 20-steps of map-reduce Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days

Current Yahoo! Development Hadoop Backwards compatibility New APIs in 21, Avro Append, Sync, Flush Improved Scheduling Capacity Scheduler, Hierarchical Queues in 21 Security - Kerberos authentication GridMix3 / Mumak Simulator / Better trace & log analysis PIG SQL and Metadata Column oriented storage access layer - Zebra Multi-query, lots of other optimizations Oozie New workflow and scheduling system Quality Engineering – Many more tests Stress tests, Fault injection, Functional test coverage, performance regression…

Questions? Eric Baldeschwieler VP Hadoop Software Development Yahoo! For more information: http://hadoop.apache.org/ http://hadoop.yahoo.com/ (including job openings)

Hadoop at Yahoo! -- Hadoop World NY 2009

More Related Content

What's hot

Viewers also liked

Similar to Hadoop at Yahoo! -- Hadoop World NY 2009

Recently uploaded

Hadoop at Yahoo! -- Hadoop World NY 2009

Editor's Notes