Hadoop at Yahoo! -- University Talks

Hadoop at Yahoo! Eric Baldeschwieler VP Hadoop Software Development [email_address]

What we want you to know about Yahoo! and Hadoop Yahoo! is: The largest contributor to Hadoop The largest tester of Hadoop The largest user of Hadoop 4000 node clusters! A great place to do Hadoop development, do internet scale science and change the world! Also: We release “The Yahoo! Distribution of Hadoop” We contribute all of our Hadoop work to the Apache Foundation as open source We continue to aggressively invest in Hadoop We do not sell Hadoop service or support! We use Hadoop services to run Yahoo!

The majority of all patches to Hadoop have come from Yahoo!s 72% of core patches Core = HDFS, Map-reduce, Common Metric is an underestimate Some Yahoo!s use apache.org accounts Patch sizes vary Yahoo! is the largest employer of Hadoop Contributors, by far! We contribute ALL of our Hadoop development work back to Apache! We are hiring! (interns & full-time) Sunnyvale, Bangalore, Beijing See http://hadoop.yahoo.com The Largest Hadoop Contributor All Patches Core Patches

The Largest Hadoop Tester Every release of The Yahoo! Distribution of Hadoop goes through multiple levels of testing before we declare it stable 4 Tiers of Hadoop clusters Development, Testing and QA (~10% of our hardware) Continuous integration / testing of new code Proof of Concepts and Ad-Hoc work (~10% of our hardware) Runs the latest version of Hadoop – currently 0.20 (RC6) Science and Research (~60% of our hardware) Runs more stable versions – currently 0.20 (RC5) Production (~20% of our hardware) The most stable version of Hadoop – currently 0.18.3 We continue to grow our Quality Engineering team We are hiring!!

The Largest Hadoop User 2006 now Hardware Internal Hadoop Users building new datacenter PB Disk, >82PB Today Nodes, >25,000 Today

Collaborations around the globe The Apache Foundation – http://hadoop.apache.org Hundreds of contributors and thousands users of Hadoop! see http://wiki.apache.org/hadoop/PoweredBy The Yahoo! Distribution of Hadoop – Opening up our testing Hundreds of git followers M45 - Yahoo!’s shared research supercomputing cluster Carnegie Mellon University The University of California at Berkeley Cornell University The University of Massachusetts at Amherst Partners in India Computational Research Laboratories (CRL), India's Tata Group Universities – IIT-B, IISc, IIIT-H, PSG Open Cirrus™ - cloud computing research & education The University of Illinois at Urbana-Champaign Infocomm Development Authority (IDA) in Singapore The Karlsruhe Institute of Technology (KIT) in Germany HP, Intel The Russian Academy of Sciences, Electronics & Telecomm. Malaysian Institute of Microelectronic Systems

Why Hadoop @ Yahoo! ? Massive scale 500M+ unique users per month Billions of “transactions” per day Many petabytes of data Analysis and data processing key to our business Need to process all of that data in a timely manner Lots of ad hoc investigation to look for patterns, run reports… Need to do this cost effectively Use low cost commodity hardware Share resources among multiple projects Try new things quickly at huge scale Handle the failure of some of that hardware every day The Hadoop infrastructure provides these capabilities

Yahoo! front page - Case Study

Yahoo! front page - Case Study Ads Optimization Search Index

Yahoo! front page - Case Study Ads Optimization Content Optimization Search Index Machine Learned Spam filters RSS Feeds Content Optimization

Large Applications 2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output ~73 hours runtime ~490 TB shuffling ~280 TB output +55% Hardware Sort benchmarks (Jim Gray contest) 1 Terabyte sorted 209 seconds 900 nodes 1 Terabyte sorted 62 seconds, 1500 nodes 1 Petabyte sorted 16.25 hours, 3700 nodes Largest cluster 2000 nodes 6PB raw disk 16TB of RAM 16K Cores 4000 nodes 16PB raw disk 64TB of RAM 32K Cores (40% faster too!)

Cost savings is not the point! Often “cloud” technologies are presented as primarily cost savers The real return is speed of R&D / business innovation Hadoop let’s us build new product and improve existing products faster! Makes Developers & Scientists more productive Research questions answered in days, not months Projects move from research to production easily (it stays on Hadoop) Easy to learn! “Even our rocket scientists use it!” The major factors You don’t need to find new hardware to experiment You can work with all your data! Production and research based on same framework No need for R&D to do I.T., the clusters just work Tremendous Impact on Productivity

Example: Search Assist TM Database for Search Assist™ is built using Hadoop. 3 years of log-data 20-steps of map-reduce Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days

Challenges & Learnings Surprises we’ve encountered along the way

Users!! Can’t live with them, can’t shoot them! There is always a new way to crash the system! DOS attacks, resource exhaustion, abuse NFS mounts…, MANY small files, HUGE # tiny task, … DOSing other internal or external service from grid… Use all RAM on a node…, extreme read patterns… Really novel use of features… Weird use of memory mapping crashed linux… We just want your cluster! Using maps to launch MPI jobs! Lots of other examples of just taking over machines and squatting on them Tragedy of the commons When have you seen a shared drive that is not full? Need social / economic feedback I didn’t know how much RAM I need, so I always ask for the MAX allowed… Lots of gaming of any scheduling / sharing policy to maximize local utility!

Users!! (more) R&D workload varies hugely in composition! Paper deadlines, end of quarter, project… work spikes Makes it very hard to understand system behavior. Did we fix something? Upgrades are hard on a shared service If you break any of 1000 apps, you need permission to upgrade… Semantics users count on are often subtle Hod->20 reduce caching Reuse of objects in Map-Reduce (broke 2 of 2000 scripts) We do love them of course, they pay out wages… Have a realistic model users! Understand their problems. They don’t have the answers you would like! (How much RAM/time…) Make shared costs visible! Make easy things easy! Bad designs lead to bad results

KISS – Keep it simple, stupid! The simplifying assumptions in Hadoop are key to success No Posix Java – Great tools to debug, static checking, … Single masters, in RAM data structures Required to install and run on a single box with stock OS… Do you want HA or an available system? Single master policy has been a huge win (to date) Cleaver code takes a cleaver person to debug! Genius in system design is code that anyone can understand! Use the existing features!!! Resist new one! Researchers & customers like adding things to systems, this makes them brittle!

Distributed performance analysis is Hard!

Isolation vs. Throughput You want to mix lots of low priority work with your high priority work, to saturate the cluster When “production” needs more CPU, just delay low priority work Share production data with research… Unfortunately research jobs crash clusters… This is unacceptable!! Need to invest in isolation Capacity scheduler Quotas, run as correct unix user (kill -9 problem, …) Smarter monitoring to detect odd user patterns

Lots of details in building a stable servers Fragmentation DOS, Brownout (very slow clients exhaust threads) RPC has many subtle issues when shedding load Critical sections, race conditions, data structure contention Versioning issues in a distributed systems

Datacenter network Simple “flat” topology 200 mb all to all bandwidth (5:1 over subscription) Copper 1Gb, ~8% total cost “ Flat” topology vastly reduces admin cost!! No moves, reconfigs Huge aggregate bandwidth Round robin static routes (layer 3), simple, but an availability issue (failures not evenly redistributed) Test show 30% increase in sort speed for 2:1 over subscription The plan for next data center (all fiber, 10 Gb to the rack) We want simple, robust, higher speed, lower cost networks! Don’t care about fancy features!!! WAN routers Routers (x8) … Rack switch … ~160 Gb … 8 Gb / rack … … … 40 nodes per rack 400+ racks / data center

Testing -> Stability and Agility Two competing needs: Rapid development Adding new features Innovate Increase stability Hadoop is mission critical Pressure to move slowly! How do you move the curve? Invest in automated testing! Continuous integration Performance testing Stress testing Coverage, mock objects…

Collaboration is hard! Open source promises lots of free help on your project! But it doesn’t really work that way… When Yahoo! invests in an area, others shift their focus, making collaboration challenging! We focus on our key concerns and let others innovate in other places Most people make lousy contributors! There is a big investment in training someone to be an effective contributor No dilettantes! To become a committer, you are expected to stay in the community and support your work and the work of others! Do you want to graduate or code Hadoop? We don’t want features to prove ideas!!! KISS! So what does make sense? Educate us! Build a prototype and write a paper with great data! Build an application on Hadoop or a tool that improves it Still want to contribute to Hadoop? Start small, fix bugs Prove you can make the product better and can stick to it

Software Engineering 101 KISS! (keep it simple!) Design with real user behavior in mind! Design in measurement and testing! Ship early and often Don’t optimize until you have proof you have solved a problem! You learn more by shipping then by over designing Refactor aggressively Code rots, design mistakes are made Shipping early makes this worse! Open source makes this worse! (patch review) You need to rewrite, redesign, improve regularly

Current Yahoo! Development Hadoop Backwards compatibility New APIs in 21, Avro Append, Sync, Flush Improved Scheduling Capacity Scheduler, Hierarchical Queues in 21 Security - Kerberos authentication GridMix3 / Mumak Simulator / Better trace & log analysis PIG SQL and Table storage model (rows and columns, schema) Column oriented storage layer - Zebra Multi-query, lots of other optimizations Oozie New workflow and scheduling system Quality Engineering – Many more tests Stress tests, Fault injection, Functional test coverage, performance regression…

Some Areas to Explore: Systems Anomaly detection, algorithms to operate in the face of such! Scheduling improvements (Mumak simulator, gridmix3) We will be releasing Logs from our clusters Isolation! Integration with Linux containers? VMs? Others? Lower cost ways of defending Hadoop resources from jobs? More pluggable APIs Scheduler, Block placement, logging Various implementation choices for Map-Reduce Push vs Pull, full sorted output?, reduce locality cases? Map-Reduce-Reduce - Can MR be made more efficient for multiple MR jobs? Load balancing tricks in HDFS (eliminate hot spots, slow nodes) Block placement strategies More failure domains (power? User Zones?) Replication on every rack Various RAID / parity approach Collocation of data (this is hard)

Some Areas to Explore: Applications Implementing standard algorithms (in Pig?) Joins, aggregations, vector operations? ML primitives… Programming model for Iterative computations Machine learning does a lot of this, how can we enhance the framework to support ML? (beyond MPI) Debugging & performance tools, UI etc One of the easiest ways to have impact! Log collection and management Hadoop should be better at monitoring itself and user jobs

Some Areas to Explore: Pig Memory Usage Java provides poor models for managing RAM. This is key to Pig, Hive, HBase, the HDFS NN… Automated Hadoop Tuning Can Pig or Oozie or MR itself figure out how to configure Hadoop to best run a particular script / job? RDBM tricks Cost based optimization – how does current RDBMS technology carry over to MR world? Indices, materialized views, etc. – How do these traditional RDBMS tools fit into the MR world? Build an optimizing compiler for Pig Latin, perhaps incorporating some database query optimization techniques Use data layout information for query optimization in Pig

Questions? Eric Baldeschwieler VP Hadoop Software Development [email_address] For more information: http://hadoop.apache.org/ http://hadoop.yahoo.com/ (including job openings)

Hadoop at Yahoo! -- University Talks

More Related Content

What's hot

Viewers also liked

Similar to Hadoop at Yahoo! -- University Talks

Recently uploaded

Hadoop at Yahoo! -- University Talks

Editor's Notes