• Save
Hadoop at Yahoo! -- University Talks
Upcoming SlideShare
Loading in...5
×
 

Hadoop at Yahoo! -- University Talks

on

  • 5,928 views

 

Statistics

Views

Total Views
5,928
Views on SlideShare
4,576
Embed Views
1,352

Actions

Likes
10
Downloads
0
Comments
0

14 Embeds 1,352

http://sankalplabs.wordpress.com 480
http://developer.yahoo.net 418
http://developer.yahoo.com 287
http://nosql.mypopescu.com 128
https://developer.yahoo.com 16
http://translate.googleusercontent.com 8
http://www.slideshare.net 8
http://webcache.googleusercontent.com 1
http://cache.baidu.com 1
http://feeds.developer.yahoo.net 1
http://www.hanrss.com 1
http://74.125.155.132 1
http://www.readpath.com 1
https://sankalplabs.wordpress.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection
  • Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection
  • Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection

Hadoop at Yahoo! -- University Talks Hadoop at Yahoo! -- University Talks Presentation Transcript

  • Hadoop at Yahoo! Eric Baldeschwieler VP Hadoop Software Development [email_address]
  • What we want you to know about Yahoo! and Hadoop
    • Yahoo! is:
      • The largest contributor to Hadoop
      • The largest tester of Hadoop
      • The largest user of Hadoop
        • 4000 node clusters!
      • A great place to do Hadoop development, do internet scale science and change the world!
    • Also:
      • We release “The Yahoo! Distribution of Hadoop”
      • We contribute all of our Hadoop work to the Apache Foundation as open source
      • We continue to aggressively invest in Hadoop
      • We do not sell Hadoop service or support!
        • We use Hadoop services to run Yahoo!
    • The majority of all patches to Hadoop have come from Yahoo!s
      • 72% of core patches
        • Core = HDFS, Map-reduce, Common
      • Metric is an underestimate
        • Some Yahoo!s use apache.org accounts
        • Patch sizes vary
    • Yahoo! is the largest employer of Hadoop Contributors, by far!
      • We contribute ALL of our Hadoop development work back to Apache!
    • We are hiring! (interns & full-time)
      • Sunnyvale, Bangalore, Beijing
      • See http://hadoop.yahoo.com
    The Largest Hadoop Contributor All Patches Core Patches
  • The Largest Hadoop Tester
    • Every release of The Yahoo! Distribution of Hadoop goes through multiple levels of testing before we declare it stable
    • 4 Tiers of Hadoop clusters
      • Development, Testing and QA (~10% of our hardware)
        • Continuous integration / testing of new code
      • Proof of Concepts and Ad-Hoc work (~10% of our hardware)
        • Runs the latest version of Hadoop – currently 0.20 (RC6)
      • Science and Research (~60% of our hardware)
        • Runs more stable versions – currently 0.20 (RC5)
      • Production (~20% of our hardware)
        • The most stable version of Hadoop – currently 0.18.3
    • We continue to grow our Quality Engineering team
      • We are hiring!!
  • The Largest Hadoop User 2006 now Hardware Internal Hadoop Users building new datacenter PB Disk, >82PB Today Nodes, >25,000 Today
  • Collaborations around the globe
      • The Apache Foundation – http://hadoop.apache.org
      • Hundreds of contributors and thousands users of Hadoop!
      • see http://wiki.apache.org/hadoop/PoweredBy
    • The Yahoo! Distribution of Hadoop – Opening up our testing
      • Hundreds of git followers
    • M45 - Yahoo!’s shared research supercomputing cluster
      • Carnegie Mellon University
      • The University of California at Berkeley
      • Cornell University
      • The University of Massachusetts at Amherst
    • Partners in India
      • Computational Research Laboratories (CRL), India's Tata Group
      • Universities – IIT-B, IISc, IIIT-H, PSG
    • Open Cirrus™ - cloud computing research & education
      • The University of Illinois at Urbana-Champaign
      • Infocomm Development Authority (IDA) in Singapore
      • The Karlsruhe Institute of Technology (KIT) in Germany
      • HP, Intel
      • The Russian Academy of Sciences, Electronics & Telecomm.
      • Malaysian Institute of Microelectronic Systems
  • Usage of Hadoop
  • Why Hadoop @ Yahoo! ?
    • Massive scale
      • 500M+ unique users per month
      • Billions of “transactions” per day
      • Many petabytes of data
    • Analysis and data processing key to our business
      • Need to process all of that data in a timely manner
      • Lots of ad hoc investigation to look for patterns, run reports…
    • Need to do this cost effectively
      • Use low cost commodity hardware
      • Share resources among multiple projects
      • Try new things quickly at huge scale
      • Handle the failure of some of that hardware every day
    • The Hadoop infrastructure provides these capabilities
  • Yahoo! front page - Case Study
  • Yahoo! front page - Case Study Ads Optimization Search Index
  • Yahoo! front page - Case Study Ads Optimization Content Optimization Search Index Machine Learned Spam filters RSS Feeds Content Optimization
  • Large Applications 2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output ~73 hours runtime ~490 TB shuffling ~280 TB output +55% Hardware Sort benchmarks (Jim Gray contest)
    • 1 Terabyte sorted
    • 209 seconds
    • 900 nodes
    • 1 Terabyte sorted
    • 62 seconds, 1500 nodes 1 Petabyte sorted
    • 16.25 hours, 3700 nodes
    Largest cluster
    • 2000 nodes
    • 6PB raw disk
    • 16TB of RAM
    • 16K Cores
    • 4000 nodes
    • 16PB raw disk
    • 64TB of RAM
    • 32K Cores
    • (40% faster too!)
    • Cost savings is not the point!
      • Often “cloud” technologies are presented as primarily cost savers
      • The real return is speed of R&D / business innovation
      • Hadoop let’s us build new product and improve existing products faster!
    • Makes Developers & Scientists more productive
      • Research questions answered in days, not months
      • Projects move from research to production easily (it stays on Hadoop)
      • Easy to learn! “Even our rocket scientists use it!”
    • The major factors
      • You don’t need to find new hardware to experiment
      • You can work with all your data!
      • Production and research based on same framework
      • No need for R&D to do I.T., the clusters just work
    Tremendous Impact on Productivity
  • Example: Search Assist TM
    • Database for Search Assist™ is built using Hadoop.
    • 3 years of log-data
    • 20-steps of map-reduce
    Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days
  • Challenges & Learnings Surprises we’ve encountered along the way
  • Users!! Can’t live with them, can’t shoot them!
    • There is always a new way to crash the system!
      • DOS attacks, resource exhaustion, abuse
        • NFS mounts…, MANY small files, HUGE # tiny task, …
        • DOSing other internal or external service from grid…
        • Use all RAM on a node…, extreme read patterns…
      • Really novel use of features…
        • Weird use of memory mapping crashed linux…
      • We just want your cluster!
        • Using maps to launch MPI jobs!
        • Lots of other examples of just taking over machines and squatting on them
    • Tragedy of the commons
      • When have you seen a shared drive that is not full? Need social / economic feedback
      • I didn’t know how much RAM I need, so I always ask for the MAX allowed…
      • Lots of gaming of any scheduling / sharing policy to maximize local utility!
  • Users!! (more)
    • R&D workload varies hugely in composition!
      • Paper deadlines, end of quarter, project… work spikes
      • Makes it very hard to understand system behavior. Did we fix something?
    • Upgrades are hard on a shared service
      • If you break any of 1000 apps, you need permission to upgrade…
      • Semantics users count on are often subtle
        • Hod->20 reduce caching
        • Reuse of objects in Map-Reduce (broke 2 of 2000 scripts)
    • We do love them of course, they pay out wages…
      • Have a realistic model users! Understand their problems.
        • They don’t have the answers you would like! (How much RAM/time…)
        • Make shared costs visible!
        • Make easy things easy! Bad designs lead to bad results
  • KISS – Keep it simple, stupid!
    • The simplifying assumptions in Hadoop are key to success
      • No Posix
      • Java – Great tools to debug, static checking, …
      • Single masters, in RAM data structures
      • Required to install and run on a single box with stock OS…
    • Do you want HA or an available system?
      • Single master policy has been a huge win (to date)
    • Cleaver code takes a cleaver person to debug!
      • Genius in system design is code that anyone can understand!
    • Use the existing features!!! Resist new one!
      • Researchers & customers like adding things to systems, this makes them brittle!
  • Distributed performance analysis is Hard!
  • Isolation vs. Throughput
    • You want to mix lots of low priority work with your high priority work, to saturate the cluster
      • When “production” needs more CPU, just delay low priority work
      • Share production data with research…
    • Unfortunately research jobs crash clusters…
      • This is unacceptable!!
    • Need to invest in isolation
      • Capacity scheduler
      • Quotas, run as correct unix user (kill -9 problem, …)
      • Smarter monitoring to detect odd user patterns
  • Lots of details in building a stable servers
    • Fragmentation
    • DOS, Brownout (very slow clients exhaust threads)
    • RPC has many subtle issues when shedding load
    • Critical sections, race conditions, data structure contention
    • Versioning issues in a distributed systems
  • Datacenter network
    • Simple “flat” topology
      • 200 mb all to all bandwidth
      • (5:1 over subscription)
      • Copper 1Gb, ~8% total cost
    • “ Flat” topology vastly reduces admin cost!!
      • No moves, reconfigs
      • Huge aggregate bandwidth
    • Round robin static routes (layer 3), simple, but an availability issue (failures not evenly redistributed)
    • Test show 30% increase in sort speed for 2:1 over subscription
      • The plan for next data center (all fiber, 10 Gb to the rack)
    • We want simple, robust, higher speed, lower cost networks! Don’t care about fancy features!!!
    WAN routers Routers (x8) … Rack switch … ~160 Gb … 8 Gb / rack … … … 40 nodes per rack 400+ racks / data center
  • Testing -> Stability and Agility
    • Two competing needs:
      • Rapid development
        • Adding new features
        • Innovate
      • Increase stability
        • Hadoop is mission critical
        • Pressure to move slowly!
    • How do you move the curve?
      • Invest in automated testing!
      • Continuous integration
      • Performance testing
      • Stress testing
      • Coverage, mock objects…
  • Collaboration is hard!
    • Open source promises lots of free help on your project!
    • But it doesn’t really work that way…
      • When Yahoo! invests in an area, others shift their focus, making collaboration challenging!
        • We focus on our key concerns and let others innovate in other places
      • Most people make lousy contributors!
        • There is a big investment in training someone to be an effective contributor
        • No dilettantes! To become a committer, you are expected to stay in the community and support your work and the work of others!
        • Do you want to graduate or code Hadoop?
        • We don’t want features to prove ideas!!! KISS!
    • So what does make sense?
      • Educate us! Build a prototype and write a paper with great data!
      • Build an application on Hadoop or a tool that improves it
      • Still want to contribute to Hadoop? Start small, fix bugs
        • Prove you can make the product better and can stick to it
  • Software Engineering 101
    • KISS! (keep it simple!)
    • Design with real user behavior in mind!
    • Design in measurement and testing!
    • Ship early and often
      • Don’t optimize until you have proof you have solved a problem!
      • You learn more by shipping then by over designing
    • Refactor aggressively
      • Code rots, design mistakes are made
      • Shipping early makes this worse!
      • Open source makes this worse! (patch review)
      • You need to rewrite, redesign, improve regularly
  • Futures & Opportunities
  • Current Yahoo! Development
    • Hadoop
      • Backwards compatibility
        • New APIs in 21, Avro
      • Append, Sync, Flush
      • Improved Scheduling
        • Capacity Scheduler, Hierarchical Queues in 21
      • Security - Kerberos authentication
      • GridMix3 / Mumak Simulator / Better trace & log analysis
    • PIG
      • SQL and Table storage model (rows and columns, schema)
      • Column oriented storage layer - Zebra
      • Multi-query, lots of other optimizations
    • Oozie
      • New workflow and scheduling system
    • Quality Engineering – Many more tests
      • Stress tests, Fault injection, Functional test coverage, performance regression…
  • Some Areas to Explore: Systems
    • Anomaly detection, algorithms to operate in the face of such!
    • Scheduling improvements (Mumak simulator, gridmix3)
      • We will be releasing Logs from our clusters
    • Isolation!
      • Integration with Linux containers? VMs? Others?
      • Lower cost ways of defending Hadoop resources from jobs?
    • More pluggable APIs
      • Scheduler, Block placement, logging
    • Various implementation choices for Map-Reduce
      • Push vs Pull, full sorted output?, reduce locality cases?
      • Map-Reduce-Reduce - Can MR be made more efficient for multiple MR jobs?
    • Load balancing tricks in HDFS (eliminate hot spots, slow nodes)
    • Block placement strategies
      • More failure domains (power? User Zones?)
      • Replication on every rack
      • Various RAID / parity approach
      • Collocation of data (this is hard)
  • Some Areas to Explore: Applications
    • Implementing standard algorithms (in Pig?)
      • Joins, aggregations, vector operations? ML primitives…
    • Programming model for Iterative computations
      • Machine learning does a lot of this, how can we enhance the framework to support ML? (beyond MPI)
    • Debugging & performance tools, UI etc
      • One of the easiest ways to have impact!
    • Log collection and management
      • Hadoop should be better at monitoring itself and user jobs
  • Some Areas to Explore: Pig
    • Memory Usage
      • Java provides poor models for managing RAM. This is key to Pig, Hive, HBase, the HDFS NN…
    • Automated Hadoop Tuning
      • Can Pig or Oozie or MR itself figure out how to configure Hadoop to best run a particular script / job?
    • RDBM tricks
      • Cost based optimization – how does current RDBMS technology carry over to MR world?
      • Indices, materialized views, etc. – How do these traditional RDBMS tools fit into the MR world?
      • Build an optimizing compiler for Pig Latin, perhaps incorporating some database query optimization techniques
      • Use data layout information for query optimization in Pig
  • Questions? Eric Baldeschwieler VP Hadoop Software Development [email_address] For more information: http://hadoop.apache.org/ http://hadoop.yahoo.com/ (including job openings)