• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Yahoo & Hadoop

Yahoo & Hadoop






Total Views
Views on SlideShare
Embed Views



2 Embeds 20

http://ibmbigdata.tumblr.com 18
http://www.linkedin.com 2



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Yahoo & Hadoop Yahoo & Hadoop Presentation Transcript

    • YAHOO & HADOOP USING AND IMPROVING APACHE HADOOP AT YAHOO! Eric Baldeschwieler VP, Hadoop Software1 © 2011 IBM Corporation
    • AGENDA Brief Overview Hadoop @ Yahoo! Hadoop Momentum The Future of Hadoop2 © 2011 IBM Corporation 2
    • what’s happening - Big Data is here! - unstructured data - petabyte scale - operationally critical3 Flickr : sub_lime79 © 2011 IBM Corporation
    • turning data into insights machine learning logic regression time series content clustering algorithms ad inventory modeling user interest prediction factorization models4 Flickr : NASA Goddard Photo and Video © 2011 IBM Corporation
    • making YAHOO relevant5 Flickr : ogimogi © 2011 IBM Corporation
    • hadoop: Powering Yahoo! science + big data + insight = personal relevance = VALUE6 Flickr : DDFic © 2011 IBM Corporation
    • WHAT IS HADOOP? Commodity Pig Hive •Computers •Network MapReduce Focus on •Simplicity •Redundancy HDFS •Scale •Availability Transforms commodity equipment into a service that: •HDFS – Stores peta bytes of data reliably •Map-Reduce – Allows huge distributed computations Key Attributes •Redundant and reliable – Doesn’t stop or loose data even as hardware fails •Easy to program – Our rocket scientists use it directly! •Very powerful – Allows the development of big data algorithms & tools 77 •Batch processing centric © 2011 IBM Corporation
    • WHAT HADOOP ISN’T A replacement for relational and data warehouse systems A transactional / online / serving system A low latency or streaming solution 88 © 2011 IBM Corporation
    • HADOOP IN THE ENTERPRISE Business Intelligence Applications HADOOP CLUSTER(S) RDMS EDW Data Marts Interactions Transactions, Structured Data Semi-Structured or Un-Structured Data Web Logs, Server Logs, Business Social Media, etc… Applications9 © 2011 IBM Corporation 9
    • HADOOP @ YAHOO!10 © 2011 IBM Corporation 10
    • HADOOP @ YAHOO! “Where Science meets Data” PRODUCTS Data Analytics DIM E NS Content Optimization ION AL Content Enrichment D ATA Yahoo! Mail Anti-Spam CO Advertising Products NT EN T HADOOP CLUSTERS Ad Optimization Tens of thousands of servers Ad Selection Big Data Processing & ETL DA TA PIP ELI NE S APPLIED SCIENCETer User Interest Prediction ab(com ytes / Ad inventory prediction pre Day Machine learning - sse d) search ranking Machine learning - ad targeting Machine learning - spam 10s of Petabytes filtering11 © 2011 IBM Corporation 11
    • FROM PROJECT TO CORE PLATFORM 90 250 80 40K+ Servers 170 PB Storage 200 70 5M+ Monthly Jobs 60 Thousands of Servers 150 50 Petabytes 40 100 30 20 50 10 0 0 2006 2007 2008 2009 201012 © 2011 IBM Corporation 12
    • HADOOP POWERS THE YAHOO! NETWORK advertising optimization data analytics machine learning search ranking advertising data systems Yahoo! Mail anti-spam audience, ad and search pipelines ad selection Yahoo! Homepage Content Optimization ad inventory prediction user interest prediction13 © 2011 IBM Corporation 13
    • CASE STUDY YAHOO! HOMEPAGE Personalized for each visitortwice the engagement Result: twice the engagement Recommended links News Interests Top Searches +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected14 © 2011 IBM Corporation 14
    • CASE STUDY YAHOO! HOMEPAGE • Serving Maps SCIENCE » Machine learning to build ever • Users - Interests HADOOP better categorization models CLUSTER • Five Minute USER CATEGORIZATION Production BEHAVIOR MODELS (weekly) • Weekly PRODUCTION Categorization HADOOP » Identify user interests using CLUSTER models SERVING Categorization models MAPS (every 5 minutes) USER BEHAVIOR SERVING SYSTEMS ENGAGED USERS Build customized home pages with latest data (thousands / second)15 © 2011 IBM Corporation 15
    • CASE STUDY YAHOO! MAIL Enabling quick response in the spam arms race • 450M mail boxes • 5B+ deliveries/day SCIENCE • Antispam models retrained every few hours on Hadoop “ 40% less spam than PRODUCTION Hotmail and 55% less “ spam than Gmail16 © 2011 IBM Corporation 16
    • YAHOO! & APACHE HADOOP Yahoo! has contributed 70+% of Apache Hadoop code to date Hadoop is not our business, but Hadoop is key to our business • Yahoo! benefits from open source eco-system around Hadoop • Hadoop drives revenue at Yahoo! by making our core products better We need Hadoop to be rock solid • We invest heavily in core Hadoop development • We focus on scalability, reliability, availability We fix bugs before you see them • We run very large clusters • We have a large QA effort • We run a huge variety of workloads We are good Apache Hadoop citizens • We contribute our work to Apache17 • We share the exact code we run © 2011 IBM Corporation 17
    • HADOOP MOMENTUM18 © 2011 IBM Corporation 18
    • HADOOP IS GOING MAINSTREAM 2007 2008 2009 2010 The Datagraph Blog19 © 2011 IBM Corporation 19
    • THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM and other Early Adopters Scale and productize Hadoop Apache Hadoop Enhance Orgs with Internet Scale Problems Hadoop Add tools / frameworks, enhance Hadoop Ecosystem Service Providers Grow ecosystem - Training, support, enhancementsVirtuous Circle!• Investment -> Adoption• Adoption -> Investment Mainstream / Enterprise adoption20 Drive further development, enhancements 20 © 2011 IBM Corporation
    • THE FUTURE OF HADOOP21 © 2011 IBM Corporation 21
    • MAKING HADOOP ENTERPRISE-READY WHAT’S NEXT Hadoop is far from “done” • Current implementation is showing its age • Need to address several deficiencies in scalability, flexibility, ease of use & performance Yahoo! is working on Next Generation of Hadoop • MapReduce: Rewrite to improve performance; pluggable support for new programming models • HDFS: Adding volumes to improve scalability; Flush & sync support for applications that log to HDFS Apache should remain the hub of Hadoop ecosystem • Yahoo! contributes all Hadoop changes back to Apache Hadoop • Everyone benefits from shared neutral foundation22 © 2011 IBM Corporation 22
    • Questions?23 © 2011 IBM Corporation 23