Yahoo & Hadoop
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,890
On Slideshare
2,870
From Embeds
20
Number of Embeds
2

Actions

Shares
Downloads
108
Comments
0
Likes
3

Embeds 20

http://ibmbigdata.tumblr.com 18
http://www.linkedin.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. YAHOO & HADOOP USING AND IMPROVING APACHE HADOOP AT YAHOO! Eric Baldeschwieler VP, Hadoop Software1 © 2011 IBM Corporation
  • 2. AGENDA Brief Overview Hadoop @ Yahoo! Hadoop Momentum The Future of Hadoop2 © 2011 IBM Corporation 2
  • 3. what’s happening - Big Data is here! - unstructured data - petabyte scale - operationally critical3 Flickr : sub_lime79 © 2011 IBM Corporation
  • 4. turning data into insights machine learning logic regression time series content clustering algorithms ad inventory modeling user interest prediction factorization models4 Flickr : NASA Goddard Photo and Video © 2011 IBM Corporation
  • 5. making YAHOO relevant5 Flickr : ogimogi © 2011 IBM Corporation
  • 6. hadoop: Powering Yahoo! science + big data + insight = personal relevance = VALUE6 Flickr : DDFic © 2011 IBM Corporation
  • 7. WHAT IS HADOOP? Commodity Pig Hive •Computers •Network MapReduce Focus on •Simplicity •Redundancy HDFS •Scale •Availability Transforms commodity equipment into a service that: •HDFS – Stores peta bytes of data reliably •Map-Reduce – Allows huge distributed computations Key Attributes •Redundant and reliable – Doesn’t stop or loose data even as hardware fails •Easy to program – Our rocket scientists use it directly! •Very powerful – Allows the development of big data algorithms & tools 77 •Batch processing centric © 2011 IBM Corporation
  • 8. WHAT HADOOP ISN’T A replacement for relational and data warehouse systems A transactional / online / serving system A low latency or streaming solution 88 © 2011 IBM Corporation
  • 9. HADOOP IN THE ENTERPRISE Business Intelligence Applications HADOOP CLUSTER(S) RDMS EDW Data Marts Interactions Transactions, Structured Data Semi-Structured or Un-Structured Data Web Logs, Server Logs, Business Social Media, etc… Applications9 © 2011 IBM Corporation 9
  • 10. HADOOP @ YAHOO!10 © 2011 IBM Corporation 10
  • 11. HADOOP @ YAHOO! “Where Science meets Data” PRODUCTS Data Analytics DIM E NS Content Optimization ION AL Content Enrichment D ATA Yahoo! Mail Anti-Spam CO Advertising Products NT EN T HADOOP CLUSTERS Ad Optimization Tens of thousands of servers Ad Selection Big Data Processing & ETL DA TA PIP ELI NE S APPLIED SCIENCETer User Interest Prediction ab(com ytes / Ad inventory prediction pre Day Machine learning - sse d) search ranking Machine learning - ad targeting Machine learning - spam 10s of Petabytes filtering11 © 2011 IBM Corporation 11
  • 12. FROM PROJECT TO CORE PLATFORM 90 250 80 40K+ Servers 170 PB Storage 200 70 5M+ Monthly Jobs 60 Thousands of Servers 150 50 Petabytes 40 100 30 20 50 10 0 0 2006 2007 2008 2009 201012 © 2011 IBM Corporation 12
  • 13. HADOOP POWERS THE YAHOO! NETWORK advertising optimization data analytics machine learning search ranking advertising data systems Yahoo! Mail anti-spam audience, ad and search pipelines ad selection Yahoo! Homepage Content Optimization ad inventory prediction user interest prediction13 © 2011 IBM Corporation 13
  • 14. CASE STUDY YAHOO! HOMEPAGE Personalized for each visitortwice the engagement Result: twice the engagement Recommended links News Interests Top Searches +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected14 © 2011 IBM Corporation 14
  • 15. CASE STUDY YAHOO! HOMEPAGE • Serving Maps SCIENCE » Machine learning to build ever • Users - Interests HADOOP better categorization models CLUSTER • Five Minute USER CATEGORIZATION Production BEHAVIOR MODELS (weekly) • Weekly PRODUCTION Categorization HADOOP » Identify user interests using CLUSTER models SERVING Categorization models MAPS (every 5 minutes) USER BEHAVIOR SERVING SYSTEMS ENGAGED USERS Build customized home pages with latest data (thousands / second)15 © 2011 IBM Corporation 15
  • 16. CASE STUDY YAHOO! MAIL Enabling quick response in the spam arms race • 450M mail boxes • 5B+ deliveries/day SCIENCE • Antispam models retrained every few hours on Hadoop “ 40% less spam than PRODUCTION Hotmail and 55% less “ spam than Gmail16 © 2011 IBM Corporation 16
  • 17. YAHOO! & APACHE HADOOP Yahoo! has contributed 70+% of Apache Hadoop code to date Hadoop is not our business, but Hadoop is key to our business • Yahoo! benefits from open source eco-system around Hadoop • Hadoop drives revenue at Yahoo! by making our core products better We need Hadoop to be rock solid • We invest heavily in core Hadoop development • We focus on scalability, reliability, availability We fix bugs before you see them • We run very large clusters • We have a large QA effort • We run a huge variety of workloads We are good Apache Hadoop citizens • We contribute our work to Apache17 • We share the exact code we run © 2011 IBM Corporation 17
  • 18. HADOOP MOMENTUM18 © 2011 IBM Corporation 18
  • 19. HADOOP IS GOING MAINSTREAM 2007 2008 2009 2010 The Datagraph Blog19 © 2011 IBM Corporation 19
  • 20. THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM and other Early Adopters Scale and productize Hadoop Apache Hadoop Enhance Orgs with Internet Scale Problems Hadoop Add tools / frameworks, enhance Hadoop Ecosystem Service Providers Grow ecosystem - Training, support, enhancementsVirtuous Circle!• Investment -> Adoption• Adoption -> Investment Mainstream / Enterprise adoption20 Drive further development, enhancements 20 © 2011 IBM Corporation
  • 21. THE FUTURE OF HADOOP21 © 2011 IBM Corporation 21
  • 22. MAKING HADOOP ENTERPRISE-READY WHAT’S NEXT Hadoop is far from “done” • Current implementation is showing its age • Need to address several deficiencies in scalability, flexibility, ease of use & performance Yahoo! is working on Next Generation of Hadoop • MapReduce: Rewrite to improve performance; pluggable support for new programming models • HDFS: Adding volumes to improve scalability; Flush & sync support for applications that log to HDFS Apache should remain the hub of Hadoop ecosystem • Yahoo! contributes all Hadoop changes back to Apache Hadoop • Everyone benefits from shared neutral foundation22 © 2011 IBM Corporation 22
  • 23. Questions?23 © 2011 IBM Corporation 23