Your SlideShare is downloading. ×
0
YAHOO &    HADOOP    USING AND IMPROVING    APACHE HADOOP AT YAHOO!                Eric Baldeschwieler                VP, ...
AGENDA             Brief Overview             Hadoop @ Yahoo!             Hadoop Momentum             The Future of Hadoop...
what’s        happening                          - Big Data is here!                          - unstructured data         ...
turning data       into insights            machine learning    logic regression                            time series   ...
making YAHOO        relevant5   Flickr : ogimogi   © 2011 IBM Corporation
hadoop:        Powering        Yahoo!                     science + big data + insight =                     personal rele...
WHAT IS HADOOP?                                                                             Commodity            Pig      ...
WHAT HADOOP ISN’T     A replacement for relational and data     warehouse systems     A transactional / online / serving s...
HADOOP IN THE ENTERPRISE                              Business Intelligence Applications                     HADOOP       ...
HADOOP @ YAHOO!10                 © 2011 IBM Corporation                                            10
HADOOP @     YAHOO!     “Where Science meets Data”                                                                        ...
FROM PROJECT TO     CORE PLATFORM                            90                                            250            ...
HADOOP POWERS THE     YAHOO! NETWORK         advertising optimization data analytics                machine learning searc...
CASE STUDY     YAHOO! HOMEPAGE     Personalized     for each visitortwice the engagement     Result:     twice the engagem...
CASE STUDY     YAHOO! HOMEPAGE • Serving Maps                          SCIENCE       » Machine learning to build ever     ...
CASE STUDY     YAHOO! MAIL                     Enabling quick response in the spam arms race                              ...
YAHOO! & APACHE HADOOP     Yahoo! has contributed 70+% of     Apache Hadoop code to date     Hadoop is not our business, b...
HADOOP     MOMENTUM18              © 2011 IBM Corporation                                         18
HADOOP IS GOING     MAINSTREAM     2007       2008   2009   2010                                     The Datagraph Blog19 ...
THE PLATFORM EFFECT     BIRTH OF AN ECOSYSTEM                                          and other Early Adopters           ...
THE FUTURE OF     HADOOP21                   © 2011 IBM Corporation                                              21
MAKING HADOOP ENTERPRISE-READY     WHAT’S NEXT     Hadoop is far from “done”      • Current implementation is showing its ...
Questions?23                © 2011 IBM Corporation                                           23
Upcoming SlideShare
Loading in...5
×

Yahoo & Hadoop

2,394

Published on

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,394
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
108
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Yahoo & Hadoop"

  1. 1. YAHOO & HADOOP USING AND IMPROVING APACHE HADOOP AT YAHOO! Eric Baldeschwieler VP, Hadoop Software1 © 2011 IBM Corporation
  2. 2. AGENDA Brief Overview Hadoop @ Yahoo! Hadoop Momentum The Future of Hadoop2 © 2011 IBM Corporation 2
  3. 3. what’s happening - Big Data is here! - unstructured data - petabyte scale - operationally critical3 Flickr : sub_lime79 © 2011 IBM Corporation
  4. 4. turning data into insights machine learning logic regression time series content clustering algorithms ad inventory modeling user interest prediction factorization models4 Flickr : NASA Goddard Photo and Video © 2011 IBM Corporation
  5. 5. making YAHOO relevant5 Flickr : ogimogi © 2011 IBM Corporation
  6. 6. hadoop: Powering Yahoo! science + big data + insight = personal relevance = VALUE6 Flickr : DDFic © 2011 IBM Corporation
  7. 7. WHAT IS HADOOP? Commodity Pig Hive •Computers •Network MapReduce Focus on •Simplicity •Redundancy HDFS •Scale •Availability Transforms commodity equipment into a service that: •HDFS – Stores peta bytes of data reliably •Map-Reduce – Allows huge distributed computations Key Attributes •Redundant and reliable – Doesn’t stop or loose data even as hardware fails •Easy to program – Our rocket scientists use it directly! •Very powerful – Allows the development of big data algorithms & tools 77 •Batch processing centric © 2011 IBM Corporation
  8. 8. WHAT HADOOP ISN’T A replacement for relational and data warehouse systems A transactional / online / serving system A low latency or streaming solution 88 © 2011 IBM Corporation
  9. 9. HADOOP IN THE ENTERPRISE Business Intelligence Applications HADOOP CLUSTER(S) RDMS EDW Data Marts Interactions Transactions, Structured Data Semi-Structured or Un-Structured Data Web Logs, Server Logs, Business Social Media, etc… Applications9 © 2011 IBM Corporation 9
  10. 10. HADOOP @ YAHOO!10 © 2011 IBM Corporation 10
  11. 11. HADOOP @ YAHOO! “Where Science meets Data” PRODUCTS Data Analytics DIM E NS Content Optimization ION AL Content Enrichment D ATA Yahoo! Mail Anti-Spam CO Advertising Products NT EN T HADOOP CLUSTERS Ad Optimization Tens of thousands of servers Ad Selection Big Data Processing & ETL DA TA PIP ELI NE S APPLIED SCIENCETer User Interest Prediction ab(com ytes / Ad inventory prediction pre Day Machine learning - sse d) search ranking Machine learning - ad targeting Machine learning - spam 10s of Petabytes filtering11 © 2011 IBM Corporation 11
  12. 12. FROM PROJECT TO CORE PLATFORM 90 250 80 40K+ Servers 170 PB Storage 200 70 5M+ Monthly Jobs 60 Thousands of Servers 150 50 Petabytes 40 100 30 20 50 10 0 0 2006 2007 2008 2009 201012 © 2011 IBM Corporation 12
  13. 13. HADOOP POWERS THE YAHOO! NETWORK advertising optimization data analytics machine learning search ranking advertising data systems Yahoo! Mail anti-spam audience, ad and search pipelines ad selection Yahoo! Homepage Content Optimization ad inventory prediction user interest prediction13 © 2011 IBM Corporation 13
  14. 14. CASE STUDY YAHOO! HOMEPAGE Personalized for each visitortwice the engagement Result: twice the engagement Recommended links News Interests Top Searches +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected14 © 2011 IBM Corporation 14
  15. 15. CASE STUDY YAHOO! HOMEPAGE • Serving Maps SCIENCE » Machine learning to build ever • Users - Interests HADOOP better categorization models CLUSTER • Five Minute USER CATEGORIZATION Production BEHAVIOR MODELS (weekly) • Weekly PRODUCTION Categorization HADOOP » Identify user interests using CLUSTER models SERVING Categorization models MAPS (every 5 minutes) USER BEHAVIOR SERVING SYSTEMS ENGAGED USERS Build customized home pages with latest data (thousands / second)15 © 2011 IBM Corporation 15
  16. 16. CASE STUDY YAHOO! MAIL Enabling quick response in the spam arms race • 450M mail boxes • 5B+ deliveries/day SCIENCE • Antispam models retrained every few hours on Hadoop “ 40% less spam than PRODUCTION Hotmail and 55% less “ spam than Gmail16 © 2011 IBM Corporation 16
  17. 17. YAHOO! & APACHE HADOOP Yahoo! has contributed 70+% of Apache Hadoop code to date Hadoop is not our business, but Hadoop is key to our business • Yahoo! benefits from open source eco-system around Hadoop • Hadoop drives revenue at Yahoo! by making our core products better We need Hadoop to be rock solid • We invest heavily in core Hadoop development • We focus on scalability, reliability, availability We fix bugs before you see them • We run very large clusters • We have a large QA effort • We run a huge variety of workloads We are good Apache Hadoop citizens • We contribute our work to Apache17 • We share the exact code we run © 2011 IBM Corporation 17
  18. 18. HADOOP MOMENTUM18 © 2011 IBM Corporation 18
  19. 19. HADOOP IS GOING MAINSTREAM 2007 2008 2009 2010 The Datagraph Blog19 © 2011 IBM Corporation 19
  20. 20. THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM and other Early Adopters Scale and productize Hadoop Apache Hadoop Enhance Orgs with Internet Scale Problems Hadoop Add tools / frameworks, enhance Hadoop Ecosystem Service Providers Grow ecosystem - Training, support, enhancementsVirtuous Circle!• Investment -> Adoption• Adoption -> Investment Mainstream / Enterprise adoption20 Drive further development, enhancements 20 © 2011 IBM Corporation
  21. 21. THE FUTURE OF HADOOP21 © 2011 IBM Corporation 21
  22. 22. MAKING HADOOP ENTERPRISE-READY WHAT’S NEXT Hadoop is far from “done” • Current implementation is showing its age • Need to address several deficiencies in scalability, flexibility, ease of use & performance Yahoo! is working on Next Generation of Hadoop • MapReduce: Rewrite to improve performance; pluggable support for new programming models • HDFS: Adding volumes to improve scalability; Flush & sync support for applications that log to HDFS Apache should remain the hub of Hadoop ecosystem • Yahoo! contributes all Hadoop changes back to Apache Hadoop • Everyone benefits from shared neutral foundation22 © 2011 IBM Corporation 22
  23. 23. Questions?23 © 2011 IBM Corporation 23
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×