Yahoo & Hadoop

  • 2,317 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,317
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
108
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. YAHOO & HADOOP USING AND IMPROVING APACHE HADOOP AT YAHOO! Eric Baldeschwieler VP, Hadoop Software1 © 2011 IBM Corporation
  • 2. AGENDA Brief Overview Hadoop @ Yahoo! Hadoop Momentum The Future of Hadoop2 © 2011 IBM Corporation 2
  • 3. what’s happening - Big Data is here! - unstructured data - petabyte scale - operationally critical3 Flickr : sub_lime79 © 2011 IBM Corporation
  • 4. turning data into insights machine learning logic regression time series content clustering algorithms ad inventory modeling user interest prediction factorization models4 Flickr : NASA Goddard Photo and Video © 2011 IBM Corporation
  • 5. making YAHOO relevant5 Flickr : ogimogi © 2011 IBM Corporation
  • 6. hadoop: Powering Yahoo! science + big data + insight = personal relevance = VALUE6 Flickr : DDFic © 2011 IBM Corporation
  • 7. WHAT IS HADOOP? Commodity Pig Hive •Computers •Network MapReduce Focus on •Simplicity •Redundancy HDFS •Scale •Availability Transforms commodity equipment into a service that: •HDFS – Stores peta bytes of data reliably •Map-Reduce – Allows huge distributed computations Key Attributes •Redundant and reliable – Doesn’t stop or loose data even as hardware fails •Easy to program – Our rocket scientists use it directly! •Very powerful – Allows the development of big data algorithms & tools 77 •Batch processing centric © 2011 IBM Corporation
  • 8. WHAT HADOOP ISN’T A replacement for relational and data warehouse systems A transactional / online / serving system A low latency or streaming solution 88 © 2011 IBM Corporation
  • 9. HADOOP IN THE ENTERPRISE Business Intelligence Applications HADOOP CLUSTER(S) RDMS EDW Data Marts Interactions Transactions, Structured Data Semi-Structured or Un-Structured Data Web Logs, Server Logs, Business Social Media, etc… Applications9 © 2011 IBM Corporation 9
  • 10. HADOOP @ YAHOO!10 © 2011 IBM Corporation 10
  • 11. HADOOP @ YAHOO! “Where Science meets Data” PRODUCTS Data Analytics DIM E NS Content Optimization ION AL Content Enrichment D ATA Yahoo! Mail Anti-Spam CO Advertising Products NT EN T HADOOP CLUSTERS Ad Optimization Tens of thousands of servers Ad Selection Big Data Processing & ETL DA TA PIP ELI NE S APPLIED SCIENCETer User Interest Prediction ab(com ytes / Ad inventory prediction pre Day Machine learning - sse d) search ranking Machine learning - ad targeting Machine learning - spam 10s of Petabytes filtering11 © 2011 IBM Corporation 11
  • 12. FROM PROJECT TO CORE PLATFORM 90 250 80 40K+ Servers 170 PB Storage 200 70 5M+ Monthly Jobs 60 Thousands of Servers 150 50 Petabytes 40 100 30 20 50 10 0 0 2006 2007 2008 2009 201012 © 2011 IBM Corporation 12
  • 13. HADOOP POWERS THE YAHOO! NETWORK advertising optimization data analytics machine learning search ranking advertising data systems Yahoo! Mail anti-spam audience, ad and search pipelines ad selection Yahoo! Homepage Content Optimization ad inventory prediction user interest prediction13 © 2011 IBM Corporation 13
  • 14. CASE STUDY YAHOO! HOMEPAGE Personalized for each visitortwice the engagement Result: twice the engagement Recommended links News Interests Top Searches +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected14 © 2011 IBM Corporation 14
  • 15. CASE STUDY YAHOO! HOMEPAGE • Serving Maps SCIENCE » Machine learning to build ever • Users - Interests HADOOP better categorization models CLUSTER • Five Minute USER CATEGORIZATION Production BEHAVIOR MODELS (weekly) • Weekly PRODUCTION Categorization HADOOP » Identify user interests using CLUSTER models SERVING Categorization models MAPS (every 5 minutes) USER BEHAVIOR SERVING SYSTEMS ENGAGED USERS Build customized home pages with latest data (thousands / second)15 © 2011 IBM Corporation 15
  • 16. CASE STUDY YAHOO! MAIL Enabling quick response in the spam arms race • 450M mail boxes • 5B+ deliveries/day SCIENCE • Antispam models retrained every few hours on Hadoop “ 40% less spam than PRODUCTION Hotmail and 55% less “ spam than Gmail16 © 2011 IBM Corporation 16
  • 17. YAHOO! & APACHE HADOOP Yahoo! has contributed 70+% of Apache Hadoop code to date Hadoop is not our business, but Hadoop is key to our business • Yahoo! benefits from open source eco-system around Hadoop • Hadoop drives revenue at Yahoo! by making our core products better We need Hadoop to be rock solid • We invest heavily in core Hadoop development • We focus on scalability, reliability, availability We fix bugs before you see them • We run very large clusters • We have a large QA effort • We run a huge variety of workloads We are good Apache Hadoop citizens • We contribute our work to Apache17 • We share the exact code we run © 2011 IBM Corporation 17
  • 18. HADOOP MOMENTUM18 © 2011 IBM Corporation 18
  • 19. HADOOP IS GOING MAINSTREAM 2007 2008 2009 2010 The Datagraph Blog19 © 2011 IBM Corporation 19
  • 20. THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM and other Early Adopters Scale and productize Hadoop Apache Hadoop Enhance Orgs with Internet Scale Problems Hadoop Add tools / frameworks, enhance Hadoop Ecosystem Service Providers Grow ecosystem - Training, support, enhancementsVirtuous Circle!• Investment -> Adoption• Adoption -> Investment Mainstream / Enterprise adoption20 Drive further development, enhancements 20 © 2011 IBM Corporation
  • 21. THE FUTURE OF HADOOP21 © 2011 IBM Corporation 21
  • 22. MAKING HADOOP ENTERPRISE-READY WHAT’S NEXT Hadoop is far from “done” • Current implementation is showing its age • Need to address several deficiencies in scalability, flexibility, ease of use & performance Yahoo! is working on Next Generation of Hadoop • MapReduce: Rewrite to improve performance; pluggable support for new programming models • HDFS: Adding volumes to improve scalability; Flush & sync support for applications that log to HDFS Apache should remain the hub of Hadoop ecosystem • Yahoo! contributes all Hadoop changes back to Apache Hadoop • Everyone benefits from shared neutral foundation22 © 2011 IBM Corporation 22
  • 23. Questions?23 © 2011 IBM Corporation 23