hadoop @ Ibmbigdata

  • 18,015 views
Uploaded on

Eric Baldeschwieler's talk about Hadoop at Yahoo, given at IBM Big Data Symposium

Eric Baldeschwieler's talk about Hadoop at Yahoo, given at IBM Big Data Symposium

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
18,015
On Slideshare
0
From Embeds
0
Number of Embeds
20

Actions

Shares
Downloads
282
Comments
0
Likes
15

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. YAHOO &HADOOPUSING  AND  IMPROVING  APACHE  HADOOP  AT  YAHOO! Eric Baldeschwieler VP, Hadoop Software
  • 2. AGENDA •   Brief  Overview   •   Hadoop  @  Yahoo!     •  Hadoop  Momentum   •  The  Future  of  Hadoop   2  
  • 3. WHAT’S happening -­‐  Big  Data  is  here!     -­‐ unstructured data -­‐    petabyte scale -­‐    operationally criticalFlickr : sub_lime79
  • 4. TURNING DATA INTO INSIGHTS machine learninglogic regression time series content clustering algorithms ad inventory modeling user interest prediction factorization modelsFlickr : NASA Goddard Photo and Video
  • 5. MAKING YAHOO RELEVANTFlickr : ogimogi
  • 6. HADOOP: POWERING YAHOO! science  +  big  data + insight = personal relevance = VALUEFlickr : DDFic
  • 7. WHAT IS HADOOP? Commodity Pig Hive Programming Languages •  Computers •  Network MapReduce Computation Focus on •  Simplicity HDFS •  Redundancy Storage •  Scale •  AvailabilityTransforms commodity equipment into a service that:•  HDFS – Stores peta bytes of data reliably•  Map-Reduce – Allows huge distributed computationsKey Attributes•  Redundant and reliable – Doesn’t stop or loose data even as hardware fails•  Easy to program – Our rocket scientists use it directly!•  Very powerful – Allows the development of big data algorithms & tools 7  •  Batch processing centric
  • 8. WHAT HADOOP ISN’T•  A  replacement  for  relaFonal  and  data   warehouse  systems    •  A  transacFonal  /  online  /  serving  system  •  A  low  latency  or  streaming  soluFon     8  
  • 9. HADOOP IN THE ENTERPRISE Business  Intelligence  ApplicaFons   HADOOP CLUSTER(S) Data   RDMS   EDW   Marts   InteracFons   TransacFons,  Structured  Data   Semi-­‐Structured  or  Un-­‐Structured  Data  Web  Logs,  Server  Logs,   Business  Social  Media,  etc…   ApplicaFons   9  
  • 10. HADOOP @ YAHOO! 10  
  • 11. HADOOP @YAHOO!“Where  Science  meets  Data”   PRODUCTS Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products HADOOP CLUSTERS Ad Optimization Tens of thousands of servers Ad Selection Big Data Processing & ETL APPLIED SCIENCE User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam 10s of Petabytes filtering 11  
  • 12. FROM PROJECT TOCORE PLATFORM 90 250 80 40K+ Servers 170 PB Storage 200 70 5M+ Monthly Jobs 60 “Behind   every   150Thousands of Servers 50 Daily   click”   ProducFon     Petabytes 40 Science   100 30 Impact   20 Research   50 10 0 0 2006 2007 2008 2009 2010 12  
  • 13. HADOOP POWERS THEYAHOO! NETWORK advertising optimization data analytics machine learning search ranking advertising data systems Yahoo! Mail anti-spam audience, ad and search pipelines ad selection Yahoo! Homepage Content Optimization ad inventory prediction user interest prediction 13  
  • 14. CASE STUDY YAHOO! HOMEPAGE       Personalized       for  each  visitor      twice  the  engagement   Result:     twice  the  engagement     Recommended  links   News  Interests   Top  Searches   +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected 14  
  • 15. CASE STUDY YAHOO! HOMEPAGE•  Serving  Maps   SCIENCE »  Machine learning to build ever •  Users  -­‐  Interests   HADOOP better categorization models   CLUSTER•  Five  Minute   USER   CATEGORIZATION   ProducLon   BEHAVIOR   MODELS  (weekly)    •  Weekly   PRODUCTION CategorizaLon   HADOOP »  Identify user interests using models   SERVING   CLUSTER Categorization models MAPS   (every  5  minutes)   USER   BEHAVIOR   SERVING  SYSTEMS ENGAGED  USERS  Build  customized  home  pages  with  latest  data  (thousands  /  second)   15  
  • 16. CASE STUDYYAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race   •  450M  mail  boxes     •  5B+  deliveries/day   SCIENCE   •  AnLspam  models  retrained    every  few  hours  on  Hadoop     PRODUCTION 40%  less  spam  than   Hotmail  and  55%  less   spam  than  Gmail   16  
  • 17. YAHOO! & APACHE HADOOPYahoo!  has  contributed  70+%  of    Apache  Hadoop  code  to  date  Hadoop  is  not  our  business,  but  Hadoop  is  key  to  our  business  •   Yahoo!  benefits  from  open  source  eco-­‐system  around  Hadoop  •   Hadoop  drives  revenue  at  Yahoo!  by  making  our  core  products  be`er    We  need  Hadoop  to  be  rock  solid  •   We  invest  heavily  in  core  Hadoop  development  •   We  focus  on  scalability,  reliability,  availability    We  fix  bugs  before  you  see  them  •   We  run  very  large  clusters  •   We  have  a  large  QA  effort  •   We  run  a  huge  variety  of  workloads    We  are  good  Apache  Hadoop  ciLzens  •   We  contribute  our  work  to  Apache  •   We  share  the  exact  code  we  run  
  • 18. HADOOPMOMENTUM 18  
  • 19. HADOOP IS GOINGMAINSTREAM2007 2008 2009 2010 The  Datagraph  Blog   19  
  • 20. THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM                                                        and other Early Adopters Scale and productize Hadoop Apache  Hadoop   Enhance   Orgs with Internet Scale Problems Hadoop   Add tools / frameworks, enhance Hadoop Ecosystem   Service Providers Grow ecosystem - Training, support, enhancementsVirtuous Circle!•  Investment -> Adoption•  Adoption -> Investment Mainstream / Enterprise adoption Drive further development, enhancements 20  
  • 21. THE FUTURE OFHADOOP 21  
  • 22. MAKING HADOOP ENTERPRISE-READYWHAT’S NEXTHadoop  is  far  from  “done”   •  Current  implementaFon  is  showing  its  age   •  Need  to  address  several  deficiencies  in  scalability,  flexibility,   ease  of  use  &  performance    Yahoo!  is  working  on  Next  GeneraLon  of  Hadoop   •  MapReduce:  Rewrite  to  improve  performance;   pluggable  support  for  new  programming  models   •  HDFS:  Adding  volumes  to  improve  scalability;   Flush  &  sync  support  for  applicaFons  that  log  to  HDFS    Apache  should  remain  the  hub  of  Hadoop  ecosystem   •  Yahoo!  contributes  all  Hadoop  changes  back  to  Apache  Hadoop   •  Everyone  benefits  from  shared  neutral  foundaFon   22  
  • 23. Questions? 23