hadoop @ Ibmbigdata
Upcoming SlideShare
Loading in...5

hadoop @ Ibmbigdata



Eric Baldeschwieler's talk about Hadoop at Yahoo, given at IBM Big Data Symposium

Eric Baldeschwieler's talk about Hadoop at Yahoo, given at IBM Big Data Symposium



Total Views
Slideshare-icon Views on SlideShare
Embed Views



21 Embeds 13,424

http://developer.yahoo.com 12386
http://feeds.developer.yahoo.net 530
http://gerenciamentodeti.com.br 335
http://developer.pdprev.global.media.yahoo.com 91
http://basketnote-dr.eglbp.corp.yahoo.com 36
http://cloud.feedly.com 6
http://www.feedspot.com 6
http://translate.googleusercontent.com 5
http://www.slideshare.net 5
http://www.hanrss.com 5
http://newsblur.com 4
http://www.newsblur.com 3 2
http://webcache.googleusercontent.com 2
http://app.unreadzero.com 2
https://www.google.com 1
https://ecportalsitc.verizon.com 1
url_unknown 1
http://www.google.com&_=1392091266064 HTTP 1
http://feedly.com 1
http://www.inoreader.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    hadoop @ Ibmbigdata hadoop @ Ibmbigdata Presentation Transcript

    • YAHOO &HADOOPUSING  AND  IMPROVING  APACHE  HADOOP  AT  YAHOO! Eric Baldeschwieler VP, Hadoop Software
    • AGENDA •   Brief  Overview   •   Hadoop  @  Yahoo!     •  Hadoop  Momentum   •  The  Future  of  Hadoop   2  
    • WHAT’S happening -­‐  Big  Data  is  here!     -­‐ unstructured data -­‐    petabyte scale -­‐    operationally criticalFlickr : sub_lime79
    • TURNING DATA INTO INSIGHTS machine learninglogic regression time series content clustering algorithms ad inventory modeling user interest prediction factorization modelsFlickr : NASA Goddard Photo and Video
    • MAKING YAHOO RELEVANTFlickr : ogimogi
    • HADOOP: POWERING YAHOO! science  +  big  data + insight = personal relevance = VALUEFlickr : DDFic
    • WHAT IS HADOOP? Commodity Pig Hive Programming Languages •  Computers •  Network MapReduce Computation Focus on •  Simplicity HDFS •  Redundancy Storage •  Scale •  AvailabilityTransforms commodity equipment into a service that:•  HDFS – Stores peta bytes of data reliably•  Map-Reduce – Allows huge distributed computationsKey Attributes•  Redundant and reliable – Doesn’t stop or loose data even as hardware fails•  Easy to program – Our rocket scientists use it directly!•  Very powerful – Allows the development of big data algorithms & tools 7  •  Batch processing centric
    • WHAT HADOOP ISN’T•  A  replacement  for  relaFonal  and  data   warehouse  systems    •  A  transacFonal  /  online  /  serving  system  •  A  low  latency  or  streaming  soluFon     8  
    • HADOOP IN THE ENTERPRISE Business  Intelligence  ApplicaFons   HADOOP CLUSTER(S) Data   RDMS   EDW   Marts   InteracFons   TransacFons,  Structured  Data   Semi-­‐Structured  or  Un-­‐Structured  Data  Web  Logs,  Server  Logs,   Business  Social  Media,  etc…   ApplicaFons   9  
    • HADOOP @ YAHOO! 10  
    • HADOOP @YAHOO!“Where  Science  meets  Data”   PRODUCTS Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products HADOOP CLUSTERS Ad Optimization Tens of thousands of servers Ad Selection Big Data Processing & ETL APPLIED SCIENCE User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam 10s of Petabytes filtering 11  
    • FROM PROJECT TOCORE PLATFORM 90 250 80 40K+ Servers 170 PB Storage 200 70 5M+ Monthly Jobs 60 “Behind   every   150Thousands of Servers 50 Daily   click”   ProducFon     Petabytes 40 Science   100 30 Impact   20 Research   50 10 0 0 2006 2007 2008 2009 2010 12  
    • HADOOP POWERS THEYAHOO! NETWORK advertising optimization data analytics machine learning search ranking advertising data systems Yahoo! Mail anti-spam audience, ad and search pipelines ad selection Yahoo! Homepage Content Optimization ad inventory prediction user interest prediction 13  
    • CASE STUDY YAHOO! HOMEPAGE       Personalized       for  each  visitor      twice  the  engagement   Result:     twice  the  engagement     Recommended  links   News  Interests   Top  Searches   +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected 14  
    • CASE STUDY YAHOO! HOMEPAGE•  Serving  Maps   SCIENCE »  Machine learning to build ever •  Users  -­‐  Interests   HADOOP better categorization models   CLUSTER•  Five  Minute   USER   CATEGORIZATION   ProducLon   BEHAVIOR   MODELS  (weekly)    •  Weekly   PRODUCTION CategorizaLon   HADOOP »  Identify user interests using models   SERVING   CLUSTER Categorization models MAPS   (every  5  minutes)   USER   BEHAVIOR   SERVING  SYSTEMS ENGAGED  USERS  Build  customized  home  pages  with  latest  data  (thousands  /  second)   15  
    • CASE STUDYYAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race   •  450M  mail  boxes     •  5B+  deliveries/day   SCIENCE   •  AnLspam  models  retrained    every  few  hours  on  Hadoop     PRODUCTION 40%  less  spam  than   Hotmail  and  55%  less   spam  than  Gmail   16  
    • YAHOO! & APACHE HADOOPYahoo!  has  contributed  70+%  of    Apache  Hadoop  code  to  date  Hadoop  is  not  our  business,  but  Hadoop  is  key  to  our  business  •   Yahoo!  benefits  from  open  source  eco-­‐system  around  Hadoop  •   Hadoop  drives  revenue  at  Yahoo!  by  making  our  core  products  be`er    We  need  Hadoop  to  be  rock  solid  •   We  invest  heavily  in  core  Hadoop  development  •   We  focus  on  scalability,  reliability,  availability    We  fix  bugs  before  you  see  them  •   We  run  very  large  clusters  •   We  have  a  large  QA  effort  •   We  run  a  huge  variety  of  workloads    We  are  good  Apache  Hadoop  ciLzens  •   We  contribute  our  work  to  Apache  •   We  share  the  exact  code  we  run  
    • HADOOP IS GOINGMAINSTREAM2007 2008 2009 2010 The  Datagraph  Blog   19  
    • THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM                                                        and other Early Adopters Scale and productize Hadoop Apache  Hadoop   Enhance   Orgs with Internet Scale Problems Hadoop   Add tools / frameworks, enhance Hadoop Ecosystem   Service Providers Grow ecosystem - Training, support, enhancementsVirtuous Circle!•  Investment -> Adoption•  Adoption -> Investment Mainstream / Enterprise adoption Drive further development, enhancements 20  
    • MAKING HADOOP ENTERPRISE-READYWHAT’S NEXTHadoop  is  far  from  “done”   •  Current  implementaFon  is  showing  its  age   •  Need  to  address  several  deficiencies  in  scalability,  flexibility,   ease  of  use  &  performance    Yahoo!  is  working  on  Next  GeneraLon  of  Hadoop   •  MapReduce:  Rewrite  to  improve  performance;   pluggable  support  for  new  programming  models   •  HDFS:  Adding  volumes  to  improve  scalability;   Flush  &  sync  support  for  applicaFons  that  log  to  HDFS    Apache  should  remain  the  hub  of  Hadoop  ecosystem   •  Yahoo!  contributes  all  Hadoop  changes  back  to  Apache  Hadoop   •  Everyone  benefits  from  shared  neutral  foundaFon   22  
    • Questions? 23