0
YAHOO &HADOOPUSING	  AND	  IMPROVING	  APACHE	  HADOOP	  AT	  YAHOO!                Eric Baldeschwieler                VP,...
AGENDA         •  	  Brief	  Overview	           •  	  Hadoop	  @	  Yahoo!	           	           •  Hadoop	  Momentum	   ...
WHAT’S    happening                      -­‐	  Big	  Data	  is	  here!	  	                        -­‐ unstructured data   ...
TURNING DATA   INTO INSIGHTS        machine learninglogic regression                            time series      content c...
MAKING YAHOO    RELEVANTFlickr : ogimogi
HADOOP:    POWERING    YAHOO!                 science	  +	  big	  data + insight =                 personal relevance = VA...
WHAT IS HADOOP?                                                                   Commodity         Pig                   ...
WHAT HADOOP ISN’T•  A	  replacement	  for	  relaFonal	  and	  data	       warehouse	  systems	  	  •  A	  transacFonal	  /...
HADOOP IN THE ENTERPRISE                                      Business	  Intelligence	  ApplicaFons	                      ...
HADOOP @ YAHOO!                  10	  
HADOOP @YAHOO!“Where	  Science	  meets	  Data”	                                                       PRODUCTS            ...
FROM PROJECT TOCORE PLATFORM                       90                                                                     ...
HADOOP POWERS THEYAHOO! NETWORK    advertising optimization data analytics           machine learning search ranking adver...
CASE STUDY  YAHOO! HOMEPAGE	  	  	   Personalized	  	  	   for	  each	  visitor	       	  	  twice	  the	  engagement	    ...
CASE STUDY YAHOO! HOMEPAGE•  Serving	  Maps	                                       SCIENCE          »	  Machine learning t...
CASE STUDYYAHOO! MAIL    Enabling	  quick	  response	  in	  the	  spam	  arms	  race	                                     ...
YAHOO! & APACHE HADOOPYahoo!	  has	  contributed	  70+%	  of	  	  Apache	  Hadoop	  code	  to	  date	  Hadoop	  is	  not	 ...
HADOOPMOMENTUM           18	  
HADOOP IS GOINGMAINSTREAM2007       2008   2009   2010                                The	  Datagraph	  Blog	             ...
THE PLATFORM EFFECT  BIRTH OF AN ECOSYSTEM                                	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  ...
THE FUTURE OFHADOOP                21	  
MAKING HADOOP ENTERPRISE-READYWHAT’S NEXTHadoop	  is	  far	  from	  “done”	         •  Current	  implementaFon	  is	  show...
Questions?             23	  
Upcoming SlideShare
Loading in...5
×

hadoop @ Ibmbigdata

18,136

Published on

Eric Baldeschwieler's talk about Hadoop at Yahoo, given at IBM Big Data Symposium

Published in: Technology, Business
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
18,136
On Slideshare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
288
Comments
0
Likes
15
Embeds 0
No embeds

No notes for slide

Transcript of "hadoop @ Ibmbigdata"

  1. 1. YAHOO &HADOOPUSING  AND  IMPROVING  APACHE  HADOOP  AT  YAHOO! Eric Baldeschwieler VP, Hadoop Software
  2. 2. AGENDA •   Brief  Overview   •   Hadoop  @  Yahoo!     •  Hadoop  Momentum   •  The  Future  of  Hadoop   2  
  3. 3. WHAT’S happening -­‐  Big  Data  is  here!     -­‐ unstructured data -­‐    petabyte scale -­‐    operationally criticalFlickr : sub_lime79
  4. 4. TURNING DATA INTO INSIGHTS machine learninglogic regression time series content clustering algorithms ad inventory modeling user interest prediction factorization modelsFlickr : NASA Goddard Photo and Video
  5. 5. MAKING YAHOO RELEVANTFlickr : ogimogi
  6. 6. HADOOP: POWERING YAHOO! science  +  big  data + insight = personal relevance = VALUEFlickr : DDFic
  7. 7. WHAT IS HADOOP? Commodity Pig Hive Programming Languages •  Computers •  Network MapReduce Computation Focus on •  Simplicity HDFS •  Redundancy Storage •  Scale •  AvailabilityTransforms commodity equipment into a service that:•  HDFS – Stores peta bytes of data reliably•  Map-Reduce – Allows huge distributed computationsKey Attributes•  Redundant and reliable – Doesn’t stop or loose data even as hardware fails•  Easy to program – Our rocket scientists use it directly!•  Very powerful – Allows the development of big data algorithms & tools 7  •  Batch processing centric
  8. 8. WHAT HADOOP ISN’T•  A  replacement  for  relaFonal  and  data   warehouse  systems    •  A  transacFonal  /  online  /  serving  system  •  A  low  latency  or  streaming  soluFon     8  
  9. 9. HADOOP IN THE ENTERPRISE Business  Intelligence  ApplicaFons   HADOOP CLUSTER(S) Data   RDMS   EDW   Marts   InteracFons   TransacFons,  Structured  Data   Semi-­‐Structured  or  Un-­‐Structured  Data  Web  Logs,  Server  Logs,   Business  Social  Media,  etc…   ApplicaFons   9  
  10. 10. HADOOP @ YAHOO! 10  
  11. 11. HADOOP @YAHOO!“Where  Science  meets  Data”   PRODUCTS Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products HADOOP CLUSTERS Ad Optimization Tens of thousands of servers Ad Selection Big Data Processing & ETL APPLIED SCIENCE User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam 10s of Petabytes filtering 11  
  12. 12. FROM PROJECT TOCORE PLATFORM 90 250 80 40K+ Servers 170 PB Storage 200 70 5M+ Monthly Jobs 60 “Behind   every   150Thousands of Servers 50 Daily   click”   ProducFon     Petabytes 40 Science   100 30 Impact   20 Research   50 10 0 0 2006 2007 2008 2009 2010 12  
  13. 13. HADOOP POWERS THEYAHOO! NETWORK advertising optimization data analytics machine learning search ranking advertising data systems Yahoo! Mail anti-spam audience, ad and search pipelines ad selection Yahoo! Homepage Content Optimization ad inventory prediction user interest prediction 13  
  14. 14. CASE STUDY YAHOO! HOMEPAGE       Personalized       for  each  visitor      twice  the  engagement   Result:     twice  the  engagement     Recommended  links   News  Interests   Top  Searches   +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected 14  
  15. 15. CASE STUDY YAHOO! HOMEPAGE•  Serving  Maps   SCIENCE »  Machine learning to build ever •  Users  -­‐  Interests   HADOOP better categorization models   CLUSTER•  Five  Minute   USER   CATEGORIZATION   ProducLon   BEHAVIOR   MODELS  (weekly)    •  Weekly   PRODUCTION CategorizaLon   HADOOP »  Identify user interests using models   SERVING   CLUSTER Categorization models MAPS   (every  5  minutes)   USER   BEHAVIOR   SERVING  SYSTEMS ENGAGED  USERS  Build  customized  home  pages  with  latest  data  (thousands  /  second)   15  
  16. 16. CASE STUDYYAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race   •  450M  mail  boxes     •  5B+  deliveries/day   SCIENCE   •  AnLspam  models  retrained    every  few  hours  on  Hadoop     PRODUCTION 40%  less  spam  than   Hotmail  and  55%  less   spam  than  Gmail   16  
  17. 17. YAHOO! & APACHE HADOOPYahoo!  has  contributed  70+%  of    Apache  Hadoop  code  to  date  Hadoop  is  not  our  business,  but  Hadoop  is  key  to  our  business  •   Yahoo!  benefits  from  open  source  eco-­‐system  around  Hadoop  •   Hadoop  drives  revenue  at  Yahoo!  by  making  our  core  products  be`er    We  need  Hadoop  to  be  rock  solid  •   We  invest  heavily  in  core  Hadoop  development  •   We  focus  on  scalability,  reliability,  availability    We  fix  bugs  before  you  see  them  •   We  run  very  large  clusters  •   We  have  a  large  QA  effort  •   We  run  a  huge  variety  of  workloads    We  are  good  Apache  Hadoop  ciLzens  •   We  contribute  our  work  to  Apache  •   We  share  the  exact  code  we  run  
  18. 18. HADOOPMOMENTUM 18  
  19. 19. HADOOP IS GOINGMAINSTREAM2007 2008 2009 2010 The  Datagraph  Blog   19  
  20. 20. THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM                                                        and other Early Adopters Scale and productize Hadoop Apache  Hadoop   Enhance   Orgs with Internet Scale Problems Hadoop   Add tools / frameworks, enhance Hadoop Ecosystem   Service Providers Grow ecosystem - Training, support, enhancementsVirtuous Circle!•  Investment -> Adoption•  Adoption -> Investment Mainstream / Enterprise adoption Drive further development, enhancements 20  
  21. 21. THE FUTURE OFHADOOP 21  
  22. 22. MAKING HADOOP ENTERPRISE-READYWHAT’S NEXTHadoop  is  far  from  “done”   •  Current  implementaFon  is  showing  its  age   •  Need  to  address  several  deficiencies  in  scalability,  flexibility,   ease  of  use  &  performance    Yahoo!  is  working  on  Next  GeneraLon  of  Hadoop   •  MapReduce:  Rewrite  to  improve  performance;   pluggable  support  for  new  programming  models   •  HDFS:  Adding  volumes  to  improve  scalability;   Flush  &  sync  support  for  applicaFons  that  log  to  HDFS    Apache  should  remain  the  hub  of  Hadoop  ecosystem   •  Yahoo!  contributes  all  Hadoop  changes  back  to  Apache  Hadoop   •  Everyone  benefits  from  shared  neutral  foundaFon   22  
  23. 23. Questions? 23  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×