YAHOO &
HADOOP
USING	
  AND	
  IMPROVING	
  
APACHE	
  HADOOP	
  AT	
  YAHOO!

                Eric Baldeschwieler
                VP, Hadoop Software
AGENDA

         •  	
  Brief	
  Overview	
  

         •  	
  Hadoop	
  @	
  Yahoo!	
  
         	
  
         •  Hadoop	
  Momentum	
  

         •  The	
  Future	
  of	
  Hadoop	
  




                                                2	
  
WHAT’S
    happening

                      -­‐	
  Big	
  Data	
  is	
  here!	
  	
  
                      -­‐ unstructured data
                      -­‐	
  	
  petabyte scale
                      -­‐	
  	
  operationally critical




Flickr : sub_lime79
TURNING DATA
   INTO INSIGHTS

        machine learning
logic regression                            time series
      content clustering
      algorithms ad inventory modeling
            user interest prediction
                                        factorization models
Flickr : NASA Goddard Photo and Video
MAKING YAHOO
    RELEVANT




Flickr : ogimogi
HADOOP:
    POWERING
    YAHOO!
                 science	
  +	
  big	
  data + insight =
                 personal relevance = VALUE




Flickr : DDFic
WHAT IS HADOOP?
                                                                   Commodity
         Pig                          Hive               Programming Languages
                                                                   •  Computers
                                                                   •  Network
                    MapReduce                                 Computation
                                                                   Focus on
                                                                   •  Simplicity
                      HDFS
                                                                   •  Redundancy
                                                                Storage
                                                                   •  Scale
                                                                   •  Availability


Transforms commodity equipment into a service that:
•  HDFS – Stores peta bytes of data reliably
•  Map-Reduce – Allows huge distributed computations

Key Attributes
•  Redundant and reliable – Doesn’t stop or loose data even as hardware fails
•  Easy to program – Our rocket scientists use it directly!
•  Very powerful – Allows the development of big data algorithms & tools        7	
  
•  Batch processing centric
WHAT HADOOP ISN’T

•  A	
  replacement	
  for	
  relaFonal	
  and	
  data	
  
     warehouse	
  systems	
  	
  
•  A	
  transacFonal	
  /	
  online	
  /	
  serving	
  system	
  
•  A	
  low	
  latency	
  or	
  streaming	
  soluFon	
  
	
  




                                                                    8	
  
HADOOP IN THE ENTERPRISE
                                      Business	
  Intelligence	
  ApplicaFons	
  




                         HADOOP
                        CLUSTER(S)                                                                 Data	
  
                                                                    RDMS	
          EDW	
  
                                                                                                   Marts	
  




    InteracFons	
                                                TransacFons,	
  Structured	
  Data	
  
    Semi-­‐Structured	
  or	
  Un-­‐Structured	
  Data	
  



Web	
  Logs,	
  Server	
  Logs,	
                                     Business	
  
Social	
  Media,	
  etc…	
                                            ApplicaFons	
  

                                                                                                               9	
  
HADOOP @ YAHOO!




                  10	
  
HADOOP @
YAHOO!
“Where	
  Science	
  meets	
  Data”	
  
                                                     PRODUCTS
                                                     Data Analytics
                                                     Content Optimization
                                                     Content Enrichment
                                                     Yahoo! Mail Anti-Spam
                                                     Advertising Products
                      HADOOP CLUSTERS                Ad Optimization
                   Tens of thousands of servers      Ad Selection
                                                     Big Data Processing & ETL




                                                       APPLIED SCIENCE
                                                     User Interest Prediction
                                                     Ad inventory prediction
                                                     Machine learning -
                                                     search ranking
                                                     Machine learning - ad
                                                     targeting
                                                     Machine learning - spam
                                  10s of Petabytes   filtering
                                                                                11	
  
FROM PROJECT TO
CORE PLATFORM
                       90                                                                        250


                       80    40K+ Servers
                             170 PB Storage                                                      200
                       70
                             5M+ Monthly Jobs
                       60                                                              “Behind	
  
                                                                                        every	
   150
Thousands of Servers




                       50                                            Daily	
            click”	
  
                                                                     ProducFon	
       	
  




                                                                                                        Petabytes
                       40
                                                Science	
                                        100
                       30
                                                Impact	
  

                       20
                               Research	
                                                         50

                       10


                       0                                                                          0

                            2006         2007                 2008         2009      2010
                                                                                                                    12	
  
HADOOP POWERS THE
YAHOO! NETWORK



    advertising optimization data analytics
           machine learning search ranking
 advertising data systems   Yahoo! Mail anti-spam
  audience, ad and search pipelines          ad selection

 Yahoo! Homepage Content Optimization
                   ad inventory prediction
         user interest prediction

                                                            13	
  
CASE STUDY
  YAHOO! HOMEPAGE
	
  
	
  
	
   Personalized	
  	
  
	
   for	
  each	
  visitor	
  
     	
  
	
  twice	
  the	
  engagement	
  
  Result:	
  	
  
  twice	
  the	
  engagement	
  
  	
  
                                    Recommended	
  links	
       News	
  Interests	
       Top	
  Searches	
  

                                   +79% clicks                 +160% clicks              +43% clicks
                                   vs. randomly selected       vs. one size fits all     vs. editor selected

                                                                                                                 14	
  
CASE STUDY
 YAHOO! HOMEPAGE

•  Serving	
  Maps	
                                       SCIENCE          »	
  Machine learning to build ever
       •  Users	
  -­‐	
  Interests	
                       HADOOP             better categorization models
	
                                                          CLUSTER
•  Five	
  Minute	
                        USER	
                               CATEGORIZATION	
  
     ProducLon	
                       BEHAVIOR	
                               MODELS	
  (weekly)	
  
	
  
•  Weekly	
                                                PRODUCTION
     CategorizaLon	
                                          HADOOP
                                                                            »	
  Identify user interests using
     models	
                               SERVING	
  
                                                              CLUSTER
                                                                               Categorization models
                                              MAPS	
  
                             (every	
  5	
  minutes)	
  
                                                               USER	
  
                                                             BEHAVIOR	
  



                                  SERVING	
  SYSTEMS                           ENGAGED	
  USERS



	
  
Build	
  customized	
  home	
  pages	
  with	
  latest	
  data	
  (thousands	
  /	
  second)	
  
                                                                                                                 15	
  
CASE STUDY
YAHOO! MAIL

    Enabling	
  quick	
  response	
  in	
  the	
  spam	
  arms	
  race	
  

                                        •  450M	
  mail	
  boxes	
  	
  
                                        •  5B+	
  deliveries/day	
  
         SCIENCE
                                        	
  
                                        •  AnLspam	
  models	
  retrained	
  
                                             	
  every	
  few	
  hours	
  on	
  Hadoop	
  
                                        	
  
        PRODUCTION
                                               40%	
  less	
  spam	
  than	
  
                                               Hotmail	
  and	
  55%	
  less	
  
                                               spam	
  than	
  Gmail	
  



                                                                                             16	
  
YAHOO! & APACHE HADOOP
Yahoo!	
  has	
  contributed	
  70+%	
  of	
  	
  
Apache	
  Hadoop	
  code	
  to	
  date	
  
Hadoop	
  is	
  not	
  our	
  business,	
  but	
  Hadoop	
  is	
  key	
  to	
  our	
  business	
  
• 	
  Yahoo!	
  benefits	
  from	
  open	
  source	
  eco-­‐system	
  around	
  Hadoop	
  
• 	
  Hadoop	
  drives	
  revenue	
  at	
  Yahoo!	
  by	
  making	
  our	
  core	
  products	
  be`er	
  
	
  
We	
  need	
  Hadoop	
  to	
  be	
  rock	
  solid	
  
• 	
  We	
  invest	
  heavily	
  in	
  core	
  Hadoop	
  development	
  
• 	
  We	
  focus	
  on	
  scalability,	
  reliability,	
  availability	
  
	
  
We	
  fix	
  bugs	
  before	
  you	
  see	
  them	
  
• 	
  We	
  run	
  very	
  large	
  clusters	
  
• 	
  We	
  have	
  a	
  large	
  QA	
  effort	
  
• 	
  We	
  run	
  a	
  huge	
  variety	
  of	
  workloads	
  
	
  
We	
  are	
  good	
  Apache	
  Hadoop	
  ciLzens	
  
• 	
  We	
  contribute	
  our	
  work	
  to	
  Apache	
  
• 	
  We	
  share	
  the	
  exact	
  code	
  we	
  run	
  
HADOOP
MOMENTUM




           18	
  
HADOOP IS GOING
MAINSTREAM

2007       2008   2009   2010




                                The	
  Datagraph	
  Blog	
  




                                                               19	
  
THE PLATFORM EFFECT
  BIRTH OF AN ECOSYSTEM
                                	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  and other Early Adopters
                                Scale and productize Hadoop

       Apache	
  Hadoop	
  

                Enhance	
       Orgs with Internet Scale Problems
                Hadoop	
        Add tools / frameworks, enhance Hadoop
                Ecosystem	
  




                                Service Providers
                                Grow ecosystem - Training, support, enhancements

Virtuous Circle!
•  Investment -> Adoption
•  Adoption -> Investment

                                Mainstream / Enterprise adoption
                                Drive further development, enhancements                                                                                                    20	
  
THE FUTURE OF
HADOOP




                21	
  
MAKING HADOOP ENTERPRISE-READY
WHAT’S NEXT
Hadoop	
  is	
  far	
  from	
  “done”	
  
       •  Current	
  implementaFon	
  is	
  showing	
  its	
  age	
  
       •  Need	
  to	
  address	
  several	
  deficiencies	
  in	
  scalability,	
  flexibility,	
  
          ease	
  of	
  use	
  &	
  performance	
  
       	
  
Yahoo!	
  is	
  working	
  on	
  Next	
  GeneraLon	
  of	
  Hadoop	
  
       •  MapReduce:	
  Rewrite	
  to	
  improve	
  performance;	
  
          pluggable	
  support	
  for	
  new	
  programming	
  models	
  
       •  HDFS:	
  Adding	
  volumes	
  to	
  improve	
  scalability;	
  
          Flush	
  &	
  sync	
  support	
  for	
  applicaFons	
  that	
  log	
  to	
  HDFS	
  
	
  
Apache	
  should	
  remain	
  the	
  hub	
  of	
  Hadoop	
  ecosystem	
  
       •  Yahoo!	
  contributes	
  all	
  Hadoop	
  changes	
  back	
  to	
  Apache	
  Hadoop	
  
       •  Everyone	
  benefits	
  from	
  shared	
  neutral	
  foundaFon	
  
                                                                                                     22	
  
Questions?




             23	
  

hadoop @ Ibmbigdata

  • 1.
    YAHOO & HADOOP USING  AND  IMPROVING   APACHE  HADOOP  AT  YAHOO! Eric Baldeschwieler VP, Hadoop Software
  • 2.
    AGENDA •   Brief  Overview   •   Hadoop  @  Yahoo!     •  Hadoop  Momentum   •  The  Future  of  Hadoop   2  
  • 3.
    WHAT’S happening -­‐  Big  Data  is  here!     -­‐ unstructured data -­‐    petabyte scale -­‐    operationally critical Flickr : sub_lime79
  • 4.
    TURNING DATA INTO INSIGHTS machine learning logic regression time series content clustering algorithms ad inventory modeling user interest prediction factorization models Flickr : NASA Goddard Photo and Video
  • 5.
    MAKING YAHOO RELEVANT Flickr : ogimogi
  • 6.
    HADOOP: POWERING YAHOO! science  +  big  data + insight = personal relevance = VALUE Flickr : DDFic
  • 7.
    WHAT IS HADOOP? Commodity Pig Hive Programming Languages •  Computers •  Network MapReduce Computation Focus on •  Simplicity HDFS •  Redundancy Storage •  Scale •  Availability Transforms commodity equipment into a service that: •  HDFS – Stores peta bytes of data reliably •  Map-Reduce – Allows huge distributed computations Key Attributes •  Redundant and reliable – Doesn’t stop or loose data even as hardware fails •  Easy to program – Our rocket scientists use it directly! •  Very powerful – Allows the development of big data algorithms & tools 7   •  Batch processing centric
  • 8.
    WHAT HADOOP ISN’T • A  replacement  for  relaFonal  and  data   warehouse  systems     •  A  transacFonal  /  online  /  serving  system   •  A  low  latency  or  streaming  soluFon     8  
  • 9.
    HADOOP IN THEENTERPRISE Business  Intelligence  ApplicaFons   HADOOP CLUSTER(S) Data   RDMS   EDW   Marts   InteracFons   TransacFons,  Structured  Data   Semi-­‐Structured  or  Un-­‐Structured  Data   Web  Logs,  Server  Logs,   Business   Social  Media,  etc…   ApplicaFons   9  
  • 10.
  • 11.
    HADOOP @ YAHOO! “Where  Science  meets  Data”   PRODUCTS Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products HADOOP CLUSTERS Ad Optimization Tens of thousands of servers Ad Selection Big Data Processing & ETL APPLIED SCIENCE User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam 10s of Petabytes filtering 11  
  • 12.
    FROM PROJECT TO COREPLATFORM 90 250 80 40K+ Servers 170 PB Storage 200 70 5M+ Monthly Jobs 60 “Behind   every   150 Thousands of Servers 50 Daily   click”   ProducFon     Petabytes 40 Science   100 30 Impact   20 Research   50 10 0 0 2006 2007 2008 2009 2010 12  
  • 13.
    HADOOP POWERS THE YAHOO!NETWORK advertising optimization data analytics machine learning search ranking advertising data systems Yahoo! Mail anti-spam audience, ad and search pipelines ad selection Yahoo! Homepage Content Optimization ad inventory prediction user interest prediction 13  
  • 14.
    CASE STUDY YAHOO! HOMEPAGE       Personalized       for  each  visitor      twice  the  engagement   Result:     twice  the  engagement     Recommended  links   News  Interests   Top  Searches   +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected 14  
  • 15.
    CASE STUDY YAHOO!HOMEPAGE •  Serving  Maps   SCIENCE »  Machine learning to build ever •  Users  -­‐  Interests   HADOOP better categorization models   CLUSTER •  Five  Minute   USER   CATEGORIZATION   ProducLon   BEHAVIOR   MODELS  (weekly)     •  Weekly   PRODUCTION CategorizaLon   HADOOP »  Identify user interests using models   SERVING   CLUSTER Categorization models MAPS   (every  5  minutes)   USER   BEHAVIOR   SERVING  SYSTEMS ENGAGED  USERS   Build  customized  home  pages  with  latest  data  (thousands  /  second)   15  
  • 16.
    CASE STUDY YAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race   •  450M  mail  boxes     •  5B+  deliveries/day   SCIENCE   •  AnLspam  models  retrained    every  few  hours  on  Hadoop     PRODUCTION 40%  less  spam  than   Hotmail  and  55%  less   spam  than  Gmail   16  
  • 17.
    YAHOO! & APACHEHADOOP Yahoo!  has  contributed  70+%  of     Apache  Hadoop  code  to  date   Hadoop  is  not  our  business,  but  Hadoop  is  key  to  our  business   •   Yahoo!  benefits  from  open  source  eco-­‐system  around  Hadoop   •   Hadoop  drives  revenue  at  Yahoo!  by  making  our  core  products  be`er     We  need  Hadoop  to  be  rock  solid   •   We  invest  heavily  in  core  Hadoop  development   •   We  focus  on  scalability,  reliability,  availability     We  fix  bugs  before  you  see  them   •   We  run  very  large  clusters   •   We  have  a  large  QA  effort   •   We  run  a  huge  variety  of  workloads     We  are  good  Apache  Hadoop  ciLzens   •   We  contribute  our  work  to  Apache   •   We  share  the  exact  code  we  run  
  • 18.
  • 19.
    HADOOP IS GOING MAINSTREAM 2007 2008 2009 2010 The  Datagraph  Blog   19  
  • 20.
    THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM                                                        and other Early Adopters Scale and productize Hadoop Apache  Hadoop   Enhance   Orgs with Internet Scale Problems Hadoop   Add tools / frameworks, enhance Hadoop Ecosystem   Service Providers Grow ecosystem - Training, support, enhancements Virtuous Circle! •  Investment -> Adoption •  Adoption -> Investment Mainstream / Enterprise adoption Drive further development, enhancements 20  
  • 21.
  • 22.
    MAKING HADOOP ENTERPRISE-READY WHAT’SNEXT Hadoop  is  far  from  “done”   •  Current  implementaFon  is  showing  its  age   •  Need  to  address  several  deficiencies  in  scalability,  flexibility,   ease  of  use  &  performance     Yahoo!  is  working  on  Next  GeneraLon  of  Hadoop   •  MapReduce:  Rewrite  to  improve  performance;   pluggable  support  for  new  programming  models   •  HDFS:  Adding  volumes  to  improve  scalability;   Flush  &  sync  support  for  applicaFons  that  log  to  HDFS     Apache  should  remain  the  hub  of  Hadoop  ecosystem   •  Yahoo!  contributes  all  Hadoop  changes  back  to  Apache  Hadoop   •  Everyone  benefits  from  shared  neutral  foundaFon   22  
  • 23.