SlideShare a Scribd company logo
YAHOO &
HADOOP
USING	
  AND	
  IMPROVING	
  
APACHE	
  HADOOP	
  AT	
  YAHOO!

                Eric Baldeschwieler
                VP, Hadoop Software
AGENDA

         •  	
  Brief	
  Overview	
  

         •  	
  Hadoop	
  @	
  Yahoo!	
  
         	
  
         •  Hadoop	
  Momentum	
  

         •  The	
  Future	
  of	
  Hadoop	
  




                                                2	
  
WHAT’S
    happening

                      -­‐	
  Big	
  Data	
  is	
  here!	
  	
  
                      -­‐ unstructured data
                      -­‐	
  	
  petabyte scale
                      -­‐	
  	
  operationally critical




Flickr : sub_lime79
TURNING DATA
   INTO INSIGHTS

        machine learning
logic regression                            time series
      content clustering
      algorithms ad inventory modeling
            user interest prediction
                                        factorization models
Flickr : NASA Goddard Photo and Video
MAKING YAHOO
    RELEVANT




Flickr : ogimogi
HADOOP:
    POWERING
    YAHOO!
                 science	
  +	
  big	
  data + insight =
                 personal relevance = VALUE




Flickr : DDFic
WHAT IS HADOOP?
                                                                   Commodity
         Pig                          Hive               Programming Languages
                                                                   •  Computers
                                                                   •  Network
                    MapReduce                                 Computation
                                                                   Focus on
                                                                   •  Simplicity
                      HDFS
                                                                   •  Redundancy
                                                                Storage
                                                                   •  Scale
                                                                   •  Availability


Transforms commodity equipment into a service that:
•  HDFS – Stores peta bytes of data reliably
•  Map-Reduce – Allows huge distributed computations

Key Attributes
•  Redundant and reliable – Doesn’t stop or loose data even as hardware fails
•  Easy to program – Our rocket scientists use it directly!
•  Very powerful – Allows the development of big data algorithms & tools        7	
  
•  Batch processing centric
WHAT HADOOP ISN’T

•  A	
  replacement	
  for	
  relaFonal	
  and	
  data	
  
     warehouse	
  systems	
  	
  
•  A	
  transacFonal	
  /	
  online	
  /	
  serving	
  system	
  
•  A	
  low	
  latency	
  or	
  streaming	
  soluFon	
  
	
  




                                                                    8	
  
HADOOP IN THE ENTERPRISE
                                      Business	
  Intelligence	
  ApplicaFons	
  




                         HADOOP
                        CLUSTER(S)                                                                 Data	
  
                                                                    RDMS	
          EDW	
  
                                                                                                   Marts	
  




    InteracFons	
                                                TransacFons,	
  Structured	
  Data	
  
    Semi-­‐Structured	
  or	
  Un-­‐Structured	
  Data	
  



Web	
  Logs,	
  Server	
  Logs,	
                                     Business	
  
Social	
  Media,	
  etc…	
                                            ApplicaFons	
  

                                                                                                               9	
  
HADOOP @ YAHOO!




                  10	
  
HADOOP @
YAHOO!
“Where	
  Science	
  meets	
  Data”	
  
                                                     PRODUCTS
                                                     Data Analytics
                                                     Content Optimization
                                                     Content Enrichment
                                                     Yahoo! Mail Anti-Spam
                                                     Advertising Products
                      HADOOP CLUSTERS                Ad Optimization
                   Tens of thousands of servers      Ad Selection
                                                     Big Data Processing & ETL




                                                       APPLIED SCIENCE
                                                     User Interest Prediction
                                                     Ad inventory prediction
                                                     Machine learning -
                                                     search ranking
                                                     Machine learning - ad
                                                     targeting
                                                     Machine learning - spam
                                  10s of Petabytes   filtering
                                                                                11	
  
FROM PROJECT TO
CORE PLATFORM
                       90                                                                        250


                       80    40K+ Servers
                             170 PB Storage                                                      200
                       70
                             5M+ Monthly Jobs
                       60                                                              “Behind	
  
                                                                                        every	
   150
Thousands of Servers




                       50                                            Daily	
            click”	
  
                                                                     ProducFon	
       	
  




                                                                                                        Petabytes
                       40
                                                Science	
                                        100
                       30
                                                Impact	
  

                       20
                               Research	
                                                         50

                       10


                       0                                                                          0

                            2006         2007                 2008         2009      2010
                                                                                                                    12	
  
HADOOP POWERS THE
YAHOO! NETWORK



    advertising optimization data analytics
           machine learning search ranking
 advertising data systems   Yahoo! Mail anti-spam
  audience, ad and search pipelines          ad selection

 Yahoo! Homepage Content Optimization
                   ad inventory prediction
         user interest prediction

                                                            13	
  
CASE STUDY
  YAHOO! HOMEPAGE
	
  
	
  
	
   Personalized	
  	
  
	
   for	
  each	
  visitor	
  
     	
  
	
  twice	
  the	
  engagement	
  
  Result:	
  	
  
  twice	
  the	
  engagement	
  
  	
  
                                    Recommended	
  links	
       News	
  Interests	
       Top	
  Searches	
  

                                   +79% clicks                 +160% clicks              +43% clicks
                                   vs. randomly selected       vs. one size fits all     vs. editor selected

                                                                                                                 14	
  
CASE STUDY
 YAHOO! HOMEPAGE

•  Serving	
  Maps	
                                       SCIENCE          »	
  Machine learning to build ever
       •  Users	
  -­‐	
  Interests	
                       HADOOP             better categorization models
	
                                                          CLUSTER
•  Five	
  Minute	
                        USER	
                               CATEGORIZATION	
  
     ProducLon	
                       BEHAVIOR	
                               MODELS	
  (weekly)	
  
	
  
•  Weekly	
                                                PRODUCTION
     CategorizaLon	
                                          HADOOP
                                                                            »	
  Identify user interests using
     models	
                               SERVING	
  
                                                              CLUSTER
                                                                               Categorization models
                                              MAPS	
  
                             (every	
  5	
  minutes)	
  
                                                               USER	
  
                                                             BEHAVIOR	
  



                                  SERVING	
  SYSTEMS                           ENGAGED	
  USERS



	
  
Build	
  customized	
  home	
  pages	
  with	
  latest	
  data	
  (thousands	
  /	
  second)	
  
                                                                                                                 15	
  
CASE STUDY
YAHOO! MAIL

    Enabling	
  quick	
  response	
  in	
  the	
  spam	
  arms	
  race	
  

                                        •  450M	
  mail	
  boxes	
  	
  
                                        •  5B+	
  deliveries/day	
  
         SCIENCE
                                        	
  
                                        •  AnLspam	
  models	
  retrained	
  
                                             	
  every	
  few	
  hours	
  on	
  Hadoop	
  
                                        	
  
        PRODUCTION
                                               40%	
  less	
  spam	
  than	
  
                                               Hotmail	
  and	
  55%	
  less	
  
                                               spam	
  than	
  Gmail	
  



                                                                                             16	
  
YAHOO! & APACHE HADOOP
Yahoo!	
  has	
  contributed	
  70+%	
  of	
  	
  
Apache	
  Hadoop	
  code	
  to	
  date	
  
Hadoop	
  is	
  not	
  our	
  business,	
  but	
  Hadoop	
  is	
  key	
  to	
  our	
  business	
  
• 	
  Yahoo!	
  benefits	
  from	
  open	
  source	
  eco-­‐system	
  around	
  Hadoop	
  
• 	
  Hadoop	
  drives	
  revenue	
  at	
  Yahoo!	
  by	
  making	
  our	
  core	
  products	
  be`er	
  
	
  
We	
  need	
  Hadoop	
  to	
  be	
  rock	
  solid	
  
• 	
  We	
  invest	
  heavily	
  in	
  core	
  Hadoop	
  development	
  
• 	
  We	
  focus	
  on	
  scalability,	
  reliability,	
  availability	
  
	
  
We	
  fix	
  bugs	
  before	
  you	
  see	
  them	
  
• 	
  We	
  run	
  very	
  large	
  clusters	
  
• 	
  We	
  have	
  a	
  large	
  QA	
  effort	
  
• 	
  We	
  run	
  a	
  huge	
  variety	
  of	
  workloads	
  
	
  
We	
  are	
  good	
  Apache	
  Hadoop	
  ciLzens	
  
• 	
  We	
  contribute	
  our	
  work	
  to	
  Apache	
  
• 	
  We	
  share	
  the	
  exact	
  code	
  we	
  run	
  
HADOOP
MOMENTUM




           18	
  
HADOOP IS GOING
MAINSTREAM

2007       2008   2009   2010




                                The	
  Datagraph	
  Blog	
  




                                                               19	
  
THE PLATFORM EFFECT
  BIRTH OF AN ECOSYSTEM
                                	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  and other Early Adopters
                                Scale and productize Hadoop

       Apache	
  Hadoop	
  

                Enhance	
       Orgs with Internet Scale Problems
                Hadoop	
        Add tools / frameworks, enhance Hadoop
                Ecosystem	
  




                                Service Providers
                                Grow ecosystem - Training, support, enhancements

Virtuous Circle!
•  Investment -> Adoption
•  Adoption -> Investment

                                Mainstream / Enterprise adoption
                                Drive further development, enhancements                                                                                                    20	
  
THE FUTURE OF
HADOOP




                21	
  
MAKING HADOOP ENTERPRISE-READY
WHAT’S NEXT
Hadoop	
  is	
  far	
  from	
  “done”	
  
       •  Current	
  implementaFon	
  is	
  showing	
  its	
  age	
  
       •  Need	
  to	
  address	
  several	
  deficiencies	
  in	
  scalability,	
  flexibility,	
  
          ease	
  of	
  use	
  &	
  performance	
  
       	
  
Yahoo!	
  is	
  working	
  on	
  Next	
  GeneraLon	
  of	
  Hadoop	
  
       •  MapReduce:	
  Rewrite	
  to	
  improve	
  performance;	
  
          pluggable	
  support	
  for	
  new	
  programming	
  models	
  
       •  HDFS:	
  Adding	
  volumes	
  to	
  improve	
  scalability;	
  
          Flush	
  &	
  sync	
  support	
  for	
  applicaFons	
  that	
  log	
  to	
  HDFS	
  
	
  
Apache	
  should	
  remain	
  the	
  hub	
  of	
  Hadoop	
  ecosystem	
  
       •  Yahoo!	
  contributes	
  all	
  Hadoop	
  changes	
  back	
  to	
  Apache	
  Hadoop	
  
       •  Everyone	
  benefits	
  from	
  shared	
  neutral	
  foundaFon	
  
                                                                                                     22	
  
Questions?




             23	
  

More Related Content

What's hot

Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
Richard McDougall
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
yaevents
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
hybrid cloud
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
Ryan Tabora
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
Richard McDougall
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
DataWorks Summit
 
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Dmitriy Ryaboy
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
iwrigley
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
Febiyan Rachman
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Hortonworks
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 

What's hot (20)

201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant Twin
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 

Similar to hadoop @ Ibmbigdata

Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
Teradata Aster
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
Hortonworks
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
Hortonworks
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
Calpont Corporation
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
Hitachi Vantara
 
Introduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsIntroduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for Windows
Hortonworks
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Caserta
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
Michael Hiskey
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
Mark Kromer
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
Hortonworks
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
Rohit Jain
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
siliconsudipt
 
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsCombining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
DataWorks Summit
 

Similar to hadoop @ Ibmbigdata (20)

Yahoo & Hadoop
Yahoo & HadoopYahoo & Hadoop
Yahoo & Hadoop
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
 
Introduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsIntroduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for Windows
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsCombining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 

hadoop @ Ibmbigdata

  • 1. YAHOO & HADOOP USING  AND  IMPROVING   APACHE  HADOOP  AT  YAHOO! Eric Baldeschwieler VP, Hadoop Software
  • 2. AGENDA •   Brief  Overview   •   Hadoop  @  Yahoo!     •  Hadoop  Momentum   •  The  Future  of  Hadoop   2  
  • 3. WHAT’S happening -­‐  Big  Data  is  here!     -­‐ unstructured data -­‐    petabyte scale -­‐    operationally critical Flickr : sub_lime79
  • 4. TURNING DATA INTO INSIGHTS machine learning logic regression time series content clustering algorithms ad inventory modeling user interest prediction factorization models Flickr : NASA Goddard Photo and Video
  • 5. MAKING YAHOO RELEVANT Flickr : ogimogi
  • 6. HADOOP: POWERING YAHOO! science  +  big  data + insight = personal relevance = VALUE Flickr : DDFic
  • 7. WHAT IS HADOOP? Commodity Pig Hive Programming Languages •  Computers •  Network MapReduce Computation Focus on •  Simplicity HDFS •  Redundancy Storage •  Scale •  Availability Transforms commodity equipment into a service that: •  HDFS – Stores peta bytes of data reliably •  Map-Reduce – Allows huge distributed computations Key Attributes •  Redundant and reliable – Doesn’t stop or loose data even as hardware fails •  Easy to program – Our rocket scientists use it directly! •  Very powerful – Allows the development of big data algorithms & tools 7   •  Batch processing centric
  • 8. WHAT HADOOP ISN’T •  A  replacement  for  relaFonal  and  data   warehouse  systems     •  A  transacFonal  /  online  /  serving  system   •  A  low  latency  or  streaming  soluFon     8  
  • 9. HADOOP IN THE ENTERPRISE Business  Intelligence  ApplicaFons   HADOOP CLUSTER(S) Data   RDMS   EDW   Marts   InteracFons   TransacFons,  Structured  Data   Semi-­‐Structured  or  Un-­‐Structured  Data   Web  Logs,  Server  Logs,   Business   Social  Media,  etc…   ApplicaFons   9  
  • 11. HADOOP @ YAHOO! “Where  Science  meets  Data”   PRODUCTS Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products HADOOP CLUSTERS Ad Optimization Tens of thousands of servers Ad Selection Big Data Processing & ETL APPLIED SCIENCE User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam 10s of Petabytes filtering 11  
  • 12. FROM PROJECT TO CORE PLATFORM 90 250 80 40K+ Servers 170 PB Storage 200 70 5M+ Monthly Jobs 60 “Behind   every   150 Thousands of Servers 50 Daily   click”   ProducFon     Petabytes 40 Science   100 30 Impact   20 Research   50 10 0 0 2006 2007 2008 2009 2010 12  
  • 13. HADOOP POWERS THE YAHOO! NETWORK advertising optimization data analytics machine learning search ranking advertising data systems Yahoo! Mail anti-spam audience, ad and search pipelines ad selection Yahoo! Homepage Content Optimization ad inventory prediction user interest prediction 13  
  • 14. CASE STUDY YAHOO! HOMEPAGE       Personalized       for  each  visitor      twice  the  engagement   Result:     twice  the  engagement     Recommended  links   News  Interests   Top  Searches   +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected 14  
  • 15. CASE STUDY YAHOO! HOMEPAGE •  Serving  Maps   SCIENCE »  Machine learning to build ever •  Users  -­‐  Interests   HADOOP better categorization models   CLUSTER •  Five  Minute   USER   CATEGORIZATION   ProducLon   BEHAVIOR   MODELS  (weekly)     •  Weekly   PRODUCTION CategorizaLon   HADOOP »  Identify user interests using models   SERVING   CLUSTER Categorization models MAPS   (every  5  minutes)   USER   BEHAVIOR   SERVING  SYSTEMS ENGAGED  USERS   Build  customized  home  pages  with  latest  data  (thousands  /  second)   15  
  • 16. CASE STUDY YAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race   •  450M  mail  boxes     •  5B+  deliveries/day   SCIENCE   •  AnLspam  models  retrained    every  few  hours  on  Hadoop     PRODUCTION 40%  less  spam  than   Hotmail  and  55%  less   spam  than  Gmail   16  
  • 17. YAHOO! & APACHE HADOOP Yahoo!  has  contributed  70+%  of     Apache  Hadoop  code  to  date   Hadoop  is  not  our  business,  but  Hadoop  is  key  to  our  business   •   Yahoo!  benefits  from  open  source  eco-­‐system  around  Hadoop   •   Hadoop  drives  revenue  at  Yahoo!  by  making  our  core  products  be`er     We  need  Hadoop  to  be  rock  solid   •   We  invest  heavily  in  core  Hadoop  development   •   We  focus  on  scalability,  reliability,  availability     We  fix  bugs  before  you  see  them   •   We  run  very  large  clusters   •   We  have  a  large  QA  effort   •   We  run  a  huge  variety  of  workloads     We  are  good  Apache  Hadoop  ciLzens   •   We  contribute  our  work  to  Apache   •   We  share  the  exact  code  we  run  
  • 19. HADOOP IS GOING MAINSTREAM 2007 2008 2009 2010 The  Datagraph  Blog   19  
  • 20. THE PLATFORM EFFECT BIRTH OF AN ECOSYSTEM                                                        and other Early Adopters Scale and productize Hadoop Apache  Hadoop   Enhance   Orgs with Internet Scale Problems Hadoop   Add tools / frameworks, enhance Hadoop Ecosystem   Service Providers Grow ecosystem - Training, support, enhancements Virtuous Circle! •  Investment -> Adoption •  Adoption -> Investment Mainstream / Enterprise adoption Drive further development, enhancements 20  
  • 22. MAKING HADOOP ENTERPRISE-READY WHAT’S NEXT Hadoop  is  far  from  “done”   •  Current  implementaFon  is  showing  its  age   •  Need  to  address  several  deficiencies  in  scalability,  flexibility,   ease  of  use  &  performance     Yahoo!  is  working  on  Next  GeneraLon  of  Hadoop   •  MapReduce:  Rewrite  to  improve  performance;   pluggable  support  for  new  programming  models   •  HDFS:  Adding  volumes  to  improve  scalability;   Flush  &  sync  support  for  applicaFons  that  log  to  HDFS     Apache  should  remain  the  hub  of  Hadoop  ecosystem   •  Yahoo!  contributes  all  Hadoop  changes  back  to  Apache  Hadoop   •  Everyone  benefits  from  shared  neutral  foundaFon   22  
  • 23. Questions? 23