SlideShare a Scribd company logo
1 of 34
Download to read offline
The Apache Way Done Right
The Success of Hadoop


Eric Baldeschwieler
CEO Hortonworks
Twitter: @jeric14
What is this talk really about?
• What is Hadoop and what I’ve learned about Apache
  projects by organizing a team of Apache Hadoop
  committers for six years…

Sub-topics:
• Apache Hadoop Primer
• Hadoop and The Apache Way
• Where Do We Go From Here?




     Architecting the Future of Big Data              Page 2
What is Apache Hadoop?
•  Set of open source projects

•  Transforms commodity hardware
   into a service that:
    –  Stores petabytes of data reliably
    –  Allows huge distributed computations

•  Solution for big data
    –  Deals with complexities of high
       volume, velocity & variety of data

•  Key attributes:
    –  Redundant and reliable (no data loss)
                                                One of the best examples of
    –  Easy to program                         open source driving innovation
    –  Extremely powerful                         and creating a market
    –  Batch processing centric
    –  Runs on commodity hardware



         Architecting the Future of Big Data                              Page 3
Apache Hadoop projects

                   Core Apache Hadoop                                                            Related Hadoop Projects



                                                                                                    Pig                      Hive
                                                                                                 (Data Flow)                     (SQL)




                                                                     (Columnar NoSQL Store)
                                                             HBase                                        MapReduce
                                Zookeeper
                                            (Coordination)
                  (Manaement)
         Ambari




                                                                                                  (Distributed Programing Framework)



                                                                                                               HCatalog
                                                                                                     (Table & Schema Management)



                                                                                                         HDFS
                                                                                              (Hadoop Distributed File System)




   Architecting the Future of Big Data                                                                                                   Page 4
Hadoop origins
                            •  Nutch project (web-scale, crawler-based search engine)
2002 - 2004
                            •  Distributed, by necessity
                            •  Ran on 4 nodes
                            •  DFS & MapReduce implementation added to Nutch
2004 - 2006
                            •  Ran on 20 nodes
                            •  Yahoo! Search team commits to scaling Hadoop for big data
2006 - 2008
                            •  Doug Cutting mentors team on Apache/Open process
                            •  Hadoop becomes top-level Apache project
                            •  Attained web-scale 2000 nodes, acceptable performance
                            •  Adoption by other internet companies
2008 - 2009
                            •  Facebook, Twitter, LinkedIn, etc.
                            •  Further scale improvements, now 4000 nodes, faster
2010 - Today
                            •  Service providers enter market
                            •  Hortonworks, Amazon, Cloudera, IBM, EMC, …
                            •  Growing enterprise adoption
       Architecting the Future of Big Data                                              Page 5
Early adopters and uses


                                                                    data
                                         analyzing web logs       analytics
                    advertising optimization           machine learning

                   text mining web search mail anti-spam
                                                            content optimization
                    customer trend analysis
                                                     ad selection
              video & audio processing
                                                             data mining
                                 user interest prediction
                                             social media




   Architecting the Future of Big Data                                             Page 6
Hadoop is Mainstream Today




                             Page 7
Big Data Platforms
Cost per TB, Adoption




                        Size of bubble = cost
                        effectiveness of solution


                        Source:




                                               Page 8
HADOOP @ YAHOO!
               Some early use cases




© Yahoo 2011                          Page 9
CASE STUDY
  YAHOO! WEBMAP (2008)
	
   •  What is a WebMap?
	
       –  Gigantic table of information about every web site,
              page and link Yahoo! knows about
	
       –  Directed graph of the web
	
       –  Various aggregated views (sites, domains, etc.)
         –  Various algorithms for ranking, duplicate detection,
	
  twice	
  tregionngagement	
   spam detection, etc.
              he	
  e classification,
   •  Why was it ported to Hadoop?
          –  Custom C++ solution was not scaling
          –  Leverage scalability, load balancing and resilience of
             Hadoop infrastructure
          –  Focus Search guys on application, not infrastructure
   © Yahoo 2011                                                    Page 10
CASE STUDY
 WEBMAP PROJECT RESULTS
	
   •  33% time savings over previous system on
	
       the same cluster (and Hadoop keeps getting
         better)
	
  
      •  Was largest Hadoop application, drove
	
       scale
	
  twice	
  tOver 10,000 cores in system
          – 
              he	
  engagement	
  
          –  100,000+ maps, ~10,000 reduces
         –  ~70 hours runtime
         –  ~300 TB shuffling
         –  ~200 TB compressed output

  •  Moving data to Hadoop increased number of
     groups using the data

  © Yahoo 2011                                    Page 11
CASE STUDY
   YAHOO SEARCH ASSIST™
	
  
	
  
	
  
      •  Database	
  for	
  Search	
  Assist™	
  is	
  built	
  using	
  Apache	
  Hadoop	
  
	
   •  Several	
  years	
  of	
  log-­‐data	
  
      •  20-­‐steps	
  of	
  MapReduce	
   	
  	
  
	
  twice	
  the	
  engagement	
  
             "                           Before Hadoop                 After Hadoop

            Time                         26 days                       20 minutes

            Language                     C++                           Python

            Development Time             2-3 weeks                     2-3 days


      © Yahoo 2011                                                                              Page 12
HADOOP @ YAHOO!
               TODAY
                        40K+ Servers
                        170 PB Storage
                        5M+ Monthly Jobs
                        1000+ Active users




© Yahoo 2011                                 Page 13
CASE STUDY
  YAHOO! HOMEPAGE
	
  
	
  
	
   Personalized	
  	
  
	
   for	
  each	
  visitor	
  
     	
  
	
  twice	
  the	
  engagement	
  
  Result:	
  	
  
  twice	
  the	
  engagement	
  
  	
  
                                    Recommended	
  links	
       News	
  Interests	
       Top	
  Searches	
  

                                   +79% clicks                 +160% clicks              +43% clicks
                                   vs. randomly selected       vs. one size fits all     vs. editor selected

         © Yahoo 2011                                                                                    Page 14
CASE STUDY
  YAHOO! HOMEPAGE

•  Serving Maps	
                                       SCIENCE      »	
  Machine learning to build
         •  Users	
  -­‐	
  Interests	
                  HADOOP        ever better categorization
	
                                                       CLUSTER
                                                                       models
•  Five	
  Minute	
                       USER	
                         CATEGORIZATION	
  
     ProducDon	
                      BEHAVIOR	
                         MODELS	
  (weekly)	
  
	
  
•  Weekly	
                                             PRODUCTION
     CategorizaDon	
                                       HADOOP
                                                                     »	
  Identify user interests
     models	
                        SERVING
                                                           CLUSTER
                                                                        using Categorization
                                        MAPS                            models
                             (every 5 minutes)
                                                           USER
                                                         BEHAVIOR



                                   SERVING	
  SYSTEMS                   ENGAGED	
  USERS
	
  
Build customized home pages with latest data (thousands / second)
       © Yahoo 2011                                                                               Page 15
CASE STUDY
YAHOO! MAIL

           Enabling	
  quick	
  response	
  in	
  the	
  spam	
  arms	
  race	
  

                                               •  450M	
  mail	
  boxes	
  	
  
                                               •  5B+	
  deliveries/day	
  
                SCIENCE
                                               	
  
                                               •  AnDspam	
  models	
  retrained	
  
                                                    	
  every	
  few	
  hours	
  on	
  Hadoop	
  
                                               	
  


                                             “        40% less spam than
               PRODUCTION


                                                      Hotmail and 55%
                                                                                “
                                                      less spam than
                                                      Gmail


© Yahoo 2011                                                                               Page 16
Traditional Enterprise Architecture
  Data Silos + ETL
                                                                           Traditional Data Warehouses,
  Serving Applications                                                             BI & Analytics
                                                   Traditional ETL &
Web       NoSQL                                    Message buses
                                                                                       Data
                             …                                                   EDW            BI /
Serving   RDMS                                                                         Marts   Analytics




                        Serving           Social     Sensor             Text
                         Logs             Media       Data             Systems    …


                                        Unstructured Systems
          Architecting the Future of Big Data                                                              Page 17
Hadoop Enterprise Architecture
  Connecting All of Your Big Data
                                                                            Traditional Data Warehouses,
  Serving Applications                                                              BI & Analytics
                                                    Traditional ETL &
Web       NoSQL                                     Message buses
                                                                                        Data
                             …                                                    EDW            BI /
Serving   RDMS                                                                          Marts   Analytics




                                                 Apache Hadoop
                                                EsTsL (s = Store)
                                                Custom Analytics




                        Serving           Social      Sensor             Text
                         Logs             Media        Data             Systems    …


                                        Unstructured Systems
          Architecting the Future of Big Data                                                               Page 18
Hadoop Enterprise Architecture
   Connecting All of Your Big Data
                                                                                  Traditional Data Warehouses,
   Serving Applications                                                                   BI & Analytics
                                                          Traditional ETL &
  Web      NoSQL                                          Message buses
                                                                                              Data
                                   …                                                    EDW            BI /
 Serving   RDMS                                                                               Marts   Analytics




                                                      Apache Hadoop
                                                     EsTsL (s = Store)
                                                     Custom Analytics

 Gartner predicts                                                                                 80-90% of data
800% data growth                                                                                 produced today
over next 5 years                                                                                is unstructured


                            Serving              Social     Sensor             Text
                             Logs                Media       Data             Systems    …


                                                 Unstructured Systems
           Architecting the Future of Big Data                                                                    Page 19
Hadoop and The Apache Way




  Architecting the Future of Big Data   Page 20
Yahoo! & Apache Hadoop
• Yahoo! committed to scaling Hadoop from prototype to
  web-scale big data solution in 2006

• Why would a corporation donate 300 person-years of
  software development to the Apache foundation?
  – Clear market need for a web-scale big data system
  – Belief that someone would build an open source solution
  – Proprietary software inevitably becomes expensive legacy software
  – Key competitors committed to building proprietary systems
  – Desire to attract top science & systems talent by demonstrating that
    Yahoo! was a center of big data excellence
  – Belief that a community and ecosystem could drive a big data platform
    faster than any individual company
  – A belief that The Apache Way would produce better, longer lived,
    more widely used code
     Architecting the Future of Big Data                          Page 21
The bet is paying off!
•  Hadoop is hugely successful!
  – Hadoop is perceived as the next data architecture for enterprise

•  Project has large, diverse and growing committer base
  – If Yahoo were to stop contributing, Hadoop would keep improving


•  Hadoop has become very fast, stable and scalable

•  Ecosystem building more than any one company could
  – Addition of new Apache components (Hive, HBase, Mahout, etc.)
  – Hardware, cloud and software companies coming now contributing

•  I guess the Apache Way works…

       Architecting the Future of Big Data                             Page 22
But, success brings challenges
•  Huge visibility and big financial upside!
  –  Harsh project politics
  –  Vendors spreading FUD & negativity
•  The PMC challenged for control of Hadoop
  –  Calls for Standards body outside Apache
  –  Abuse of the Apache Hadoop brand guidelines
•  Increasing size and complexity of code base
  –  Very long release cycles & unstable trunk
  –  0.20 to 0.23 is taking ~3 years
•  New users finding Hadoop too hard to use
  –  It takes skilled people to manage and use
  –  There are not enough such people
     Architecting the Future of Big Data           Page 23
What is the Apache Way?
•  What is Apache about? – From the Apache FAQ
  –  Transparancy, consensus, non-affiliation, respect for fellow
     developers, and meritocracy, in no specific order.

•  What is Apache not about? – From the Apache FAQ
  –  To flame someone to shreds, to make code decisions on
     IRC, to demand someone else to fix your bugs.

•  The Apache Way is primarily about Community,
   Merit, and Openness, backed up by Pragmatism and
   Charity. - Shane Curcuru

•  Apache believes in Community over Code.
   (I hear this a lot)

     Architecting the Future of Big Data                      Page 24
Boiling it down a bit

•  Community over Code - Transparency,
   Openness, Mutual Respect

•  Meritocracy, Consensus, Non-affiliation

•  Pragmatism & Charity




    Architecting the Future of Big Data   Page 25
Hadoop & the Apache Way, forward
•  Community over Code - Transparency, Openness,
   Mutual Respect
   –  Aggressive optimism & a no enemies policy pays dividends
   –  Leaders must publish designs, plans, roadmaps
   –  It’s ok if people meet and then share proposals on the list!
•  Meritocracy, Consensus, Non-affiliation
   –    Influence the community by contributing great new work!
   –    Acknowledge domain experts in various project components
   –    Fight! Your vote only counts if you speak up.
   –    Rejecting contributions is ok.
          •  Assigning work via whining is not!
•  Pragmatism & Charity
   –  Hadoop is big business, companies are here to stay, use them
   –  Mentor new contributors!
   –  Make Hadoop easier to use!
        Architecting the Future of Big Data                   Page 26
Where Do We Go From Here?
Vision:
Half of the world’s data will be processed
   by Apache Hadoop within 5 years

Ubiquity is the Goal!


      Architecting the Future of Big Data   Page 27
How do we achieve ubiquity?...
• Integrate with existing data architectures
  – Extend Hadoop project APIs to allow make it easy to
   integrate and specialize Hadoop
  – Create an ecosystem of ISVs and OEMs


• Make Apache Hadoop easy to use
  – Fix user challenges, package working binaries
  – Improve and extend Hadoop documentation
  – Build training, support & pro-serve ecosystem



     Architecting the Future of Big Data            Page 28
Build a strong partner ecosystem!


•    Unify the community
     around a strong Apache                      Hadoop Application Partners   Integration & Services Partners

     Hadoop offering

•    Make Apache Hadoop                          Serving &
     easier to integrate &                      Unstructured
                                                   Data
                                                                                                DW, Analytics
                                                                                                & BI Partners
     extend                                      Systems
                                                 Partners
      –  Work closely with partners
         to define and build open
         APIs
      –  Everything contributed                      Hardware Partners
                                                                                 Cloud & Hosting Platform
                                                                                         Partners
         back to Apache

•    Provide enablement
     services as necessary to
     optimize integration


          Architecting the Future of Big Data                                                               Page 29
To change the world… Ship code!
• Be aggressive - Ship early and often
  – Project needs to keep innovating and visibly improve
  – Aim for big improvements
  – Make early buggy releases


• Be predictable - Ship late too
  – We need to do regular sustaining engineering releases
  – We need to ship stable, working releases
  – Make packaged binary releases available



     Architecting the Future of Big Data                   Page 30
Hadoop: Now, Next, and Beyond
Roadmap Focus: Make Hadoop an Open, Extensible, and Enterprise Viable
Platform, and Enable More Applications to Run on Apache Hadoop

                                                                        “Hadoop.Beyond”
                                                                        Stay tuned!!
                                               “Hadoop.Next”
                                                (hadoop 0.23)
                                              Extensible architecture
 “Hadoop.Now”
                                              MapReduce re-write
(hadoop 0.20.205)
                                              Enterprise robustness
Most stable version ever                      Extended APIs
RPMs and DEBs
Hbase & security




        Architecting the Future of Big Data                                           Page 31
Hortonworks @ ApacheCon
• Hadoop Meetup Tonight @ 8pm
  – Roadmap for Hadoop 0.20.205 and 0.23
  – Current status (suggestions, issues) of Hadoop integration with
    other projects


• Owen O’Malley Presentation, Tomorrow @ 9am
  – “State of the Elephant: Hadoop Yesterday, Today and Tomorrow”
  – Salon B


• Visit Hortonworks in the Expo to learn more




      Architecting the Future of Big Data                             Page 32
Thank You
Eric Baldeschwieler
Twitter: @jeric14




     Architecting the Future of Big Data   Page 33
Extra links
•  WWW.Hortonworks.com

•  http://developer.yahoo.com/blogs/hadoop/posts/2011/01/the-backstory-
   of-yahoo-and-hadoop/

•  http://www.slideshare.net/jaaronfarr/the-apache-way-presentation

•  http://incubator.apache.org/learn/rules-for-revolutionaries.html




        Architecting the Future of Big Data                           Page 34

More Related Content

What's hot

Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
Hortonworks
 

What's hot (20)

Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
State of the Union with Shaun Connolly
State of the Union with Shaun ConnollyState of the Union with Shaun Connolly
State of the Union with Shaun Connolly
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 

Similar to Keynote from ApacheCon NA 2011

Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
m_hepburn
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
Brock Noland
 

Similar to Keynote from ApacheCon NA 2011 (20)

Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Hadoop
HadoopHadoop
Hadoop
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Firebird meets NoSQL
Firebird meets NoSQLFirebird meets NoSQL
Firebird meets NoSQL
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and Beyond
 

More from Hortonworks

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

Recently uploaded (20)

Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 

Keynote from ApacheCon NA 2011

  • 1. The Apache Way Done Right The Success of Hadoop Eric Baldeschwieler CEO Hortonworks Twitter: @jeric14
  • 2. What is this talk really about? • What is Hadoop and what I’ve learned about Apache projects by organizing a team of Apache Hadoop committers for six years… Sub-topics: • Apache Hadoop Primer • Hadoop and The Apache Way • Where Do We Go From Here? Architecting the Future of Big Data Page 2
  • 3. What is Apache Hadoop? •  Set of open source projects •  Transforms commodity hardware into a service that: –  Stores petabytes of data reliably –  Allows huge distributed computations •  Solution for big data –  Deals with complexities of high volume, velocity & variety of data •  Key attributes: –  Redundant and reliable (no data loss) One of the best examples of –  Easy to program open source driving innovation –  Extremely powerful and creating a market –  Batch processing centric –  Runs on commodity hardware Architecting the Future of Big Data Page 3
  • 4. Apache Hadoop projects Core Apache Hadoop Related Hadoop Projects Pig Hive (Data Flow) (SQL) (Columnar NoSQL Store) HBase MapReduce Zookeeper (Coordination) (Manaement) Ambari (Distributed Programing Framework) HCatalog (Table & Schema Management) HDFS (Hadoop Distributed File System) Architecting the Future of Big Data Page 4
  • 5. Hadoop origins •  Nutch project (web-scale, crawler-based search engine) 2002 - 2004 •  Distributed, by necessity •  Ran on 4 nodes •  DFS & MapReduce implementation added to Nutch 2004 - 2006 •  Ran on 20 nodes •  Yahoo! Search team commits to scaling Hadoop for big data 2006 - 2008 •  Doug Cutting mentors team on Apache/Open process •  Hadoop becomes top-level Apache project •  Attained web-scale 2000 nodes, acceptable performance •  Adoption by other internet companies 2008 - 2009 •  Facebook, Twitter, LinkedIn, etc. •  Further scale improvements, now 4000 nodes, faster 2010 - Today •  Service providers enter market •  Hortonworks, Amazon, Cloudera, IBM, EMC, … •  Growing enterprise adoption Architecting the Future of Big Data Page 5
  • 6. Early adopters and uses data analyzing web logs analytics advertising optimization machine learning text mining web search mail anti-spam content optimization customer trend analysis ad selection video & audio processing data mining user interest prediction social media Architecting the Future of Big Data Page 6
  • 7. Hadoop is Mainstream Today Page 7
  • 8. Big Data Platforms Cost per TB, Adoption Size of bubble = cost effectiveness of solution Source: Page 8
  • 9. HADOOP @ YAHOO! Some early use cases © Yahoo 2011 Page 9
  • 10. CASE STUDY YAHOO! WEBMAP (2008)   •  What is a WebMap?   –  Gigantic table of information about every web site, page and link Yahoo! knows about   –  Directed graph of the web   –  Various aggregated views (sites, domains, etc.) –  Various algorithms for ranking, duplicate detection,  twice  tregionngagement   spam detection, etc. he  e classification, •  Why was it ported to Hadoop? –  Custom C++ solution was not scaling –  Leverage scalability, load balancing and resilience of Hadoop infrastructure –  Focus Search guys on application, not infrastructure © Yahoo 2011 Page 10
  • 11. CASE STUDY WEBMAP PROJECT RESULTS   •  33% time savings over previous system on   the same cluster (and Hadoop keeps getting better)   •  Was largest Hadoop application, drove   scale  twice  tOver 10,000 cores in system –  he  engagement   –  100,000+ maps, ~10,000 reduces –  ~70 hours runtime –  ~300 TB shuffling –  ~200 TB compressed output •  Moving data to Hadoop increased number of groups using the data © Yahoo 2011 Page 11
  • 12. CASE STUDY YAHOO SEARCH ASSIST™       •  Database  for  Search  Assist™  is  built  using  Apache  Hadoop     •  Several  years  of  log-­‐data   •  20-­‐steps  of  MapReduce        twice  the  engagement   " Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days © Yahoo 2011 Page 12
  • 13. HADOOP @ YAHOO! TODAY 40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users © Yahoo 2011 Page 13
  • 14. CASE STUDY YAHOO! HOMEPAGE       Personalized       for  each  visitor      twice  the  engagement   Result:     twice  the  engagement     Recommended  links   News  Interests   Top  Searches   +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected © Yahoo 2011 Page 14
  • 15. CASE STUDY YAHOO! HOMEPAGE •  Serving Maps   SCIENCE »  Machine learning to build •  Users  -­‐  Interests   HADOOP ever better categorization   CLUSTER models •  Five  Minute   USER   CATEGORIZATION   ProducDon   BEHAVIOR   MODELS  (weekly)     •  Weekly   PRODUCTION CategorizaDon   HADOOP »  Identify user interests models   SERVING CLUSTER using Categorization MAPS models (every 5 minutes) USER BEHAVIOR SERVING  SYSTEMS ENGAGED  USERS   Build customized home pages with latest data (thousands / second) © Yahoo 2011 Page 15
  • 16. CASE STUDY YAHOO! MAIL Enabling  quick  response  in  the  spam  arms  race   •  450M  mail  boxes     •  5B+  deliveries/day   SCIENCE   •  AnDspam  models  retrained    every  few  hours  on  Hadoop     “ 40% less spam than PRODUCTION Hotmail and 55% “ less spam than Gmail © Yahoo 2011 Page 16
  • 17. Traditional Enterprise Architecture Data Silos + ETL Traditional Data Warehouses, Serving Applications BI & Analytics Traditional ETL & Web NoSQL Message buses Data … EDW BI / Serving RDMS Marts Analytics Serving Social Sensor Text Logs Media Data Systems … Unstructured Systems Architecting the Future of Big Data Page 17
  • 18. Hadoop Enterprise Architecture Connecting All of Your Big Data Traditional Data Warehouses, Serving Applications BI & Analytics Traditional ETL & Web NoSQL Message buses Data … EDW BI / Serving RDMS Marts Analytics Apache Hadoop EsTsL (s = Store) Custom Analytics Serving Social Sensor Text Logs Media Data Systems … Unstructured Systems Architecting the Future of Big Data Page 18
  • 19. Hadoop Enterprise Architecture Connecting All of Your Big Data Traditional Data Warehouses, Serving Applications BI & Analytics Traditional ETL & Web NoSQL Message buses Data … EDW BI / Serving RDMS Marts Analytics Apache Hadoop EsTsL (s = Store) Custom Analytics Gartner predicts 80-90% of data 800% data growth produced today over next 5 years is unstructured Serving Social Sensor Text Logs Media Data Systems … Unstructured Systems Architecting the Future of Big Data Page 19
  • 20. Hadoop and The Apache Way Architecting the Future of Big Data Page 20
  • 21. Yahoo! & Apache Hadoop • Yahoo! committed to scaling Hadoop from prototype to web-scale big data solution in 2006 • Why would a corporation donate 300 person-years of software development to the Apache foundation? – Clear market need for a web-scale big data system – Belief that someone would build an open source solution – Proprietary software inevitably becomes expensive legacy software – Key competitors committed to building proprietary systems – Desire to attract top science & systems talent by demonstrating that Yahoo! was a center of big data excellence – Belief that a community and ecosystem could drive a big data platform faster than any individual company – A belief that The Apache Way would produce better, longer lived, more widely used code Architecting the Future of Big Data Page 21
  • 22. The bet is paying off! •  Hadoop is hugely successful! – Hadoop is perceived as the next data architecture for enterprise •  Project has large, diverse and growing committer base – If Yahoo were to stop contributing, Hadoop would keep improving •  Hadoop has become very fast, stable and scalable •  Ecosystem building more than any one company could – Addition of new Apache components (Hive, HBase, Mahout, etc.) – Hardware, cloud and software companies coming now contributing •  I guess the Apache Way works… Architecting the Future of Big Data Page 22
  • 23. But, success brings challenges •  Huge visibility and big financial upside! –  Harsh project politics –  Vendors spreading FUD & negativity •  The PMC challenged for control of Hadoop –  Calls for Standards body outside Apache –  Abuse of the Apache Hadoop brand guidelines •  Increasing size and complexity of code base –  Very long release cycles & unstable trunk –  0.20 to 0.23 is taking ~3 years •  New users finding Hadoop too hard to use –  It takes skilled people to manage and use –  There are not enough such people Architecting the Future of Big Data Page 23
  • 24. What is the Apache Way? •  What is Apache about? – From the Apache FAQ –  Transparancy, consensus, non-affiliation, respect for fellow developers, and meritocracy, in no specific order. •  What is Apache not about? – From the Apache FAQ –  To flame someone to shreds, to make code decisions on IRC, to demand someone else to fix your bugs. •  The Apache Way is primarily about Community, Merit, and Openness, backed up by Pragmatism and Charity. - Shane Curcuru •  Apache believes in Community over Code. (I hear this a lot) Architecting the Future of Big Data Page 24
  • 25. Boiling it down a bit •  Community over Code - Transparency, Openness, Mutual Respect •  Meritocracy, Consensus, Non-affiliation •  Pragmatism & Charity Architecting the Future of Big Data Page 25
  • 26. Hadoop & the Apache Way, forward •  Community over Code - Transparency, Openness, Mutual Respect –  Aggressive optimism & a no enemies policy pays dividends –  Leaders must publish designs, plans, roadmaps –  It’s ok if people meet and then share proposals on the list! •  Meritocracy, Consensus, Non-affiliation –  Influence the community by contributing great new work! –  Acknowledge domain experts in various project components –  Fight! Your vote only counts if you speak up. –  Rejecting contributions is ok. •  Assigning work via whining is not! •  Pragmatism & Charity –  Hadoop is big business, companies are here to stay, use them –  Mentor new contributors! –  Make Hadoop easier to use! Architecting the Future of Big Data Page 26
  • 27. Where Do We Go From Here? Vision: Half of the world’s data will be processed by Apache Hadoop within 5 years Ubiquity is the Goal! Architecting the Future of Big Data Page 27
  • 28. How do we achieve ubiquity?... • Integrate with existing data architectures – Extend Hadoop project APIs to allow make it easy to integrate and specialize Hadoop – Create an ecosystem of ISVs and OEMs • Make Apache Hadoop easy to use – Fix user challenges, package working binaries – Improve and extend Hadoop documentation – Build training, support & pro-serve ecosystem Architecting the Future of Big Data Page 28
  • 29. Build a strong partner ecosystem! •  Unify the community around a strong Apache Hadoop Application Partners Integration & Services Partners Hadoop offering •  Make Apache Hadoop Serving & easier to integrate & Unstructured Data DW, Analytics & BI Partners extend Systems Partners –  Work closely with partners to define and build open APIs –  Everything contributed Hardware Partners Cloud & Hosting Platform Partners back to Apache •  Provide enablement services as necessary to optimize integration Architecting the Future of Big Data Page 29
  • 30. To change the world… Ship code! • Be aggressive - Ship early and often – Project needs to keep innovating and visibly improve – Aim for big improvements – Make early buggy releases • Be predictable - Ship late too – We need to do regular sustaining engineering releases – We need to ship stable, working releases – Make packaged binary releases available Architecting the Future of Big Data Page 30
  • 31. Hadoop: Now, Next, and Beyond Roadmap Focus: Make Hadoop an Open, Extensible, and Enterprise Viable Platform, and Enable More Applications to Run on Apache Hadoop “Hadoop.Beyond” Stay tuned!! “Hadoop.Next” (hadoop 0.23) Extensible architecture “Hadoop.Now” MapReduce re-write (hadoop 0.20.205) Enterprise robustness Most stable version ever Extended APIs RPMs and DEBs Hbase & security Architecting the Future of Big Data Page 31
  • 32. Hortonworks @ ApacheCon • Hadoop Meetup Tonight @ 8pm – Roadmap for Hadoop 0.20.205 and 0.23 – Current status (suggestions, issues) of Hadoop integration with other projects • Owen O’Malley Presentation, Tomorrow @ 9am – “State of the Elephant: Hadoop Yesterday, Today and Tomorrow” – Salon B • Visit Hortonworks in the Expo to learn more Architecting the Future of Big Data Page 32
  • 33. Thank You Eric Baldeschwieler Twitter: @jeric14 Architecting the Future of Big Data Page 33
  • 34. Extra links •  WWW.Hortonworks.com •  http://developer.yahoo.com/blogs/hadoop/posts/2011/01/the-backstory- of-yahoo-and-hadoop/ •  http://www.slideshare.net/jaaronfarr/the-apache-way-presentation •  http://incubator.apache.org/learn/rules-for-revolutionaries.html Architecting the Future of Big Data Page 34