MODERNIZING YOUR IT
    INFRASTRUCTURE WITH HADOOP
    Merv Adrian, VP Research Gartner
    Charles Zedlewski, VP Product Cloudera




1
"Big Data‖ Crystallizes Extreme
Information Management Challenges
      Perishability    Fidelity
                                        ―Big Data‖ refers to high volume, velocity
                                        and variety information assets that demand
                                        cost-effective, innovative forms of

Validation                    Linking
                                        information processing for enhanced insight
     Classification   Contracts         and decision-making. They require:


                                        Extreme information management - the

Technology                  Pervasive   concept that your current information
                               Use
         Velocity      Volume           infrastructure must be intentionally
                                        managed along 12 complementary
                                        dimensions to meet the challenges of the
  Variety                  Complexity   21st century Information Age.
Where Does Big Data Come From?
   Email                                                              Transactions
                    Enterprise               Partner, Employee
                   ―Dark Data‖               Customer, Supplier

 Contracts                                                           Observations


                Public                                Commercial

Credit
             Weather                                              Sensors

     Population                  Social Media
               Economic
                                                    Network
                           Sentiment
         Correlations and patterns from disparate, linked data sources
         yield the greatest insights and transformative opportunities
                                         3
The Big Data Challenge: Putting Together the
Pieces Quickly and Efficiently
                       I
                                     Through 2015:
                       N
                       E
                       R             • 85% of Fortune 500
                       T               organizations will be
INFRASTRUCTURE         I               unable to exploit big
                       A               data for competitive
                                       advantage.
    LEADERSHIP
                   R
                   I                 • Business analytics
                   S                   needs will drive 70%
                   K                   of investments in the
S     ANALYTICS    S    INVESTMENT     expansion and
K                                      modernization of
I                 ORGANIZATION         information
L
                                       infrastructure.
        ARCHITECTURE
L
S
Interest in "Big Data" Is Rising Rapidly —
 Though ―Hadoop‖ Remains Steady
       500


       450


       400


       350


       300


       250
                                                                                           Hadoop
       200                                                                                 Big Data


       150


       100


        50


         0




Client searches for ―Big Data‖ and "hadoop‖ on gartner.com over 12 months ending June 12
Hadoop's A Good Idea, But Confusion
Has Slowed Commercial Adoption

• IT organizations are confused, but want to make
  a decision
• The standards steward — Apache Software
  Foundation — distributes some, but not all
  "projects" in Hadoop distributions
• Many distributions exist —and they differ in their
  components and their "openness"
• Choosing the right distributions should be driven
  by business analytics needs
Maturity is Growing and Will Spur
Adoption As Marketing Ramps Up

• Wave of new releases, beginning with Apache
  1.0 and 2.0 (alpha), has crystalized the platform
• Current versions are beginning to address early
  problems with availability, security, performance
• Distribution vendors add features and support
  enterprises require
• Rapidly evolving ecosystem includes
  differentiated distributions, data
  integration, business intelligence, and services
  vendors

                           7
Recommendations
• Don't surrender to hype-ocracy. This is early, maturing
  technology.
• Deploy when you have a clear use case, not ―to play with
  the technology‖
• Use a commercially supported distribution when you
  move to production; till then, experiment "in the cloud" or
  on existing hardware with ―free‖ downloads
• Choose the distribution based on business need —
  leverage specific, supported projects
• Skills will be key: define "data science" needs and
  hire/train for them
• Manage your big data – don‘t abdicate. Practice Extreme
  Information Management.
CLOUDERA: THE STANDARD FOR
APACHE HADOOP IN THE ENTERPRISE
CHARLES ZEDLEWSKI, VP PRODUCT
1   HIGH
         AVAILABILITY                        2   GRANULAR
                                                 SECURITY
         THERE‘S NO DOWNTIME. YOUR DATA IS       PROCESS AND CONTROL SENSITIVE
         ALWAYS AVAILABLE FOR DECISIONS          DATA WITH CONFIDENCE




     3   ROBUST
         MANAGEMENT                          4   SCALABLE AND
                                                 EXTENSIBLE
         ACHIEVE OPTIMAL PERFORMANCE VIA         ADAPTS TO YOUR WORKLOAD AND
         CENTRALIZED ADMINISTRATION              GROWS WITH THE BUSINESS




     5   CERTIFIED AND
         COMPATIBLE                          6   GLOBAL SUPPORT
                                                 AND SERVICES
         EXTEND AND LEVERAGE EXISTING            ACHIEVE SLAs AND ADHERE TO
         INFRASTRUCTURE INVESTMENTS              EXISTING IT POLICIES




10
1                        INDUSTRY TERM                 VERTICAL         INDUSTRY TERM              2
                        Social Network Analysis          Web          Clickstream Sessionization
ADVANCED ANALYTICS




                                                                                                   DATA PROCESSING
                         Content Optimization           Media               Engagement



                           Network Analytics             Telco                Mediation



                     Loyalty & Promotions Analysis       Retail             Data Factory


                            Fraud Analysis             Financial         Trade Reconciliation



                            Entity Analysis             Federal                SIGINT


                         Sequencing Analysis         Bioinformatics       Genome Mapping



11
CDH4
       Cloudera’s Distribution Including Apache Hadoop (CDH)
                                                                       STORAGE       COMPUTATION       ACCESS         INTEGRATION
       Big Data storage, processing and analytics platform based on
       Apache Hadoop – 100% open source



     Cloudera Enterprise 4.0
       Cloudera Manager
                                                                                                                      DIAGNOSTICS
                                                                      DEPLOYMENT    CONFIGURATION    MONITORING
       End-to-end management application for the deployment                                                           & REPORTING
       and operation of CDH


       Production Support
                                                                         ISSUE        ESCALATION                      KNOWLEDGE
                                                                                                     OPTIMIZATION
       Our dedicated team of experts on call to help you              RESOLUTION      PROCESSES                          BASE
       meet your Service Level Agreements (SLAs)



                                                                          Cloudera University
 Partner Ecosystem                                                        Equipping the Big Data workforce – 12,000+ trained
 250+ partners across hardware, software, platforms and services


                                                                           Professional Services
                                                                           Use case discovery, pilots, process & team development




12
All the industry leaders integrate with CDH.

     CDH4
                                                                            STORAGE    COMPUTATION        ACCESS   INTEGRATION
     Big Data storage, processing and analytics platform based
     on Apache Hadoop – 100% open source




      BI / Analytics               Data Integration              Database             OS / Cloud / Sys Mgmt        Hardware




13
END-TO-END MANAGEMENT
APPLICATION FOR APACHE HADOOP


1    DEPLOY
     INSTALL, CONFIGURE AND START YOUR
     CLUSTER IN 3 SIMPLE STEPS




2    CONFIGURE & MANAGE
     ENSURE OPTIMAL SETTINGS FOR ALL HOSTS
     AND SERVICES




3    MONITOR, DIAGNOSE & REPORT
     FIND AND FIX PROBLEMS QUICKLY, VIEW CURRENT
     AND HISTORICAL ACTIVITY AND RESOURCE USAGE



15
REQUIRED SKILLS
      LINUX ADMIN OR DBA BACKGROUND
      JAVA KNOWLEDGE
      NETWORKING KNOWLEDGE




     RESPONSIBILITIES
        KEEP IMPORTANT WORKLOADS WITHIN SLA
        INSTALL, CONFIGURE AND UPGRADE HADOOP
        MONITOR SYSTEM HEALTH & PERFORMANCE
        PLAN FOR THE FUTURE



16
CLOUDERA MANAGER DEMO




17
Q&A

18
REGISTER NOW FOR THE REMAINING
         ‗POWER OF HADOOP‘ WEBINARS:

         REALIZING THE PROMISE OF BIG DATA WITH HADOOP


 THANK
         FORRESTER AND CLOUDERA
         THURSDAY, JULY 26, 11AM PST




  YOU!
         WHAT THE HADOOP: WHY YOUR BUSINESS CAN’T
         AFFORD TO IGNORE THE POWER OF HADOOP
         GIGAOM AND CLOUDERA
         WEDNESDAY, AUGUST 29, 10AM PST



         THE BUSINESS ADVANTAGE OF HADOOP:
         LESSONS FROM THE FIELD
         451 RESEARCH AND CLOUDERA
         WEDNESDAY, SEPTEMBER 26, 10AM PST




19

Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Series: Gartner

  • 1.
    MODERNIZING YOUR IT INFRASTRUCTURE WITH HADOOP Merv Adrian, VP Research Gartner Charles Zedlewski, VP Product Cloudera 1
  • 2.
    "Big Data‖ CrystallizesExtreme Information Management Challenges Perishability Fidelity ―Big Data‖ refers to high volume, velocity and variety information assets that demand cost-effective, innovative forms of Validation Linking information processing for enhanced insight Classification Contracts and decision-making. They require: Extreme information management - the Technology Pervasive concept that your current information Use Velocity Volume infrastructure must be intentionally managed along 12 complementary dimensions to meet the challenges of the Variety Complexity 21st century Information Age.
  • 3.
    Where Does BigData Come From? Email Transactions Enterprise Partner, Employee ―Dark Data‖ Customer, Supplier Contracts Observations Public Commercial Credit Weather Sensors Population Social Media Economic Network Sentiment Correlations and patterns from disparate, linked data sources yield the greatest insights and transformative opportunities 3
  • 4.
    The Big DataChallenge: Putting Together the Pieces Quickly and Efficiently I Through 2015: N E R • 85% of Fortune 500 T organizations will be INFRASTRUCTURE I unable to exploit big A data for competitive advantage. LEADERSHIP R I • Business analytics S needs will drive 70% K of investments in the S ANALYTICS S INVESTMENT expansion and K modernization of I ORGANIZATION information L infrastructure. ARCHITECTURE L S
  • 5.
    Interest in "BigData" Is Rising Rapidly — Though ―Hadoop‖ Remains Steady 500 450 400 350 300 250 Hadoop 200 Big Data 150 100 50 0 Client searches for ―Big Data‖ and "hadoop‖ on gartner.com over 12 months ending June 12
  • 6.
    Hadoop's A GoodIdea, But Confusion Has Slowed Commercial Adoption • IT organizations are confused, but want to make a decision • The standards steward — Apache Software Foundation — distributes some, but not all "projects" in Hadoop distributions • Many distributions exist —and they differ in their components and their "openness" • Choosing the right distributions should be driven by business analytics needs
  • 7.
    Maturity is Growingand Will Spur Adoption As Marketing Ramps Up • Wave of new releases, beginning with Apache 1.0 and 2.0 (alpha), has crystalized the platform • Current versions are beginning to address early problems with availability, security, performance • Distribution vendors add features and support enterprises require • Rapidly evolving ecosystem includes differentiated distributions, data integration, business intelligence, and services vendors 7
  • 8.
    Recommendations • Don't surrenderto hype-ocracy. This is early, maturing technology. • Deploy when you have a clear use case, not ―to play with the technology‖ • Use a commercially supported distribution when you move to production; till then, experiment "in the cloud" or on existing hardware with ―free‖ downloads • Choose the distribution based on business need — leverage specific, supported projects • Skills will be key: define "data science" needs and hire/train for them • Manage your big data – don‘t abdicate. Practice Extreme Information Management.
  • 9.
    CLOUDERA: THE STANDARDFOR APACHE HADOOP IN THE ENTERPRISE CHARLES ZEDLEWSKI, VP PRODUCT
  • 10.
    1 HIGH AVAILABILITY 2 GRANULAR SECURITY THERE‘S NO DOWNTIME. YOUR DATA IS PROCESS AND CONTROL SENSITIVE ALWAYS AVAILABLE FOR DECISIONS DATA WITH CONFIDENCE 3 ROBUST MANAGEMENT 4 SCALABLE AND EXTENSIBLE ACHIEVE OPTIMAL PERFORMANCE VIA ADAPTS TO YOUR WORKLOAD AND CENTRALIZED ADMINISTRATION GROWS WITH THE BUSINESS 5 CERTIFIED AND COMPATIBLE 6 GLOBAL SUPPORT AND SERVICES EXTEND AND LEVERAGE EXISTING ACHIEVE SLAs AND ADHERE TO INFRASTRUCTURE INVESTMENTS EXISTING IT POLICIES 10
  • 11.
    1 INDUSTRY TERM VERTICAL INDUSTRY TERM 2 Social Network Analysis Web Clickstream Sessionization ADVANCED ANALYTICS DATA PROCESSING Content Optimization Media Engagement Network Analytics Telco Mediation Loyalty & Promotions Analysis Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping 11
  • 12.
    CDH4 Cloudera’s Distribution Including Apache Hadoop (CDH) STORAGE COMPUTATION ACCESS INTEGRATION Big Data storage, processing and analytics platform based on Apache Hadoop – 100% open source Cloudera Enterprise 4.0 Cloudera Manager DIAGNOSTICS DEPLOYMENT CONFIGURATION MONITORING End-to-end management application for the deployment & REPORTING and operation of CDH Production Support ISSUE ESCALATION KNOWLEDGE OPTIMIZATION Our dedicated team of experts on call to help you RESOLUTION PROCESSES BASE meet your Service Level Agreements (SLAs) Cloudera University Partner Ecosystem Equipping the Big Data workforce – 12,000+ trained 250+ partners across hardware, software, platforms and services Professional Services Use case discovery, pilots, process & team development 12
  • 13.
    All the industryleaders integrate with CDH. CDH4 STORAGE COMPUTATION ACCESS INTEGRATION Big Data storage, processing and analytics platform based on Apache Hadoop – 100% open source BI / Analytics Data Integration Database OS / Cloud / Sys Mgmt Hardware 13
  • 15.
    END-TO-END MANAGEMENT APPLICATION FORAPACHE HADOOP 1 DEPLOY INSTALL, CONFIGURE AND START YOUR CLUSTER IN 3 SIMPLE STEPS 2 CONFIGURE & MANAGE ENSURE OPTIMAL SETTINGS FOR ALL HOSTS AND SERVICES 3 MONITOR, DIAGNOSE & REPORT FIND AND FIX PROBLEMS QUICKLY, VIEW CURRENT AND HISTORICAL ACTIVITY AND RESOURCE USAGE 15
  • 16.
    REQUIRED SKILLS  LINUX ADMIN OR DBA BACKGROUND  JAVA KNOWLEDGE  NETWORKING KNOWLEDGE RESPONSIBILITIES  KEEP IMPORTANT WORKLOADS WITHIN SLA  INSTALL, CONFIGURE AND UPGRADE HADOOP  MONITOR SYSTEM HEALTH & PERFORMANCE  PLAN FOR THE FUTURE 16
  • 17.
  • 18.
  • 19.
    REGISTER NOW FORTHE REMAINING ‗POWER OF HADOOP‘ WEBINARS: REALIZING THE PROMISE OF BIG DATA WITH HADOOP THANK FORRESTER AND CLOUDERA THURSDAY, JULY 26, 11AM PST YOU! WHAT THE HADOOP: WHY YOUR BUSINESS CAN’T AFFORD TO IGNORE THE POWER OF HADOOP GIGAOM AND CLOUDERA WEDNESDAY, AUGUST 29, 10AM PST THE BUSINESS ADVANTAGE OF HADOOP: LESSONS FROM THE FIELD 451 RESEARCH AND CLOUDERA WEDNESDAY, SEPTEMBER 26, 10AM PST 19

Editor's Notes

  • #2 The 21st century CIO is will effectively have three major information management issues coming together in an almost coordinated, simultaneous "perfect storm.""Big data" is a first taste of the extreme information challenges that will become increasingly difficult to address. The concept of data volumes so large that even as storage technology and network infrastructures increase in capacity, the data simply grows at a faster rate. In addition to volume, information assets will exhibit irregular rates of change, governance rules will become increasingly fluid, new asset types will continue to emerge and the desire of all levels of an analyst to use that data will increase exponentially. All of this in combination begins to define the extreme information management environment and any one of 12 different dimensions can overcome your existing systems — let alone two or more of them in combination.The combination of consumerization and mobility has only begun to explore the many different types of information assets that will be introduced as well as the many information "create" or "write" situations that will occur. This is only one, obvious example of the proliferation of information sources that will highlight the importance of understanding how information assets are linked together and their inherent reliability!Comprehensive, complete and deliberate architectures are the only hope for creating standards or guidance to deal with the massive "news to noise" ratios that are pending and the wide diversity of information assets and the resulting use cases.Action Item: Determine if the organization will develop in-house architectural expertise to lead the enterprise strategy based on hiring and training, or if it will follow a strategy of out-sourcing for this expertise.
  • #3 By far, Hadoop is the technology getting the most attention in the market for analytic use cases involving big data at rest. Its early usage in Web 2.0 companies was largely designed for batch processing of large volumes of data, data mining-style, so it's not surprising that this should be so. And while there is more to Hadoop than this, client inquiries fielded by the information management team reinforce our position that it is the most typical catalyst for Hadoop consideration.
  • #4 While its acquisition costs are lower, using open-source Apache Hadoop entails bigger risks than traditional commercial database environments; it is less mature, is fragmented, and Apache lacks a commercial support organization. Other open-source offerings have a similar profile, but typical Hadoop implementations consist of more independent "moving parts" and version numbers than most. Enterprises can construct and maintain a Hadoop solution themselves if they have the time, resources and expertise to do so. Preferably, they will choose a commercial Apache Hadoop distribution whose component projects are pre-integrated and may be backed by the vendor's support. More mature distributions provide scripts to install all the pieces, often with a graphic script that allows the user to choose the pieces appropriate for their specific needs, hardware and network setup, etc. Numerous vendors offer Hadoop distributions with pre-integrated components, but the vendors have varying levels of credibility and experience, and offer different combinations of projects at different release stages. Data management leaders can easily go wrong.However, commercial distributions include different projects along with the core Hadoop projects, and no commercial distributions include or support all available projects. Distributions also have varying release levels of the included projects and update them at different rates. Thus, data management leaders run the risk of choosing a Hadoop solution that does not meet enterprise needs.
  • #5 Choose one of two options, depending on your needs and circumstances: build a customstack, or use a distribution from a major Apache Hadoop provider. In the latter case,subscribing to support is recommended.■ Choose your approach (custom or distribution) based on your team's skills, whether this is atactical experiment or a strategic initiative, and on how well available distributions fit your usecase.■ Balance the enterprise's longer-term needs with immediate pressures to deliver, and considerprojects that may be needed for future initiatives.■ Evaluate the distribution vendor as a whole if you plan to run Apache Hadoop for the long term.Look at financial viability, support capabilities, partnerships and future technology plans. Aboveall, talk to reference customers; treat distributions as you would development tools orworkbenches.