Taming the Elephant - Learn how
    Monsanto manages their Hadoop clusters
    to enable Genome/Sequence processing

          Erich Hochmuth          Bala Venkatrao
         Mark Seidenstricker      Aparna Ramani

•   Hadoop World 2012, New York, October 25th, 2012
Agenda
• Introductions
• Monsanto Hadoop Use Case
     • Operational Challenges
     • How Monsanto leverages Cloudera Manager & Product Demo
     • Key benefits of using Cloudera Manager
•   Cloudera Manager
     • Overview
     • Key Features
     • Roadmap
•   Q&A

2
Introductions
    • Monsanto
      • Erich Hochmuth – R&D IT Data & Analytics Lead
      • Mark Seidenstricker – Infrastructure R&D Architect


    • Cloudera
       • Bala Venkartrao – Director, Products
       • Aparna Ramani – Director, Engineering



3
Monsanto Serves Farmers Around the World
    Working With Growers Large and Small, Row Crops and Vegetables




4
Monsanto’s Approach to Driving Yield
    A System of Agriculture Working Together to Boost Productivity




                          BREEDING                   BIOTECHNOLOGY                AGRONOMICS




                   The art and science             The science of improving    The farm management
                   of combining genetic material   plants by inserting genes   practices involved in
                   to produce a new seed           into their DNA              growing plants

5
Increasing Yield through Big Data
    At the Cornerstone of Yield Increases is Information & Analytics
                                            Increased Yield




                      Variety                      Volume                          Velocity




         • Raw Sequence data              • PBs of NGS data              • 10’s millions yield dps/day
         • Unstructured sensor data       • 10’s TBs of genomic data     • 100’s million genotyping dps/day
         • Poly-structured genomic data   • TBs of yield data            • TBs of NGS data/week
         • Spatial data                   • Billions of genotyping dps

6
What are the Challenges of managing a Hadoop Cluster?
    Software Provisioning & Configuration Management
        •   Automated & simplified installation/patch management
        •   Streamlined cluster configuration

    Enterprise –ready Tools
        •   Enterprise grade monitoring & management capabilities
        •   Integration with existing enterprise IT stack

    Reporting & Monitoring
        •   Proactive monitoring & alerting
        •   Capacity planning

    Support
        •   Midwest Location
        •   Lack of Hadoop expertise


7
What are the Solutions?
    With Cloudera Manager, you get…
    Intuitive Management Console
         •   Mission control style dashboard for entire cluster
         •   Centralized management of entire Hadoop ecosystem
         •   Treat the cluster as an appliance
         •   Configuration change audit & validation
    Integration with Enterprise IT Management Tools
         •   Connect to Corporate LDAP
         •   Cloudera Manager API integrates with existing BMC platform
    Comprehensive Monitoring & Alerting
         •   Proactive service level alerts
         •   Summarized cluster level graphs & charts
         •   Real-time series charts (MapReduce & HBase)
    Historical Cluster Metrics/Reports
         •   Capacity planning - Disk usage/ Slot Capacity


8
What are the Benefits of Cloudera Manager?
    Lowers the barrier for Hadoop administration
        •   Do not need to rely on experts solely

        •   Reduces the number of administrators needed

    Provides a “one-stop” holistic view
        •   Easy to understand how the overall cluster is performing

    Includes pre-tuned configuration with best practices
        •   Get straight to solving the business problem

    Integrates with Cloudera support
        •   Leverage the real experts…not just for bugs
9
Cloudera Enterprise – The Platform for Big Data




10
Why You Need Cloudera Manager?
     Complexity services running across many machines
     Hadoop is more than a dozen
        • Hundreds of hardware components
        • Thousands of settings
        • Limitless permutations

     Context not just a collection of parts
     Hadoop is a system,
        • Everything is interrelated
        • Raw data about individual pieces is not enough
        • Must extract what’s important


     Efficiency multiple tools & manual process takes longer
     Managing Hadoop with
        • Complicated, error-prone workflows
        • Longer issue resolution
        • Lack of consistent & repeatable processes

11
Cloudera Manager
     End-to-End Administration for CDH




     1   Deploy
         Install, configure & start your cluster in 3
         simple steps



     2 Configure & Optimize
         Ensure optimal settings for all hosts & services




     3 Monitor, Diagnose & Report
         Find & fix problems quickly, view current &
         historical activity & resource usage



12
Managing Complexity
       One Tool For Everything
 DEPLOYMENT &                                                                                            ACTIVITY
                      MONITORING   WORKFLOWS   EVENTS & ALERTS   LOG SEARCH   DIAGNOSTICS   REPORTING
 CONFIGURATION                                                                                          MONITORING

DO-IT-YOURSELF




                         +




CLOUDERA ENTERPRISE




      “In a recent Cloudera survey, >95% of respondents emphasized the importance of having a
                      single end-to-end tool to manage their Hadoop Operations”
 13
Raw Data vs. Hadoop Intelligence
     Providing Context




                                   1   Smart Configuration

                         ?
                                       Auto-sets configurations & guards against user error

                             VS.   2   Workflows
                                       Ensures that multi-step tasks are accomplished completely
                                       & in the correct sequence

                                   3   Dependencies
                                       Aware of how a particular action affects the rest of the
                                       cluster & manages the impact

                                   4   Events & Alerts
                                       Makes you aware of what’s important at a Hadoop system level


                                   5   History
                                       Compares current & past activities for context

14
Cloudera Manager Key Features
                  Installs the complete Hadoop stack in minutes via a wizard-based interface

                  Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single
                  interface
                  Allows you to manage multiple clusters from a single instance of Cloudera Manager

                  Integrate Cloudera Manager with Active Directory

                  Establishes the time context globally for almost all views

                  Correlates jobs, activities, logs, system changes, configuration changes and service metrics along
                  a single timeline to simplify diagnosis
                  Set server roles, configure services and manage security across the cluster

                  Gracefully start, stop and restart of services as needed
                  Supports Administrator and Read-Only users

                  Maintains a complete record of configuration changes with the ability to roll back to previous
                  states
                  Monitors dozens of service performance metrics and alerts you when you approach critical
                  thresholds
15
Cloudera Manager Key Features (Contd..)
                  Gather, view and search Hadoop logs collected from across the cluster

                  Scans Hadoop logs for irregularities and warns you before they impact the cluster
                  Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user
                  services and activities and make them available for alerting and searching


                  Generates email alerts when certain events occur

                  Consolidates all cluster activity into a single, real-time view

                  View information pertaining to hosts in your cluster including status, resident memory, virtual
                  memory and roles
                  Visualize health status and metrics across the cluster to quickly identify problem nodes and take
                  action
                  Visualize current and historical disk usage by user, group and directory
                  Track MapReduce activity on the cluster by job or user
                  Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with
                  resolution
                  Easily integrate Cloudera Manager with your existing enterprise-wide management and monitoring
                  tools

16
Cloudera Manager Roadmap
     •   Cloudera Manager 4.1 – Released 10/24
           • Platform Support for CDH4.1
           • Cloudera Impala management & monitoring
           • New monitoring – Zookeeper, Flume NG
           • Maintenance Mode
           • Host Decommissioning
           • Several Usability Enhancements


     •   Cloudera Manager 4.5 – Early 2013
           •   Rolling Upgrades/ Restarts
           •   Enhanced Monitoring, Cluster Heatmaps etc.
           •   Role Groups Configuration
           •   Cloud Support
           •   Others – SNMP support, Error handling, ISV integration etc.


17
Why Cloudera Manager?
      Simple administration in a single tool
      End-to-End Hadoop

      Intelligentsystem level – Cloudera’s experience realized in software
      Manages Hadoop at a


      Efficient workflows & makes administrators more productive
      Simplifies complex

      Best-in-Class management application available
      The only enterprise-grade Hadoop

18
Next Steps
     • Try out FREE edition of Cloudera Manager
        •   Download from:
            http://www.cloudera.com/products-services/tools/
        •   Support available via scm-users@cloudera.org


     • For Cloudera Enterprise subscriptions,    please contact:
      sales@cloudera.com

19
Q&A




20
Key Features
     Cloudera Manager




22
Install A Cluster In 3 Simple Steps
     Cloudera Manager Key Features


                  1
             Find Nodes
                                                                    2
                                                        Install Components
                                                                                                                       3
                                                                                                                 Assign Roles




  Enter the names of the hosts which will be      Cloudera Manager automatically installs the CDH   Verify the roles of the nodes within your cluster.
included in the Hadoop cluster. Click Continue.       components on the hosts you specified.                   Make changes as necessary.


23
View Service Health & Performance
     Cloudera Manager Key Features




24
Get Host-Level Snapshots
     Cloudera Manager Key Features




25
Monitor & Diagnose Cluster Workloads
     Cloudera Manager Key Features




26
Gather, View & Search Hadoop Logs
     Cloudera Manager Key Features




27
Track Events From Across The Cluster
     Cloudera Manager Key Features




28
Report On System Performance & Usage
     Cloudera Manager Key Features




29
Visualize Health Status With Heatmaps
     Cloudera Manager Key Features




30
Manage Multiple CDH Clusters
     Cloudera Manager Key Features




31
Easily Configure High Availability
     Cloudera Manager Key Features




32
Set The Time Context Globally
     Cloudera Manager Key Features




33

Strata + Hadoop World 2012: Taming the Elephant - Learn how Monsanto manages their Hadoop clusters to enable Genome/Sequence processing

  • 1.
    Taming the Elephant- Learn how Monsanto manages their Hadoop clusters to enable Genome/Sequence processing Erich Hochmuth Bala Venkatrao Mark Seidenstricker Aparna Ramani • Hadoop World 2012, New York, October 25th, 2012
  • 2.
    Agenda • Introductions • MonsantoHadoop Use Case • Operational Challenges • How Monsanto leverages Cloudera Manager & Product Demo • Key benefits of using Cloudera Manager • Cloudera Manager • Overview • Key Features • Roadmap • Q&A 2
  • 3.
    Introductions • Monsanto • Erich Hochmuth – R&D IT Data & Analytics Lead • Mark Seidenstricker – Infrastructure R&D Architect • Cloudera • Bala Venkartrao – Director, Products • Aparna Ramani – Director, Engineering 3
  • 4.
    Monsanto Serves FarmersAround the World Working With Growers Large and Small, Row Crops and Vegetables 4
  • 5.
    Monsanto’s Approach toDriving Yield A System of Agriculture Working Together to Boost Productivity BREEDING BIOTECHNOLOGY AGRONOMICS The art and science The science of improving The farm management of combining genetic material plants by inserting genes practices involved in to produce a new seed into their DNA growing plants 5
  • 6.
    Increasing Yield throughBig Data At the Cornerstone of Yield Increases is Information & Analytics Increased Yield Variety Volume Velocity • Raw Sequence data • PBs of NGS data • 10’s millions yield dps/day • Unstructured sensor data • 10’s TBs of genomic data • 100’s million genotyping dps/day • Poly-structured genomic data • TBs of yield data • TBs of NGS data/week • Spatial data • Billions of genotyping dps 6
  • 7.
    What are theChallenges of managing a Hadoop Cluster? Software Provisioning & Configuration Management • Automated & simplified installation/patch management • Streamlined cluster configuration Enterprise –ready Tools • Enterprise grade monitoring & management capabilities • Integration with existing enterprise IT stack Reporting & Monitoring • Proactive monitoring & alerting • Capacity planning Support • Midwest Location • Lack of Hadoop expertise 7
  • 8.
    What are theSolutions? With Cloudera Manager, you get… Intuitive Management Console • Mission control style dashboard for entire cluster • Centralized management of entire Hadoop ecosystem • Treat the cluster as an appliance • Configuration change audit & validation Integration with Enterprise IT Management Tools • Connect to Corporate LDAP • Cloudera Manager API integrates with existing BMC platform Comprehensive Monitoring & Alerting • Proactive service level alerts • Summarized cluster level graphs & charts • Real-time series charts (MapReduce & HBase) Historical Cluster Metrics/Reports • Capacity planning - Disk usage/ Slot Capacity 8
  • 9.
    What are theBenefits of Cloudera Manager? Lowers the barrier for Hadoop administration • Do not need to rely on experts solely • Reduces the number of administrators needed Provides a “one-stop” holistic view • Easy to understand how the overall cluster is performing Includes pre-tuned configuration with best practices • Get straight to solving the business problem Integrates with Cloudera support • Leverage the real experts…not just for bugs 9
  • 10.
    Cloudera Enterprise –The Platform for Big Data 10
  • 11.
    Why You NeedCloudera Manager? Complexity services running across many machines Hadoop is more than a dozen • Hundreds of hardware components • Thousands of settings • Limitless permutations Context not just a collection of parts Hadoop is a system, • Everything is interrelated • Raw data about individual pieces is not enough • Must extract what’s important Efficiency multiple tools & manual process takes longer Managing Hadoop with • Complicated, error-prone workflows • Longer issue resolution • Lack of consistent & repeatable processes 11
  • 12.
    Cloudera Manager End-to-End Administration for CDH 1 Deploy Install, configure & start your cluster in 3 simple steps 2 Configure & Optimize Ensure optimal settings for all hosts & services 3 Monitor, Diagnose & Report Find & fix problems quickly, view current & historical activity & resource usage 12
  • 13.
    Managing Complexity One Tool For Everything DEPLOYMENT & ACTIVITY MONITORING WORKFLOWS EVENTS & ALERTS LOG SEARCH DIAGNOSTICS REPORTING CONFIGURATION MONITORING DO-IT-YOURSELF + CLOUDERA ENTERPRISE “In a recent Cloudera survey, >95% of respondents emphasized the importance of having a single end-to-end tool to manage their Hadoop Operations” 13
  • 14.
    Raw Data vs.Hadoop Intelligence Providing Context 1 Smart Configuration ? Auto-sets configurations & guards against user error VS. 2 Workflows Ensures that multi-step tasks are accomplished completely & in the correct sequence 3 Dependencies Aware of how a particular action affects the rest of the cluster & manages the impact 4 Events & Alerts Makes you aware of what’s important at a Hadoop system level 5 History Compares current & past activities for context 14
  • 15.
    Cloudera Manager KeyFeatures Installs the complete Hadoop stack in minutes via a wizard-based interface Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface Allows you to manage multiple clusters from a single instance of Cloudera Manager Integrate Cloudera Manager with Active Directory Establishes the time context globally for almost all views Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis Set server roles, configure services and manage security across the cluster Gracefully start, stop and restart of services as needed Supports Administrator and Read-Only users Maintains a complete record of configuration changes with the ability to roll back to previous states Monitors dozens of service performance metrics and alerts you when you approach critical thresholds 15
  • 16.
    Cloudera Manager KeyFeatures (Contd..) Gather, view and search Hadoop logs collected from across the cluster Scans Hadoop logs for irregularities and warns you before they impact the cluster Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur Consolidates all cluster activity into a single, real-time view View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles Visualize health status and metrics across the cluster to quickly identify problem nodes and take action Visualize current and historical disk usage by user, group and directory Track MapReduce activity on the cluster by job or user Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution Easily integrate Cloudera Manager with your existing enterprise-wide management and monitoring tools 16
  • 17.
    Cloudera Manager Roadmap • Cloudera Manager 4.1 – Released 10/24 • Platform Support for CDH4.1 • Cloudera Impala management & monitoring • New monitoring – Zookeeper, Flume NG • Maintenance Mode • Host Decommissioning • Several Usability Enhancements • Cloudera Manager 4.5 – Early 2013 • Rolling Upgrades/ Restarts • Enhanced Monitoring, Cluster Heatmaps etc. • Role Groups Configuration • Cloud Support • Others – SNMP support, Error handling, ISV integration etc. 17
  • 18.
    Why Cloudera Manager? Simple administration in a single tool End-to-End Hadoop Intelligentsystem level – Cloudera’s experience realized in software Manages Hadoop at a Efficient workflows & makes administrators more productive Simplifies complex Best-in-Class management application available The only enterprise-grade Hadoop 18
  • 19.
    Next Steps • Try out FREE edition of Cloudera Manager • Download from: http://www.cloudera.com/products-services/tools/ • Support available via scm-users@cloudera.org • For Cloudera Enterprise subscriptions, please contact: sales@cloudera.com 19
  • 20.
  • 22.
    Key Features Cloudera Manager 22
  • 23.
    Install A ClusterIn 3 Simple Steps Cloudera Manager Key Features 1 Find Nodes 2 Install Components 3 Assign Roles Enter the names of the hosts which will be Cloudera Manager automatically installs the CDH Verify the roles of the nodes within your cluster. included in the Hadoop cluster. Click Continue. components on the hosts you specified. Make changes as necessary. 23
  • 24.
    View Service Health& Performance Cloudera Manager Key Features 24
  • 25.
    Get Host-Level Snapshots Cloudera Manager Key Features 25
  • 26.
    Monitor & DiagnoseCluster Workloads Cloudera Manager Key Features 26
  • 27.
    Gather, View &Search Hadoop Logs Cloudera Manager Key Features 27
  • 28.
    Track Events FromAcross The Cluster Cloudera Manager Key Features 28
  • 29.
    Report On SystemPerformance & Usage Cloudera Manager Key Features 29
  • 30.
    Visualize Health StatusWith Heatmaps Cloudera Manager Key Features 30
  • 31.
    Manage Multiple CDHClusters Cloudera Manager Key Features 31
  • 32.
    Easily Configure HighAvailability Cloudera Manager Key Features 32
  • 33.
    Set The TimeContext Globally Cloudera Manager Key Features 33

Editor's Notes

  • #5  Monsanto is a St. Louis-based agricultural company with one goal in mind – produce more food, fiber and fuel using less inputs like water and land, while improving the lives of the people around the world that benefit from our technology.Monsanto utilizes a systems approach to improving upon today’s agricultural offerings – Breeding, Biotechnology, and Advanced Agronomic Practices These three facets of our approach help farmers improve productivity, reduce the costs of farming, and grow better foods for consumers and better feed for animals.We’re proud to have customers of all kinds; from large-acre, technology-driven row-crop farmers in Central Illinois all the way to farmers with very small landholdings who are just beginning to realize the benefits of modern agriculture in Africa.
  • #6 Sustainably increasing yield, while more efficiently using inputs and resources, requires every tool at farmers’ disposal. At Monsanto, we’re focused on three pillars for driving yield: breeding, biotechnology and improved agronomic practices. All three are required to meet our goals.Basics of Breeding Breeding, a technique that has been practiced by farmers for thousands of years, involves bringing together two parent plants to produce a new offspring that contains a mixture of parent characteristics. Monsanto has assembled a pool of elite seed genetics (germplasm) from around the world, and we use cutting-edge technology to help us more quickly, efficiently and accurately find desired traits for breeding. Our primary method is using genetic analysis – mapping the DNA of plants – to identify seeds with traits we want, such as improved yield, disease resistance, suitability for a particular climate, and in the case of vegetables better taste and nutrition.Basics of Biotechnology Biotechnology is the process of inserting a gene from one species, like a plant or a bacterium, into another species. We use biotechnology to give plants desirable characteristics (or traits) that often cannot be developed through breeding practices. The traits we develop help farmers produce more of their crop, reduce costs and conserve resources. Examples of these traits would be herbicide tolerance, insect-resistance and drought-tolerance. We also are working to develop traits that will benefit consumers, such as soybeans that produce healthier oils.Basics of AgronomicsAgronomic practices are steps farmers incorporate into their farm management systems to improve soil quality, enhance water use, manage crop residue and improve the environment through better fertilizer management. These steps not only improve a farmer’s bottom line by decreasing input costs, but also improve the environment by decreasing water use and over-fertilization. Improved agronomics cover a broad range of practices, suitable for any type of farm. For example, a high-tech, high productivity grower may use GPS and computer systems to automate planting for optimal row spacing and varying inputs acre by acre, to produce more and conserve more. A subsistence farmer can see significant benefits by learning about input management and optimal plant spacing to reduce costs and improve yield. Conservation tillage is a broadly applicable technique that preserves topsoil and locks in moisture.