by Ran Silberman
DevOps for Big Data
Cluster management tools
20.4.2015
Hosted by:
FullStack Developers Israel
Ran Silberman,
Big Data Architect
...and amateur birder
● Explain Cluster Management
tools by example
● Demo Cloudera Management
● Pros and Cons
Agenda
Birds of Brazil Wiki application
● Input photos and locations
● Batch: Display statistics on bird,
location & photographer.
● Real-time: Count how many birds
were seen in the last minute from
each species
Application
requirements
● Volume growth
● Velocity of Streaming and Batch
● Same env from DEV to PROD
● Data from PROD to test on DEV
● Manage Deployment of many
applications on many nodes
Big Data lifecycle
considerations
● HDFS for storing the data
● Hive for batch processing
● Solr/elasticsearch for search
● Spark for streaming
● ...Home-grown applications
Choosing the
Infrastructures
Many Infrastructures
How can we manage all those
infrastructures?
● Hortonworks Ambari
or
● Cloudera Manager
Choosing the
Management
tool
● All platforms & infrastructures are
installed by the tool
● Monitoring, Audits & logs are
built-in
● Easy installation and upgrade
● Save scripting work
What are the
news for DevOps
pipeline?
● Manage cluster with GUI or API
● Hadoop installation and setup
● System monitoring & alerts
● Built-in systems: Zookeeper,
Spark, Hive Impala and more
● Ability to add parcels
CM features
● Monolithic packages
● Relocatable
● sudo-less installs
● Rolling upgrade
Parcels
Custom Service Descriptors
● CSD is a descriptor for a service
used by CM
● Defines how to install start/stop
a service and the logic used by
CM
CSD
Demo
● Archive data in Hadoop
● Growing data affects DWH
performance & capabilities
● Creating realistic testing data
● Dev and Prod env. may differ in
cluster size (dev may be 1 node)
More DevOps
considerations
Tools Comparison
CM Ambari
Licence Paid Ent edition Free Apache Open Source
Technology Cloudera puppet, ganglia, nagios
Dependency CDH HDP
Manage cluster Parcels Yum
REST API + +
Extra Features Rolling Upgrade, 3rd-
parties Mngt,
Extendable by REST API
CM features
Express Enterprise
Subscription Free Annual
Deployment &
Configuration
+ +
Management + +
Monitoring + +
Diagnostic + +
Extra Features Reports, Rollbacks, Rolling
Upgrade, AD Kerberos, Kerberos
wizard, Backup & DR
● Fast Deploy
● Easy management by GUI
● Built in monitoring and alerts
● Simple upgrades
● Same management and deploy
in Dev and Prod
Pros. of Hadoop
Management
tools
● Tied to specific vendor
proprietary system
● Tied to system version by
Parcels
● Less flexibility to low-level
management
Cons. of Hadoop
Management
tools
THANK YOU
Ran Silberman
Email: ran@tikalk.com

Dev ops for big data cluster management tools

  • 1.
    by Ran Silberman DevOpsfor Big Data Cluster management tools 20.4.2015 Hosted by: FullStack Developers Israel
  • 2.
    Ran Silberman, Big DataArchitect ...and amateur birder
  • 3.
    ● Explain ClusterManagement tools by example ● Demo Cloudera Management ● Pros and Cons Agenda
  • 4.
    Birds of BrazilWiki application
  • 6.
    ● Input photosand locations ● Batch: Display statistics on bird, location & photographer. ● Real-time: Count how many birds were seen in the last minute from each species Application requirements
  • 7.
    ● Volume growth ●Velocity of Streaming and Batch ● Same env from DEV to PROD ● Data from PROD to test on DEV ● Manage Deployment of many applications on many nodes Big Data lifecycle considerations
  • 8.
    ● HDFS forstoring the data ● Hive for batch processing ● Solr/elasticsearch for search ● Spark for streaming ● ...Home-grown applications Choosing the Infrastructures
  • 9.
  • 10.
    How can wemanage all those infrastructures?
  • 11.
    ● Hortonworks Ambari or ●Cloudera Manager Choosing the Management tool
  • 12.
    ● All platforms& infrastructures are installed by the tool ● Monitoring, Audits & logs are built-in ● Easy installation and upgrade ● Save scripting work What are the news for DevOps pipeline?
  • 13.
    ● Manage clusterwith GUI or API ● Hadoop installation and setup ● System monitoring & alerts ● Built-in systems: Zookeeper, Spark, Hive Impala and more ● Ability to add parcels CM features
  • 14.
    ● Monolithic packages ●Relocatable ● sudo-less installs ● Rolling upgrade Parcels
  • 15.
    Custom Service Descriptors ●CSD is a descriptor for a service used by CM ● Defines how to install start/stop a service and the logic used by CM CSD
  • 16.
  • 17.
    ● Archive datain Hadoop ● Growing data affects DWH performance & capabilities ● Creating realistic testing data ● Dev and Prod env. may differ in cluster size (dev may be 1 node) More DevOps considerations
  • 18.
    Tools Comparison CM Ambari LicencePaid Ent edition Free Apache Open Source Technology Cloudera puppet, ganglia, nagios Dependency CDH HDP Manage cluster Parcels Yum REST API + + Extra Features Rolling Upgrade, 3rd- parties Mngt, Extendable by REST API
  • 19.
    CM features Express Enterprise SubscriptionFree Annual Deployment & Configuration + + Management + + Monitoring + + Diagnostic + + Extra Features Reports, Rollbacks, Rolling Upgrade, AD Kerberos, Kerberos wizard, Backup & DR
  • 20.
    ● Fast Deploy ●Easy management by GUI ● Built in monitoring and alerts ● Simple upgrades ● Same management and deploy in Dev and Prod Pros. of Hadoop Management tools
  • 21.
    ● Tied tospecific vendor proprietary system ● Tied to system version by Parcels ● Less flexibility to low-level management Cons. of Hadoop Management tools
  • 22.

Editor's Notes

  • #17 Manage services health Show timeline Search box Start or stop services/cluster Enable HDFS high availability Enable Kerberos Changing HDFS block size from Configuration, View configuration history View Host’s status (charts) and processes Obtaining version of CDH > hosts > hosts inspector Upgrade CDH using parcels Install CSD, change port.