Hadoop Operations for Production Systems (Strata NYC)

1© Cloudera, Inc. All rights reserved.
Apache Hadoop Operations
for Production Systems
Philip Zeyliger, Philip Langdale, Kathleen Ting, Miklos Christine
Strata Hadoop NYC, 29 September 2015

Your Hosts
Philip
Langdale
Philip
Zeyliger
Miklos
Christine
Kathleen
Ting

Philip Langdale
• CM Architect
• philipl@cloudera
Philip Zeyliger
• CM Architect
• philip@cloudera
Kathleen Ting
• Customer Success Manager
• kate@cloudera
Miklos Christine
• Solutions Engineer
• mwc@databricks.com
$ whoami

Overall Agenda
• Intro
• Installation
• Configuration
• Troubleshooting
• Enterprise Considerations
(lunch) 12:30-1:30pm
• Hands-on Lab Exercises
(break) 3-3:30pm
• Continue hands-on exercises
• AMA
Q&A at end of every section
(Opening Reception) 5pm @ 3E

Prerequisites (for hands-on portions)
SSH Client
○ Putty for Windows
○ Terminal for OS X
Web Browser
Wi-Fi

Hands-on Logistics
We’ll be handing out URLs to your clusters shortly!

Asking Questions
sli.do/ops
(reasonably mobile-friendly)

Why Apache Hadoop?
Solves problems that don’t fit on a single computer.
Doesn’t require you to be a distributed systems person.
Handles failures for you.
Data
Analysts
Distributed
Systems
People
Unicorns!

A Distributed System
Many processes, which, taken together, are trying to act
like a single system.
Ideally, users deal with the system as a whole.
For operations, you need to understand the system by
parts too.

Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
In Mem processing:
Spark
Interactive SQL:
ImpalaBatch processing
MapReduce
Resource Management: YARN
Coordination:
ZooKeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
Random Access Storage: HBase
File Storage: HDFS
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
The Hadoop Stack

Storage
File Storage: HDFS
Search: Solr

Execution
In Mem processing:
Spark
Interactive SQL:
MapReduce
Search: Solr

Compilers
Hive—SQL to MR/Spark compiler, metadata mapping
between files and tables
Pig—PigLatin to MR compiler
Kite, Mahout

Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
In Mem processing:
Spark
Interactive SQL:
MapReduce
Coordination:
ZooKeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
File Storage: HDFS
Kite, Mahout
Taken all together...

How to learn these things...
These systems are largely defined by the state they store on
disk (what will survive a restart?) and the defined protocols
(RPCs) they use to interact amongst themselves and with the
outside world.

for Production Systems:
Installation
Philip Langdale

Agenda
• Hardware Considerations
• Node types and recommended role allocations
• Host configuration
• Rack configuration
• Software Installation
• OS Prerequisites
• Installing Hadoop and other Ecosystem components
• Launch
• Initial Configuration
• Sanity testing
• Security considerations

Hardware Considerations
• As a distributed system, Hadoop is going to be deployed onto
multiple interconnected hosts
• How large will the cluster be?
• What services will be deployed on the cluster?
• Can all services effectively run together on the same hosts or is
some form of physical partitioning required?
• What role will each host play in the cluster?
• This impacts the hardware profile (CPU, Memory, Storage, etc)
• How should the hosts be networked together?

Host Roles
within a
Cluster
Master Node
HDFS NameNode
YARN ResourceManager
HBase Master
Impala StateStore
ZooKeeper
Worker Node
HDFS DataNode
YARN NodeManager
HBase RegionServer
Impalad
Utility Node
Relational Database
Management (eg: CM)
Hive Metastore
Oozie
Impala Catalog Server
Edge Node
Gateway Configuration
Client Tools
Hue
HiveServer2
Ingest (eg: Flume)
For larger clusters, roles will
be spread across multiple
nodes of a given type

Roles vs Cluster Size
Master Worker Utility Edge
Very Small
(≤10)
1 ≤10 1 shared Host
Small (≤20) 2 ≤20 1 shared Host
Medium (≤200) 3 ≤200 2 1+
Large (≤500) 5 ≤500 2 1+

Host Hardware Configuration
• CPU
• There’s no such thing as too much CPU
• Jobs typically do not saturate their cores, so raw clock speed is
not at a premium
• Cost and Budget are the major factors here
• Memory
• You really don’t want to overcommit and swap
• Java heaps should fit into physical RAM with additional space for
OS and non-Hadoop processes

Host Configuration (cont.)
• Disk
• More spindles == More I/O capacity
• Larger drives == lower cost per TB
• More hosts with less capacity increases parallelism and
decreases re-replication costs when replacing a host
• Fewer hosts with more capacity generally means lower unit cost
• Rule of thumb: One disk per two configured YARN vcores
• Lower latency disks are generally not a good investment, except
for specific use-cases where random I/O is important

A Hadoop Machine
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem

A Hadoop Rack
Rack
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Top of Rack Switch

A Hadoop ClusterCluster
Backbone Switch
Rack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack SwitchRack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack Switch
Rack
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Top of Rack Switch

Cluster
Rack
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
Top of Rack Switch
Backbone Switch
WANCluster
Backbone Switch
Rack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack Switch
Rack
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Top of Rack Switch

Rack Configuration
• Common: 10Gb or (20Gb bonded) to the server, 40Gb to the spine
• Cost sensitive: 2Gb bonded to the server, 10Gb to the spine
• This is likely to be a false economy, with the real potential of
being network bottlenecked with disks idle
• Look for 25/100 in the next couple of years
• Network I/O is generally consumed by reading/writing data from
disks on other nodes
• Teragen is a useful benchmark for network capacity
• Typically 3-9x more intensive than normal workloads

Software Installation
• Linux Distribution
• Operating System Configuration
• Hadoop Distribution
• Distribution Lifecycle Management

Linux Distributions
• All enterprise distributions are credible choices
• Do you already buy support from a vendor?
• Which distributions does your Hadoop distro support?
• What are you already familiar with
• Cloudera supports
• RHEL 5.x, 6.x
• Ubuntu 12.04, 14.04
• Debian 6.x, 7.x
• SLES 11
• RHEL 6.x is the most common

Operating System Configuration
• Turn off IPTables (or any other firewall tool)
• Turn off SELinux
• Turn down swappiness to 10 or less
• Turn off Transparent Huge Page Compaction
• Use a network time source to keep all hosts in sync (also timezones!)
• Make sure forward and reverse DNS work on each host to resolve all
other hosts consistently
• Use the Oracle JDK - OpenJDK is subtly different and may lead you to
grief

Cloudera
Manager
provides a Host
Inspector to
check for these
situations

Hadoop Distributions
• Well, we’re obviously biased here
• CDH is a jolly good Hadoop Distro. We recommend it
• You’re free to try others
• While it’s technically possible to have a go at running a cluster
without using any management tools, it’s not going to be fun and we’
re not going to talk about that much

Distribution Lifecycle Management
• Theoretically, you might want to just install the binaries for services
and programs you are running on a specific node
• But honestly, space is not at that much of a premium
• Install everything everywhere and don’t worry about it again
• If you decide to alter the footprint of a service, you don’t need to
worry about binaries
• You don’t need different profiles for different hosts in the cluster
• Cloudera Manager works this way

Distribution Lifecycle Management (cont.)
• For CDH, Cloudera Manager can handle lifecycle management
through parcels
• For package based installations, there are a variety of recognised
options
• Puppet, Chef, Ansible, etc
• But please use something
• Long term package management by hand is a recipe for disaster
• Exposes you to having inconsistencies between hosts
• Missing Packages
• Un-upgraded Packages

Lifecycle
Management
with Parcels in
Cloudera
Manager

Installation with
Cloudera
Manager

Launch
• Initial Configuration
• Sanity testing
• Security Considerations

Initial Configuration
• Recall our earlier discussion of hardware
• YARN vcores (or MapReduceV1 slots) proportional to physical cores
• Typically 1/1.5:1 (1.5 with hyperthreading)
• Heap sizes
• Don’t overcommit on memory, but don’t make Java heaps too large (GC)
• See Memory table in Appendix
• Mounting data disks
• One partition per disk
• No RAID
• Use a well established filesystem (ext4)
• Use a uniform naming scheme

Initial Configuration (cont.)
• Think about space on the OS partition(s)
• /var is where your logs go by default
• /opt is where Cloudera Manager stores parcels
• By default, these are part of / on modern distros, which
works out fine
• Your IT policies may require separate partitions
• If these areas are too small, then you’ll need to change
these configurations

Cloudera Manager can help
• Assigns roles based on cluster size
• Tries to detect masters based on physical
capabilities
• Sets vcore count based on detected host CPUs
• Sets heap sizes to avoid overcommitting RAM
• Autodetects mounted drives and assigns them for
DataNodes

Initial Service
Configuration in
Cloudera
Manager

Implications of Services in Use
• Different services running concurrently means we need to
consider how resources are shared between them
• Services that use YARN can be managed through YARN’s
scheduler
• But certain services do not - most visibly HBase, but also
services like Accumulo or Flume
• These services can run on a shared cluster through the use
of static resource partitioning with cgroups
• Cloudera Manager can configure these

Dynamic
Resource
Management

Static Resource
Management

Sanity Testing
• Use the basic sanity tests provided by each service
• Submit example jobs: pi, sleep, teragen/terasort
• Work with sample tables in Hive/Impala
• Use Hue to do these things if your users will
• Repeat these tests when turning on
security/authentication/authorization mechanisms
• Make sure they succeed for the expected users and fail for
others
• Cloudera Manager provides some ‘canary’ tests for certain services
• HDFS create/read/write/delete
• Hbase, Zookeeper, etc

HDFS Health
tests

Hue Examples

Appendix

Detailed Role Assignments
• Very Small (Up to 10 workers, No High Availability)
• 1 Master Node
• NameNode, YARN Resource Manager, Job History Server,
ZooKeeper, Impala StateStore
• 1 Utility/Edge Node
• Secondary NameNode, Cloudera Manager, Hive Metastore,
HiveServer2, Impala Catalog, Hue, Oozie, Flume, Relational
Database, Gateway configurations
• 3-10 Worker Nodes
• DataNode, NodeManager, Impalad, Llama

• Small (Up to 20 workers, High Availability)
• 2 Master Nodes
• NameNode (with JournalNode and FailoverController), YARN Resource
Manager, ZooKeeper
• (1 Node each) Job History Server, Impala StateStore
• 1 Utility/Edge Node
• Cloudera Manager, Hive Metastore, HiveServer2, Impala Catalog, Hue,
Oozie, Flume, Relational Database, Gateway configurations
• (requires dedicated spindle) Zookeeper, JournalNode

• Medium (Up to 200 workers, High Availability)
• 3 Master Nodes
• (3 Nodes) Zookeeper, JournalNode
• (2 Nodes each) NameNode (with FailoverController), YARN Resource Manager
• 2 Utility Nodes
• Node 1: Cloudera Manager, Relational Database
• Node 2: CM Management Service, Hive Metastore, Catalog Server, Oozie
• 1+Edge Noded
• Hue, HiveServer2, Flume, Gateway configuration

• Large (Up to 500 workers, High Availability)
• 5 Master Nodes
• (5 Nodes) Zookeeper, JournalNode
• (2 Nodes) NameNode (with FailoverController)
• (2 different Nodes) YARN Resource Manager
• 2 Utility Nodes
• Node 1: Cloudera Manager, Relational Database
• Node 2: CM Management Service, Hive Metastore, Catalog Server, Oozie
• 1+Edge Noded
• Hue, HiveServer2, Flume, Gateway configuration

Memory Allocation
Item RAM Allocated
Operating System Overhead 2 GB (minimum)
DataNode 1-4 GB
YARN NodeManager 1 GB
YARN ApplicationManager 1 GB
YARN Map/Reduce Containers 1-2GB/Container
HBase RegionServer 4-12 GB
Impala 128 GB (can be reduced with spill-to-disk)

Configuration
Philip Zeyliger

Agenda
• Mechanics
• Key configurations
• Resource Management
• Configurations...are...living documents!

What’s there to configure, anyway?
• On a 100-node cluster, there are likely 400+ (HDFS datanode, Yarn
nodemanager, HBase RegionServer, Impalad) processes running. Each has
environment variables, config files, and command line options!
• Where your daemons are running are configuration too, but often implicit.
Moving from machine A to machine B is a configuration change!
• Most (but not all) settings require a restart to take effect.
• Different kinds of restarts (e.g., rolling)
• A management tool will help you with “scoping.” Some configurations must
be the same globally (e.g., kerberos), some make sense within a service
(HDFS Trash), some per-daemon

$ vi /etc/hadoop/conf/hdfs-site.xml
• Configs are key-value pairs, in a straightforward if verbose XML format
• When in doubt, place configuration everywhere, since you might not know
whether the client reads it or which daemons read it.
• Dear golly, use configuration management.
(This is one of many config files!)

Editing a configuration
• Quick demo!

Step One: edit a config
Step Two: save

Step Three: at top-level, note that restart is needed
Step Four: review changes
Step five: restart

Show me the files!
• core-site.xml, hdfs-site.xml, dfs_hosts_allow, dfs_hosts_exclude, hbase-site.
xml, hive-env.sh, hive-site.xml, hue.ini, mapred-site.xml, oozie-site.xml,
yarn-site.xml, zoo.cfg (and so on)
• e.g., /var/run/cloudera-scm-agent/process/*-NAMENODE

Let’s take a look!
• Demo: find the files via UI

How to double-check?
• http://ec2-54-209-51-178.compute-1.amazonaws.com:50070/conf

Key Configuration Themes (everyone)
• Security (Kerberos, SSL, ACLs?)
• Ports
• Local Storage
• JVM (use Oracle JDK 1.7)
• Databases (back them up)
• Heap Sizes (different workload? → different config!)

Rack Topology
• Hadoop cares about racks because:
• Shared failure zone
• Network bandwidth
• When operating a large cluster, tell Hadoop (by use of a rack locality script)
what machines are in which rack.

Networking
• Does your DNS work?
• Like, forwards and backwards?
• For sure?
• Have you really checked?

HDFS Configurations of Note
• Heap sizes:
• Datanodes: linear in # of blocks
• Namenode: linear in # of blocks and # of files
select blocks_total, jvm_heap_used_mb
where roletype=DATANODE and hostname RLIKE "hodor-016.*"

HDFS Configurations of Note
• Local Data Directories
• dfs.datanode.data.dir: one per spindle; avoid RAID
• dfs.namenode.name.dir: two copies of your metadata better than one
• High Availability
• Requires more daemons
• Two failover controllers co-located with namenodes
• Three journal nodes

Yarn Configurations of Note
• Yarn doles out resources to applications across two axes: memory and CPU
• Define per-NodeManager resources
• yarn.nodemanager.resource.cpu-vcores
• yarn.nodemanager.resource.memory-mb
1GB
1GB
1GB
1GB
1
core
1
core
1
core
1
core
1
core
1
core
1 GB, 1 core
2 GB, 2 cores
1 GB, 2 cores

Resource management in YARN
• Fannie and Freddie go in on a large cluster together. Fannie ponies up 75%
of the budget; Freddie ponies up 25%.
• When cluster is idle, let Freddie use 100%
• When Fannie has a lot of work, Freddie can only use 25%.
• The “Fair Scheduler” implements “DRF” to share the cluster fairly across
CPU and memory resources.
• Configured by an “allocations.xml” file.

Quick Demo
• Yarn Applications

Configs as live documents
• Configurations that have to do with workloads will evolve.
Configs Metrics

Hokey-Pokey
Break
(You put your
right CPU in…)

Troubleshooting
Kathleen Ting

Troubleshooting
Managing Hadoop Clusters
Troubleshooting Hadoop Systems
Debugging Hadoop Applications

Troubleshooting

Understanding Normal
• Establish normal
– Boring logs are good
• So that you can detect abnormal
– Find anomalies and outliers
– Compare performance
• And then isolate root causes
– Who are the suspects
– Pull the thread by interrogating
– Correlate across different subsystems
BORING
INFRASTRUCTURE

Basic tools
• What happened?
– Logs
• What state are we in now?
– Metrics
– Thread stacks
• What happens when I do this?
– Tracing
– Dumps
• Is it alive?
– listings
– Canaries
• Is it OK?
– fsck / hbck

Diagnostics for single machines
Machine
Linux
Disk, CPU, Mem
Analysis and Tools via @brendangregg

Diagnosis Tools
HW/Kernel
Logs /log, /var/log
dmesg
Metrics /proc, top, iotop,
sar, vmstat, netstat
Tracing Strace, tcpdump
Dumps Core dumps
Liveness ping
Corrupt fsck

Diagnostics for the JVM
• Most Hadoop services run in the Java VM, intros new java subsystems
– Just in time compiler
– Threads
– Garbage collection
• Dumping Current threads
– jstacks (any threads stuck or blocked?)
• JVM Settings:
– Enabling GC Logs: -XX:+PrintGCDateStamps -XX:+PrintGCDetails
– Enabling OOME Heap dumps: -XX:+HeapDumpOnOutOfMemoryError
• Metrics on GC
– Get gc counts and info with jstat
• Memory Usage dumps
– Create heap dump: jmap –dump:live,format=b,file=heap.bin <pid>
– View heap dump from crash or jmap with jhat
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem

Diagnosis Tools
HW/Kernel JVM
Logs /log, /var/log
dmesg
Enable gc logging
Metrics /proc, top, iotop, sar,
vmstat, netstat
jstat
Tracing Strace, tcpdump Debugger, jdb, jstack
Dumps Core dumps jmap
Enable OOME dump
Liveness ping Jps
Corrupt fsck

Other
systems:
httpd, sas,
custom apps
etc
Tools for lots of machines
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem

Other
systems:
httpd, sas,
custom apps
etc
Tools for lots of machines
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Ganglia, Nagios
Ganglia*, Nagios*

Ganglia
• Collects and aggregates metrics from machines
• Good for what’s going on right now and describing normal perf
• Organized by physical resource (CPU, Mem, Disk)
– Good defaults
– Good for pin pointing machines
– Good for seeing overall utilization
– Uses RRDTool under the covers
• Some data scalability limitations, lossy over time.
• Dynamic for new machines, requires config for new metrics

Graphite
• Popular alternative to Ganglia
• Can handle the scale of metrics coming
in
• Similar to Ganglia, but uses its own RRD
database.
• More aimed at dynamic metrics (as
opposed to statically defined metrics)

Nagios
• Provides alerts from canaries and basic health checks for services on
machines
• Organized by Service (httpd, dns, etc)
• Defacto standard for service monitoring
• Lacks distributed system know how
– Requires bespoke setup for service slaves and masters
– Lacks details with multi-tenant services or short-lived jobs

Other
systems:
httpd, sas,
custom apps
etc
Tools for the Hadoop Stack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Ganglia, Nagios
Ganglia*, Nagios*

Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
In Mem
processing:
Spark
Interactive SQL:
ImpalaBatch
processing
MapReduce
Coordination:
zookeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
File Storage: HDFS
Kite, Mahout
Ganglia, Nagios
Ganglia*, Nagios*

Diagnostics for the Hadoop Stack
• Single client call can trigger many RPCs spanning many machines
• Systems are evolving quickly
• A failure on one daemon, by design, does not cause failure of the entire service
• Logs:
– Each service’s master and slaves have their own logs: /var/log/
– There are lot of logs and they change frequently
• Metrics:
– Each daemon offers metrics, often aggregated at masters
• Tracing:
– HTrace (integrated into Hadoop, HBase; currently in Apache Incubator)
• Liveness:
– Canaries, service/daemon web UIs

Diagnosis Tools
HW/Kernel JVM Hadoop Stack
Logs /log, /var/log
dmesg
Enable gc logging *:/var/log/hadoo|hbase|…
Metrics /proc, top, iotop, sar,
vmstat, netstat
jstat *:/stacks, *:/jmx,
Tracing Strace, tcpdump Debugger, jdb, jstack htrace
Dumps Core dumps jmap
Enable OOME dump
*:/dump
Liveness ping Jps Web UI
Corrupt fsck HDFS fsck, HBase hbck

Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
In Mem
processing:
Spark
Interactive SQL:
ImpalaBatch
processing
MapReduce
Coordination:
zookeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
File Storage: HDFS
Kite, Mahout
Ganglia, Nagios
Ganglia*, Nagios*

Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
In Mem
processing:
Spark
Interactive SQL:
ImpalaBatch
processing
MapReduce
Coordination:
zookeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
File Storage: HDFS
Kite, Mahout
Logs and Metrics
Logs and
Metrics
Logs and
Metrics
Logs and
Metrics
Logs and
Metrics
Logs and Metrics
Logs
Logs and
Metrics
Logsand
Metrics
Logs
Logs and Metrics
Logs and Metrics
HTrace
HTrace
Custom
app logs
/proc, dmesg, /var/logs/
jmx, jstack, jstat, GC logs

Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
In Mem
processing:
Spark
Interactive SQL:
ImpalaBatch
processing
MapReduce
Coordination:
zookeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
File Storage: HDFS
Kite, Mahout
Ganglia, Nagios
Ganglia*, Nagios*
Ganglia*, Nagios* ?

Ganglia and Nagios are not enough
• Fault tolerant distributed system masks many problems (by design!)
– Some failures are not critical – failure condition more complicated
• Lacks distributed system know how
– Requires bespoke setup for service slaves and masters
– Lacks details with multitenant services or short-lived jobs
• Hadoop services are logically dependent on each other
– Need to correlate metrics across different service and machines
– Need to correlate logs from different services and machines
– Young systems where Logs are changing frequently
– What about all the logs?
– What of all these metrics do we really need?
• Some data scalability limitations, lossy over time
– What about full fidelity?

OpenTSDB
• OpenTSDB (Time Series Database)
– Efficiently stores metric data into HBase
– Keeps data at full fidelity
– Keep as much data as your HBase instance
can handle.
• Free and Open Source

Cloudera Manager
• Extracts hardware, OS, and Hadoop service metrics
specifically relevant to Hadoop and its related services.
– Operation Latencies
– # of disk seeks
– HDFS data written
– Java GC time
– Network IO
• Provides
– Cluster preflight checks
– Basic host checks
– Regular health checks
• Uses LevelDB for underlying metrics storage
• Provides distributed log search
• Monitors logs for known issues
• Point and click for useful utils (lsof, jstack, jmap)
• Free

Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
In Mem
processing:
Spark
Interactive SQL:
ImpalaBatch
processing
MapReduce
Coordination:
zookeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
File Storage: HDFS
Kite, Mahout
Ganglia, Nagios
Ganglia*, Nagios*
Full Fidelity metrics:
Open TSDB*Metrics + Logs:
Cloudera Manager
Logs:
Search a la Splunk

Activity
How many file descriptors do the datanodes have open?
What is the current latency of the HDFS canary?

Troubleshooting
Managing a Hadoop Clusters

The Law of Cluster Inertia
A cluster in a good state stays in a good
state,
and
a cluster in a bad state stays in a bad state,
unless
acted upon by an external force.

External Forces
• Failures
• Acts of God
• Users
• Admins

Batch
processing
MapReduce
Kite, Mahout
Failures as an External Force
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Coordination:
zookeeper;
Security: Sentry
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
File Storage: HDFS Other
systems:
httpd, sas,
custom apps
etc
Machine
JVM
HW
Slave
Job
In Mem
processing:
Spark
Interactive SQL:
Impala
Search: Solr
User interface: HUE

Acts of God as an External Force WAN
Cluster Backbone Switch
Rack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack Switch
Rack
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack Switch
Power
Failure

Users as an external force
• Use Security to protect systems from users
– Prevent and track
• Authentication – proving who you are
– LDAP, Kerberos
• Authorization – deciding what you are allowed to do
– Apache Sentry (incubating), Hadoop security, HBase security
• Audit – who and when was something done?
– Cloudera Navigator

Admins as an external force
Upgrades
• Linux
• Hadoop
• Java
Misconfiguration
• Memory Mismanagement
– TT OOME
– JT OOME
– Native Threads
• Thread Mismanagement
– Fetch Failures
– Replicas
• Disk Mismanagement
– No File
– Too Many Files

Troubleshooting

Example application pipeline with strict SLAs
ingest process export
time

Example Application Pipeline - Ingest
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Coordination:
zookeeper;
Security: SentryFile Storage: HDFS
Custom app feeds
event data into HBase
Eventingest:
Flume,Kafka
Other
systems:
httpd, sas,
custom apps
etc

Example Application Pipeline - processing
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Coordination:
zookeeper;
Security: Sentry
Eventingest:
Flume,Kafka
Other
systems:
httpd, sas,
custom apps
etc
Batch
processing
MapReduce
Kite, Mahout
Map Reduce Job
generates new artifact
from HBase data and
writes to HDFS
File Storage: HDFS

Batch
processing
MapReduce
Kite, Mahout
Example Application Pipeline - export
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Coordination:
zookeeper;
Security: Sentry
Eventingest:
Flume,Kafka
Random Access Storage: HBase9
DBImport
Export:Sqoop
File Storage: HDFS Other
systems:
httpd, sas,
custom apps
etc
Sqoop result into
relational
database to serve
application

Case study 1: slow jobs after Hadoop upgrade
Symptom:
After an upgrade, activity
on the cluster eventually
began to slow down and
the job queue
overflowed.

Finding the right part of the stack
E-SPORE (from Eric Sammer’s Hadoop Operations)
• Environment
– What is different about the environment now from the last time everything worked?
• Stack
– The entire cluster also has shared dependency on data center infrastructure such as the network, DNS, and other services.
• Patterns
– Are the tasks from the same job? Are they all assigned to the same tasktracker? Do they all use a shared library that was
changed recently?
• Output
– Always check log output for exceptions but don’t assume the symptom correlates to the root cause.
• Resources
– Do local disks have enough? Is the machine swapping? Does the network utilization look normal? Does the CPU utilization look
normal?
• Event correlation
– It’s important to know the order in which the events led to the failure.

Example Application Pipeline - processing
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Coordination:
zookeeper;
Security: Sentry
Eventingest:
Flume,Kafka
Other
systems:
httpd, sas,
custom apps
etc
Kite, Mahout
File Storage: HDFS
Batch processing
MapReduce
Map Reduce Job
generates new artifact
from HBase data and
writes to hdfs

Evidence:
Isolated to Processing phase (MR).
In TT Logs, found an innocuous but
anomalous log entry about “fetch
failures.”
Many users had run in to this MR
problem using different versions of
MR.
Workaround provided: remove the
problem node from the cluster.

Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem

Root cause:
All MR versions had a
common dependency on
a particular version of
Jetty (Jetty 6.1.26).
Dev was able to
reproduce and fix the
bug in Jetty.

Case study 2: slow jobs after Linux upgrade
Symptom:
After an upgrade, system
CPU usage peaked at
30% or more of the total
CPU usage.

Evidence:
Used tracing tools to
isolate a majority of time
was inexplicably spent in
virtual memory calls.
http://structureddata.org/2012/06/18/linux-6-
transparent-huge-pages-and-hadoop-workloads/

Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem

Root cause:
Running RHEL or CentOS
versions 6.2, 6.3, and 6.4 or
SLES 11 SP2 has a feature called
"Transparent Huge Page (THP)"
compaction which interacts
poorly with Hadoop workloads.

Case study 3: slow jobs at a precise moment
Symptom:
High CPU usage and responsive
but sluggish cluster - even non-
Hadoop apps e.g. MySQL.
30 customers all hit this at the
exact same time: 6/30/12 at
5pm PDT.

Evidence:
Checked the kernel
message buffer (run
dmesg) and look for
output confirming the
leap second injection.
Other systems had same
problem.

Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem

Root cause:
Linux OS kernel
mishandled a leap
second added.

Similar symptoms, different problem

After an upgrade, activity on the
cluster eventually began to slow
down and the job queue
overflowed.
In TT Logs, found an innocuous but
anomalous log entry about “fetch
failures.”
Many users had run in to this MR
problem using different versions of MR.
Workaround provided: remove the
problem node from the cluster.
All MR versions had a common
dependency on a particular
version of Jetty (Jetty 6.1.26) .
Dev was able to reproduce and
fix the bug in Jetty.
Symptom Evidence Root Cause

After an upgrade, system CPU
usage peaked at 30% or more of
the total CPU usage.
Perf tool which proved that
majority of time was
inexplicably spent in virtual
memory calls.
Running RHEL or CentOS versions 6.2,
6.3, and 6.4 or SLES 11 SP2 has a feature
called "Transparent Huge Page (THP)"
compaction which interacts poorly with
Hadoop workloads.

High CPU usage and responsive
but sluggish cluster.
30 customers all hit this at the
exact same time.
Checked the kernel message
buffer (run dmesg) and look for
output confirming the leap
second injection.
Linux OS kernel mishandled a leap
second added on 6/30/12 at 5pm PDT.

Lessons learned
More crucial than the
specific troubleshooting
methodology used is to
use one.
More crucial than the
specific tool used is the
type of data analyzed and
how it’s analyzed.
Capture for posterity in a
knowledge base article,
blog post, or conference
presentation.
Methodology Tools Learn from failure

Enterprise Considerations
Miklos Christine

Scale Considerations
Ref: http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-clusters-like-a-boss/

• HDFS
• Namenode Heap Settings
• Namenode RPC Configurations
Property Name Default Recommended
dfs.namenode.servicerpc-
address
N/A 8022
dfs.namenode.handler.count 10 ln(# of DNs) * 20
dfs.namenode.service.handler.
count
10 ln(# of DNs) * 20

• YARN
• ResourceManager High Availability
• yarn.resourcemanager.zk-address
• Application recovery
• yarn.resourcemanager.work-preserving-recovery.enabled
• yarn.nodemanager.recovery.dir
• User cache disk space
• yarn.nodemanager.local-dirs

Metrics: HDFS
• HDFS is the core of the platform.
What’s important?
• Is the Standby NN
checkpointing?
• Are the NNs garbage
collecting?
• Percentage of heap used at
steady state?

Logs are your friend
• Logs are verbose but necessary
• Namenode Logs:
• 10 * 200MB log files = 2GB
• 2GB of logs span 3 hours
• 3 days of logs = ~48GB
• Retain enough logs for debugging. Plan for the worst case
• Adjust log retention as the cluster grows

• Just reduce the log level to save space?
• NO!
• INFO logging is important!
• Application: Yarn containers write logs locally, then migrate to HDFS.
• Ensure application log space is sufficient
• Tool to Fetch System Logs for Root Cause Analysis

• GC Logging
• -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-Xloggc:/var/log/hdfs/nn-hdfs.log
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=20M
• Great resource:
• https://stackoverflow.com/questions/895444/java-garbage-collection-log-messages

• jstack
• Use the same JDK
• Must be run as user of the
process
• kill -3 <PID>
• Dumps jstack to stdout
Debugging Techniques : Hung Process

Debugging Techniques : Hung Process

Debugging Techniques : LogLevel
• Set the log level without process restarts
http://namenode.cloudera.com:50070/logLevel
• Scriptable
http://namenode.cloudera.com:50070/logLevel?log=org&level=DEBUG

Debugging Techniques : Heap Analysis
• jstat -gcutil <PID> 1s 120
• Checks for current GC activity
• jmap -histo:live <PID>
• Get a histogram of the current objects within the heap

Sanity Tests
• Lots of tools available to end users
( Hive / Impala / MapReduce / HDFS Client / Cascading / Spark / … )
• Create sanity tests for each tool
• Document expected response / processing time

Controlled Usage
• How to prevent bad behavior from bringing down the cluster?
• HDFS Quotas
• Yarn FairScheduler Pools
• Hive / Impala Access Control with Sentry

Failure Testing
• If the NN fails, how long does it take to recover given the average #
of edits?
• If RM HA failover were to occur, would jobs continue?
• What is the mean time to recovery for HBase when a RS dies?
• What properties can be tuned to improve this?

Security Considerations
• Securing communication channels within the cluster
• Kerberos
• Allows secure communication between hosts on an
untrusted network.
• Secures traffic between hosts in the cluster
• Provides authentication for users to services
• TLS
• Used to secure http interfaces
• Kerberos can be used to authenticate to these interfaces
with SPNEGO

Kerberos, Authentication and Authorization
• While often conflated, these are distinct concepts
• They are usually configured together, and we would recommend this,
but it’s not an absolute requirement
• Authentication: Having a user provide and prove their identity
• Authorization: Controlling what a user can access or do

Security and Authentication (cont)
• Setting up Kerberos is an exercise that’s beyond the scope of this
tutorial
• Main implementations: MIT Kerberos, Active Directory
• Typically LDAP (or AD) is used for user management
• Cloudera Manager can help you configure Kerberos for your services

Authentication
• Without Kerberos, users are typically identified as whatever Linux
system user their client application runs as.
• With Kerberos, the user will obtain a kerberos ticket (typically at
login time) that will be used to identify them to the cluster services

Authorization
• Even if you’re using an authentication mechanism to limit who can
connect to the various services, you probably want to control what
they can do. Without authorization, anyone can do anything
• Each service provides different authorization mechanisms. eg:
• YARN queues can be restricted to certain users
• HBase tables can be restricted to certain users (ACLs)
• The nature of cluster users will affect authorization requirements
• Are there different groups with different SLAs?

Ask Me Anything on Hadoop Ops this Thursday
Date: Thursday, 10/01
Time: 2:05 – 2:45 pm
Location: 3D 05/08

Takeaway
A cluster in a good state stays in
a good state, and a cluster in a
bad state stays in a bad state,
unless acted upon by an
external force..
Cloudera has seen a lot of
diverse clusters and used that
experience to build tools to help
diagnose and understand how
Hadoop operates.
Similar symptoms can lead to
different root causes. Use tools
to assist with event correlation
and pattern determination.
Anatomy of a Hadoop System Managing Hadoop Clusters Troubleshooting Hadoop
Applications

Exercises
• Configs and Restarting - Phil Z
• Monitoring and Health Tests - Phil L
• Spark & MR - Miklos
• Ingest Using Apache Sqoop - Kate

Configurations
• Edit the “NameNode Port”
• What will need to be restarted? (Almost everything)
• Restart it!
• Let’s find the underlying files on the FS
• /var/run/cloudera-scm-agent/...

Health Tests
• Look at some basic health test results
• Number of missing blocks in HDFS (you want this to be zero!)
• Number of times the Datanode process has exited unexpectedly
• Let’s kill a Datanode
• CM automatically restarts it
• Look at our health tests again
• Let’s kill a Datanode a lot
• CM will back off if restarts seem to be going nowhere
• Now what do we see?
• Let’s get it running again

Configuring for a multi-use cluster
● Log into the Cloudera Manager instance
● Go to the Yarn service
● Configure the following property to 2g
yarn.scheduler.maximum-allocation-mb
○ This determines the max size per requested container
● Restart the Yarn service
MapReduce / Spark Example

MapReduce / Spark Example - 1
We will run 2 jobs on the platform.
● Log into the system using the provided key
● Run the following test pi MapReduce job
$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-
mapreduce/hadoop-examples.jar
pi
-Dmapreduce.map.memory.mb=2048
5 5000

Attempt to run the following:
● Log into the system using the provided key
● Run the following test pi Spark job
$ spark-submit
--class org.apache.spark.examples.SparkPi
--conf spark.yarn.executor.memoryOverhead=1g
--master yarn-client
--executor-memory 2g
/opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples.jar 10

• What happened and why?

Exception in thread "main" java.lang.
IllegalArgumentException:
Required executor memory (2048+384 MB) is above the max
threshold (2048 MB) of this cluster!
• Where does the 384 MB come from?

Cloudera Live Tutorial
• “Getting Started” tutorial for Apache Hadoop:
http://<IP Address>/#/tutorial/home
OR
http://www.cloudera.
com/content/cloudera/en/developers/home/developer-admin-
resources/get-started-with-hadoop-tutorial/exercise-1.
html
• Load relational and clickstream data into HDFS
• Use Apache Avro to serialize/prepare that data for analysis
• Create Apache Hive tables
• Query those tables using Hive or Impala
• Index the clickstream data for business users/analysts

Exercise 1: Ingest Using Apache Sqoop
(1) Fetch Table Metadata
(3) Data Transfer Via Map Job
Import
(2) MR Job Submission

Exercise 1: Ingest Using Apache Sqoop Explained
> sqoop import-all-tables
-m {{cluster_data.worker_node_hostname.length}}
--connect jdbc:mysql://{{cluster_data.manager_node_hostname}}:3306/retail_db
--username=retail_dba
--password=cloudera
--compression-codec=snappy
--as-parquetfile
--warehouse-dir=/user/hive/warehouse
--hive-import

Join the Discussion
Get community
help or provide
feedback
cloudera.com/community

Hadoop Operations for Production Systems (Strata NYC)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Operations for Production Systems (Strata NYC)

Similar to Hadoop Operations for Production Systems (Strata NYC) (20)

Recently uploaded

Recently uploaded (20)

Hadoop Operations for Production Systems (Strata NYC)