Big data interview questions and answers

3/26/2015 Kalyan Hadoop Training @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345: Big Data Interview Questions and Answers
data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outlin… 1/14
This study guide was made for the versions of Hadoop that Cloudera uses with their own distro (CDH4) which is
2.0.0. Hadoop development moves fast and this info might be out of date so verify anything you read below. I
would hope that most primary functions will stay the same, but just some properties names will be deprecated and
changed. Here is a link for information about the test.
############
### HDFS ###
############
Hadoop Distributed Filesystem (HDFS)
Built to support high throughput, streaming reads and writes of extremely large files.
NAS and SAN's offer centralized, low-latency access to either a block device or a filesystem on the order
of terabytes in size. They do not scale to meet the need of thousands of machines pulling hundreds of Gigs
of content all at one time.
### Goals for HDFS ###
* Store millions of large files, each greater than tens of gigabytes, and filesystem sizes
reaching tens of petabytes.
* Use a scale-out model based on inexpensive commodity servers with internal JBOD ("Just a bunch of disks")
rather than RAID to achieve large-scale storage. Accomplish availability and high throughput through
application-level replication of data.
* Optimize for large, streaming reads and writes rather than low-latency access to many small files.
Batch performance is more important than interactive response times.
* Gracefully deal with component failures of machines and disks.
* Support the functionality and scale requirements of MapReduce processing.
### HDFS Design ###
- HDFS is a userspace filesystem. It runs as a process of the OS. It uses the local machines filesystem to
store it's own files. HDFS is not a POSIX-compliant filesystem.
- HDFS is a distributed filesystem. Each machine in a cluster stores a subset of the data (blocks) that
makes up the complete filesystem. Filesystem metadata is stored on a centralized server (NameNode),
acting as a directory of block data and providing a global picture of the filesystem's state. The admin
controls the number of blocks that are replicated across the cluster. The default replication for each
block is 3 times. Should the number of copies of a block drop below the configured replication factor,
the filesystem automatically makes a new copy from one of the remaining replicas
Having this replication allows:
- Multiple machine failures can be more easily tolerated.
- Read data from a machine closest to an application on the network
- During processing any copy of the data can be used, giving the scheduler a better chance of finding
available resources
- No need for special expensive specialized storage system for data protection of block data. Ex. RAID
- HDFS has a block size higher than most other filesystems. The default is 64M and some go as high as 1G.
Increasing the block size means data will be written in larger contiguous chunks on disk, which in turn
means data can be written and read in larger sequential operations. This minimizes drive seek operations
and results in better performance when doing large streaming I/O operations.
- A block consists of the raw bytes of a portion of the file being stored. All block files start with the prefix
blk_. The metadata file (with a .meta suffix) is made up of a header with version and type information,
followed by a series of checksums for sections of the block.
- Files in HDFS are write once, once a replica is written, it is not possible for it to change.
Benefits of this include:
- Removing the need for complex reasoning about the consistency between replicas.
- Applications being able to read any of the available replicas when accessing a file
- A file in HDFS that is smaller than a single block does not occupy a full block's worth of underlying storage.
So if a file is 10MB but the block size is 64MB the block only occupies 10MB until more data is added.
- Multiple files in HDFS can use different block sizes. HDFS provides api to specify block size when you
create a file. FileSystem.create(Path, overwrite, bufferSize, replication, blockSize, progress)
- Every datanode runs a block scanner periodically verified every three weeks, this verifies all the blocks stored on
the datanode. This allows bad blocks to be detected and fixed before they are read by clients. The DataBlockScanner
maintains a list of blocks to verify and scans them one by one for checksum errors.
- HDFS permits a client to read a file that is open for writing. When reading a file open for writing, the length of
the last block still being written is unknown to the NameNode. The last incomplete block is not visible and neithe
any subsequent blocks which are going to be completed. Only the blocks which were completed before opening the stre
for reading are available. In other words if the block being written is not finished yet the reader can't see it.
- HAR file (Hadoop Archive) groups together small files into a single Hadoop archive file.
- Useful when there are already lots of small files in HDFS, which need to be grouped together before some expensiv
- Implemented as a MapReduce job.
- Use a har:// URL to access each file from the archive and view the archive as a folder.
- Use a normal hdfs:// URL to access the actual content of the archive in HDFS. HARs are stored in HDFS as folders
contain a file with the concatenation of all its containing input files.
### Daemons that make up HDFS ###
There are three daemons that make up a standard HDFS cluster.
Cloudera Certified
Administrator for
Apache Hadoop (CCAH)
study guide

NameNode - 1 per cluster. Filesystem metadata is stored on a centralized server, acting as a
directory of block data and providing a global picture of the filesystem's state.
Secondary NameNode - 1 per cluster. Performs internal NameNode transaction log checkpointing.
Datanode - Many per cluster. Stores block data (contents of files).
### NameNode ###
- Clients connect to the NameNode to perform filesystem operations
- Datanodes regularly report their status to the NameNode in a heartbeat. They carry information about total storage
fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for t
NameNode's block allocation and load balancing decisions.
- The NameNode does not directly send requests to DataNodes. It uses replies to heartbeats to send instructions to th
The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and
immediate block report, and shut down the node.
- At any given time, the NameNode has a complete view of all datanodes in the cluster, their current health,
and what blocks they have available.
- Datanode initially starts up, as well as every hour thereafter, a block report to the NameNode. A block report
is simply a list of all blocks the datanode currently has on its disks. The NameNode keeps track of all the
changes.
- File to block mapping on the NameNode is stored on disk. The host specific location of the blocks are not
recorded. DataNodes send out their block lists on startup and periodically after that. The NameNode uses
these reports to see where all of it's blocks are. This allows you to move disks from one datanode to another
and not worry about a changing hostname or IP.
- NameNode stores its filesystem metadata on local filesystem disks in a few different files, but the two most
important of which are fsimage and edits.
- Fsimage contains a complete snapshot of the filesystem metadata including a serialized form of all the directory
and file inodes in the filesystem. Each inode is an internal representation of a file or directory's metadata
and contains such information as the file's replication level, modification and access times, access permissions,
block size, and the blocks a file is made up of. For directories, the modification time, permissions, and quota
metadata is stored.
- Edits file (journal) contains only incremental modifications made to the metadata. It uses a write ahead log whic
operations to sequential, append-only operations (in the context of the NameNode, since it serves directly from
RAM), which avoids costly seek operations and yields better overall performance.
- Upon NameNode startup, the fsimage file is loaded into RAM and any changes in the edits file are replayed,
bringing the in-memory view of the filesystem up to date.
- Each client-initiated transaction (copy,move,etc) is recorded in the journal, and the journal file is flushed and
the acknowledgment is sent to the client. The NameNode also updates its in-memory (RAM) representation of the fil
which it updates after the edit log has been modified.
- Starting a NameNode will bring it into service after it loads the fsimage, replays the transaction log, sees
some percentage of blocks (minimally replicated) from the datanodes, and is stable for some additional amount of ti
- This threshold is almost always 100% but can be changed with dfs.safemode.extension parameter.
- During safe mode replication of blocks is prohibited, but it does offer a read-only view of the filesystem to clien
The NameNode awaits when all or majority of DataNodes report their blocks. Depending on how safe mode parameters
are configured the name-node will stay in safe mode until a specific percentage of blocks of the system is minimall
Safe mode is exited when the minimal replication condition is reached, plus an extension time of 30 seconds. The mi
replication condition is when 99.9% of the blocks in the whole filesystem meet their minimum replication level (whi
to one and is set by dfs.replication.min. hadoop dfsadmin -safemode get
- NameNode filesystem metadata is served entirely from RAM. This makes it fast, but limits the amount of
metadata a box can handle. Roughly 1 million blocks occupies roughly 1 GB of heap.
- Hadoop's default strategy is for the NameNode to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
### Secondary NameNode ###
- IT IS NOT A BACKUP FOR THE NameNode! The name is horrible, but it is what it is.
- The NameNodes edits file needs to be periodically applied to the fsimage file. NameNode may not have the
available resources (CPU or RAM) to do this while continuing to provide service to the cluster so the
secondary NameNode applies the updates from the edits file to the fsimage file and sends it back to the
primary. This is known as the checkpointing process.
- The checkpoint file is never changed by the NameNode only the Secondary NameNode.
- This application of the updates (checkpointing) to the fsimage file occurs every 60 mins by default or whenever the
edits file reaches 64Meg. Which ever happens first. Newer versions of Hadoop use a defined number of transactions
rather than file size to determine when to perform a checkpoint.
- If the secondary NameNode is not running at all, the edit log will grow significantly and it will slow the system d
Also, the system will go into safemode for an extended time since the NameNode needs to combine the edit log and th
current filesystem checkpoint image.
### Datanode ###
- Daemon responsible for storing and retrieving block (chunks of a file) data is called the Datanode (DN).
- Datanode has direct local access to one or more disks, (commonly called data disks) in a server on which
it's permitted to store block data
- Point Datanode to new disks in existing servers or adding new servers with more disks increases the amount
of storage in the cluster.
- Block data is streamed to and from datanodes directly, so bandwidth is not limited by a single node
- Datanodes regularly report their status to the NameNode in a heartbeat

- Datanode initially starts up, as well as every hour thereafter, a block report to the NameNode. A block report
is simply a list of all blocks the datanode currently has on its disks. The NameNode keeps track of all the
changes.
### Process of reading a file ###
Reading file in HDFS called /foo/bar.txt.
1. The client uses a Hadoop client program to make the request.
2. Client program reads the cluster config file on the local machine which tells it where the namemode
is located. This has to be configured ahead of time.
3. The client contacts the NameNode and requests the file it would like to read.
4. Client validation is checked by username or by strong authentication mechanism like Kerberos.
5. Once client is validated request is checked against the owner and permissions of the file.
6. If the file exists and the user has access to it then the NameNode responds with the first block id and
provides a list of datanodes a copy of the block can be found, sorted by their distance to the client (reader).
7. The client now contacts the most appropriate datanode directly and read the block data it needs. This process
repeats until all blocks in the file have been read or the client closes the file stream.
- If while reading the file the datanode dies, library will automatically attempt to read another replica of the
data from another datanode. If all replicas are unavailable, the read operation fails and the client receives an ex
- If information returned by the NameNode about block locations are outdated by the time the client attempts to conta
datanode, a retry will occur if there are other replicas or the read will fail.
### Process of writing a file ###
Writing a new file to HDFS called /foo/babar.txt.
1. The client uses a Hadoop client program to make the request.
2. Client program reads the cluster config file on the local machine which tells it where the namemode
is located. This has to be configured ahead of time.
3. A request is sent to the NameNode to create the file metadata
4. Client validation is checked by username or by an authentication mechanism like Kerberos.
5. If the user has the necessary permissions to do so, the metadata entry for the new file is made. However, it
initially has no associated blocks.
6. NameNode responds to the client and indicates the open request was successful and that it may now begin writing da
7. The client starts breaking up the file into pieces (packets, not TCP ones), queues them in memory and starts a dat
from this queue.
8. The client contacts the NameNode requesting a set of datanodes to which replicas of the next block should be writt
9. The namemode responds and the clients data packets are then streamed to the first datanode, which writes the data
and to the next datanode, which writes to its disk, and so on. This is called a replication pipeline. Each datanod
replication pipeline acknowledges each packet as it's successfully written.
10. The client application maintains a list of packets for which acknowledgments have not yet been received and when
a response, it knows the data has been written to all nodes in the pipeline. This process of writing packets to t
continues until the block size is reached, at which point the client goes back to the NameNode for the next set o
to write to.
11. Eventually, the client indicates it's finished sending data by closing the stream, which flushes any remaining pa
disk and updates the NameNode to indicate the file is now complete.
- If a datanode in the pipeline fails to write the pipeline is immediately closed and all packets that had been sent
the last acknowledgment are pushed back into the queue to be written so that any datanodes past the failed node in
pipeline will receive the data. The current block is given a new ID on the remaining healthy datanodes. This is don
should the failed datanode return, the abandoned block will appear to not belong to any file and be discarded autom
A new replication pipeline containing the remaining datanodes is opened and the write resumes.
- When a new block is created, HDFS places the first replica on the node where the writer is located. The second and
replicas are placed on two different nodes in a different rack. The rest are placed on random nodes with restrictio
more than one replica is placed at any one node and no more than two replicas are placed in the same rack, if possi
### NameNode High Availability ###
- NameNode high availability (or HA) is deployed as an active/passive(standby) pair of NameNodes. The edits write ahe
needs to be available to both NameNodes, and therefore is stored on a shared storage device. Currently, an NFS file
as the shared storage, although there are plans to remove this dependency. As the active NameNode writes to the edi
Standby NameNode is constantly replaying transactions to ensure it is up to date and ready to take over in the case
- Datanodes are also aware of both NameNodes in an HA configuration and send block reports to both servers.
- High-availability pair of NameNodes can be configured for manual or automatic failover. Default is manual failover.
- In a manual failover a command must be sent to effect a state transition from one NameNode to the other.
- In automatic failover, each NameNode runs an additional process called a "failover controller" that monitors the he
process and coordinates state transitions.
- Graceful failover is initiated by the admin.
- A Nongraceful failover is a detected fault in the active failover controller process.
- The system can use a series of increasingly drastic fencing techniques to ensure the failed node (which could still
is actually stopped. Tell it to stop via RPC, or Send a IPMI reboot message to the failed host.
- It's impossible to know if a NameNode has relinquished active status or if it's simply inaccessible from the standb
- When running in HA mode the standby NameNode takes over the role of the secondary NameNode. There is no separate se
NameNode process in an HA cluster, only a pair of NameNode processes. Most repurpose their secondary NameNode machi
a second NameNode.
- Manual failover from primary NN to Secondary NN looks like this: haadmin -failover hadoop1 hadoop2
This shuts down (fences) hadoop1 and brings up hadoop2 as the active NN.
### NameNode Federation ###
- Helps overcome the limit of how much metadata the NameNode can store in memory by splitting it up across multiple N
gives us one logical namespace from a bunch of different NameNodes. Similar to the Linux filesystem where many diff
can be mounted to different points, but still form under one named root /.
- Each datanode has a block pool for each namespace. While blocks from different pools are stored on the same disks (
physical separation), they are logically exclusive. Each datanode sends heartbeats and block reports to each NameNo
- NameNodes do not communicate with one another and failure of one does not affect the other

- Clients view the namespace via an API implementation called ViewFS. This maps slices of the filesystem to the prope
It is configured on the client side via the local core-site.xml file.
- Federation does not support overlapping mount points as of right now.
### Clients ###
- Client can read and write data to HDFS using different tools and API's.
- Clients can be on the same physical machines as any of the Hadoop daemons, or they can be on a host separate from t
- Clients that regularly use one datanode can cause the node to become unbalanced because of the block placement poli
The NameNode will assign the local machine as the destination for the first replica when an HDFS client is running
a datanode. This causes more blocks to kept on the local datanode that others. To help fix this run the balancer.
hadoop balancer -threshold N (where N is the percentage of blocks within which datanodes should be with one another
This can be killed at any time without any repercussions. Apache Hadoop users can use the start-balancer.sh script.
## Commands ###
- Hadoop comes with a number of command-line tools that enable basic filesystem operations.
- HDFS commands are subcommands of the hadoop command-line utility
# Display basic usage information
hadoop fs
# List files in a dir. Uses fs.default.name value in core-site.xml file if full url syntax is not used.
hadoop fs -ls /user/dude
or
hadoop fs -ls hdfs://localhost:8020/home/dude
# Upload file with -put or -copyFromLocal which copies file form local filesystem
hadoop fs -put /etc/resolv.conf /user/dude/
# Download file from HDFS using -get or -copyToLocal.
hadoop fs -get /user/dude/resolv.conf ./
# Set a replication factor for a file or dir of files with the -R
hadoop fs -setrep 5 -R /user/dude/rep5/
# Run a fsck on the files we set the rep factor on and see if it looks correct
hadoop fsck /user/dude/rep5 -files -blocks -locations
#################
### MapReduce ###
#################
- Their are 2 versions on MapReduce in the ecosystem right now. V1 and V2.
- V1 is the orginal MapReduce that uses tasktracker and jobtracker daemons.
- V2 is called YARN. YARN to splits up the two major functionalities of the JobTracker, resource management
and job scheduling/monitoring, into separate daemons. Resource manager, application master, and node manager.
### MapReduce v1 ###
- Developers write jobs (code) that contains of a map function and a reduce function, along with job
configuration information that controls various aspects of its execution
- The jobs are broken up into tasks, the tasks are scheduled to run on machines, each task's health is
monitored, in case of any failures the necessary tasks are retried.
- TaskTracker is a process that runs on slave (data) nodes and is responsible for instantiating and
monitoring individual Map and Reduce tasks. It starts a separate JVM processes to do the actual work
(called as Task Instances).
- MapReduce job is made up of four distinct stages, executed in order: client job submission, map task
execution, shuffle and sort, and reduce task execution
- MapReduce framework provides a set of APIs for submitting jobs and interacting with the cluster.
- A job is made up of code written by a developer against the MapReduce APIs and the configuration which
specifies things such as the input and output datasets
- The jobtracker process (running on another host in the cluster) is responsible for accepting the job
submissions, scheduling tasks to run on worker nodes, and providing administrative functions such as
worker health and task progress monitoring to the cluster.
- Speculative execution - The speculative execution helps to offset the slow workers. The jobtracker
will create multiple instances of the same task and will take the first result into
consideration and the second instance of the task will be killed.
- Job submission occurs over the network from any machine. It does not have to be one from
the cluster.
- There is one jobtracker per MapReduce cluster. If it dies all running jobs fails. So put it on a
reliable server.
- Tasktrackers inform the jobtracker as to their current health and status by way of regular heartbeats.
Each heartbeat contains the total number of map and reduce task slots available, the number occupied,
and detailed information about any currently executing tasks. After a configured period of no
heartbeats the tasktracker is considered dead.
- When a job is submitted to jobtracker, information about each task that makes up the job is stored in memory.
After the job completes, this information is retained for a configurable window of time or until a specified numbe
of jobs have been executed. On an active cluster where many jobs, each with many tasks, are running, this
information can consume a considerable amount of RAM. Because of this monitoring jobtracker memory
utilization is critical
- The act of deciding which tasks of a job should be executed on which worker nodes is referred to as task
scheduling. The scheduler decides when tasks get executed and in what order.
- Their is 1 tasktracker process on all worker nodes, it accepts task assignments from the jobtracker,
instantiates the user code, executes those tasks locally, and reports progress back to the jobtracker periodically
- Both tasktrackers and datanodes processes run on the same machines, which makes each node both a compute
node and a storage node.
- Each tasktracker is configured with a specific number of map and reduce task slots that indicate how many of each
type of task it is capable of executing in parallel.
- Tasktrackers allow more map tasks than reduce tasks to execute in parallel because they consume resources
differently.
- Upon receiving a task assignment from the jobtracker, the tasktracker executes an attempt of the task in a

separate process. A task is the logical unit of work, while a task attempt is a specific, physical instance
of that task being executed. Attempts may fail, but each task in a job will have at least 1 attempt.
Communication between the task attempt (usually called the child, or child process) and the tasktracker is
maintained via an RPC connection over the loopback interface called the umbilical protocol.
- Tasktracker uses a list of user-specified directories to hold the intermediate map output and reducer input
during job execution. These user-specificed dir's are kept on the local machines filesystem.
- When a failure is detected by the tasktracker, it is reported to the jobtracker in the next heartbeat. The
job is rescheduled. If enough tasks from the same job fail on the same tasktracker then the node is added
to a job-level blacklist. If multiple tasks from different jobs repeatedly fail on a specific tasktracker,
the tasktracker in question is added to a global blacklist for 24 hours
- If there is a loss of the tasktracker daemon or the entire worker node. The jobtracker, after a configurable
amount of time with no heartbeats, will consider the tasktracker dead along with any tasks it was assigned.
Tasks are rescheduled on other nodes.
- If the jobtracker fails all jobs will fail eventually. This is a single point of failure in Hadoop.
- Benefit of MapReduce is it has data locality. This is the ability to execute computation on the same
machine where the data being processed is stored. This helps remove the "Store Effect" of all machines
smashing a SAN for large datasets, and bogging down the network.
### Highlevel overview of a MapReduce v1 ###
1. User creates a job and submits it to the jobtracker process.
2. A map task is created that runs the user-supplied map function on each record. The map function takes
a key-value pair as input and produces zero or more intermediate key-value pairs. Map tasks are executed
in parallel by various machines aross the cluster. Mapper output (intermediate data) is stored on the
Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory
location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up
after the Hadoop Job completes.
When you set the reducers to zero no reducers will be executed, and the output of each mapper will be
stored to a separate file on HDFS. The output of the reduce is normally stored in HDFS for reliability.
When the mapper creates a key-value pair each key is assigned to a partition using a component called the
partitioner. The partitioner is it's own key-value pair. It takes a hash of the key, modulo the number of
configured reducers in the job, to get a partition number. Keys that are the same are put into the same
partition. Try to picture a partition number next to each key value pair (record). Example looks like:
partition 1, dog => poodle
partition 2, cat => bengal
partition 3, bird => hawk
partition 1, dog => pug
partition 3, bird => duck
3. The next step is shuffle and sort. This is preformed by the reducers (reduce tasks). Each reducer is
assigned one of the partitions on which it should work. This is a flury of network copies between each
reducer in the cluster so it can get the partition (intermediate key-value) data it was assigned to work on.
The number of reducers is defined by the developer. The output of the reduce is normally stored in HDFS for
reliability. This step uses a lot of bandwidth between servers and benefits from very fast networking like 10G.
4. After the partition data has been copied we can start performing a merge sort of the data. A merge sort takes
a number of sorted items and merges them together to form a fully sorted list. Each reducer produces a separate
output file, usually in HDFS. Each reducer output file usually named part-, where is the number
of the reduce task within the job. The output format of the file is specified by the author of the MapReduce job.
### MapReduce v2 (YARN) ###
- YARN to splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitorin
into separate daemons. ResourceManager, ApplicationMaster, and NodeManager.
- Resource managment (what was the JobTracker in v1) is now done by a daemon called the ResourceManager(RM). It is
responsible for creating and allocating resources to multiple applications. Each application is an individual
MapReduce job. The resource manager daemon is still centralized (global) on one machine. assumes the responsibility
to negotiate a specified container in which to start the ApplicationMaster and then launches the ApplicationMaster.
On successful container allocations, the ApplicationMaster launches the container by providing the container launch
specification to the NodeManager. These containers can be launched on any node with NodeManager in the cluster.
- NodeManager(NM) is another deamon that runs on each worker node in the cluster. It runs in place of the traditional
tasktracker. NodeManager launches any type of process, dictated by the application, in an application container.
It also manages the ApplicationMaster daemon.
- Jobs (applications) are now managed and executed by the per application ApplicationMaster(AM) daemon. This deamon
can run on any node of the cluster. It is tasked with negotiating resources from the ResourceManager and working w
the NodeManager(s) to execute and monitor the tasks. Jobs are now isolated from each other and are decentrailized.
Every application has its own instance of an ApplicationMaster. AM is essentially user code and it is not to be
trusted. It is not run at a privileged level.
- An application can ask for specific resource requests via the ApplicationMaster to satisfy its resource needs.
The Scheduler responds to a resource request by granting a container, which satisfies the requirements laid out
by the ApplicationMaster in the initial ResourceRequest. The ApplicationMaster has to take the Container and presen
it to the NodeManager managing the host, on which the container was allocated, to use the resources for launching i
- MRV2 maintains API compatibility with previous stable release (hadoop-0.20.205). This means that all MapReduce jobs
should still run unchanged on top of MRv2 with just a recompile.
- Don't run MRv1 and YARN on the same set of nodes at the same time. It is not supported.
##########################
### Planning a cluster ###
##########################
### Distribution and software version ###
- Download tarballs from Apache Foundation. v1.0 has support for building RPM's or DEB's.
- Cloudera has it's own disto called CDH. It backports critical fixes, provides packages
for a number of different operating systems, and has a commercial-grade QA and testing process.
- CDH currently includes Apache Hadoop, Apache HBase, Apache Hive, Apache Pig, Apache Sqoop,
Apache Flume, Apache ZooKeeper, Apache Oozie, Apache Mahout, and Hue.

- CDH4 is based on Apache Hadoop 2.0.0
- CDH is available as tarballs, RedHat Enterprise Linux 5 and 6 RPMs, SuSE Enterprise Linux RPMs,
and Debian Deb packages.
### Hardware ###
- Master nodes (jobtracker,NameNode,secondary NameNode)
- Redundency is key. Dual power supplies, bonded nic's. RAID 10 possibly or NameNode.
- Lot's and lot's of RAM. As much as you can afford.
- OS disks should be RAID 1 mirror.
- NameNode and jobtracker produce a lot of logs so have a good amount of disk space
- Small cluster (20 workers or less) dual quad-core 2.6 Ghz CPU, 24 GB of DDR3 RAM, dual 1 Gb Ethernet
NICs, a SAS drive controller, and at least two SATA II drives in a JBOD configuration in addition to
the host OS device.
- Medium cluster (up to 300 workers) 48 GB ram, and the rest the same as small cluster.
- Master nodes in large clusters should have a total of 96 GB of RAM, and the rest the same as small cluster.
- NameNode considerations
- Length of filenames actually starts to matter at scale. The longer the name the more memory it takes up.
- Lot's of small files use up metadata. The more files there are to track the more memory we need.
- NameNodes don't need lots of storage as they only store what fits in memory. The metadata dump can't be
bigger than memory.
- NameNode disk reliability is very important. Use RAID or send multiple copies to different JBOD disks.
Also a copy of data should be written to an NFS share. That is a last ditch recovery effort.
- Secondary NameNode is usually the same hardware config as the NameNode.
- Jobtracker hardware is similar to NameNode as it is RAM hungry. Jobtracker keeps metadata information about
the last 100 (by default) jobs executed on the cluster in RAM. This can be tweaked though.
- Worker/datanodes
- Data disks should be as big of disks as you can afford setup in JBOD configuration. 20-30% of the machine's
raw disk capacity needs to be reserved for temporary data.
- Each task needs ~ between 2 GB and 4 GB of memory, depending on the task being performed.
- Each worker node in the cluster executes a predetermined number of map and reduce tasks simultaneously.
A cluster administrator configures the number of these slots, and Hadoop's task scheduler a function of
the jobtracker�assigns tasks that need to execute to available slots. Each one of these slots can be
thought of as a compute unit consuming some amount of CPU, memory, and disk I/O resources, depending on
the task being performed. It is very dynamic and has to be tuned to the hardware you have.
- Mid size would be 2 - 6 core processors, 64Gig ECC RAM, 12 SATA drives, SAS 6 Gb/s, 2 - 1G Ethernet,
Allocate memory to take advantage of triple or quad channel.
- High end would be 2 - 6 core processors, 96Gig ECC RAM, 2 - SAS 6 Gb/s, 24 � 1 TB SFF Nearline/MDL,
1 - 10G NIC, Allocate memory to take advantage of triple or quad channel.
- Hadoop does not generaly benefit from Virtualization. Guest OSes are unaware of one another as they
perform I/O scheduling operations and, as a result, can cause excessive drive seek. Hadoop us
- Blade enclosures commonly share I/O planes and network connectivity, and the blades themselves usually have
little to no local storage. They are built for more compute intensive workloads.
- JBOD is preferred for datanodes. Multiple independent disks doing sequential reads is better than a group of
disks working together. Also, variation in drive speeds will not matter to independent disks.
- SAN's and NAS's are not good for Hadoop. They are usually oversubscribed only a small number of machines can
do concurrent I/O at one time.
### Cluster sizing ###
- You could choose based on your storage needs.
- Sample growth plan
Average daily ingest rate 2TB
Replication factor 3
Daily raw consumption 6 TB Ingest x replication
Node raw storage 24 TB 12 x 2 TB SATA II HDD
MapReduce temp space reserve 25% For intermediate MapReduce data
Node-usable raw storage 18 TB Node raw storage � MapReduce reserve
1 year (flat growth) 122 nodes Ingest x replication x 365 / node raw storage
- Watch the trade off of CPU for disk space density pitfall. You could try to purchase machines that are
half as powerful, but buy twice as many. Then you get in power,space,cooling, network port density
issues. It's preferable to purchase reasonably dense machines without falling outside the normal boundaries
of what is considered commodity hardware.
### OS selection, prep, layout ###
- Linux is your best bet for a good production quality OS. Pick the distro your most comfortable with and
what works best with your hardware. Also look at support and tools offered for each.
- Use a software configuration management system to help you configure a large number of machines
- Directory layout
- Hadoop home - Hadoop software is installed here. Usually in /opt or /usr/local or /usr.
- Datanode data directories - used by the datanode to store HDFS block data. assumes that each directory
provided is a separate physical device with independent spindles. Often put on the same devices as the
tasktracker MapReduce local directories
- NameNode directories - used by the NameNode to store filesystem metadata. NameNode assumes that each
directory provided is a separate physical device and replicates all writes to each device synchronously.
Directories will all require the same amount of space and generally do not use more than 100 GB.
- MapReduce local directories - One or more directories used by the tasktracker to store temporary data during
a MapReduce job. The more spindles the better. Often put on the same devices as the datanode data directories.
- Hadoop log directory - used by all daemons to store log data as well as job and task-level data.
- Hadoop pid directory - small, does not grow. Daemons store pid id's in here.
- Hadoop temp directory - small, short-lived files it sometimes needs to create. /tmp/hadoop-<${user.name}> by
default.
- Java SDK is critical for Hadoop since it's run with Java. HotSpot JVM is, by far, the best performing, most

stable implementation available. Cloudera's RPM packages require Oracle Java RPM.
- All machines in the cluster should run the exact same version of Java, down to the patch level. Use 64bit version.
- Hadoop can be deployed in stand alone mode, pseudo-distributed mode or fully-distributed mode.
### Hostnames ###
- Tasktracker and datanodes heartbeat into the jobtracker and NameNode (respectively) every few seconds. Heartbeat
includes the identity of the machine, either by hostname or by IP address. This identity is how it will refer to th
machine. If you contact the NameNode to open a file on a machine it will return the name the worker uses in its hea
- Clients must be able to resolve the hostname and communicate with the worker using the identifier as it was reporte
the NameNode. This means the hostname of the nodes in the cluster needs to get set correctly in all places on the O
### Users, Groups, Privs ###
- Try to run HDFS daemons as one user and MapReduce daemons as another. Jobtracker,Tasktracker, Child tasks as user m
In secure mode child tasks are run as the user that submitted the job. NameNode, Secondary NameNode,Datanode run as
### Tuning and conf changes ###
- /etc/security/limits.conf file. Remove file limits for Hadoop daemons.
hdfs - nofile 32768
mapred - nofile 32768
hbase - nofile 32768
- /etc/sysctl.conf. Change following Linux kernel parameters
vm.swappiness=0
vm.overcommit_memory=1
- Do not use Linux Volume Manager for Hadoop data disks
- When mouting filesystems disable dir and file access times (atime).
### Network usage and design ###
- HDFS and MapReduce exhibit strong East/West, or full node-to-node communication patterns
- Tasktrackers regularly heartbeat small bits of information to the jobtracker to indicate they're alive.
- During job submission, the jobtracker communicates with the NameNode, but also in the form of small RPC requests
- Each reducer must fetch the map output data for its partition from each tasktracker in the cluster.
- Shuffle phase accounts for a rather significant amount of East/West traffic within the cluster.
- Traditional Tree Topology
- predominant architecture deployed in data centers today
- Has multiple tiers, each of which brings together (or aggregates) the branches of another tier
- Hosts are connected to leaf or access switches in a tree, which are then connected via one or more uplinks
to the next tier
- Example: First tier are 48x1GbE switchs with four 10GbE ports. Second tier is 1 48x10GbE port to tie all
of the trunked 10GbE ports together. Hosts sit on the 1GbE ports. This will get you 576 hosts. Ports are
oversubscribed 1.2:1
- Adding more hosts means adding more tiers towards the top. Problem is adding more tiers adds more hop points
and the amount of oversubscription is compounded with each tier in the tree.
- Cluster data ingress and egress should be nearest to the root of a tree network prevents branch monopolization
and unbalanced traffic patterns
- Tree network works for small and midsize networks that fit within two tiers.
- Never place low latency services on the cluster distribution switch
- Spine Fabric
- Preferred topology for Hadoop traffic because it maximizes east/west traffic patterns and is redundant
- Tries to be as close to equal distance between any two hosts as possible
- Each leaf switch (1st tier) has one or more uplinks to every switch in the second tier, called the spine
- A routing protocol such as IS-IS, OSPF, or EIGRP is run with equal cost multipath (ECMP) routes so that
traffic has multiple path options and takes the shortest path between two hosts
- ECMP routing says we can simply take a different path to the same place; the bandwidth isn't gone, just spread ou
- This equidistant, uniform bandwidth configuration is perfect for applications with strong East/West traffic
patterns.
##########################
### Install and Config ###
##########################
### Install ###
- Tarball installs are difficult because of all the extra steps needed like permissions and log relocation, etc.
It is however the most flexible.
- Installing from RPM or Deb packages is easier. Saves you from making common mistakes and helps keep best pratices.
- Packages install to places like /usr/bin, /usr/lib, /etc/hadoop.
- For a production environment root privs will be required. It's harder to install without root privs. You can do it
with the tarball install and change the owner to someone other than root.
- CDH install mods the /etc/security/limits.d file for you. alternatives system, which allows multiple alternative
versions of files to be installed, software installed to /usr/lib/hadoop-0.20 and symlinked to /usr/lib/hadoop
by way of alternatives
### Config ###
- Global, cluster level parameters control how the software itself is deployed, service identification, access
controls, and integration with the OS and external systems
- Not all these parameters can not be changed without restarting the daemons which they affect
- Developers or automated systems that submit jobs can override these values in code or from the command line,
where applicable
- Some parameters can be specified per operation, provided the code or context supports it
- The following configuration files exist in the conf directory:
- hadoop-env.sh - A bourne shell fragment sourced by the Hadoop scripts, this file specifies environment
variables that affect the JDK used by Hadoop, daemon JDK options, the pid file, and log file directories.
- core-site.xml - An XML file that specifies parameters relevant to all Hadoop daemons and clients.
- hdfs-site.xml - An XML file that specifies parameters used by the HDFS daemons and clients.

- mapred-site.xml - An XML file that specifies parameters used by the MapReduce daemons and clients.
- log4j.properties - A Java property file that contains all log configuration information. Covered in
- masters (optional) - A newline separated list of machines that run the secondary NameNode, used only
by the start-*.sh helper scripts.
- slaves (optional) - A newline separated list of machine names that run the datanode / tasktracker pair
of daemons, used only by the start-*.sh helper scripts.
- fair-scheduler.xml (optional) - The file used to specify the resource pools and settings for the Fair
Scheduler task scheduler plugin for MapReduce.
- capacity-scheduler.xml (optional) - The name of the file used to specify the queues and settings for the
Capacity Scheduler task scheduler plugin for MapReduce.
- dfs.include (optional, conventional name) - A newline separated list of machine names that are permitted
to connect to the NameNode. Blank file means no nodes are allowed to connect.
- NameNode.dfs.exclude (optional, conventional name) - A newline separated list of machine names that are not
permitted to connect to the NameNode.
- hadoop-policy.xml - An XML file that defines which users and / or groups are permitted to invoke
specific RPC functions when communicating with Hadoop.
- mapred-queue-acls.xml - An XML file that defines which users and / or groups are permitted to submit jobs
to which MapReduce job queues.
- taskcontroller.cfg - A Java property−style file that defines values used by the setuid task-controller
MapReduce helper program used when operating in secure mode.
- Hadoop scripts ensure the conf directory is always at the head of the classpath so files can easily be located
by the code
- Administrators can mark a property as final to prevent a developer override of properties looks like
true
### core-site.xml important properties ###
- fs.default.name - This paremeter is not depreacted and has been replaced by fs.defaultFS. But it is still given on
in 2013 so still know that the parameter is a URL that specifies the default filesystem used by clients.
Default set to file:/// (local file system) prod clusters usually use hdfs://hostname:port which points to
NameNode.
- fs.defaultFS - Replaces fs.default.name. Default value is file:///. The name of the default file system. A URI who
and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.i
the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem
- fs.checkpoint.dir - The fs.checkpoint.dir parameter specifies the comma separated list of directories used
by the secondary NameNode in which to store filesystem metadata during a checkpoint operation
- fs.checkpoint.size - Default is 67108864 (64Meg) The size of the current edit log (in bytes) that triggers a period
even if the fs.checkpoint.period hasn't expired.
- io.file.buffer.size - used as a general purpose buffer size. Property should be set to a multiple of the system pag
defined in bytes, and is 4KB by default. Try starting value at 65536.
- fs.trash.interval - Number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabl
option may be configured both on the server and the client. If trash is disabled server side then the client side c
is checked. If trash is enabled on the server side then the value configured on the server is used and the client c
value is ignored.
- hadoop.security.authentication - Possible values are simple (no authentication), and kerberos
### hdfs-site.xml important properties ###
- dfs.name.dir - Specifies a comma separated list of local directories (with no spaces) in which the NameNode
should store a copy of the HDFS filesystem metadata. encouraged to specify two internal disks and a low
latency, highly reliable, NFS mount.
- dfs.name.edits.dir or dfs.namenode.edits.dir - Determines where on the local filesystem the DFS name node should st
transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated
the directories, for redundancy. Default value is same as dfs.name.dir
- dfs.data.dir or dfs.datanode.data.dir - indicate where datanodes should store HDFS block data. Also a comma separat
than mirroring data to each directory specified, the datanode round robins blocks between disks
- dfs.replication - Default block replication (3). The actual number of replications can be specified when the file i
The default is used if replication is not specified in create time.
- dfs.blocksize - The default block size for new files, in bytes.
- dfs.permissions.enabled - If "true", enable permission checking in HDFS.
- dfs.datanode.du.reserved - Reserved space in bytes per volume. Always leave this much space free for non dfs use.
- dfs.permissions.supergroup - a designated group is given special privileges equivalent to being the superuser
- dfs.namenode.name.dir.restore - Set to true to enable NameNode to attempt recovering a previously failed dfs.nameno
When enabled, a recovery of any failed directory is attempted during checkpoint.
- dfs.namenode.shared.edits.dir - A directory on shared storage between the multiple namenodes in an HA cluster. This
will be written by the active and read by the standby in order to keep the namespaces synchronized. This directory
to be listed in dfs.namenode.edits.dir above. It should be left empty in a non-HA cluster.
- dfs.namenode.safemode.threshold-pct - Specifies the percentage of blocks that should satisfy the minimal replicatio
defined by dfs.namenode.replication.min. Values less than or equal to 0 mean not to wait for any particular percent
before exiting safemode. Values greater than 1 will make safe mode permanent.
- dfs.namenode.safemode.min.datanodes - Specifies the number of datanodes that must be considered alive before the na
- dfs.hosts - Names a file that contains a list of hosts that are permitted to connect to the namenode. The full path
be specified. If the value is empty, all hosts are permitted.
- dfs.hosts.exclude - Names a file that contains a list of hosts that are not permitted to connect to the namenode. T
of the file must be specified. If the value is empty, no hosts are excluded.
### mapred-site.xml important properties ###
- mapred.job.tracker - provides the same for tasktrackers and MapReduce clients
Parameter is a hostname (or IP address) and port pair on which the jobtracker listens for RPC communication
Standard port is 8021.
- mapred.local.dir - map tasks in a MapReduce job use the machine's local disk to store their intermediate output.
Uses a comma seperated list of disks, sometimes sharing disks with the data in dfs.data.dir
- mapred.child.ulimit - The maximum virtual memory, in KB, of a process launched by the Map-Reduce framework. This ca
both the Mapper/Reducer tasks and applications using Hadoop Pipes, Hadoop Streaming etc. By default it is left unsp

cluster admins control it via limits.conf and other such relevant mechanisms.
- mapred.tasktracker.map.tasks.maximum - The maximum number of map tasks that will be run simultaneously by a task tr
- mapred.tasktracker.reduce.tasks.maximum - The maximum number of reduce tasks that will be run simultaneously by a t
- io.sort.mb - The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each mer
which should minimize seeks.
- io.sort.factor - The number of streams to merge at once while sorting files. This determines the number of open fil
- mapred.map.tasks - The default number of map tasks per job. Ignored when mapred.job.tracker is "local".
mapred.reduce.tasks - The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capa
a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is "local".
- mapred.map.max.attempts - The maximum number of attempts per map task. In other words, framework will try to execut
many number of times before giving up on it.
- mapred.reduce.max.attempts - The maximum number of attempts per reduce task. In other words, framework will try to
reduce task these many number of times before giving up on it.
- mapred.reduce.parallel.copies - The default number of parallel transfers run by reduce during the copy(shuffle) pha
- mapred.task.timeout - The number of milliseconds before a task will be terminated if it neither reads an input, wri
updates its status string.
- mapred.hosts - Names a file that contains the list of nodes that may connect to the jobtracker. If the value is emp
hosts are permitted.
- mapred.hosts.exclude - Names a file that contains the list of hosts that should be excluded by the jobtracker.
If the value is empty, no hosts are excluded.
- Tuning a worker node - start with a baseline configuration of 1.5 tasks for each physical CPU core.
Two thirds allocated to map task slots and the remaining one third as reduce task slots
### Logging ###
- Hadoop uses the Java logging package log4j to control log output. log4j.properties file in the conf/ dir.
This file controls the overall log levels of both the Hadoop daemons as well as MapReduce jobs
- System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default.
- This can be changed using the HADOOP_LOG_DIR setting in hadoop-env.sh.
- Each Hadoop daemon running on a machine produces two logfiles. The first is the log output written via log4j.
This file, which ends in .log, should be checked fisrt when diagnosing problems because most
application log messages are written here. The standard Hadoop log4j configuration uses a Daily Rolling File
Appender to rotate logfiles. Old logfiles are never deleted, so you should arrange for them to be periodically
deleted or archived, so as to not run out of disk space on the local node.
- The second logfile is the combined standard output and standard error log. This logfile, which ends in .out,
usually contains little or no output, since Hadoop uses log4j for logging. It is rotated only when the daemon
is restarted, and only the last five logs are retained. Old logfiles are suffixed with a number between 1 and
5, with 5 being the oldest file.
- Logfile names (of both types) are a combination of the name of the user running the daemon, the daemon name,
and the machine hostname. For example, hadoop-blahdatanode-poofy.local.log.2008-07-04 is the name of a
logfile after it has been rotated.
- logger is a named channel for log events that has a specified minimum log level
- The log levels, in order of most severe to least, are FATAL, ERROR, WARN, INFO, DEBUG, and TRACE. Log events
with a log level greater than or equal to that which is specified are accepted while less severe events are simply
discarded
- each logger has a parent logger from which it inherits its configuration information
- Loggers output their log events to an appender which is responsible to handling the event. Most common appenders
log to disk, but you can send them to places like syslog or console.
- log4j.rootLogger which specifies the default log level and appender to be used by all loggers
- Loggers can be specified by using the naming convention of log4j.logger.logger-name
- Value of a logger parameter is always a log level, a comma, and the name of one or more appenders.
Ex: hadoop.root.logger INFO, console then lower down you might see log4j.rootLogger ${hadoop.root.logger}, EventCo
### Monitoring cluster ###
- start the NameNodes and datanodes, and verify that all datanodes are communicating with each NameNode
- Each NameNode has two primary URLs: one that shows the view of the namespace managed by that NameNode, and an
aggregate cluster view that shows all NameNodes that make up the cluster
- URL of view of the namespace managed by that NameNode: http://NameNode.blah.com:50070/dfshealth.jsp
- Html page has the following info on it.
NameNode localhost.localdomain:8020 (active)
Started: sat sep 13 4:4:44 EST 2013
Version: 2.0.0-cdh4.3.0, 234929d9s9839s98sd7f9d7s9g
Compiled: Mon May 3 3:3:33 by blah from Unknown
Upgrades: There are not upgrades in progress
Cluster ID: cluster11
Blockpool ID: BP-23434343333-127.0.0.1-34322343ll3434
[Browse the filesystem]
[NameNode Logs]
--------------------------------------------------------------
Cluster Summary
Security is OFF
124 files and driectories, 91 blocks = 215 total.
Heap memory used 41.31 MB is 50% of Commited Heap Memory 81.06 MB. Max Heap Memory is 171.94 MB.
Non Heap Memory used 47.55 MB is 70% of Commited Non Heap Memory 67.05 MB. Max Non Heap Memory is 130MB
Configured Capacity : 11.07GB
DFS Used : 58.96 MB
Non DFS Used : 2.95GB

data:text/html;charset=utf-8,%3Cdiv%20class%3D%22article-header%22%20style%3D%22margin%3A%200px%3B%20outli… 10/14
DFS Remaining : 8.06GB
DFS Used% : 0.52%
DFS Remaining% : 72.82%
Block Pool Used : 58.96 MB
Block Pool Used% : 0.52%
DataNodes Usages : Min % Median % Max % stdev %
0.52% 0.52% 0.52% 0.00%
[Live Nodes] : 1 (Decomissioned: 0)
[Dead Nodes] : 0 (Decomissioned: 0)
[Decomissioning Nodes] : 0
Number of unreplicated blocks : 91
NameNode Journal Status:
Current transaction ID: 894
--------------------------------- --------------------------------------------
Journal Manager State
-------------------------------- ---------------------------------------------
FileJournalManager(root=/dfs/nn) EditLogFileOutputStream(/dfs/nn/current/edits_inprogress_00000000844
NameNode Storage:
------------------- ----- -----------
Storage Directory Type State
------------------ ---- -------------
/dfs/nn IMAGE_AND_EDITS Active
- URL of view of aggregate cluster view that shows all NameNodes that make up the cluster.
This is the federated NameNode cluster view.
http://localhost:50070/dfsclusterhealth.jsp
- Html page looks like:
Cluster ' cluster11'
Cluster Summary
Total files and directories 124
Configured Capacity : 11.07GB
DFS Used : 58.96 MB
Non DFS Used : 2.95GB
DFS Remaining : 8.06GB
DFS Used% : 0.52%
DFS Remaining% : 72.82%
-----------------------------------------
NameNodes
Number of NameNodes : 1
NameNode Blockpool Files and Blocks Missing Live Datenode Dead Datanode
Used Directories Blocks (Decomissioned) (Decomissioned)
--------------------- --------- ----------- ----- ------ -------------- --------------
localhost.localdomain 58.96MB 124 91 0 1 (0) 0 (0)
- URL of view of the jobtracker http://localhost:50030/jobtracker.jsp
- Html page looks like:
localhost Hadoop Map/Reduce Administration
State: RUNNING
Started: Sat Sep 14 12:22:33 PDT 2013
Version: 2.0.0.-mr1-cdh4.3.0, Unknown
Compiled: Mon May 14 12:22:33 by blah from Unknown
Identifier: 202023003020
Cluster Summary (Heap Size is 81.06MB/171.94 MB)
Running Running Total Nodes Occupied Occupied Reserve Map Reduce Avg. Blacklisted Exclude
Map Reduce Submissions Map Reduced Reduce Task Task Task/Node Nodes Nodes
Tasks Tasks Slots Slots Slots Capacity Capacity
------ ------- ----------- ------ -------- -------- ------- -------- -------- --------- ----------- -------
0 0 0 1 0 0 0 0 0 6.00 0 0
Scheduling Information
Queue Name State Scheduling Information
----------- ------ ----------------------
default running N/A
Filter (Jobid,Priority, User, Name ______________
Running Jobs
------------
None

Retired Jobs
-------------
None
Local Logs
[Log] directory, [Job Tracker History]
- sudo -u hdfs hadoop dfsadmin -report
### Quick HTTP Ports Reference ###
HDFS Port Config Parameter
-------------
Namenode 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondary NameNode 50090 dfs.secondary.http.address
Backup/Checkpoint node 50105 dfs.backup.http.address
Mapreduce Port Config Parameter
------------
JobTracker 50030 mapred.job.tracker.http.address
Tasktrackers 50060 mapred.task.tracker.http.address
#################################
### Quotas and Job Schedulers ###
#################################
### Quota ###
- Setting a quota on a directory (size is in bytes) - dfsadmin -setSpaceQuota size path
- Look a quotas of a dir - hadoop fs -count -q /user/blah
- Column one is the file count quota and two is the file count remaining column. The third and fourth columns
are the space quota and remaining space in bytes
- Remember: HDFS quota is applied to the physical (post-replication) size rather than the logical (pre-replication)
- You can also apply quotas on the number of files that may be created within a directory.
hadoop dfsadmin -setQuota number path
### Schedulers ###
- Task slots are shared across jobs that are executing concurrently. The scheduler is responsible for assigning tasks
to open slots on tasktrackers
- The scheduler is a plug-in within the jobtracker
- FIFO (First In First Out) Scheduler
- Default scheduler in Hadoop
- Very simple concept. Jobs are processed in the order they are recieved.
- Supports five levels of job prioritization, very low, low, normal, high, very high. Each is implemented in a sepe
queue.
- Higher tasks are processed before lower tasks. No matter when the jobs come in higher always trumps lower.
- To set this edit the mapred-site.xml file and set to mapred.jobtracker.taskScheduler
- Fair Scheduler
- Jobs, which are submitted to queues, are placed into pools. Each pool is assigned a number of task slots based on
number of factors including the total slot capacity of the cluster, the current demand (where "demand" is the num
of tasks in a pool) on other pools, minimum slot guarantees, and available slot capacity. Pools may optionally ha
minimum slot guarantees. These pools are said to have an SLA, with the minimum number of slots providing the vehi
for ensuring task scheduling within a given period of time. Beyond the minimum slot guarantees, each pool gets an
equal number of the remaining available slots on the cluster; this is where the "fair share" portion of the name
comes from.
- By default, no pools have minimums and so all pools simply receive an equal number of slots.
- Pools can be added dynamically, in which case, the number of slots they receive is adjusted
- The scheduler determines to which pool a job (and really, it's tasks) are assigned by a MapReduce property usuall
set to user.name. This yields a pool per user, each of which receives an equal number of slots
- Each time a tasktracker heartbeats to the jobtracker and reports available slots the rules are evauluated and cha
are made.
- Total capacity - In the context of scheduling, total capacity (or total cluster capacity) is the sum of all slots
each type (map slots and reduce slots) regardless of their state. Total capacity is not a static number
- Total available capacity - The total available capacity is the number of open slots in a cluster.
- Pool - A pool is a container for a group of jobs and the recipient of resource allocation.
- Demand - A pool is said to have demand if and only if there are queued tasks that should be assigned to it.
- Fair share - The "fair" number of slots a pool should receive.
- Minimum share - An administrator-configured number of slots that a pool is guaranteed to receive. By default,
pools have no minimum share.
- When assigning tasks, the scheduler first looks at the demand for each pool. No demand then no slots are given.
- The scheduler gives each pool with demand its minimum share, if min share is configured. Min share is always 1st.
- Once the minimum shares satisfied the scheduler switches to allocating the remaining slots
- If the sum of the minimum shares of pools with demand is greater than the total capacity, the minimum shares
of each pool are adjusted pro rata.
- The scheduler now tries to assign the rest of the slots evenly across the pools.
- The scheduler never allocates more slots than the demand. If a user wants 10 slots and 20 are available then they
only get 10 for their pool.
- In addition to, or in place of a minimum share, pools may also have a weight. Pools with greater weight receive

more slots during fair share allocation (weight does not impact minimum share allocation). The weight of a pool
simply acts as a multiplier; a weight of 2 means the pool receives two slots to every one slot the other pools re
By default, pools have a weight of 1.
- Job priorities, like those supported in the FIFO scheduler, are also supported in the Fair Scheduler. When a prio
is set, it simply affects the weight of the job; the higher the priority, the greater the weight, and the more ta
assigned during fair share allocation.
- If two or more jobs could be submitted to the same pool by different users then The Fair Scheduler uses another
instance of itself to schedule jobs within each pool. It evenly splits up the jobs in the pool.
- When a job is submitted to a pool with a minimum share and those slots have been given away it can kill tasks or
wait for them to complete. This is called preemption.
- Delayed task assignment (sometimes called delay scheduling) is done by the scheduler to increase the data localit
hit ratio and as a result, the performance of a job. This lets a free slot on a tasktracker remain open for a sho
amount of time if there is no queued task that would prefer to run on the host in question
- Choose the Fair Scheduler over the Capacity Scheduler if:
- You have a slow network and data locality makes a significant difference to job runtime. Features like delay
scheduling can make a dramatic difference in the effective locality rate of map tasks.
- You have a lot of variability in the utilization between pools. The Fair Scheduler's preemption model affects
much greater overall cluster utilization by giving away otherwise reserved resources when they're not used.
- You require jobs within a pool to make equal progress rather than running in FIFO order.
- Capacity Scheduler (CS)
- An administrator configures one or more queues, each with a capacity, a predetermined fraction of the total clust
slot capacity. It is reserved for the queue in question and is not given away in the absence of demand.
- During the tasktracker heartbeat, slots are given to queues, with the most starved queues receive slots first.
Queue starvation is measured by dividing the number of running tasks in the queue by the queue's capacity
or in other words, its percentage used. Any additional slots beyond the sum of the queue capacities defined may b
freely assigned to any queue, as needed, by the scheduler.
- Within a queue, jobs for the same user are FIFO ordered.
- CS additionally understands scheduling tasks based on (user defined) memory consumption of a job's
tasks. It uses information collected by the tasktracker to aid in scheduling decisions
- Administrators may specify a default virtual and physical memory limit on tasks that users may optionally
override upon job submission. The scheduler then uses this information to decide on which tasktracker to
place the tasks,
- The CS has feature where job initialization is performed lazily, which can reduce the required memory footprint o
jobtracker. This feature allows an administrator to specify a maximum number of jobs to initialize, per user, per
- Choose the CS scheduler if:
- You know a lot about your cluster workloads and utilization and simply want to enforce resource allocation.
- You have very little fluctuation within queue utilization. The CS's more rigid resource allocation makes
sense when all queues are at capacity almost all the time.
- You have high variance in the memory requirements of jobs and you need the CS's memory-based scheduling support.
- You demand scheduler determinism.
###############################
### Monitoring and Managing ###
###############################
- Starting a proccess using scripts can be done from the /etc/init.d dir with the process name you want to start
and using start|stop|restart command. You need to become root before doing this.
- Starting a process in the foreground can be done with the hadoop command "hadoop process" where process is the name
of the process you want to start.
- Adding a datanode: 1. Add hosts ip address to the dfs.hosts file if using host include functionality. 2. run
hdfs dfsadmin -refreshNodes as HDFS superuser. 3. Update any rack awareness files. 4. Start datanode process on the
datanode. 5. Check hadoop dfsadmin -report or webUI for new node. 6. Run the balancer to redistribute the blocks.
- Adding a tasktracker node: 1. Follow the procedure for adding a datanode to HDFS. 2. Run the balancer utility to
distribute existing block data to the new datanode. 3. Start the tasktracker process. 4. Confirm that the jobtracke
can communicate with the new tasktracker by checking the number of available tasktrackers in its web user interface
- The decommissioning process relies on the HDFS host include and exclude files. If you are not using these files,
it is not possible to gracefully decommission a datanode. To decomission do the following 1. Add the ip of the node
to the dfs.hosts.exclude file. 2. Run hadoop dfsadmin -refreshNodes as HDFS superuser. 3. Monitor the NameNode web
to make sure it starts the decomisison process. 4. Wait a very long time and keep watching the NameNode web UI unti
it says "decomissioned" next to the nodes name. 5. Stop the datanode process. 6. Remove it from the HDFS include an
exclude files as well as any rack topology database. 7. Run hadoop dfsadmin -refreshNodes so NameNode sees the remo
- Check the health of the HDFS filesystem with the fsck command. HDFS is considered healthy if and only if all files
have a minimum number of replicas available. The command is sudo -u hdfs hadoop fsck /
- Dealing with a failed disk
- Is determined by:
1. The specified path is a directory.
2. The directory exists.
3. The directory is readable.
4. The directory is writable.
- Default is if one disk fails node is considered dead. To fix and replace
1. Stop any Hadoop-related processes (optionally following the decommissioning process for the datanode).
2. Replace any failed disks.
3. Follow the process for adding the node back into the cluster.
4. Run the Hadoop fsck utility to validate the health of HDFS. Over-replicated blocks are normal immediately
after a node is reintroduced to the cluster, which is automatically corrected (over replicated files are
deleted) over time.
- When a datanode loses network connection for a few minutes - The replication factor is actively maintained by the N
The NameNode monitors the status of all datanodes and keeps track which blocks are located on that node. The moment
datanode is not avaialble the NameNode will notice that certian blocks will be under replicated. The NameNode begin
replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blo
from one datanode to another. The replication data transfer happens directly between datanodes and the data never
passes through the NameNode. The NameNode also makes sure that not all replicas of a block are located on one rack.

If the NameNode detects that a block's replicas end up at one rack, the NameNode treats the block as mis-replicated
replicates the block to a different rack
- However, if the datanode comes back up, over replicated data will be deleted. Note: the data might be deleted from
#################
### Ecosystem ###
#################
- Apache Hive
- Converts your SQL query into a series of MapReduce jobs for execution on a Hadoop cluster. Hive organizes data
into tables, which provide a means for attaching structure to data stored in HDFS. Metadata such as table schemas
is stored in a database called the metastore. Everyone can use the metastore if it is centrally kept.
- To install
- just download tarball from website and untar it: % tar xzf hive-x.y.z.tar.gz
- It's handy to put Hive on your path to make it easy to launch:
% export HIVE_INSTALL=/home/blah/hive-x.y.z-dev
% export PATH=$PATH:$HIVE_INSTALL/bin
- Now type hive to launch the Hive shell:
% hive
hive>
- Log files located in /tmp/$USER/hive.log, Check conf/hive-log4j.properties, and you can edit this file to change
log levels and other logging-related settings.
- Pig
- The language used to express data flows, called Pig Latin.
- The execution environment to run Pig Latin programs. There are currently two environments: local execution in a
single JVM and distributed execution on a Hadoop cluster.
- A Pig Latin program is script made up of a series of operations, or transformations, that are applied to the inpu
data to produce output. Taken as a whole, the operations describe a data flow, which the Pig execution environmen
translates into an executable representation and then runs. Under the covers, Pig turns the transformations
into a series of MapReduce jobs
- Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster, there is nothing extra to
install on the cluster: Pig launches jobs and interacts with HDFS (or other Hadoop filesystems) from your worksta
- To install (you need Java 6)
- Download and untar: tar xzf pig-x.y.z.tar.gz
- % export PIG_INSTALL=/home/tom/pig-x.y.z
% export PATH=$PATH:$PIG_INSTALL/bin
- You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.
- You can execute Pig in local or MapReduce mode. Local mode is for a small local file (pig -x local) or you can
run it in MapReduce mode Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster. pig -x map
- Grunt is an interactive shell for running Pig commands. Just run Pig with no options.
- Zookeeper
- Distributed coordiniation service with a set of tools to build distributed applications that can safely handle
partial failures
- ZooKeeper is, at its core, a stripped-down filesystem that exposes a few simple operations, and some extra
abstractions such as ordering and notifications.
- ZooKeeper runs on a collection of machines and is designed to be highly available. No single point of failure.
- ZooKeeper interactions support participants that do not need to know about one another. Processes can drop notes
to each other even if they don't know the other process.
- Installing
- Download and untar: % tar xzf zookeeper-x.y.z.tar.gz
- % export ZOOKEEPER_INSTALL=/home/blah/zookeeper-x.y.z
- % export PATH=$PATH:$ZOOKEEPER_INSTALL/bin
- Config file is zoo.cfg and is usually placed in the conf/ dir.
tickTime=2000
dataDir=/Users/tom/zookeeper
clientPort=2181
- Start local zookeeper server: % zkServer.sh start
- Sqoop
- Allows users to extract data from a structured data store (like a RDBMS) into Hadoop for further processing
- This processing can be done with MapReduce programs or other higher-level tools such as Hive
- When the final results of an analytic pipeline are available, Sqoop can export these results back to the
data store for consumption by other clients.
- Installing
- Download and untar: tar xvzf sqoop-x.y.z.tar.gz
- % export SQOOP_HOME=/home/blah/sqoop export ENV var
- % $SQOOP_HOME/bin/sqoop then run it. standard location is /usr/bin/
- Sqoop has an extension framework that makes it possible to import data from�and export data to�any external
storage system that has bulk data transfer capabilities. A Sqoop connector is a modular component that uses
this framework to enable Sqoop imports and exports.
- Sqoop ships with connectors for working with a range of popular relational databases, including MySQL,
PostgreSQL, Oracle, SQL Server, and DB2
- By default, Sqoop will generate comma-delimited text files for imported data.
- Mahout: A Scalable machine learning and data mining library.
- Cassandra: A scalable multi-master database with no single points of failure.
- Chukwa: A data collection system for managing large distributed systems.
- HBase: A scalable, distributed database that supports structured data storage for large tables.
- Mahout: A Scalable machine learning and data mining library.
- Apache Ambari project is aimed at making Hadoop management simpler by developing software for
provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use

Hadoop management web UI backed by its RESTful APIs.
- Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.
Its major features include powerful full-text search, hit highlighting, faceted search, near real-time
indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search
- Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are
Directed Acyclical Graphs (DAGs) of actions.

Big data interview questions and answers

More Related Content

What's hot

Similar to Big data interview questions and answers

Recently uploaded

Big data interview questions and answers