This document provides an overview of Hadoop and several related big data technologies. It begins by defining the challenges of big data as the 3Vs - volume, velocity and variety. It then explains that traditional databases cannot handle this type and scale of unstructured data. The document goes on to describe how Hadoop works using HDFS for storage and MapReduce as the programming model. It also summarizes several Hadoop ecosystem projects including YARN, Hive, Pig, HBase, Zookeeper and Spark that help to process and analyze large datasets.
2. AGENDA
Defining the problem – 3Vs
Why traditional storages don’t work
How does Hadoop work?
HDFS (Hadoop 1.0 Vs 2.0)
YARN (2.0)- Yet Another Resource Negotiator
Map Reduce
When we don’t know how to code
Hive (Overview)
PIG (Overview)
Hbase (Overview)
Zookeeper (Overview)
Spark (Overview)
3. DEFINING THE PROBLEM – 3VS
Volume - Lots and lots of data
Datasets are so large and complex
Cannot use relational database
Challenges: capture, curation, storage, search, sharing,
transfer, analysis and visualization.
4. DEFINING THE PROBLEM – 3V (CONTD.)
Velocity - Huge amounts of data generated at
incredible speed
NYSE generates about 1 TB of new trade data per day
AT&T anonymized Call Detail Records (CDRs) top at
around 1 GB per hour.
Variety - Differently formatted data sets from
different sources
Twitter keeps tracks of tweets, Facebook produces
posts and likes data, Youtube streams videos)
5. WHY TRADITIONAL STORAGES DON’T WORK
Unstructured data is exploding, not much of data
produced has relational nature.
No redundancy
High computational cost
Capacity limit for structured data (costly hardware)
Expensive License
Data type Nature
XML Semi-structured
Word docs, PDF files etc. Unstructured
Email body Unstructured
Data from Enterprise Systems
(ERP, CRM etc.)
Structured
9. YARN (2.0)- YET ANOTHER RESOURCE
NEGOTIATOR
Computing framework for Hadoop.
YARN has Resource Manager-
Manages and allocates cluster resources
Improves performance and Quality of Service
10. MAP REDUCE
Programming model in Java
Work on large amounts of data
Provides redundancy & fault tolerance
Runs the code on each data node
11. MAP REDUCE (CONTD.)
Steps for Map Reduce:
Read in lots of data
Map: extract something you care about from each
record/line.
Shuffle and sort
Reduce: aggregate, summarize, filter or transform
Write results.
13. HIVE (OVERVIEW)
Data warehouse infrastructure built on top of Hadoop
Compile SQL queries as MapReduce jobs and run the job in
the cluster.
Brings structure to unstructured data
Key Building Principles:
Structured data with rich data types (structs, lists and maps)
Directly query data from different formats (text/binary) and file
formats (Flat/sequence).
SQL as a familiar programming tool and for standard analytics
Types of applications:
Summarization: Daily/weekly aggregations
Ad hoc analysis
Data Mining
Spam detection
Many more ….
14. PIG (OVERVIEW)
High level dataflow language
Has its own syntax (Preferable for people with
programming background)
Compiler that produces sequences of MapReduce
programs.
Structure is agreeable to substantial parallelization.
Key properties of PIG:
Ease of programming: Trivial to achieve parallel execution of
simple and parallel data analysis tasks
Optimization opportunities: allows user to focus on semantics
rather than efficiency.
Extensibility: Users can create their own functions to do
special purpose processing.
15. HBASE (OVERVIEW)
HBase is a distributed column-oriented data store built
on top of HDFS.
Data is logically organized into tables, rows and
columns.
HDFS is good for batch processing (scan over big files).
Not good for record lookup.
Not good for incremental addition of small batches.
Not good for updates.
HBase is designed to efficiently address the above
points
Fast record lookup
Support for record level insertion
Support for updates (not in place).
Updates are done by creating new versions of values.
16. ZOOKEEPER (OVERVIEW)
Zookeeper is a distributed, open source
coordination service for distributed applications.
Exposes simple set of primitives that distributed
applications can build upon to implement higher
level services for synchronization, configuration
maintenance, and groups and naming.
Coordination services are notoriously hard to get
right. They are prone to errors like race conditions
and deadlock.
The motivation behind zookeeper is to relieve
distributed applications the responsibility of
implementing coordination services from scratch.
17. SPARK (OVERVIEW)
Motivation : MapReduce programming model transform data
flowing from stable storage to stable storage (disk to disk).
Acyclic data flow is a powerful abstraction, but not efficient for
applications that repeatedly reuse a working set of data.
Iterative algorithms
Interactive data mining
Spark makes working sets a first-class concept to efficiently
support these applications.
Goal:
To provide distributed memory abstractions for clusters to support
apps with working sets.
Retain the attractive properties of map reduce.
Fault tolerance
Data locality
Scalability
Augment data flow model with “resilient distributed datasets”
(RDDs)
18. SPARK (OVERVIEW CONTD.)
Resilient distributed datasets (RDDs)
Immutable collections partitioned across cluster that can
be rebuilt if a partition is lost.
Created by transforming data in stable storage using
data flow operators (map, filter, group-by, ..)
Can be cached across parallel operations.
Parallel operations on RDDs.
Reduce, collect, count, save, …..
Restricted shared variables
Accumulators, broadcast variables.
19. SPARK (OVERVIEW CONTD)
Fast map reduce like engine
Uses in memory cluster computing
Compatible with Hadoop storage API.
Has API’s written in Scala, Java, Python.
Useful for large datasets and iterative algorithms.
Up to 40x faster than MapReduce.
Support for:
Spark SQL : Hive on Spark
Mlib : Machine learning library
Graphx : Graph processing.
Founder: Doug Cutting. He named it after his son’s toy elephant
Active Namenode: In order to provide HDFS high availability, we have an active and standby NameNode in the architecture now. Namenode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data itself, just reference meta data. Client applications talk to the namenode whenever they wish to locate a file, or add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant Datanode servers where the data lives.
It is essential to look after the NameNode. Here are some recommendations from production use
Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.
Use ECC RAM.
On Java6u15 or later, run the server VM with compressed pointers -XX:+UseCompressedOops to cut the JVM heap size down.
List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.
Configure the NameNode to store one set of transaction logs on a separate disk from the image.
Configure the NameNode to store another set of transaction logs to a network mounted disk.
Monitor the disk space available to the NameNode. If free space is getting low, add more storage.
Do not host DataNode, JobTracker or TaskTracker services on the same system.
DataNodes: A datanode stores data in the HDFS. A functional file system has more than one Datanode, with data replicated across them. On startup, the datanode connects to the Namenode, spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host Datanode instances, so that MapReduce operations are performed close to the data.
DataNode instances can talk to each other, which is what they do when they are replicating data.
There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.
An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data.
Avoid using NFS for data storage in production system.
Node Manager: The NM is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to-date with Resource Manager(RM), overseeing containers’ life-cycle management, monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.
Resource Manager: RM is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NMs and the per-application ApplicationMasters (Ams).
Application Masters: are responsible for negotiating resources with the ResourceManager and for working with the Node Managers to start the containers.
Active Namenode: In order to provide HDFS high availability, we have an active and standby NameNode in the architecture now. Namenode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data itself, just reference meta data. Client applications talk to the namenode whenever they wish to locate a file, or add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant Datanode servers where the data lives.
It is essential to look after the NameNode. Here are some recommendations from production use
Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.
Use ECC RAM.
On Java6u15 or later, run the server VM with compressed pointers -XX:+UseCompressedOops to cut the JVM heap size down.
List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.
Configure the NameNode to store one set of transaction logs on a separate disk from the image.
Configure the NameNode to store another set of transaction logs to a network mounted disk.
Monitor the disk space available to the NameNode. If free space is getting low, add more storage.
Do not host DataNode, JobTracker or TaskTracker services on the same system.
DataNodes: A datanode stores data in the HDFS. A functional file system has more than one Datanode, with data replicated across them. On startup, the datanode connects to the Namenode, spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host Datanode instances, so that MapReduce operations are performed close to the data.
DataNode instances can talk to each other, which is what they do when they are replicating data.
There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.
An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data.
Avoid using NFS for data storage in production system.
Node Manager: The NM is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to-date with Resource Manager(RM), overseeing containers’ life-cycle management, monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.
Resource Manager: RM is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NMs and the per-application ApplicationMasters (Ams).
Application Masters: are responsible for negotiating resources with the ResourceManager and for working with the Node Managers to start the containers.
Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS).
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java.
Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi structured data. Hive Hadoop Component is mainly used for creating reports whereas Pig Hadoop Component is mainly used for programming.
Ambari: A completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters.