3. Objectives
• At the end of this chapter, you are able to:
–Understand concepts, business benefits, characteristics and
sources of big data.
–Explain Big Data with traditional data
4. Introduction
• What makes the big data valuable?
– Insightful, actionable and predictive with time.
• What are the challenges of big data?
– Unimaginable size & growth, heterogonous systems and data
– Traditional systems do not scale up and is costly (RDMS)
• What are the solutions?
– Scale up (increase configuration of single system-Storage, RAM, CPU)
– Scale out (use multiple machines (commodity) and distribute the load)
• Nodes may fail frequently (network or machine issues), #nodes keep changing
• During analysis take results from different machines and merge/aggregate
them. (B/c, same files divided into multiple machines for parallel processing)
– A solution to handle and process structural and unstructured data which is huge-
Hadoop
5. • Some of the big data problems are:
– Storage exponentially growing and variety of huge dataset & Processing
• Hadoop as a Solution
– A framework that allows us to store and process large data sets in parallel and
distributed fashion. It has three basic components.
Introduction to Hadoop
YARN- Resource Manager
6. • Apache open source software framework for reliable, scalable,
distributed computing of massive amount of data
– Hides underlying system details and complexities from user
– Developed in Java
• Meant for heterogeneous commodity hardware
• Hadoop Distributed File System = HDFS
– Where Hadoop stores data
– A file system that spans all the nodes in a Hadoop cluster
– It links together the file systems on many local nodes to make them into one
big file system
• Has a large ecosystem with both open-source & proprietary Hadoop-related
projects
– Hbase / Zookeeper / Avro / etc.
Introduction to Hadoop…
7. • A large (and growing) Ecosystem
Introduction to Hadoop….
8. Hadoop has an ecosystem that has evolved from its four core
components. It is continuously growing to meet the needs of Big Data.
Introduction to Hadoop….
10. • What Hadoop is good for:
– Massive amounts of data through
parallelism
– A variety of data (structured, unstructured,
semi-structured)
– Inexpensive commodity hardware
• Hadoop is not good for:
– Not to process transactions (random access)
– Not good when work cannot be parallelized
– Not good for low latency data access
– Not good for processing lots of small files
– Not good for intensive calculations with little data
Introduction to Hadoop…..
11. HDFS
• HDFS Creates a level of abstraction over the resources, from where
we ca see the whole HDFS logical as a single unit to store big data. But
actually restoring the data a ross multiple systems.
• Characteristics
–Scalable Storage for Large Files
–Replication
–Streaming Data Access
–File append
12. HDFS
• Has two core components
– NameNode: main node that contains meta data about the data stored
• Which data block is stored in which data node, where are the
replication of the data block
• persistently stores the filesystem meta-data and the
mappings of the blocks to the datanodes, on the disk as two files:
fsimage and edits files.
• fsimage contains a complete snapshot of the filesystem meta-
data.
• The edits file stores the incremental updates to the meta-data.
• responsible for executing operations such as opening
and closing of files, no data actually flows through the Namenode
13. HDFS
• DataNode:
–commodity hardware in the distributed environment in
which actual data is stored on it.
• Replicate the data block that is present in the data nodes
and by default the replication factor is 3.
• The placement of replicas on the Datanodes is determined
by a rack-aware placement policy.
• This placement policy ensures reliability and availability of
the blocks.
• For a replication factor of three, one replica is placed on a
node on a local rack, the second replica is placed on a
different node on a remote rack and the third replica is
placed on a different node on the same remote rack.
17. HDFS-Architecture…..
Secondary Namenode
The edits file keeps growing in size, over time, as the incremental updates are stored. The
responsibility of applying the updates to the fsimage file is delegated to the Secondary
Namenode, as the Namenode may not have enough resources available, as it is
performing other operations.
18.
19.
20. Multiple Namenodes / Namespaces
• To scale the name service horizontally, federation uses multiple
independent NameNodes / namespaces.
• The NameNodes are federated; the NameNodes are independent
and do not require coordination with each other.
• The DataNodes are used as common storage for blocks by all the
NameNodes.
• Each DataNode registers with all the NameNodes in the cluster.
• DataNodes send periodic heartbeats and block reports.
• They also handle commands from the NameNodes.
• In Federated NameNode, One million blocks or ~100TB of
data require roughly one GB RAM in NN
22. HDFS
• In earlier versions of Hadoop/HDFS, the default blocksize was
often quoted as 64 MB, but the current
• A typical block size used by HDFS is 128 MB. Thus, an HDFS file is
chopped up into 128 MB chunks, and if possible, each chunk will reside
on a different DataNode
• It should be noted that Linux itself has both a logical block size (typically
4 KB) and a physical or hardware block size (typically 512 bytes).