Unit 1

What’s Big Data?
2
 Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.
 The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
 The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with
the same total amount of data, allowing correlations to be
found to "spot business trends, determine quality of
research, prevent diseases, link legal citations, combat
crime, and determine real-time roadway traffic conditions.”

Characteristics Of Big Data
Big data can be described by the following
characteristics:
 Volume
 Variety
 Velocity
 Variability

Volume (Scale)
5
 Data Volume
 44x increase from 2009 2020
 From 0.8 zettabytes to 35zb
 Data volume is increasing
exponentially
Exponential increase in
collected/generated data

 Volume – The name Big Data itself is related to a
size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also,
whether a particular data can actually be considered
as a Big Data or not, is dependent upon the volume
of data. Hence, 'Volume' is one characteristic which
needs to be considered while dealing with Big Data.

12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?
TBs
of
data
every
day
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014

Maximilien Brice, © CERN
CERN’s Large Hydron Collider (LHC) generates 15 PB a

The Earthscope
• The Earthscope is the world's
largest science project. Designed
to track North America's
geological evolution, this
observatory records data over 3.8
million square miles, amassing 67
terabytes of data. It analyzes
seismic slips in the San Andreas
fault, sure, but also the plume of
magma underneath Yellowstone
and much, much more.
(http://www.msnbc.msn.com/id/44
363598/ns/technology_and_scien

Variety (Complexity)
10
 Relational Data
(Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
 You can only scan the data once
 A single application can be
generating/collecting many types of data
 Big Public Data (online, weather, finance,
etc)
To extract knowledge all these types of
data need to linked together

 Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the
nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases
were the only sources of data considered by most of
the applications. Nowadays, data in the form of
emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses
certain issues for storage, mining and analyzing
data.

A Single View to the Customer
Customer
Social
Media
Gamin
g
Entertain
Bankin
g
Financ
e
Our
Known
History
Purchas
e

Velocity (Speed)
13
 Data is begin generated fast and need to be
processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
 E-Promotions: Based on your current location, your purchase
history, what you like  send promotions right now for store next to
you
 Healthcare monitoring: sensors monitoring your activities and body
 any abnormal measurements require immediate reaction

 Velocity – The term 'velocity' refers to the speed of
generation of data. How fast the data is generated
and processed to meet the demands, determines
real potential in the data.
 Big Data Velocity deals with the speed at which data
flows in from sources like business processes,
application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is
massive and continuous.

Real-time/Fast Data
15
 The progress and innovation is no longer hindered by the ability to
collect data
 But, by the ability to manage, analyze, summarize, visualize, and
discover knowledge from the collected data in a timely manner and
in a scalable fashion
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)

Real-Time Analytics/Decision Requirement
Customer
Influence
Behavior
Product
Recommendations
that are Relevant
& Compelling
Friend Invitations
to join a
Game or Activity
that expands
business
Preventing Fraud
as it is Occurring
& preventing more
proactively
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play

 Variability – This refers to the inconsistency
which can be shown by the data at times, thus
hampering the process of being able to handle
and manage the data effectively

Big Data Challenges
The major challenges associated with big data are
as follows −
 Capturing data
 Curation
 Storage
 Searching
 Sharing
 Transfer
 Analysis
 Presentation

Harnessing Big Data
20
 OLTP: Online Transaction Processing (DBMSs)
 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

The Model Has Changed…
21
 The Model of Generating/Consuming Data has
Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming
data

What’s driving Big Data
22
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

Big Data:
Batch Processing &
Distributed Data
Store
Hadoop/Spark;
HBase/Cassandra
BI Reporting
OLAP &
Dataware house
Business Objects, SAS,
Informatica, Cognos other
SQL Reporting Tools
Interactive
Business
Intelligence &
In-memory
RDBMS
QliqView, Tableau,
HANA
Big Data:
Real Time &
Single View
Graph Databases
The Evolution of Business Intelligence
1990’s 2000’s 2010’s
Speed
Scale
Scale
Speed

Big Data Analytics
24
 Big data is more real-time in
nature than traditional DW
applications
 Traditional DW architectures (e.g.
Exadata, Teradata) are not well-
suited for big data apps
 Shared nothing, massively parallel
processing, scale out architectures
are well-suited for big data apps

Distributed File System (DFS)
 distributed file system (DFS) is a file system with
data stored on a server. The data is accessed
and processed as if it was stored on the local
client machine.
 The DFS makes it convenient to share
information and files among users on a network in
a controlled and authorized way. The server
allows the client users to share files and store
data just as if they are storing the information
locally. However, the servers have full control
over the data, and give access control to the
clients.

 As there has been exceptional growth in network-
based computing, client/server-based applications
have brought revolutions in the process of building
distributed file systems.
 Sharing storage resources and information on a
network is one of the key elements in both local area
networks (LANs) and wide area networks (WANs).
Different technologies like DFS have been developed
to bring convenience and efficiency to sharing
resources and files on a network, as networks
themselves also evolve.
 One process involved in implementing the DFS is
giving access control and storage management
controls to the client system in a centralized way

 A DFS allows efficient and well-managed data
and storage sharing options on a network
compared to other options. Another option for
users in network-based computing is a shared
disk file system. A shared disk file system puts the
access control on the client’s systems, so that the
data is inaccessible when the client system goes
offline. DFS, however, is fault-tolerant, and the
data is accessible even if some of the network
nodes are offline.

The Google File System
 GFS shares many of the same goals as previous
distributed file systems such as performance,
scalability, reliability, and availability. However, its
design has been driven by key observations of
application workloads and technological
environment, both current and anticipated, that
reflect a marked departure from some earlier file
system design assumptions. GFS reexamined
traditional choices and explored radically different
points in the design space

DESIGN OVERVIEW
In designing a file system have been guided by
assumptions that offer both challenges and
opportunities.
 The system is built from many inexpensive
commodity components that often fail. It must
constantly monitor itself and detect, tolerate, and
recover promptly from component failures on a
routine basis
 The system stores a modest number of large
files. We expect a few million files, each typically
100 MB or larger in size. Multi-GB files are the
common case and should be managed efficiently.
Small files must be supported, but we need not
optimize for them.

 The system must efficiently implement well-defined
semantics for multiple clients that concurrently
append to the same file. Our files are often used as
producer consumer queues or for many-way merging.
Hundreds of producers, running one per machine, will
concurrently append to a file. Atomicity with minimal
synchronization overhead is essential. The file may
be read later, or a consumer may be reading through

High sustained bandwidth is more important than
low latency. Most of our target applications place a
premium on processing data in bulk at a high rate,
while few have stringent response time
requirements for an individual read or write.

Interface
GFS provides a familiar file system interface, though it
does not implement a standard API such as POSIX.
Files are organized hierarchically in directories and
identified by pathnames.We support the usual
operations to create, delete, open, close, read, and
write files.
Moreover, GFS has snapshot and record append
operations.Snapshot creates a copy of a file or a
directory tree at low cost. Record append allows
multiple clients to append data to the same file
concurrently while guaranteeing the atomicity of each
individual client’s append. It is useful for implementing
multi-way merge results and producerconsumer
queues that many clients can simultaneously append

Chunk server
 A chunk server stores actual data of virtual
machines and Containers and services requests
to it. All data is split into chunks and can be
stored in Storage cluster in multiple copies called
replicas. ... This requires at least three chunk
servers to be set up in the cluster.

Architecture
 A GFS cluster consists of a single master and multiple chunk
servers and is accessed by multiple clients, Each of these is
typically a commodity Linux machine running a user-level server
process. It is easy to run both a chunkserver and a client on the
same machine, as long as machine resources permit and the
lower reliability caused by running possibly flaky application code
is acceptable.

Files are divided into fixed-size chunks. Each chunk
is identified by an immutable and globally unique
64 bit chunk handle assigned by the master at the
time of chunk creation.
Chunk servers store chunks on local disks as Linux
files and read or write chunk data specified by a
chunk handle and byte range. For reliability, each
chunk is replicated on multiple Chunk servers. By
default, we store three replicas, though users can
designate different replication levels for different
regions of the file namespace

The master maintains all file system metadata. This
includes the namespace, access control information, the
mapping from files to chunks, and the current locations of
chunks.It also controls system-wide activities such as
chunk lease management, garbage collection of
orphaned chunks, and chunk migration between chunk
servers. The master periodically communicates with each
chunkserver in HeartBeat messages to give it instructions

GFS Master
 Having a single master vastly simplifies our
design and enables the master to make
sophisticated chunk placement and replication
decisions using global knowledge. How ever,we
must minimize its involvement in reads and writes
so that it does not become a bottleneck. Clients
never read and write file data through the master.
Instead, a client asks the master which chunk
servers it should contact. It caches this
information for a limited time and interacts with
the chunkservers directly for many subsequent
operations.

Chunk Size
 Chunk size is one of the key design parameters.
We have chosen 64 MB, which is much larger
than typical file system block sizes. Each chunk
replica is stored as a plain Linux file on a
chunkserver and is extended only as
needed.Lazy space allocation avoids wasting
space due to internal fragmentation, perhaps the
greatest objection against such a large chunk
size.
 A large chunk size offers several important
advantages.First, it reduces clients’ need to
interact with the master because reads and writes
on the same chunk require only one initial request
to the master for chunk location information.

Chunk Locations
 The master does not keep a persistent record of which chunkservers have a
replica of a given chunk. It simply polls chunkservers for that information at
startup. The master can keep itself up-to-date thereafter because it controls all
chunk placement and monitors chunkserver status with regular HeartBeat
messages.
 We initially attempted to keep chunk location information persistently at the
master, but we decided that it was much simpler to request the data from chunk
servers at startup,and periodically thereafter. This eliminated the problem of
keeping the master and chunkservers in sync as chunkservers join and leave the
cluster, change names, fail, restart, and so on. In a cluster with hundreds of
servers, these events happen all too often.Another way to understand this design
decision is to realize that a chunkserver has the final word over what chunks it
does or does not have on its own disks. There is no point in trying to maintain a
consistent view of this information on the master because errors on a
chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad
and be disabled) or an operator may rename a chunkserver.

CONCLUSIONS
 The Google File System demonstrates the qualities
essential for supporting large-scale data processing
workloads on commodity hardware.
 GFS has successfully met googles storage needs and
is widely used within Google as the storage platform
for research and development as well as production
data processing. It is an important tool that enables us
to continue to innovate and Attack problems on the
scale of the entire web.

Hadoop Distributed File System
(HDFS)

Hadoop Distributed File System (HDFS )

What features does Hadoop
offer?

When should you choose
Hadoop?

Hadoop
Hadoop employs a master/slave architecture for both
distributed storage and distributed computation. The
distributed storage system is called the Hadoop
Distributed File System (HDFS). On a fully configured
cluster, "running Hadoop" means running a set of
daemons, or resident programs, on the different
servers in you network. These daemons have specific
roles; some exists only on one server, some exist

The daemons include
NameNode
DataNode
Secondary NameNode
JobTracker
TaskTracker

NameNode
The NameNode is the master of HDFS that directs the slave DataNode daemons
to perform the low-level I/O tasks. It is the bookkeeper of HDFS; it keeps
track of how your files are broken down into file blocks, which nodes store
those blocks and the overall health of the distributed filesystem.
The server hosting the NameNode typically doesn't store any user data or
perform any computations for a MapReduce program to lower the workload
on the machine, hence memory & I/O intensive.
There is unfortunately a negative aspect to the importance of the NameNode -
it's a single point of failure of your Hadoop cluster. For any of the other
daemons, if their host fail for software or hardware reasons, the Hadoop
cluster will likely continue to function smoothly or you can quickly restart it.
Not so for the NameNode.

DataNode
 Each slave machine in your cluster will host a DataNode
daemon to perform the grunt work of the distributed
filesystem - reading and writing HDFS blocks to actual
files on the local file system
When you want to read or write a HDFS file, the file is
broken into blocks and the NameNode will tell your client
which DataNode each block resides in. Your client
communicates directly with the DataNode daemons to
process the local files corresponding to the blocks.

DataNode
 Furthermore, a DataNode may communicate with
other DataNodes to replicate its data blocks for
redundancy.This ensures that if any one
DataNode crashes or becomes inaccessible over
the network, you'll still able to read the files.
DataNodes are constantly reporting to the
NameNode. Upon initialization, each of the
DataNodes informs the NameNode of the blocks
it's currently storing. After this mapping is
complete, the DataNodes continually poll the
NameNode to provide information regarding local
changes as well as receive instructions to create,
move, or delete from the local disk.

Secondary NameNode(SNN)
 The SNN is an assistant daemon for monitoring the
state of the cluster HDFS. Like the NameNode, each
cluster has one SNN, and it typically resides on its
own machine as well. No other DataNode or
TaskTracker daemons run on the same server. The
SNN differs from the NameNode in that this process
doesn't receive or record any real-time changes to
HDFS. Instead, it communicates with the NameNode
to take snapshots of the HDFS metadata at intervals
defined by the cluster configuration.
As mentioned earlier, the NameNode is a single point
of failure for a Hadoop cluster, and the SNN
snapshots help minimize the downtime and loss of
data.

fsimage(filesystem image) file and
edits file
 The HDFS namespace is stored by the NameNode.
The NameNode uses a transaction log called
the EditLog to persistently record every change that
occurs to file system metadata. For example, creating
a new file in HDFS causes the NameNode to insert a
record into the EditLog indicating this. Similarly,
changing the replication factor of a file causes a new
record to be inserted into the EditLog. The
NameNode uses a file in its local host OS file system
to store the EditLog. The entire file system
namespace, including the mapping of blocks to files
and file system properties, is stored in a file called
the FsImage. The FsImage is stored as a file in the
NameNode’s local file system too.

JobTracker
 Once you submit your code to your cluster, the
JobTracker determines the execution plan by
determining which files to process, assigns nodes to
different tasks, and monitors all tasks as they're
running. should a task fail, the JobTracker will
automatically relaunch the task, possibly on a different
node, up to a predefined limit of retries.
There is only one JobTracker daemon per Hadoop
cluster. It's typically run on a server as a master node
of the cluster.

TaskTracker
 the computing daemons also follow a master/slave
architecture: the JobTracker is the master overseeing the
overall execution of a MapReduce job and the
TaskTracker manage the execution of individual tasks on
each slave node.
Each TaskTracker is responsible for executing the
individual tasks that the JobTracker assigns. Although
there is a single TaskTracker per slave node, each
TaskTracker can spawn multiple JVMs to handle many
map or reduce tasks in parallel.
One responsibility of the TaskTracker is to constantly
communicate with the JobTracker. If the JobTracker fails
to receive a heartbeat from a TaskTracker within a
specified amount of time, it will assume the TaskTracker
has crashed and will resubmit the corresponding tasks to
other nodes in the cluster.

Topology of a typical Hadoop
cluster.

Unit 1

More Related Content

What's hot

Similar to Unit 1

More from vishal choudhary

Recently uploaded

Unit 1