Introduction to Big Data
What’s Big Data?
2
 Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.
 The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
 The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with
the same total amount of data, allowing correlations to be
found to "spot business trends, determine quality of
research, prevent diseases, link legal citations, combat
crime, and determine real-time roadway traffic conditions.”
Characteristics Of Big Data
Big data can be described by the following
characteristics:
 Volume
 Variety
 Velocity
 Variability
Big Data: 3V’s
4
Volume (Scale)
5
 Data Volume
 44x increase from 2009 2020
 From 0.8 zettabytes to 35zb
 Data volume is increasing
exponentially
Exponential increase in
collected/generated data
 Volume – The name Big Data itself is related to a
size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also,
whether a particular data can actually be considered
as a Big Data or not, is dependent upon the volume
of data. Hence, 'Volume' is one characteristic which
needs to be considered while dealing with Big Data.
12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?
TBs
of
data
every
day
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Maximilien Brice, © CERN
CERN’s Large Hydron Collider (LHC) generates 15 PB a
The Earthscope
• The Earthscope is the world's
largest science project. Designed
to track North America's
geological evolution, this
observatory records data over 3.8
million square miles, amassing 67
terabytes of data. It analyzes
seismic slips in the San Andreas
fault, sure, but also the plume of
magma underneath Yellowstone
and much, much more.
(http://www.msnbc.msn.com/id/44
363598/ns/technology_and_scien
Variety (Complexity)
10
 Relational Data
(Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
 You can only scan the data once
 A single application can be
generating/collecting many types of data
 Big Public Data (online, weather, finance,
etc)
To extract knowledge all these types of
data need to linked together
 Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the
nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases
were the only sources of data considered by most of
the applications. Nowadays, data in the form of
emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses
certain issues for storage, mining and analyzing
data.
A Single View to the Customer
Customer
Social
Media
Gamin
g
Entertain
Bankin
g
Financ
e
Our
Known
History
Purchas
e
Velocity (Speed)
13
 Data is begin generated fast and need to be
processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
 E-Promotions: Based on your current location, your purchase
history, what you like  send promotions right now for store next to
you
 Healthcare monitoring: sensors monitoring your activities and body
 any abnormal measurements require immediate reaction
 Velocity – The term 'velocity' refers to the speed of
generation of data. How fast the data is generated
and processed to meet the demands, determines
real potential in the data.
 Big Data Velocity deals with the speed at which data
flows in from sources like business processes,
application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is
massive and continuous.
Real-time/Fast Data
15
 The progress and innovation is no longer hindered by the ability to
collect data
 But, by the ability to manage, analyze, summarize, visualize, and
discover knowledge from the collected data in a timely manner and
in a scalable fashion
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
Real-Time Analytics/Decision Requirement
Customer
Influence
Behavior
Product
Recommendations
that are Relevant
& Compelling
Friend Invitations
to join a
Game or Activity
that expands
business
Preventing Fraud
as it is Occurring
& preventing more
proactively
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play
Some Make it 4V’s
17
 Variability – This refers to the inconsistency
which can be shown by the data at times, thus
hampering the process of being able to handle
and manage the data effectively
Big Data Challenges
The major challenges associated with big data are
as follows −
 Capturing data
 Curation
 Storage
 Searching
 Sharing
 Transfer
 Analysis
 Presentation
Harnessing Big Data
20
 OLTP: Online Transaction Processing (DBMSs)
 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
The Model Has Changed…
21
 The Model of Generating/Consuming Data has
Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming
data
What’s driving Big Data
22
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
Big Data:
Batch Processing &
Distributed Data
Store
Hadoop/Spark;
HBase/Cassandra
BI Reporting
OLAP &
Dataware house
Business Objects, SAS,
Informatica, Cognos other
SQL Reporting Tools
Interactive
Business
Intelligence &
In-memory
RDBMS
QliqView, Tableau,
HANA
Big Data:
Real Time &
Single View
Graph Databases
The Evolution of Business Intelligence
1990’s 2000’s 2010’s
Speed
Scale
Scale
Speed
Big Data Analytics
24
 Big data is more real-time in
nature than traditional DW
applications
 Traditional DW architectures (e.g.
Exadata, Teradata) are not well-
suited for big data apps
 Shared nothing, massively parallel
processing, scale out architectures
are well-suited for big data apps
Big Data Technology
26
Distributed File System (DFS)
 distributed file system (DFS) is a file system with
data stored on a server. The data is accessed
and processed as if it was stored on the local
client machine.
 The DFS makes it convenient to share
information and files among users on a network in
a controlled and authorized way. The server
allows the client users to share files and store
data just as if they are storing the information
locally. However, the servers have full control
over the data, and give access control to the
clients.
 As there has been exceptional growth in network-
based computing, client/server-based applications
have brought revolutions in the process of building
distributed file systems.
 Sharing storage resources and information on a
network is one of the key elements in both local area
networks (LANs) and wide area networks (WANs).
Different technologies like DFS have been developed
to bring convenience and efficiency to sharing
resources and files on a network, as networks
themselves also evolve.
 One process involved in implementing the DFS is
giving access control and storage management
controls to the client system in a centralized way
 A DFS allows efficient and well-managed data
and storage sharing options on a network
compared to other options. Another option for
users in network-based computing is a shared
disk file system. A shared disk file system puts the
access control on the client’s systems, so that the
data is inaccessible when the client system goes
offline. DFS, however, is fault-tolerant, and the
data is accessible even if some of the network
nodes are offline.
The Google File System
 GFS shares many of the same goals as previous
distributed file systems such as performance,
scalability, reliability, and availability. However, its
design has been driven by key observations of
application workloads and technological
environment, both current and anticipated, that
reflect a marked departure from some earlier file
system design assumptions. GFS reexamined
traditional choices and explored radically different
points in the design space
DESIGN OVERVIEW
In designing a file system have been guided by
assumptions that offer both challenges and
opportunities.
 The system is built from many inexpensive
commodity components that often fail. It must
constantly monitor itself and detect, tolerate, and
recover promptly from component failures on a
routine basis
 The system stores a modest number of large
files. We expect a few million files, each typically
100 MB or larger in size. Multi-GB files are the
common case and should be managed efficiently.
Small files must be supported, but we need not
optimize for them.
 The system must efficiently implement well-defined
semantics for multiple clients that concurrently
append to the same file. Our files are often used as
producer consumer queues or for many-way merging.
Hundreds of producers, running one per machine, will
concurrently append to a file. Atomicity with minimal
synchronization overhead is essential. The file may
be read later, or a consumer may be reading through
High sustained bandwidth is more important than
low latency. Most of our target applications place a
premium on processing data in bulk at a high rate,
while few have stringent response time
requirements for an individual read or write.
Interface
GFS provides a familiar file system interface, though it
does not implement a standard API such as POSIX.
Files are organized hierarchically in directories and
identified by pathnames.We support the usual
operations to create, delete, open, close, read, and
write files.
Moreover, GFS has snapshot and record append
operations.Snapshot creates a copy of a file or a
directory tree at low cost. Record append allows
multiple clients to append data to the same file
concurrently while guaranteeing the atomicity of each
individual client’s append. It is useful for implementing
multi-way merge results and producerconsumer
queues that many clients can simultaneously append
Chunk server
 A chunk server stores actual data of virtual
machines and Containers and services requests
to it. All data is split into chunks and can be
stored in Storage cluster in multiple copies called
replicas. ... This requires at least three chunk
servers to be set up in the cluster.
Architecture
 A GFS cluster consists of a single master and multiple chunk
servers and is accessed by multiple clients, Each of these is
typically a commodity Linux machine running a user-level server
process. It is easy to run both a chunkserver and a client on the
same machine, as long as machine resources permit and the
lower reliability caused by running possibly flaky application code
is acceptable.
Files are divided into fixed-size chunks. Each chunk
is identified by an immutable and globally unique
64 bit chunk handle assigned by the master at the
time of chunk creation.
Chunk servers store chunks on local disks as Linux
files and read or write chunk data specified by a
chunk handle and byte range. For reliability, each
chunk is replicated on multiple Chunk servers. By
default, we store three replicas, though users can
designate different replication levels for different
regions of the file namespace
The master maintains all file system metadata. This
includes the namespace, access control information, the
mapping from files to chunks, and the current locations of
chunks.It also controls system-wide activities such as
chunk lease management, garbage collection of
orphaned chunks, and chunk migration between chunk
servers. The master periodically communicates with each
chunkserver in HeartBeat messages to give it instructions
GFS Master
 Having a single master vastly simplifies our
design and enables the master to make
sophisticated chunk placement and replication
decisions using global knowledge. How ever,we
must minimize its involvement in reads and writes
so that it does not become a bottleneck. Clients
never read and write file data through the master.
Instead, a client asks the master which chunk
servers it should contact. It caches this
information for a limited time and interacts with
the chunkservers directly for many subsequent
operations.
Chunk Size
 Chunk size is one of the key design parameters.
We have chosen 64 MB, which is much larger
than typical file system block sizes. Each chunk
replica is stored as a plain Linux file on a
chunkserver and is extended only as
needed.Lazy space allocation avoids wasting
space due to internal fragmentation, perhaps the
greatest objection against such a large chunk
size.
 A large chunk size offers several important
advantages.First, it reduces clients’ need to
interact with the master because reads and writes
on the same chunk require only one initial request
to the master for chunk location information.
Chunk Locations
 The master does not keep a persistent record of which chunkservers have a
replica of a given chunk. It simply polls chunkservers for that information at
startup. The master can keep itself up-to-date thereafter because it controls all
chunk placement and monitors chunkserver status with regular HeartBeat
messages.
 We initially attempted to keep chunk location information persistently at the
master, but we decided that it was much simpler to request the data from chunk
servers at startup,and periodically thereafter. This eliminated the problem of
keeping the master and chunkservers in sync as chunkservers join and leave the
cluster, change names, fail, restart, and so on. In a cluster with hundreds of
servers, these events happen all too often.Another way to understand this design
decision is to realize that a chunkserver has the final word over what chunks it
does or does not have on its own disks. There is no point in trying to maintain a
consistent view of this information on the master because errors on a
chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad
and be disabled) or an operator may rename a chunkserver.
CONCLUSIONS
 The Google File System demonstrates the qualities
essential for supporting large-scale data processing
workloads on commodity hardware.
 GFS has successfully met googles storage needs and
is widely used within Google as the storage platform
for research and development as well as production
data processing. It is an important tool that enables us
to continue to innovate and Attack problems on the
scale of the entire web.
What is Hadoop?
Hadoop Distributed File System
(HDFS)
Hadoop Distributed File System
(HDFS)
Why should I use Hadoop?
HDFS: Key Features
Hadoop Distributed File System (HDFS )
Who uses Hadoop?
What features does Hadoop
offer?
When should you choose
Hadoop?
When should you avoid Hadoop?
Hadoop application usage
Hadoop Distributed File System (HDFS )
How HDFS works: Split Data
How HDFS works: Replication
Hadoop
Hadoop employs a master/slave architecture for both
distributed storage and distributed computation. The
distributed storage system is called the Hadoop
Distributed File System (HDFS). On a fully configured
cluster, "running Hadoop" means running a set of
daemons, or resident programs, on the different
servers in you network. These daemons have specific
roles; some exists only on one server, some exist
The daemons include
NameNode
DataNode
Secondary NameNode
JobTracker
TaskTracker
NameNode
The NameNode is the master of HDFS that directs the slave DataNode daemons
to perform the low-level I/O tasks. It is the bookkeeper of HDFS; it keeps
track of how your files are broken down into file blocks, which nodes store
those blocks and the overall health of the distributed filesystem.
The server hosting the NameNode typically doesn't store any user data or
perform any computations for a MapReduce program to lower the workload
on the machine, hence memory & I/O intensive.
There is unfortunately a negative aspect to the importance of the NameNode -
it's a single point of failure of your Hadoop cluster. For any of the other
daemons, if their host fail for software or hardware reasons, the Hadoop
cluster will likely continue to function smoothly or you can quickly restart it.
Not so for the NameNode.
DataNode
 Each slave machine in your cluster will host a DataNode
daemon to perform the grunt work of the distributed
filesystem - reading and writing HDFS blocks to actual
files on the local file system
When you want to read or write a HDFS file, the file is
broken into blocks and the NameNode will tell your client
which DataNode each block resides in. Your client
communicates directly with the DataNode daemons to
process the local files corresponding to the blocks.
DataNode
 Furthermore, a DataNode may communicate with
other DataNodes to replicate its data blocks for
redundancy.This ensures that if any one
DataNode crashes or becomes inaccessible over
the network, you'll still able to read the files.
DataNodes are constantly reporting to the
NameNode. Upon initialization, each of the
DataNodes informs the NameNode of the blocks
it's currently storing. After this mapping is
complete, the DataNodes continually poll the
NameNode to provide information regarding local
changes as well as receive instructions to create,
move, or delete from the local disk.
Secondary NameNode(SNN)
 The SNN is an assistant daemon for monitoring the
state of the cluster HDFS. Like the NameNode, each
cluster has one SNN, and it typically resides on its
own machine as well. No other DataNode or
TaskTracker daemons run on the same server. The
SNN differs from the NameNode in that this process
doesn't receive or record any real-time changes to
HDFS. Instead, it communicates with the NameNode
to take snapshots of the HDFS metadata at intervals
defined by the cluster configuration.
As mentioned earlier, the NameNode is a single point
of failure for a Hadoop cluster, and the SNN
snapshots help minimize the downtime and loss of
data.
fsimage(filesystem image) file and
edits file
 The HDFS namespace is stored by the NameNode.
The NameNode uses a transaction log called
the EditLog to persistently record every change that
occurs to file system metadata. For example, creating
a new file in HDFS causes the NameNode to insert a
record into the EditLog indicating this. Similarly,
changing the replication factor of a file causes a new
record to be inserted into the EditLog. The
NameNode uses a file in its local host OS file system
to store the EditLog. The entire file system
namespace, including the mapping of blocks to files
and file system properties, is stored in a file called
the FsImage. The FsImage is stored as a file in the
NameNode’s local file system too.
JobTracker
 Once you submit your code to your cluster, the
JobTracker determines the execution plan by
determining which files to process, assigns nodes to
different tasks, and monitors all tasks as they're
running. should a task fail, the JobTracker will
automatically relaunch the task, possibly on a different
node, up to a predefined limit of retries.
There is only one JobTracker daemon per Hadoop
cluster. It's typically run on a server as a master node
of the cluster.
TaskTracker
 the computing daemons also follow a master/slave
architecture: the JobTracker is the master overseeing the
overall execution of a MapReduce job and the
TaskTracker manage the execution of individual tasks on
each slave node.
Each TaskTracker is responsible for executing the
individual tasks that the JobTracker assigns. Although
there is a single TaskTracker per slave node, each
TaskTracker can spawn multiple JVMs to handle many
map or reduce tasks in parallel.
One responsibility of the TaskTracker is to constantly
communicate with the JobTracker. If the JobTracker fails
to receive a heartbeat from a TaskTracker within a
specified amount of time, it will assume the TaskTracker
has crashed and will resubmit the corresponding tasks to
other nodes in the cluster.
Topology of a typical Hadoop
cluster.

Unit 1

  • 1.
  • 2.
    What’s Big Data? 2 Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on- hand database management tools or traditional data processing applications.  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.  The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”
  • 3.
    Characteristics Of BigData Big data can be described by the following characteristics:  Volume  Variety  Velocity  Variability
  • 4.
  • 5.
    Volume (Scale) 5  DataVolume  44x increase from 2009 2020  From 0.8 zettabytes to 35zb  Data volume is increasing exponentially Exponential increase in collected/generated data
  • 6.
     Volume –The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big Data.
  • 7.
    12+ TBs of tweetdata every day 25+ TBs of log data every day ? TBs of data every day 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014
  • 8.
    Maximilien Brice, ©CERN CERN’s Large Hydron Collider (LHC) generates 15 PB a
  • 9.
    The Earthscope • TheEarthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44 363598/ns/technology_and_scien
  • 10.
    Variety (Complexity) 10  RelationalData (Tables/Transaction/Legacy Data)  Text Data (Web)  Semi-structured Data (XML)  Graph Data  Social Network, Semantic Web (RDF), …  Streaming Data  You can only scan the data once  A single application can be generating/collecting many types of data  Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together
  • 11.
     Variety –The next aspect of Big Data is its variety. Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
  • 12.
    A Single Viewto the Customer Customer Social Media Gamin g Entertain Bankin g Financ e Our Known History Purchas e
  • 13.
    Velocity (Speed) 13  Datais begin generated fast and need to be processed fast  Online Data Analytics  Late decisions  missing opportunities  Examples  E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you  Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction
  • 14.
     Velocity –The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.  Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
  • 15.
    Real-time/Fast Data 15  Theprogress and innovation is no longer hindered by the ability to collect data  But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data)
  • 16.
    Real-Time Analytics/Decision Requirement Customer Influence Behavior Product Recommendations thatare Relevant & Compelling Friend Invitations to join a Game or Activity that expands business Preventing Fraud as it is Occurring & preventing more proactively Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play
  • 17.
    Some Make it4V’s 17
  • 18.
     Variability –This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively
  • 19.
    Big Data Challenges Themajor challenges associated with big data are as follows −  Capturing data  Curation  Storage  Searching  Sharing  Transfer  Analysis  Presentation
  • 20.
    Harnessing Big Data 20 OLTP: Online Transaction Processing (DBMSs)  OLAP: Online Analytical Processing (Data Warehousing)  RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
  • 21.
    The Model HasChanged… 21  The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  • 22.
    What’s driving BigData 22 - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time
  • 23.
    Big Data: Batch Processing& Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS, Informatica, Cognos other SQL Reporting Tools Interactive Business Intelligence & In-memory RDBMS QliqView, Tableau, HANA Big Data: Real Time & Single View Graph Databases The Evolution of Business Intelligence 1990’s 2000’s 2010’s Speed Scale Scale Speed
  • 24.
    Big Data Analytics 24 Big data is more real-time in nature than traditional DW applications  Traditional DW architectures (e.g. Exadata, Teradata) are not well- suited for big data apps  Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps
  • 26.
  • 27.
    Distributed File System(DFS)  distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine.  The DFS makes it convenient to share information and files among users on a network in a controlled and authorized way. The server allows the client users to share files and store data just as if they are storing the information locally. However, the servers have full control over the data, and give access control to the clients.
  • 28.
     As therehas been exceptional growth in network- based computing, client/server-based applications have brought revolutions in the process of building distributed file systems.  Sharing storage resources and information on a network is one of the key elements in both local area networks (LANs) and wide area networks (WANs). Different technologies like DFS have been developed to bring convenience and efficiency to sharing resources and files on a network, as networks themselves also evolve.  One process involved in implementing the DFS is giving access control and storage management controls to the client system in a centralized way
  • 29.
     A DFSallows efficient and well-managed data and storage sharing options on a network compared to other options. Another option for users in network-based computing is a shared disk file system. A shared disk file system puts the access control on the client’s systems, so that the data is inaccessible when the client system goes offline. DFS, however, is fault-tolerant, and the data is accessible even if some of the network nodes are offline.
  • 30.
    The Google FileSystem  GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system design assumptions. GFS reexamined traditional choices and explored radically different points in the design space
  • 31.
    DESIGN OVERVIEW In designinga file system have been guided by assumptions that offer both challenges and opportunities.  The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis  The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
  • 32.
     The systemmust efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently append to a file. Atomicity with minimal synchronization overhead is essential. The file may be read later, or a consumer may be reading through
  • 33.
    High sustained bandwidthis more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.
  • 34.
    Interface GFS provides afamiliar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by pathnames.We support the usual operations to create, delete, open, close, read, and write files. Moreover, GFS has snapshot and record append operations.Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results and producerconsumer queues that many clients can simultaneously append
  • 35.
    Chunk server  Achunk server stores actual data of virtual machines and Containers and services requests to it. All data is split into chunks and can be stored in Storage cluster in multiple copies called replicas. ... This requires at least three chunk servers to be set up in the cluster.
  • 36.
    Architecture  A GFScluster consists of a single master and multiple chunk servers and is accessed by multiple clients, Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.
  • 37.
    Files are dividedinto fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunk servers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. For reliability, each chunk is replicated on multiple Chunk servers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace
  • 38.
    The master maintainsall file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks.It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunk servers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions
  • 39.
    GFS Master  Havinga single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. How ever,we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunk servers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.
  • 40.
    Chunk Size  Chunksize is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed.Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.  A large chunk size offers several important advantages.First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information.
  • 41.
    Chunk Locations  Themaster does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.  We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunk servers at startup,and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.
  • 42.
    CONCLUSIONS  The GoogleFile System demonstrates the qualities essential for supporting large-scale data processing workloads on commodity hardware.  GFS has successfully met googles storage needs and is widely used within Google as the storage platform for research and development as well as production data processing. It is an important tool that enables us to continue to innovate and Attack problems on the scale of the entire web.
  • 43.
  • 44.
  • 45.
  • 46.
    Why should Iuse Hadoop?
  • 47.
  • 48.
  • 49.
  • 50.
    What features doesHadoop offer?
  • 51.
    When should youchoose Hadoop?
  • 52.
    When should youavoid Hadoop?
  • 53.
  • 54.
  • 55.
    How HDFS works:Split Data
  • 56.
    How HDFS works:Replication
  • 76.
    Hadoop Hadoop employs amaster/slave architecture for both distributed storage and distributed computation. The distributed storage system is called the Hadoop Distributed File System (HDFS). On a fully configured cluster, "running Hadoop" means running a set of daemons, or resident programs, on the different servers in you network. These daemons have specific roles; some exists only on one server, some exist
  • 77.
    The daemons include NameNode DataNode SecondaryNameNode JobTracker TaskTracker
  • 78.
    NameNode The NameNode isthe master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. It is the bookkeeper of HDFS; it keeps track of how your files are broken down into file blocks, which nodes store those blocks and the overall health of the distributed filesystem. The server hosting the NameNode typically doesn't store any user data or perform any computations for a MapReduce program to lower the workload on the machine, hence memory & I/O intensive. There is unfortunately a negative aspect to the importance of the NameNode - it's a single point of failure of your Hadoop cluster. For any of the other daemons, if their host fail for software or hardware reasons, the Hadoop cluster will likely continue to function smoothly or you can quickly restart it. Not so for the NameNode.
  • 79.
    DataNode  Each slavemachine in your cluster will host a DataNode daemon to perform the grunt work of the distributed filesystem - reading and writing HDFS blocks to actual files on the local file system When you want to read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which DataNode each block resides in. Your client communicates directly with the DataNode daemons to process the local files corresponding to the blocks.
  • 80.
    DataNode  Furthermore, aDataNode may communicate with other DataNodes to replicate its data blocks for redundancy.This ensures that if any one DataNode crashes or becomes inaccessible over the network, you'll still able to read the files. DataNodes are constantly reporting to the NameNode. Upon initialization, each of the DataNodes informs the NameNode of the blocks it's currently storing. After this mapping is complete, the DataNodes continually poll the NameNode to provide information regarding local changes as well as receive instructions to create, move, or delete from the local disk.
  • 82.
    Secondary NameNode(SNN)  TheSNN is an assistant daemon for monitoring the state of the cluster HDFS. Like the NameNode, each cluster has one SNN, and it typically resides on its own machine as well. No other DataNode or TaskTracker daemons run on the same server. The SNN differs from the NameNode in that this process doesn't receive or record any real-time changes to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration. As mentioned earlier, the NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime and loss of data.
  • 83.
    fsimage(filesystem image) fileand edits file  The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.
  • 85.
    JobTracker  Once yousubmit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they're running. should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries. There is only one JobTracker daemon per Hadoop cluster. It's typically run on a server as a master node of the cluster.
  • 86.
    TaskTracker  the computingdaemons also follow a master/slave architecture: the JobTracker is the master overseeing the overall execution of a MapReduce job and the TaskTracker manage the execution of individual tasks on each slave node. Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns. Although there is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parallel. One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.
  • 88.
    Topology of atypical Hadoop cluster.