SlideShare a Scribd company logo
1 of 43
BIGDATA
PROGRAMMING
HDFS (Hadoop Distributed File System)
Design of HDFS, HDFS concepts, benefits and challenges,,
command line interface, Hadoop file system interfaces, data flow, data
ingest with Flume and Scoop, Hadoop archives, Hadoop I/O:
compression, serialization, Avro and file-based data structures.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 1
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 2
DESIGNOF HDFS
• HDFS is designed for storing very large files with
streaming data access patterns, running on clusters
of commodity hardware.
It is designed for very large files. “Very large” in
this context means files that are hundreds of
megabytes, gigabytes, or terabytes in size.
• It is built around the idea that the most
efficient data processing pattern is a write-
once, read-many-times pattern
DESIGNOF HDFS
.
• It is designed for commodity hardware.
Hadoop doesn’t require expensive, highly
reliable hardware. It’s designed to run on the
commonly available hardware that can be
obtained from multiple vendors.
• HDFS is designed to carry on working
without a noticeable interruption to the user
in case of hardware failure.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 3
FEATURES OF HDFS:
• Data replication. This is used to ensure that the data is always available
and prevents data loss
• Fault tolerance and reliability. HDFS' ability to replicate file blocks and
store them across nodes in a large cluster ensures fault tolerance and
reliability.
• High availability. As mentioned earlier, because of replication across
notes, data is available even if the Name Node or a Data Node fails.
• Scalability. Because HDFS stores data on various nodes in the cluster, as
requirements increase, a cluster can scale to hundreds of nodes.
• High throughput. Because HDFS stores data in a distributed manner, the
data can be processed in parallel on a cluster of nodes. This, plus data
locality , cut the processing time and enable high throughput.
• Data locality. With HDFS, computation happens on the DataNodes where
the data resides, rather than having the data move to where the
computational unit is
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 4
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 5
HDFS :BENEFITS
• Cost effectiveness: The Data Nodes that store the data rely on
inexpensive off-the-shelf hardware, which cuts storage costs. Also,
because HDFS is open source, there's no licensing fee.
• Large data set storage: HDFS stores a variety of data of any size --
from megabytes to petabytes -- and in any format, including
structured and unstructured data.
• Fast recovery from hardware failure: HDFS is designed to detect
faults and automatically recover on its own.
• Portability: HDFS is portable across all hardware platforms, and it
is compatible with several operating systems, including Windows,
Linux and Mac OS/X.
• Streaming data access: HDFS is built for high data throughput,
which is best for access to streaming data.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 6
HDFS CHALLENGES
• HDFS is not a good fit if we have a lot of small files.
Because the namenode holds filesystem metadata in
memory, the limit to the number of files in a filesystem is
governed by the amount of memory on the namenode
• If we have multiple writers and arbitrary file
modifications, HDFS will not a good fit. Files in HDFS are
modified by a single writer at any time.
• Writes are always made at the end of the file, in the
append-only fashion. There is no support for
modifications at arbitrary offsets in the file. So, HDFS is
not a good fit if modifications have to be made at
arbitrary offsets in the file.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 7
HDFS CHALLENGES(CONTD.)
• HDFS does not give any reliability if that machine goes
down.
• An enormous number of clients must be handled if all
the clients need the data stored on a single machine.
• Clients need to copy the data to their local machines
before they can operate it.
• Applications that require low-latency access to data, in
the tens of milliseconds range, will not work well with
HDFS.
• Since the namenode holds filesystem metadata in
memory, the limit to the number of files in a filesystem
is governed by the amount of memory on the namenode.
CHALLENGES INHADOOP:
• Hadoop is a cutting edge technology
Hadoop is a new technology, and as with adopting any new
technology, finding people who know the technology is difficult.
• Hadoop in the Enterprise Ecosystem
Hadoop is designed to solve Big Data problems encountered by Web
and Social companies. In doing so a lot of the features Enterprises need or
want are put on the back burner. For example, HDFS does not offer native
support for security and authentication.
• Hadoop is still rough around the edges
The development and admin tools for Hadoop are still pretty new.
Companies like Cloudera, Hortonworks, MapR and Karmasphere have
been working on this issue. How ever the tooling may not be as mature
as Enterprises are used to (as say, Oracle Admin, etc.)
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 8
CHALLENGES INHADOOP(CONTD.)
Hardware Cost:
Hadoop is NOT cheap, Hadoop runs on 'commodity' hardware. But
these are not cheapo machines, they are server grade hardware. So
standing up a reasonably large Hadoop cluster, say 100 nodes, will cost a
significant amount of money. For example, lets say a Hadoop node is
$5000, so a 100 node cluster would be $500,000 for hardware.
IT and Operations costs:
A large Hadoop cluster will require support from various teams like :
Network Admins, IT, Security Admins, System Admins.Also one needs to
think about operational costs like Data Center expenses : cooling,
electricity, etc.
Map Reduce is a different programming paradigm
Solving problems using Map Reduce requires a different kind of
thinking. Engineering teams generally need additional training to take
advantage of Hadoop
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 9
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 10
FILE SIZES, BLOCKSIZES - INHDFS
HDFS works with a main NameNode and
multiple other datanodes, all on a commodity
hardware cluster. These nodes are organized in the
same place within the data center.
Next, it's
distributed
storage.
broken down into blocks which are
among the multiple DataNodes for
To reduce the chances of data loss, blocks are
often replicated across nodes. It's a backup system
should data be lost.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 11
FILE SIZES, BLOCKSIZES - IN
HDFS(CONTD.)
The NameNode is the node within the cluster
that knows what the data contains, what block it
belongs to, the block size, and where it should go.
NameNodes are also used to control access to
can write, read,
data acrossthe
files including when someone
create, remove, and replicate
various data notes.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 12
HDFS FILE COM
M
ANDS
HDFS commands, such as the below, to operate and manage your system.
Description
Removes file or directory
Lists files with permissions and other details
Creates a directory named path in HDFS
Shows contents of the file
Deletes a directory
Uploads a file or folder from a local disk to HDFS
Deletes the file identified by path or folder and subfolders
Moves file or folder from HDFS to local file
Counts number of files, number of directory, and file size
Shows free space
Merges multiple files in HDFS
Changes file permissions
Copies files to the local system
Prints statistics about the file or directory
Displays the first kilobyte of a file
Returns the help for an individual command
Allocates a new owner and group of a file
Command
• -rm
• -ls
• -mkdir
• -cat
• -rmdir
• -put
• -rmr
• -get
• -count
• -df
• -getmerge
• -chmod
• -copyToLocal
• -Stat
• -head
• -usage
• -chown
•
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 13
HDFS BLOCKABSTRACTION
• HDFS block size is usually 64MB-128MB and unlike other
file systems, a file smaller than the block size does not
occupy the complete block size’s worth of memory.
• The block size is kept so large so that less time is made
doing disk seeks as compared to the data transfer rate.
Why do we need block abstraction :
• Files can be bigger than individual disks.
• Filesystem metadata does not need to be associated with
each and every block.
• Simplifies storage management - Easy to figure out the
number of blocks which can be stored on each disk.
• Fault tolerance and storage replication can be easily done
on a per-block basis.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 14
DATA REPLICATION
• Replication ensures the availability of the data.
• Replication is - making a copy of something and the number of times
we make a copy of that particular thing can be expressed as its
Replication Factor.
• As HDFS stores the data in the form of various blocks at the same time
Hadoop is also configured to make a copy of those file blocks.
• By default, the Replication Factor for Hadoop is set to 3 which can be
configured.
• We need this replication for our file blocks because for running
Hadoop we are using commodity hardware (inexpensive system
hardware) which can be crashed at any time
DATA REPLICATION
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 15
HDFS : READFILES
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 16
HDFS : READFILES
• Step 1: The client opens the file he/she wishes to read by calling open() on the File
System Object.
• Step 2: Distributed File System(DFS) calls the name node, to determine the locations
of the first few blocks in the file. For each block, the name node returns the addresses
of the data nodes that have a copy of that block. The DFS returns an FSDataInputStream to
the client for it to read data from.
•Step 3: The client then calls read() on the stream. DFSInputStream, which has stored
the info node addresses for the primary few blocks within the file, then connects to the
primary (closest) data node for the primary block in the file.
• Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
• Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block.
• Step 6: When the client has finished reading the file, a function is called, close() on
the FSDataInputStream.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 17
HDFS: W
RITE FILES
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 18
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 19
HDFS: W
RITE FILES(CONTD.)
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s
namespace, with no blocks associated with it. The name node performs various checks to
make sure the file doesn’t already exist and that the client has the right permissions to
create the file. If these checks pass, the name node prepares a record of the new file;
otherwise, the file can’t be created. The DFS returns an FSDataOutputStream for the
client to start out writing data to the file.
Step 3: DFSOutputStream splits it into packets, which it writes to an indoor queue called
the info queue. The data queue is consumed by the DataStreamer, which is liable for
asking the name node to allocate new blocks by picking an inventory of suitable data nodes
to store the replicas. The list of data nodes forms a pipeline. The DataStreamer streams
the packets to the primary data node within the pipeline, which stores each packet
and forwards it to the second data node within the pipeline.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 20
HDFS: W
RITE FILES(CONTD.)
Step 4: Similarly, the second data node stores the packet and forwards it to the
third (and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are
waiting to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline
and waits for acknowledgements before connecting to the name node to
signal whether the file is complete or not.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 21
HDFS: STORE FILES
• HDFS divides files into blocks and stores each
block on a DataNode.
• Multiple DataNodes are linked to the master
node in the cluster, the NameNode.
• The master node distributes replicas of these
data blocks across the cluster. It also instructs the
user where to locate wanted information.
• Before the NameNode can help to store and
manage the data, it first needs to partition the
file into smaller, manageable data blocks.
• This process is called data block splitting
STORE FILES
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 22
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 23
JAVA INTERFACES TOHDFS
Java code for writing file in HDFS :
FileSystem fileSystem = FileSystem.get(conf);
// Check if the file already exists
Path path = new Path("/path/to/file.ext");
if (fileSystem.exists(path)) {
System.out.println("File " + dest + " already exists");
return;
}
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 24
JAVA INTERFACES TOHDFS
// Create a new file and write data to it.
FSDataOutputStream out = fileSystem.create(path);
InputStream in = new BufferedInputStream(new FileInputStream(new
File(source)));
byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}
// Close all the file descripters
in.close();
out.close();
fileSystem.close();
JAVA INTERFACES TOHDFS
• // Java code for reading file in HDFS :
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/path/to/file.ext");
if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
FSDataInputStream in = fileSystem.open(path);
int numBytes = 0;
while ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));
// code to manipulate the data which is read
}
in.close();
out.close();
fileSystem.close();
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 25
COM
M
ANDLINE INTERFACE
The HDFS can be manipulated through a Java API or through a command-
line interface. The File System (FS) shell includes various shell-like
commands that directly interact with the Hadoop Distributed File System
(HDFS) as well as other file systems that Hadoop supports.
Below are the commands supported :
• appendToFile: Append the content of the text file in the HDFS.
• cat: Copies source paths to stdout.
• checksum: Returns the checksum information of a file.
• chgrp : Change group association of files. The user must be the owner of
files, or else a super-user.
• chmod : Change the permissions of files. The user must be the owner of
the file, or else a super-user.
• chown: Change the owner of files. The user must be a super-user.
• copyFromLocal: This command copies all the files inside the test folder in
the edge node to the test folder in the HDFS.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 26
COM
M
ANDLINE INTERFACE
• copyToLocal : This command copies all the files inside the test folder in the
HDFS to the test folder in the edge node.
• count: Count the number of directories, files and bytes under the paths
that match the specified file pattern.
• cp: Copy files from source to destination. This command allows multiple
sources as well in which case the destination must be a directory.
• createSnapshot: HDFS Snapshots are read-only point-in-time copies of the
file system. Snapshots can be taken on a subtree of the file system or the
entire file system. Some common use cases of snapshots are data backup,
protection against user errors and disaster recovery.
• deleteSnapshot: Delete a snapshot from a snapshot table directory. This
operation requires the owner privilege of the snapshottable directory.
• df: Displays free space
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 27
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 28
COM
M
ANDLINE INTERFACE
• du: Displays sizes of files and directories contained in the given directory or the
length of a file in case its just a file.
• expunge: Empty the Trash.
• find: Finds all files that match the specified expression and applies selected actions
to them. If no path is specified then defaults to the current working directory. If no
expression is specified then defaults to -print.
• get Copy files to the local file system.
• getfacl: Displays the Access Control Lists (ACLs) of files and directories. If a
directory has a default ACL, then getfacl also displays the default ACL.
• getfattr: Displays the extended attribute names and values for a file or directory.
• getmerge : Takes a source directory and a destination file as input and
concatenates files in src into the destination local file.
• help: Return usage output.
• ls: list files
• lsr: Recursive version of ls.
COM
M
ANDLINE INTERFACE
• mkdir: Takes path URI’s as argument and creates directories.
• moveFromLocal: Similar to put command, except that the source localsrc is deleted after it’s
copied.
• moveToLocal: Displays a “Not implemented yet” message.
• mv: Moves files from source to destination. This command allows multiple sources as well in which
case the destination needs to be a directory.
• put : Copy single src, or multiple srcs from local file system to the destination file system. Also reads
input from stdin and writes to destination file system.
• renameSnapshot : Rename a snapshot. This operation requires the owner privilege of the
snapshottable directory.
• rm : Delete files specified as args.
• rmdir : Delete a directory.
• rmr : Recursive version of delete.
• setfacl : Sets Access Control Lists (ACLs) of files and directories.
• setfattr : Sets an extended attribute name and value for a file or directory.
• setrep: Changes the replication factor of a file. If the path is a directory then the command
recursively changes the replication factor of all files under the directory tree rooted at the path.
• stat : Print statistics about the file/directory at <path> in the specified format.
• tail: Displays the last kilobyte of the file to stdout.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 29
COM
M
ANDLINE INTERFACE
• test : Hadoop fs -test -[defsz] URI.
• text: Takes a source file and outputs the file in text format. The allowed
formats are zip and TextRecordInputStream.
• touchz: Create a file of zero length.
• truncate: Truncate all files that match the specified file pattern to the
specified length.
• usage: Return the help for an individual command.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 30
HADOOP FILE
SYSTEM
INTERFACES
• HDFS Interfaces :
Features of HDFS interfaces are :
• Create new file
• Upload files/folder
• Set Permission
• Copy
• Move
• Rename
• Delete
• Drag and Drop
• HDFS File viewer
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 31
DATA FLOW
• MapReduce is used to compute a huge amount of data. To handle the
upcoming data in a parallel and distributed form, the data has to flow
from various phases :
• Input Reader :
The input reader reads the upcoming data and splits it into the data
blocks of the appropriate size (64 MB to 128 MB).
Once input reads the data, it generates the corresponding key-value pairs.
The input files reside in HDFS.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 32
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 33
DATA FLOW
• Map Function :
The map function process the upcoming key-value pairs and generated the corresponding
output key-value pairs. The mapped input and output types may be different from each
other.
• Partition Function :
The partition function assigns the output of each Map function to the appropriate
reducer. The available key and value provide this function. It returns the index of reducers.
• Shuffling and Sorting :The data are shuffled between nodes so that it moves out from the
map and get ready to process for reduce function. The sorting operation is performed on
input data for Reduce function.
• Reduce Function : The Reduce function is assigned to each unique key. These keys are
already arranged in sorted order. The values associated with the keys can iterate the Reduce
and generates the corresponding output.
• Output Writer : Once the data flow from all the above phases, the Output writer executes.
The role of the Output writer is to write the Reduce output to the stable storage.
DATA INGEST : FLUM
E AND
SCOOP
• Hadoop Data ingestion is the beginning of input data pipeline in a
data lake.
• It means taking data from various silo databases and files and
putting it into Hadoop.
• For many companies, it does turn out to be an intricate task.
• That is why they take more than a year to ingest all their data into
the Hadoop data lake.
• The reason is, as Hadoop is open-source; there are a variety of ways
you can ingest data into Hadoop.
• It gives every developer the choice of using her/his favourite tool
or language to ingest data into Hadoop.
• Developers while choosing a tool/technology stress on
performance, but this makes governance very complicated.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 34
FLUM
E
Flume :
• Apache Flume is a service designed for streaming logs
into the Hadoop environment.
• Flume is a distributed and reliable service for collecting
and aggregating huge amounts of log data.
• With a simple and easy to use architecture based on
streaming data flows, it also has tunable reliability
mechanisms and several recoveries and failover
mechanisms.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 35
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 36
SCOOP
Sqoop :
 Apache Sqoop (SQL-to-Hadoop) is a lifesaver for
anyone who is experiencing difficulties in moving data
from the data warehouse into the Hadoop environment.
• Apache Sqoop is an effective Hadoop tool used for
importing data from RDBMS’s like MySQL, Oracle, etc. into
HBase, Hive or HDFS.
• Sqoop Hadoop can also be used for exporting data from
HDFS into RDBMS.
• Apache Sqoop is a command-line interpreter i.e. the Sqoop
commands are executed one at a time by the interpreter.
HADOOP ARCHIVES
• Hadoop Archive is a facility that packs up small files into one
compact HDFS block to avoid memory wastage of name
nodes.
• Name node stores the metadata information of the HDFS
data.
• If 1GB file is broken into 1000 pieces then namenode will
have to store metadata about all those 1000 small files.
• In that manner,namenode memory will be wasted in storing
and managing a lot of data.
• HAR is created from a collection of files and the archiving
tool will run a MapReduce job.
• These Maps reduces jobs to process the input files in parallel
to create an archive file.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 37
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 38
HADOOP ARCHIVES
• Hadoop is created to deal with large files data, so small files are
problematic and to be handled efficiently. As a large input file is
split into a number of small input files and stored across all the data
nodes, all these huge numbers of records are to be stored in the
name node which makes the name node inefficient.
• To handle this problem, Hadoop Archive has been created which
packs the HDFS files into archives and we can directly use these
files as input to the MR jobs.
It always comes with *.har extension.
HAR Syntax :
• hadoop archive -archiveName NAME -p <parent path> <src>*
<dest>
Example :
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2
/user/zoo
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 39
HADOOP I/ O: COM
PRESSION
• In the Hadoop framework, where large data sets are stored and processed, we
will need storage for large files.
• These files are divided into blocks and those blocks are stored in different nodes
across the cluster so lots of I/O and network data transfer is also involved.
• In order to reduce the storage requirements and to reduce the time spent in-
network transfer, we can have a look at data compression in the Hadoop
framework.
• Using data compression in Hadoop we can compress files at various steps, at all
of these steps it will help to reduce storage and quantity of data transferred.
• We can compress the input file itself.
• That will help us reduce storage space in HDFS.
• We can also configure that the output of a MapReduce job is compressed in
Hadoop.
• That helps is reducing storage space if you are archiving output or sending it to
some other application for further processing.
SERIALIZATION
• Serialization refers to the conversion of structured objects into byte
streams for transmission over the network or permanent storage
on a disk.
• Deserialization refers to the conversion of byte streams back to
structured objects.
Serialization is mainly used in two areas of distributed data processing :
• Interprocess communication
• Permanent storage
We require I/O Serialization because :
• To process records faster (Time-bound).
• When proper data formats need to maintain and transmit over data
without schema support on another end.
• When in the future, data without structure or format needs to process,
complex Errors may occur.
• Serialization offers data validation over transmission.
• To maintain the proper format of data serialization, the system must have
the following four properties -
• Compact - helps in the best use of network bandwidth
• Fast - reduces the performance overhead
• Extensible - can match new requirements
• Inter-operable - not language-specific
•
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 40
AVROANDFILE-BASEDDATA
STRUCTURES.
• Apache Avro is a language-neutral data serialization system.
• Since Hadoop writable classes lack language portability, Avro
becomes quite helpful, as it deals with data formats that can be
processed by multiple languages.
• Avro is a preferred tool to serialize data in Hadoop.
• Avro has a schema-based system.
• A language-independent schema is associated with its read and
write operations.
• Avro serializes the data which has a built-in schema.
• Avro serializes the data into a compact binary format, which can
be deserialized by any application.
• Avro uses JSON format to declare the data structures.
• Presently, it supports languages such as Java, C, C++, C#, Python,
and Ruby.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 41
SECURITYINHADOOP
• Apache Hadoop achieves security by using Kerberos.
At a high level, there are three steps that a client
must take to access a service when using Kerberos.
Thus, each of which involves a message exchange with
a server.
• Authentication – The client authenticates
itself to the authentication server. Then, receives a
timestamped Ticket-Granting Ticket (TGT).
• Authorization – The client uses the TGT to request a
service ticket from the Ticket Granting Server.
• Service Request – The client uses the service ticket to
authenticate itself to the server.
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 42
2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 43
ADM
INISTERINGHADOOP
The person who administers Hadoop is called HADOOP
ADMINISTRATOR.
Some of the common administering tasks in Hadoop are :
• Monitor health of a cluster
• Add new data nodes as needed
• Optionally turn on security
• Optionally turn on encryption
• Recommended, but optional, to turn on high availability
• Optional to turn on MapReduce Job History Tracking Server
• Fix corrupt data blocks when necessary
• Tune performance

More Related Content

Similar to Unit-3.pptx

Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET Journal
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptxAakashBerlia1
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 

Similar to Unit-3.pptx (20)

Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data StorageEMC Isilon Best Practices for Hadoop Data Storage
EMC Isilon Best Practices for Hadoop Data Storage
 
Hdfs design
Hdfs designHdfs design
Hdfs design
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptx
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 

More from JasmineMichael1 (20)

DATA MODEL PRESENTATION UNIT I-BCA I.pptx
DATA MODEL PRESENTATION UNIT I-BCA I.pptxDATA MODEL PRESENTATION UNIT I-BCA I.pptx
DATA MODEL PRESENTATION UNIT I-BCA I.pptx
 
SWOT ANALYSIS.pptx
SWOT ANALYSIS.pptxSWOT ANALYSIS.pptx
SWOT ANALYSIS.pptx
 
UNIT IV.pptx
UNIT IV.pptxUNIT IV.pptx
UNIT IV.pptx
 
upload1.pptx
upload1.pptxupload1.pptx
upload1.pptx
 
prese1.pptx
prese1.pptxprese1.pptx
prese1.pptx
 
Presentation2.pptx
Presentation2.pptxPresentation2.pptx
Presentation2.pptx
 
Big Sheets.pptx
Big Sheets.pptxBig Sheets.pptx
Big Sheets.pptx
 
ppt excel2.pptx
ppt excel2.pptxppt excel2.pptx
ppt excel2.pptx
 
ppt excel1.pptx
ppt excel1.pptxppt excel1.pptx
ppt excel1.pptx
 
ppt excel.pptx
ppt excel.pptxppt excel.pptx
ppt excel.pptx
 
Presentation3.pptx
Presentation3.pptxPresentation3.pptx
Presentation3.pptx
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
s2.pptx
s2.pptxs2.pptx
s2.pptx
 
s1.pptx
s1.pptxs1.pptx
s1.pptx
 
Presentation8.pptx
Presentation8.pptxPresentation8.pptx
Presentation8.pptx
 
Physical Storage Media (Cont.pptx
Physical Storage Media (Cont.pptxPhysical Storage Media (Cont.pptx
Physical Storage Media (Cont.pptx
 
Physical Storage Media (Cont.pptx
Physical Storage Media (Cont.pptxPhysical Storage Media (Cont.pptx
Physical Storage Media (Cont.pptx
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 

Recently uploaded

AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptNishitharanjan Rout
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lessonQUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lessonhttgc7rh9c
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfPondicherry University
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxakanksha16arora
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSAnaAcapella
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17Celine George
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111GangaMaiya1
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 

Recently uploaded (20)

AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.ppt
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lessonQUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
QUATER-1-PE-HEALTH-LC2- this is just a sample of unpacked lesson
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 

Unit-3.pptx

  • 1. BIGDATA PROGRAMMING HDFS (Hadoop Distributed File System) Design of HDFS, HDFS concepts, benefits and challenges,, command line interface, Hadoop file system interfaces, data flow, data ingest with Flume and Scoop, Hadoop archives, Hadoop I/O: compression, serialization, Avro and file-based data structures. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 1
  • 2. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 2 DESIGNOF HDFS • HDFS is designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. It is designed for very large files. “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. • It is built around the idea that the most efficient data processing pattern is a write- once, read-many-times pattern
  • 3. DESIGNOF HDFS . • It is designed for commodity hardware. Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on the commonly available hardware that can be obtained from multiple vendors. • HDFS is designed to carry on working without a noticeable interruption to the user in case of hardware failure. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 3
  • 4. FEATURES OF HDFS: • Data replication. This is used to ensure that the data is always available and prevents data loss • Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them across nodes in a large cluster ensures fault tolerance and reliability. • High availability. As mentioned earlier, because of replication across notes, data is available even if the Name Node or a Data Node fails. • Scalability. Because HDFS stores data on various nodes in the cluster, as requirements increase, a cluster can scale to hundreds of nodes. • High throughput. Because HDFS stores data in a distributed manner, the data can be processed in parallel on a cluster of nodes. This, plus data locality , cut the processing time and enable high throughput. • Data locality. With HDFS, computation happens on the DataNodes where the data resides, rather than having the data move to where the computational unit is 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 4
  • 5. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 5 HDFS :BENEFITS • Cost effectiveness: The Data Nodes that store the data rely on inexpensive off-the-shelf hardware, which cuts storage costs. Also, because HDFS is open source, there's no licensing fee. • Large data set storage: HDFS stores a variety of data of any size -- from megabytes to petabytes -- and in any format, including structured and unstructured data. • Fast recovery from hardware failure: HDFS is designed to detect faults and automatically recover on its own. • Portability: HDFS is portable across all hardware platforms, and it is compatible with several operating systems, including Windows, Linux and Mac OS/X. • Streaming data access: HDFS is built for high data throughput, which is best for access to streaming data.
  • 6. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 6 HDFS CHALLENGES • HDFS is not a good fit if we have a lot of small files. Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode • If we have multiple writers and arbitrary file modifications, HDFS will not a good fit. Files in HDFS are modified by a single writer at any time. • Writes are always made at the end of the file, in the append-only fashion. There is no support for modifications at arbitrary offsets in the file. So, HDFS is not a good fit if modifications have to be made at arbitrary offsets in the file.
  • 7. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 7 HDFS CHALLENGES(CONTD.) • HDFS does not give any reliability if that machine goes down. • An enormous number of clients must be handled if all the clients need the data stored on a single machine. • Clients need to copy the data to their local machines before they can operate it. • Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. • Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode.
  • 8. CHALLENGES INHADOOP: • Hadoop is a cutting edge technology Hadoop is a new technology, and as with adopting any new technology, finding people who know the technology is difficult. • Hadoop in the Enterprise Ecosystem Hadoop is designed to solve Big Data problems encountered by Web and Social companies. In doing so a lot of the features Enterprises need or want are put on the back burner. For example, HDFS does not offer native support for security and authentication. • Hadoop is still rough around the edges The development and admin tools for Hadoop are still pretty new. Companies like Cloudera, Hortonworks, MapR and Karmasphere have been working on this issue. How ever the tooling may not be as mature as Enterprises are used to (as say, Oracle Admin, etc.) 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 8
  • 9. CHALLENGES INHADOOP(CONTD.) Hardware Cost: Hadoop is NOT cheap, Hadoop runs on 'commodity' hardware. But these are not cheapo machines, they are server grade hardware. So standing up a reasonably large Hadoop cluster, say 100 nodes, will cost a significant amount of money. For example, lets say a Hadoop node is $5000, so a 100 node cluster would be $500,000 for hardware. IT and Operations costs: A large Hadoop cluster will require support from various teams like : Network Admins, IT, Security Admins, System Admins.Also one needs to think about operational costs like Data Center expenses : cooling, electricity, etc. Map Reduce is a different programming paradigm Solving problems using Map Reduce requires a different kind of thinking. Engineering teams generally need additional training to take advantage of Hadoop 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 9
  • 10. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 10 FILE SIZES, BLOCKSIZES - INHDFS HDFS works with a main NameNode and multiple other datanodes, all on a commodity hardware cluster. These nodes are organized in the same place within the data center. Next, it's distributed storage. broken down into blocks which are among the multiple DataNodes for To reduce the chances of data loss, blocks are often replicated across nodes. It's a backup system should data be lost.
  • 11. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 11 FILE SIZES, BLOCKSIZES - IN HDFS(CONTD.) The NameNode is the node within the cluster that knows what the data contains, what block it belongs to, the block size, and where it should go. NameNodes are also used to control access to can write, read, data acrossthe files including when someone create, remove, and replicate various data notes.
  • 12. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 12 HDFS FILE COM M ANDS HDFS commands, such as the below, to operate and manage your system. Description Removes file or directory Lists files with permissions and other details Creates a directory named path in HDFS Shows contents of the file Deletes a directory Uploads a file or folder from a local disk to HDFS Deletes the file identified by path or folder and subfolders Moves file or folder from HDFS to local file Counts number of files, number of directory, and file size Shows free space Merges multiple files in HDFS Changes file permissions Copies files to the local system Prints statistics about the file or directory Displays the first kilobyte of a file Returns the help for an individual command Allocates a new owner and group of a file Command • -rm • -ls • -mkdir • -cat • -rmdir • -put • -rmr • -get • -count • -df • -getmerge • -chmod • -copyToLocal • -Stat • -head • -usage • -chown •
  • 13. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 13 HDFS BLOCKABSTRACTION • HDFS block size is usually 64MB-128MB and unlike other file systems, a file smaller than the block size does not occupy the complete block size’s worth of memory. • The block size is kept so large so that less time is made doing disk seeks as compared to the data transfer rate. Why do we need block abstraction : • Files can be bigger than individual disks. • Filesystem metadata does not need to be associated with each and every block. • Simplifies storage management - Easy to figure out the number of blocks which can be stored on each disk. • Fault tolerance and storage replication can be easily done on a per-block basis.
  • 14. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 14 DATA REPLICATION • Replication ensures the availability of the data. • Replication is - making a copy of something and the number of times we make a copy of that particular thing can be expressed as its Replication Factor. • As HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make a copy of those file blocks. • By default, the Replication Factor for Hadoop is set to 3 which can be configured. • We need this replication for our file blocks because for running Hadoop we are using commodity hardware (inexpensive system hardware) which can be crashed at any time
  • 15. DATA REPLICATION 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 15
  • 16. HDFS : READFILES 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 16
  • 17. HDFS : READFILES • Step 1: The client opens the file he/she wishes to read by calling open() on the File System Object. • Step 2: Distributed File System(DFS) calls the name node, to determine the locations of the first few blocks in the file. For each block, the name node returns the addresses of the data nodes that have a copy of that block. The DFS returns an FSDataInputStream to the client for it to read data from. •Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node addresses for the primary few blocks within the file, then connects to the primary (closest) data node for the primary block in the file. • Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream. • Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data node, then finds the best data node for the next block. • Step 6: When the client has finished reading the file, a function is called, close() on the FSDataInputStream. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 17
  • 18. HDFS: W RITE FILES 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 18
  • 19. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 19 HDFS: W RITE FILES(CONTD.) Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS). Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace, with no blocks associated with it. The name node performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the name node prepares a record of the new file; otherwise, the file can’t be created. The DFS returns an FSDataOutputStream for the client to start out writing data to the file. Step 3: DFSOutputStream splits it into packets, which it writes to an indoor queue called the info queue. The data queue is consumed by the DataStreamer, which is liable for asking the name node to allocate new blocks by picking an inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline. The DataStreamer streams the packets to the primary data node within the pipeline, which stores each packet and forwards it to the second data node within the pipeline.
  • 20. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 20 HDFS: W RITE FILES(CONTD.) Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data node in the pipeline. Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be acknowledged by data nodes, called an “ack queue”. Step 6: This action sends up all the remaining packets to the data node pipeline and waits for acknowledgements before connecting to the name node to signal whether the file is complete or not.
  • 21. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 21 HDFS: STORE FILES • HDFS divides files into blocks and stores each block on a DataNode. • Multiple DataNodes are linked to the master node in the cluster, the NameNode. • The master node distributes replicas of these data blocks across the cluster. It also instructs the user where to locate wanted information. • Before the NameNode can help to store and manage the data, it first needs to partition the file into smaller, manageable data blocks. • This process is called data block splitting
  • 22. STORE FILES 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 22
  • 23. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 23 JAVA INTERFACES TOHDFS Java code for writing file in HDFS : FileSystem fileSystem = FileSystem.get(conf); // Check if the file already exists Path path = new Path("/path/to/file.ext"); if (fileSystem.exists(path)) { System.out.println("File " + dest + " already exists"); return; }
  • 24. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 24 JAVA INTERFACES TOHDFS // Create a new file and write data to it. FSDataOutputStream out = fileSystem.create(path); InputStream in = new BufferedInputStream(new FileInputStream(new File(source))); byte[] b = new byte[1024]; int numBytes = 0; while ((numBytes = in.read(b)) > 0) { out.write(b, 0, numBytes); } // Close all the file descripters in.close(); out.close(); fileSystem.close();
  • 25. JAVA INTERFACES TOHDFS • // Java code for reading file in HDFS : FileSystem fileSystem = FileSystem.get(conf); Path path = new Path("/path/to/file.ext"); if (!fileSystem.exists(path)) { System.out.println("File does not exists"); return; } FSDataInputStream in = fileSystem.open(path); int numBytes = 0; while ((numBytes = in.read(b))> 0) { System.out.prinln((char)numBytes)); // code to manipulate the data which is read } in.close(); out.close(); fileSystem.close(); 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 25
  • 26. COM M ANDLINE INTERFACE The HDFS can be manipulated through a Java API or through a command- line interface. The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports. Below are the commands supported : • appendToFile: Append the content of the text file in the HDFS. • cat: Copies source paths to stdout. • checksum: Returns the checksum information of a file. • chgrp : Change group association of files. The user must be the owner of files, or else a super-user. • chmod : Change the permissions of files. The user must be the owner of the file, or else a super-user. • chown: Change the owner of files. The user must be a super-user. • copyFromLocal: This command copies all the files inside the test folder in the edge node to the test folder in the HDFS. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 26
  • 27. COM M ANDLINE INTERFACE • copyToLocal : This command copies all the files inside the test folder in the HDFS to the test folder in the edge node. • count: Count the number of directories, files and bytes under the paths that match the specified file pattern. • cp: Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory. • createSnapshot: HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system. Some common use cases of snapshots are data backup, protection against user errors and disaster recovery. • deleteSnapshot: Delete a snapshot from a snapshot table directory. This operation requires the owner privilege of the snapshottable directory. • df: Displays free space 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 27
  • 28. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 28 COM M ANDLINE INTERFACE • du: Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file. • expunge: Empty the Trash. • find: Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print. • get Copy files to the local file system. • getfacl: Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL. • getfattr: Displays the extended attribute names and values for a file or directory. • getmerge : Takes a source directory and a destination file as input and concatenates files in src into the destination local file. • help: Return usage output. • ls: list files • lsr: Recursive version of ls.
  • 29. COM M ANDLINE INTERFACE • mkdir: Takes path URI’s as argument and creates directories. • moveFromLocal: Similar to put command, except that the source localsrc is deleted after it’s copied. • moveToLocal: Displays a “Not implemented yet” message. • mv: Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. • put : Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system. • renameSnapshot : Rename a snapshot. This operation requires the owner privilege of the snapshottable directory. • rm : Delete files specified as args. • rmdir : Delete a directory. • rmr : Recursive version of delete. • setfacl : Sets Access Control Lists (ACLs) of files and directories. • setfattr : Sets an extended attribute name and value for a file or directory. • setrep: Changes the replication factor of a file. If the path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at the path. • stat : Print statistics about the file/directory at <path> in the specified format. • tail: Displays the last kilobyte of the file to stdout. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 29
  • 30. COM M ANDLINE INTERFACE • test : Hadoop fs -test -[defsz] URI. • text: Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream. • touchz: Create a file of zero length. • truncate: Truncate all files that match the specified file pattern to the specified length. • usage: Return the help for an individual command. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 30
  • 31. HADOOP FILE SYSTEM INTERFACES • HDFS Interfaces : Features of HDFS interfaces are : • Create new file • Upload files/folder • Set Permission • Copy • Move • Rename • Delete • Drag and Drop • HDFS File viewer 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 31
  • 32. DATA FLOW • MapReduce is used to compute a huge amount of data. To handle the upcoming data in a parallel and distributed form, the data has to flow from various phases : • Input Reader : The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128 MB). Once input reads the data, it generates the corresponding key-value pairs. The input files reside in HDFS. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 32
  • 33. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 33 DATA FLOW • Map Function : The map function process the upcoming key-value pairs and generated the corresponding output key-value pairs. The mapped input and output types may be different from each other. • Partition Function : The partition function assigns the output of each Map function to the appropriate reducer. The available key and value provide this function. It returns the index of reducers. • Shuffling and Sorting :The data are shuffled between nodes so that it moves out from the map and get ready to process for reduce function. The sorting operation is performed on input data for Reduce function. • Reduce Function : The Reduce function is assigned to each unique key. These keys are already arranged in sorted order. The values associated with the keys can iterate the Reduce and generates the corresponding output. • Output Writer : Once the data flow from all the above phases, the Output writer executes. The role of the Output writer is to write the Reduce output to the stable storage.
  • 34. DATA INGEST : FLUM E AND SCOOP • Hadoop Data ingestion is the beginning of input data pipeline in a data lake. • It means taking data from various silo databases and files and putting it into Hadoop. • For many companies, it does turn out to be an intricate task. • That is why they take more than a year to ingest all their data into the Hadoop data lake. • The reason is, as Hadoop is open-source; there are a variety of ways you can ingest data into Hadoop. • It gives every developer the choice of using her/his favourite tool or language to ingest data into Hadoop. • Developers while choosing a tool/technology stress on performance, but this makes governance very complicated. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 34
  • 35. FLUM E Flume : • Apache Flume is a service designed for streaming logs into the Hadoop environment. • Flume is a distributed and reliable service for collecting and aggregating huge amounts of log data. • With a simple and easy to use architecture based on streaming data flows, it also has tunable reliability mechanisms and several recoveries and failover mechanisms. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 35
  • 36. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 36 SCOOP Sqoop :  Apache Sqoop (SQL-to-Hadoop) is a lifesaver for anyone who is experiencing difficulties in moving data from the data warehouse into the Hadoop environment. • Apache Sqoop is an effective Hadoop tool used for importing data from RDBMS’s like MySQL, Oracle, etc. into HBase, Hive or HDFS. • Sqoop Hadoop can also be used for exporting data from HDFS into RDBMS. • Apache Sqoop is a command-line interpreter i.e. the Sqoop commands are executed one at a time by the interpreter.
  • 37. HADOOP ARCHIVES • Hadoop Archive is a facility that packs up small files into one compact HDFS block to avoid memory wastage of name nodes. • Name node stores the metadata information of the HDFS data. • If 1GB file is broken into 1000 pieces then namenode will have to store metadata about all those 1000 small files. • In that manner,namenode memory will be wasted in storing and managing a lot of data. • HAR is created from a collection of files and the archiving tool will run a MapReduce job. • These Maps reduces jobs to process the input files in parallel to create an archive file. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 37
  • 38. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 38 HADOOP ARCHIVES • Hadoop is created to deal with large files data, so small files are problematic and to be handled efficiently. As a large input file is split into a number of small input files and stored across all the data nodes, all these huge numbers of records are to be stored in the name node which makes the name node inefficient. • To handle this problem, Hadoop Archive has been created which packs the HDFS files into archives and we can directly use these files as input to the MR jobs. It always comes with *.har extension. HAR Syntax : • hadoop archive -archiveName NAME -p <parent path> <src>* <dest> Example : hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
  • 39. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 39 HADOOP I/ O: COM PRESSION • In the Hadoop framework, where large data sets are stored and processed, we will need storage for large files. • These files are divided into blocks and those blocks are stored in different nodes across the cluster so lots of I/O and network data transfer is also involved. • In order to reduce the storage requirements and to reduce the time spent in- network transfer, we can have a look at data compression in the Hadoop framework. • Using data compression in Hadoop we can compress files at various steps, at all of these steps it will help to reduce storage and quantity of data transferred. • We can compress the input file itself. • That will help us reduce storage space in HDFS. • We can also configure that the output of a MapReduce job is compressed in Hadoop. • That helps is reducing storage space if you are archiving output or sending it to some other application for further processing.
  • 40. SERIALIZATION • Serialization refers to the conversion of structured objects into byte streams for transmission over the network or permanent storage on a disk. • Deserialization refers to the conversion of byte streams back to structured objects. Serialization is mainly used in two areas of distributed data processing : • Interprocess communication • Permanent storage We require I/O Serialization because : • To process records faster (Time-bound). • When proper data formats need to maintain and transmit over data without schema support on another end. • When in the future, data without structure or format needs to process, complex Errors may occur. • Serialization offers data validation over transmission. • To maintain the proper format of data serialization, the system must have the following four properties - • Compact - helps in the best use of network bandwidth • Fast - reduces the performance overhead • Extensible - can match new requirements • Inter-operable - not language-specific • 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 40
  • 41. AVROANDFILE-BASEDDATA STRUCTURES. • Apache Avro is a language-neutral data serialization system. • Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. • Avro is a preferred tool to serialize data in Hadoop. • Avro has a schema-based system. • A language-independent schema is associated with its read and write operations. • Avro serializes the data which has a built-in schema. • Avro serializes the data into a compact binary format, which can be deserialized by any application. • Avro uses JSON format to declare the data structures. • Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 41
  • 42. SECURITYINHADOOP • Apache Hadoop achieves security by using Kerberos. At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server. • Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT). • Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server. • Service Request – The client uses the service ticket to authenticate itself to the server. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 42
  • 43. 2023/5/13 SHEAT CSE/Big Data/KCS061/Unit-III/BJ 43 ADM INISTERINGHADOOP The person who administers Hadoop is called HADOOP ADMINISTRATOR. Some of the common administering tasks in Hadoop are : • Monitor health of a cluster • Add new data nodes as needed • Optionally turn on security • Optionally turn on encryption • Recommended, but optional, to turn on high availability • Optional to turn on MapReduce Job History Tracking Server • Fix corrupt data blocks when necessary • Tune performance