SlideShare a Scribd company logo
1 of 66
www.sensaran.wordpress.com
DAY 2-GETTING STARTED
WITH HDFS AND MAP
REDUCE
www.sensaran.wordpress.com
 Describe the use of Hadoop in commodity hardware.
 Explain the various configurations and services of Hadoop.
 Differentiate between regular file system and Hadoop Distributed
File System (HDFS).
 Explain HDFS architecture and Map Reduce
TOPICS
TEN STEPS TO INSTALL HADOOP 2.6.0 IN
UBUNTU USING SINGLE NODE CLUSTER
Step 1 – Download Apache Hadoop 2.6.0 version from
https://hadoop.apache.org/docs/r2.6.0 and unzip the file in
home directory.
Step 2 – Download java Linux 64 bit version from
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-
downloads-1880260.html and unzip the file in home directory.
Step 3 – Select Home button at top left hand side and click ctrl+H in
order to view the hidden files .
Then open the .bashrc in edit mode and configure the Hadoop Java
configuration as mention below .
Step 4 – Core-site.xml In this edit the following XMLs & flat files enter
you IP Address as mention below
Step 5 – Mapred-site.xml
Step 6 – hdfs-site.xml
Step 7 – httpsdf-site.xml
Step 8 – Yarn-Site.xml
Step 9 – Once it’s completed . open the terminal and
type startall.sh and press enter
Step 10 – To check all the all node are properly started . type jps
HADOOP CLUSTER USING COMMODITY
HARDWARE
 Hadoop supports the concept of distributed architecture.
 The diagram represents the nodes connected and installed
with Hadoop.
 Let see what is Distributed File system and how it’s working in
Hadoop system.
www.sensaran.wordpress.com
FILE SYSTEM
 A file system is the underlying structure a computer uses to
organize data on a hard disk.
 Without a file system, information placed in a storage area would
be one large body of data with no way to tell where one piece of
information stops and the next begins.
 If you are installing a new hard disk, you need to partition and
format it using a file system before you can begin storing data
or programs.
www.sensaran.wordpress.com
WHAT IS “DISTRIBUTED”?
 The file system will be physically distributed among several
machines in same network.
 Actually, the file system is treated as one single coherent system.
 Data will be replicated within the file system to for availability
and fault tolerance.
www.sensaran.wordpress.com
16
 Resource sharing
 Combined computation speedup
 Reliability (Fault tolerant, No single point failure)
 Location transparency & location independence.
 Scaling up to any level
 Best processing capability & availability with cheap hardware
WHY DISTRIBUTED? WHY DATA
REPLICATION?
www.sensaran.wordpress.com
17
APACHE HADOOP CORE COMPONENTS
www.sensaran.wordpress.com
HADOOP DISTRIBUTED FS (HDFS)
www.sensaran.wordpress.com
 HDFS is a specially designed file system for storing huge data set with
cluster of commodities hardware and streaming access pattern.
 As in Java – write once and run in N number of platform like that in
Hadoop also “WORM” concept is used WORM – Write Once Read
Many times without changing the data’s once file has been updated
in HDFS.
HADOOP DISTRIBUTED FS (HDFS)
www.sensaran.wordpress.com
 HDFS provides interfaces for applications to move themselves closer
to where the data is located.
 HDFS supports a traditional hierarchical file organization. A user or an
application can create directories and store files inside these
directories.
 Detection of faults and quick, automatic recovery from file system’s
data is a core architectural goal of HDFS.
FILE SYSTEM Vs HDFS
www.sensaran.wordpress.com
File System HDFS
Each block of data is small in size;
approximately 51 bytes
Each block of data is very large in size;
64MB by default
Large data access suffers from disk I/O
problems; mainly because of multiple
seek operation
Reads huge data sequentially after a
single seek
HDFS - CHARACTERISTICS
www.sensaran.wordpress.com
 High fault-tolerance.
 High throughput.
 Suitable for applications with large data sets.
 Suitable for applications with streaming access to file system data.
 Can be built on commodity hardware and heterogeneous platforms.
Hadoop Distributed File Syste
m
HDFS
Google File System
GFS
Cross Platform Linux
Developed in Java
environment
Developed in c,c++
environment
At first its developed by
Yahoo and now its an open
source Framework
Its developed by Google
It has Name node and Data
Node
It has Master-node and Chunk
server
128 MB will be the default
block size
64 MB will be the default
block size
Name node receive heartbeat
from Data node
Master node receive
heartbeat from Chunk server
Commodities hardware
were used
Commodities hardware
werused
WORM – Write Once and
Read Many times
Multiple writer , multiple
reader model
Deleted files are renamed into
particular folder and then it
will removed via garbage
Deleted files are not
reclaimed immediately and
are renamed in hidden name
space and it will deleted after
three days if it's not in use
No Network stack issue Network stack Issue
Journal ,editlog Oprational log
only append is possible random file write possible
HDFS Vs GFS
www.sensaran.wordpress.com
Hadoop Distributed File System
HDFS
Google File System
GFS
Cross Platform
Linux
Developed in Java environment Developed in c,c++ environment
At first its developed by Yahoo and
now its an open source Framework
Its developed by Google
It has Name node and Data Node It has Master-node and Chunk server
128 MB will be the default block size 64 MB will be the default block size
Name node receive heartbeat from
Data node
Master node receive heartbeat from
Chunk server
www.sensaran.wordpress.com
HDFS GFS
WORM – Write Once and Read Many
times
Multiple writer , multiple reader
model
Deleted files are renamed into
particular folder and then it will
removed via garbage
Deleted files are not reclaimed
immediately and are renamed in
hidden name space and it will
deleted after three days if it's not in
use
No Network stack issue Network stack Issue
Journal ,editlog Oprational log
only append is possible random file write possible
What was its
“Big Data” limit?
www.sensaran.wordpress.com
LET SEE HDFS AND HOW IT’S USED
What was its
“Big Data” limit?
www.sensaran.wordpress.com
WHEN TO USE HDFS
 HDFS designed for storing very large file with streaming data access
patterns cluster with commodity hardware.
 Very large file means ?
Files == hundreds of Mega,Giga or Terabytes.
 Streaming data access WORM – Write once and Read Many pattern
Dataset is typically generated or copied from source.
 Commodity Hardware does not require expensive , highly reliable
hardware.
What was its
“Big Data” limit?
www.sensaran.wordpress.com
WHEN NOT TO USE HDFS
 Lots of Small file? Limited size of file is not advisable to use in
HDFS name node.
 Multiple writes is not allowed File in HDFS
may be written to by a single writer No
support for multiple writers.
www.sensaran.wordpress.com
HADOOP CORE SERVICES
www.sensaran.wordpress.com
HDFS ARCHITECTURE
www.sensaran.wordpress.com
Function of HDFS Components
 Name Node vs Data Node.
 Job Tracker vs Task Tracker.
Master / Slave
SECONDARY NODE ( SINGLE INSTANCE ) :
It will act as back up for Name node server keeps the Namespace image
through edit log.
 Name node contains file system Namespace i.e metadata .
 If there is any change in file system or in storage pattern , this will be
tracked in Name node say for eg . if the files is deleted from HDFS or
else any change or modification then name node will change in their
EDIT log.
 It will initiate the Data node to perform the actions
 It maintain the record how the files in HDFS is splited and stored.
 It will receive the heartbeat and black report from the data node .
based on that the communication replication factor will happen.
NAME NODE ( SINGLE INSTANCE ) :
DATA NODE ( MULTIPLE INSTANCE )
 The Data-node is responsible for storing the files in HDFS.
 It manages the file blocks within the node. It sends information
to the Name Node about the files and blocks stored in that node
and responds to the Name Node for all file system operations.
 Data nodes send heartbeats to the Name Node once every 3
seconds, to report the overall health of HDFS.
 Data nodes also enables pipelining of data and it's forward data
to other nodes.
 The data nodes can talk to each other to rebalance data, move
and copy data around and keep the replication high.
www.sensaran.wordpress.com
 Each file is split into one or more blocks stored and replicated in Data
Nodes.
 Each block is typically 64Mb or 124 Mb size.
 Each block is replicated multiple times. Default is to replicate each
block three times Replicas are stored on different nodes.
DATA BLOCKS
JOB TRACKER & TASK TRACKER
JOB TRACKER:
The Job Tracker is the service within Hadoop that farms out Map
Reduce tasks to specific nodes in the cluster, ideally the nodes that have
the data, or at least are in the same rack.
JOB TRACKER
www.sensaran.wordpress.com
 The Job Tracker will communicate with Name Node to determine
the location of the data.
 The Job Tracker locates Task Tracker nodes with available slots at or
near the data.
 The Job Tracker submits the work to the chosen Task Tracker nodes.
And monitored how i’s worked.
 If Job Tracker do not submit heartbeat signals often enough, they
are deemed to have failed and the work is scheduled on a different
Task Tracker.
 A Task Tracker will notify the Job Tracker when a task fails.
www.sensaran.wordpress.com
 Task Trackers which run on Data Nodes; Task Trackers run the tasks
and report the status of task to Job Tracker.
 The Job Tracker runs on Master Node aka Name Node whereas Task
Trackers run on Data Nodes.
 Mapper and Reducer tasks are executed on DataNodes administered
by Task Trackers.
 Task Trackers will be assigned Mapper and Reducer tasks to execute
by Job Tracker.
 Task Tracker will be in constant communication with the Job Tracker
signaling the progress of the task in execution
TASK TRACKER
www.sensaran.wordpress.com
 Task Trackers which run on Data Nodes; Task Trackers run the tasks
and report the status of task to Job Tracker.
 The Job Tracker runs on Master Node aka Name Node whereas Task
Trackers run on Data Nodes.
 Mapper and Reducer tasks are executed on Data Nodes
administered by Task Trackers.
 Task Trackers will be assigned Mapper and Reducer tasks to execute
by Job Tracker.
 Task Tracker will be in constant communication with the Job Tracker
signaling the progress of the task in execution.
TASK TRACKER
FS IMAGE IN HADOOP?
 In hadoop name node command creates fs image file and it store the
details about name space i.e mapping of blocks to files and file system
properties
 This will stored in both memory and local disk . while restart the
hadoop at that time it will update in memory ,during run time all the
operation will updated in Edit log.
 It is always synchronized with name node. So there is no need to copy
FS Image & log file from name node.
www.sensaran.wordpress.com
 Edit log file may increase drastically, which will be challenging to
manage.
 Longer Name node restarting due to lot of changes needs to be
merged.
 In the case of crash, we will lost huge amount of metadata since fs
image is very old.
 Therefore, restarting of name node is going to take longer even.
 Secondary name node is solution for this issue. This is another
machine having connectivity with name node.
www.sensaran.wordpress.com
CHALLENGES
www.sensaran.wordpress.com
HDFS COMMANDS
 How to upload files/directory to HDFS
hadoop fs -put <file or dir> <hdfs path>
 copying the file from local directory to HDFS location .
in below example we are copying the Sample.txt file to HDFS location.
hadoop fs -put Sample.txt /
 How to find list of files/directory available in HDFS location
hadoop fs -ls /
 How to count the files Number in HDFS Folder
hadoop fs -count /
HDFS COMMANDS
www.sensaran.wordpress.com
 Remove a file or directory in HDFS
hadoop fs -rm <HDFSpath>
hadoop fs -rmr <HDFSpath>
hadoop fs -rm /testfile
hadoop fs -rmr /file_or_folder
 How to view the content of files
hadoop fs -cat <path>
hadoop fs -cat /Sample.txt
 How to display the last few lines of file in HDFS
hadoop fs -tail <path>
hadoop fs -tail /Sample.txt
www.sensaran.wordpress.com
 How to create a directory in HDFS
hadoop fs -mkdir <HDFS dir Name>
hadoop fs -mkdir /HDFSfolder
 How to move a file/directory within HDFS
hadoop fs -mv <Hdfs files to move > <Hdfs files to move
mention location>
hadoop fs -mv/samplefile/testfolder/samplefile
hadoop fs -ls /
hadoop fs -ls /testfolder
HDFS COMMANDS
www.sensaran.wordpress.com
 How to download a file from HDFS to local
hadoop fs -get <hdfs path> <local path>
hadoop fs -get /samplefile .
 How to copy a file/directory within HDFS
hadoop fs -cp <hdfs path> <Hdfs files to copy mention location>
hadoop fs -cp /samplefile /copyHDFSfolder/samplefile
hadoop fs -ls /copyHDFSfolder
HDFS COMMANDS
HDFS QUIZ
www.sensaran.wordpress.com
www.sensaran.wordpress.com
1 . What are the two major components of Hadoop cluster ?
a) Hadoop file system, TaskTracker
b) MapReduce, Hadoop Distributed file system
c) JobTracker, MapReduce
d) JobTracker, TaskTracker
www.sensaran.wordpress.com
2 . Which of the following services run in the Master node of Apache
Hadoop in cluster mode (fully distributed mode) ?
a) JobTracker
b) TaskTracker
c) JobTracker, MapReduce
d) JobTracker, TaskTracker
www.sensaran.wordpress.com
3 . Which are the single instance critical tasks?
a) NameNode, DataNode
b) NameNode, Secondary NameNode
c) JobTracker, DataNode
d) JobTracker, NameNode
www.sensaran.wordpress.com
4 . Which of the following services are used by MapReduce programs ?
a) JobTracker and TaskTracker
b) Secondary NameNode
c) TaskTracker
d) NameNode
www.sensaran.wordpress.com
5. The NameNode uses RAM for the following purpose
a) To store the file contents in HDFS
b) To store filenames, list of blocks, and other meta information
c) To store the edits log that keeps track of changes in HDFS
d) To manage distributed read and write locks on files in HDFS
www.sensaran.wordpress.com
6. Which of the following commands is used to start all Hadoop
services?
a) start-dfs.sh
b) stop-dfs.sh
c) start-all.sh
d) stop-all.sh
www.sensaran.wordpress.com
7. Which of the following files in Hadoop configuration is
responsible for maintaining information about the JobTracker?
a) core-site.xml
b) mapred-site.xml
c) hadoop-env.sh
d) slaves file
www.sensaran.wordpress.com
8. Which of the following commands helps to indicate a file or
directory?
a) ls
b) ls - a
c) ls -l
d) All the above
www.sensaran.wordpress.com
www.sensaran.wordpress.com
 Map Reduce is a programming model and software framework first
developed by Google (Google’s Map Reduce paper submitted in
2004).
 This simplifies the processing of vast amounts of data in parallel on
large clusters of commodity hardware in a reliable, fault-tolerant
manner.
 Handles both structured and unstructured data.
WHAT IS MAP REDUCE ?
www.sensaran.wordpress.com
MAP REDUCE
The key reason to perform mapping and reducing is to speed up the
execution of a specific process by splitting the process into a number of
tasks, thus enabling parallel work.
MAP REDUCE EXECUTION STEPS
www.sensaran.wordpress.com
 JobTracker and the TaskTracker are responsible for performing Map
Reduce operations.
 Generally, the JobTracker is present in the master node of the
Hadoop and the TaskTracker service is present in the slave nodes.
 The JobTracker service is responsible for assigning the jobs to the
DataNode. The DataNode consists of the TaskTracker which performs
the tasks that are submitted by the JobTracker and provides the
results back to the JobTracker.
MAP REDUCE ENGINE
www.sensaran.wordpress.com
 Map Step
– Master node takes large problem input and slices it into
smaller sub problems; distributes these to worker nodes.
– Worker node may do this again; leads to a multilevel
tree structure.
– Worker processes smaller problem and hands
back to master.
 Reduce step
– Master node takes the answers to the sub problems and
combines them in a predefined way to get the
output/answer to original problem.
MAP REDUCE STEPS
www.sensaran.wordpress.com
Let us take a file which has 2 blocks of data BLOCK1 has some text
data and BLOCK2 has some more text data
MAP REDUCE EXAMPLE – WORD COUNT
www.sensaran.wordpress.com
MAP REDUCE EXAMPLE – WORD COUNT
www.sensaran.wordpress.com
 It handle very large scale data: peta, exa bytes, and so on
 It works well on Write once and read many (WORM) data.
 It allows parallelism without mutexes.
 The runtime takes care of splitting and moving data for operations.
 The operations are provisioned near the data i.e., data locality is
preferred.
 The commodity hardware and storage is leveraged.
CHARACTERISTICS OF MAP REDUCE
www.sensaran.wordpress.com
 Simple algorithms such as grep, text-indexing, reverse indexing
 Search engine operations like keyword indexing, ad rendering,
page rank.
 It allows parallelism without mutexes.
 Enterprise analytics .
 Data-intensive computing such as sorting.
REAL-TIME USES OF MAPREDUCE
MAP REDUCE QUIZ
www.sensaran.wordpress.com
www.sensaran.wordpress.com
1 . Who executes the map job?
a) TaskTracker
b) NameNode server
c) JobTracker
d) Secondary NameNode server
www.sensaran.wordpress.com
2. What is the sequence of Map Reduce Job ?
a) Map, input split, reduce
b) Input split, map, reduce
c) Map and then reduce
d) Map, split, map
www.sensaran.wordpress.com
3 . Which one is the responsibility of the Map Reduce developer ?
a) Map, reduce
b) Reduce, input split
c) Input split, map
d) Sort, shuffle

More Related Content

What's hot

Hadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopHadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopDr Neelesh Jain
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Apekshit Sharma
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBaseHBaseCon
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
SQL Server 2012 - FileTables
SQL Server 2012 - FileTables SQL Server 2012 - FileTables
SQL Server 2012 - FileTables Sperasoft
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0enissoz
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityDinesh Chitlangia
 
NoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesNoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesRonen Botzer
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services
 

What's hot (20)

Hadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopHadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of Hadoop
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBase
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
SQL Server 2012 - FileTables
SQL Server 2012 - FileTables SQL Server 2012 - FileTables
SQL Server 2012 - FileTables
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
The Heterogeneous Data lake
The Heterogeneous Data lakeThe Heterogeneous Data lake
The Heterogeneous Data lake
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
NoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesNoSQL in Real-time Architectures
NoSQL in Real-time Architectures
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 

Similar to Getting Started with HDFS and MapReduce

Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing datapreetik9044
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxSakthiVinoth78
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxsunithachphd
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 

Similar to Getting Started with HDFS and MapReduce (20)

Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 

More from senthil0809

First look on python
First look on pythonFirst look on python
First look on pythonsenthil0809
 
Enterprise search with apache solr
Enterprise search with apache solrEnterprise search with apache solr
Enterprise search with apache solrsenthil0809
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R langsenthil0809
 
AIR - Framework ( Cairngorm and Parsley )
AIR - Framework ( Cairngorm and Parsley )AIR - Framework ( Cairngorm and Parsley )
AIR - Framework ( Cairngorm and Parsley )senthil0809
 
Exploring Layouts and Providers
Exploring Layouts and ProvidersExploring Layouts and Providers
Exploring Layouts and Providerssenthil0809
 
Exploring Adobe Flex
Exploring Adobe Flex Exploring Adobe Flex
Exploring Adobe Flex senthil0809
 
Flex Introduction
Flex Introduction Flex Introduction
Flex Introduction senthil0809
 
Multi Touch presentation
Multi Touch presentationMulti Touch presentation
Multi Touch presentationsenthil0809
 

More from senthil0809 (8)

First look on python
First look on pythonFirst look on python
First look on python
 
Enterprise search with apache solr
Enterprise search with apache solrEnterprise search with apache solr
Enterprise search with apache solr
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 
AIR - Framework ( Cairngorm and Parsley )
AIR - Framework ( Cairngorm and Parsley )AIR - Framework ( Cairngorm and Parsley )
AIR - Framework ( Cairngorm and Parsley )
 
Exploring Layouts and Providers
Exploring Layouts and ProvidersExploring Layouts and Providers
Exploring Layouts and Providers
 
Exploring Adobe Flex
Exploring Adobe Flex Exploring Adobe Flex
Exploring Adobe Flex
 
Flex Introduction
Flex Introduction Flex Introduction
Flex Introduction
 
Multi Touch presentation
Multi Touch presentationMulti Touch presentation
Multi Touch presentation
 

Recently uploaded

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Getting Started with HDFS and MapReduce

  • 2. www.sensaran.wordpress.com  Describe the use of Hadoop in commodity hardware.  Explain the various configurations and services of Hadoop.  Differentiate between regular file system and Hadoop Distributed File System (HDFS).  Explain HDFS architecture and Map Reduce TOPICS
  • 3. TEN STEPS TO INSTALL HADOOP 2.6.0 IN UBUNTU USING SINGLE NODE CLUSTER
  • 4. Step 1 – Download Apache Hadoop 2.6.0 version from https://hadoop.apache.org/docs/r2.6.0 and unzip the file in home directory. Step 2 – Download java Linux 64 bit version from http://www.oracle.com/technetwork/java/javase/downloads/jdk7- downloads-1880260.html and unzip the file in home directory.
  • 5. Step 3 – Select Home button at top left hand side and click ctrl+H in order to view the hidden files . Then open the .bashrc in edit mode and configure the Hadoop Java configuration as mention below .
  • 6. Step 4 – Core-site.xml In this edit the following XMLs & flat files enter you IP Address as mention below
  • 7. Step 5 – Mapred-site.xml
  • 8. Step 6 – hdfs-site.xml
  • 9. Step 7 – httpsdf-site.xml
  • 10. Step 8 – Yarn-Site.xml
  • 11. Step 9 – Once it’s completed . open the terminal and type startall.sh and press enter
  • 12. Step 10 – To check all the all node are properly started . type jps
  • 13. HADOOP CLUSTER USING COMMODITY HARDWARE  Hadoop supports the concept of distributed architecture.  The diagram represents the nodes connected and installed with Hadoop.  Let see what is Distributed File system and how it’s working in Hadoop system. www.sensaran.wordpress.com
  • 14. FILE SYSTEM  A file system is the underlying structure a computer uses to organize data on a hard disk.  Without a file system, information placed in a storage area would be one large body of data with no way to tell where one piece of information stops and the next begins.  If you are installing a new hard disk, you need to partition and format it using a file system before you can begin storing data or programs. www.sensaran.wordpress.com
  • 15. WHAT IS “DISTRIBUTED”?  The file system will be physically distributed among several machines in same network.  Actually, the file system is treated as one single coherent system.  Data will be replicated within the file system to for availability and fault tolerance. www.sensaran.wordpress.com
  • 16. 16  Resource sharing  Combined computation speedup  Reliability (Fault tolerant, No single point failure)  Location transparency & location independence.  Scaling up to any level  Best processing capability & availability with cheap hardware WHY DISTRIBUTED? WHY DATA REPLICATION? www.sensaran.wordpress.com
  • 17. 17 APACHE HADOOP CORE COMPONENTS www.sensaran.wordpress.com
  • 18. HADOOP DISTRIBUTED FS (HDFS) www.sensaran.wordpress.com  HDFS is a specially designed file system for storing huge data set with cluster of commodities hardware and streaming access pattern.  As in Java – write once and run in N number of platform like that in Hadoop also “WORM” concept is used WORM – Write Once Read Many times without changing the data’s once file has been updated in HDFS.
  • 19. HADOOP DISTRIBUTED FS (HDFS) www.sensaran.wordpress.com  HDFS provides interfaces for applications to move themselves closer to where the data is located.  HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories.  Detection of faults and quick, automatic recovery from file system’s data is a core architectural goal of HDFS.
  • 20. FILE SYSTEM Vs HDFS www.sensaran.wordpress.com File System HDFS Each block of data is small in size; approximately 51 bytes Each block of data is very large in size; 64MB by default Large data access suffers from disk I/O problems; mainly because of multiple seek operation Reads huge data sequentially after a single seek
  • 21. HDFS - CHARACTERISTICS www.sensaran.wordpress.com  High fault-tolerance.  High throughput.  Suitable for applications with large data sets.  Suitable for applications with streaming access to file system data.  Can be built on commodity hardware and heterogeneous platforms.
  • 22. Hadoop Distributed File Syste m HDFS Google File System GFS Cross Platform Linux Developed in Java environment Developed in c,c++ environment At first its developed by Yahoo and now its an open source Framework Its developed by Google It has Name node and Data Node It has Master-node and Chunk server 128 MB will be the default block size 64 MB will be the default block size Name node receive heartbeat from Data node Master node receive heartbeat from Chunk server Commodities hardware were used Commodities hardware werused WORM – Write Once and Read Many times Multiple writer , multiple reader model Deleted files are renamed into particular folder and then it will removed via garbage Deleted files are not reclaimed immediately and are renamed in hidden name space and it will deleted after three days if it's not in use No Network stack issue Network stack Issue Journal ,editlog Oprational log only append is possible random file write possible HDFS Vs GFS www.sensaran.wordpress.com Hadoop Distributed File System HDFS Google File System GFS Cross Platform Linux Developed in Java environment Developed in c,c++ environment At first its developed by Yahoo and now its an open source Framework Its developed by Google It has Name node and Data Node It has Master-node and Chunk server 128 MB will be the default block size 64 MB will be the default block size Name node receive heartbeat from Data node Master node receive heartbeat from Chunk server
  • 23. www.sensaran.wordpress.com HDFS GFS WORM – Write Once and Read Many times Multiple writer , multiple reader model Deleted files are renamed into particular folder and then it will removed via garbage Deleted files are not reclaimed immediately and are renamed in hidden name space and it will deleted after three days if it's not in use No Network stack issue Network stack Issue Journal ,editlog Oprational log only append is possible random file write possible
  • 24. What was its “Big Data” limit? www.sensaran.wordpress.com LET SEE HDFS AND HOW IT’S USED
  • 25. What was its “Big Data” limit? www.sensaran.wordpress.com WHEN TO USE HDFS  HDFS designed for storing very large file with streaming data access patterns cluster with commodity hardware.  Very large file means ? Files == hundreds of Mega,Giga or Terabytes.  Streaming data access WORM – Write once and Read Many pattern Dataset is typically generated or copied from source.  Commodity Hardware does not require expensive , highly reliable hardware.
  • 26. What was its “Big Data” limit? www.sensaran.wordpress.com WHEN NOT TO USE HDFS  Lots of Small file? Limited size of file is not advisable to use in HDFS name node.  Multiple writes is not allowed File in HDFS may be written to by a single writer No support for multiple writers.
  • 29. www.sensaran.wordpress.com Function of HDFS Components  Name Node vs Data Node.  Job Tracker vs Task Tracker. Master / Slave
  • 30. SECONDARY NODE ( SINGLE INSTANCE ) : It will act as back up for Name node server keeps the Namespace image through edit log.
  • 31.  Name node contains file system Namespace i.e metadata .  If there is any change in file system or in storage pattern , this will be tracked in Name node say for eg . if the files is deleted from HDFS or else any change or modification then name node will change in their EDIT log.  It will initiate the Data node to perform the actions  It maintain the record how the files in HDFS is splited and stored.  It will receive the heartbeat and black report from the data node . based on that the communication replication factor will happen. NAME NODE ( SINGLE INSTANCE ) :
  • 32. DATA NODE ( MULTIPLE INSTANCE )  The Data-node is responsible for storing the files in HDFS.  It manages the file blocks within the node. It sends information to the Name Node about the files and blocks stored in that node and responds to the Name Node for all file system operations.  Data nodes send heartbeats to the Name Node once every 3 seconds, to report the overall health of HDFS.  Data nodes also enables pipelining of data and it's forward data to other nodes.  The data nodes can talk to each other to rebalance data, move and copy data around and keep the replication high.
  • 33. www.sensaran.wordpress.com  Each file is split into one or more blocks stored and replicated in Data Nodes.  Each block is typically 64Mb or 124 Mb size.  Each block is replicated multiple times. Default is to replicate each block three times Replicas are stored on different nodes. DATA BLOCKS
  • 34. JOB TRACKER & TASK TRACKER JOB TRACKER: The Job Tracker is the service within Hadoop that farms out Map Reduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
  • 35. JOB TRACKER www.sensaran.wordpress.com  The Job Tracker will communicate with Name Node to determine the location of the data.  The Job Tracker locates Task Tracker nodes with available slots at or near the data.  The Job Tracker submits the work to the chosen Task Tracker nodes. And monitored how i’s worked.  If Job Tracker do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different Task Tracker.  A Task Tracker will notify the Job Tracker when a task fails.
  • 36. www.sensaran.wordpress.com  Task Trackers which run on Data Nodes; Task Trackers run the tasks and report the status of task to Job Tracker.  The Job Tracker runs on Master Node aka Name Node whereas Task Trackers run on Data Nodes.  Mapper and Reducer tasks are executed on DataNodes administered by Task Trackers.  Task Trackers will be assigned Mapper and Reducer tasks to execute by Job Tracker.  Task Tracker will be in constant communication with the Job Tracker signaling the progress of the task in execution TASK TRACKER
  • 37. www.sensaran.wordpress.com  Task Trackers which run on Data Nodes; Task Trackers run the tasks and report the status of task to Job Tracker.  The Job Tracker runs on Master Node aka Name Node whereas Task Trackers run on Data Nodes.  Mapper and Reducer tasks are executed on Data Nodes administered by Task Trackers.  Task Trackers will be assigned Mapper and Reducer tasks to execute by Job Tracker.  Task Tracker will be in constant communication with the Job Tracker signaling the progress of the task in execution. TASK TRACKER
  • 38. FS IMAGE IN HADOOP?  In hadoop name node command creates fs image file and it store the details about name space i.e mapping of blocks to files and file system properties  This will stored in both memory and local disk . while restart the hadoop at that time it will update in memory ,during run time all the operation will updated in Edit log.  It is always synchronized with name node. So there is no need to copy FS Image & log file from name node. www.sensaran.wordpress.com
  • 39.  Edit log file may increase drastically, which will be challenging to manage.  Longer Name node restarting due to lot of changes needs to be merged.  In the case of crash, we will lost huge amount of metadata since fs image is very old.  Therefore, restarting of name node is going to take longer even.  Secondary name node is solution for this issue. This is another machine having connectivity with name node. www.sensaran.wordpress.com CHALLENGES
  • 40. www.sensaran.wordpress.com HDFS COMMANDS  How to upload files/directory to HDFS hadoop fs -put <file or dir> <hdfs path>  copying the file from local directory to HDFS location . in below example we are copying the Sample.txt file to HDFS location. hadoop fs -put Sample.txt /  How to find list of files/directory available in HDFS location hadoop fs -ls /  How to count the files Number in HDFS Folder hadoop fs -count /
  • 41. HDFS COMMANDS www.sensaran.wordpress.com  Remove a file or directory in HDFS hadoop fs -rm <HDFSpath> hadoop fs -rmr <HDFSpath> hadoop fs -rm /testfile hadoop fs -rmr /file_or_folder  How to view the content of files hadoop fs -cat <path> hadoop fs -cat /Sample.txt  How to display the last few lines of file in HDFS hadoop fs -tail <path> hadoop fs -tail /Sample.txt
  • 42. www.sensaran.wordpress.com  How to create a directory in HDFS hadoop fs -mkdir <HDFS dir Name> hadoop fs -mkdir /HDFSfolder  How to move a file/directory within HDFS hadoop fs -mv <Hdfs files to move > <Hdfs files to move mention location> hadoop fs -mv/samplefile/testfolder/samplefile hadoop fs -ls / hadoop fs -ls /testfolder HDFS COMMANDS
  • 43. www.sensaran.wordpress.com  How to download a file from HDFS to local hadoop fs -get <hdfs path> <local path> hadoop fs -get /samplefile .  How to copy a file/directory within HDFS hadoop fs -cp <hdfs path> <Hdfs files to copy mention location> hadoop fs -cp /samplefile /copyHDFSfolder/samplefile hadoop fs -ls /copyHDFSfolder HDFS COMMANDS
  • 45. www.sensaran.wordpress.com 1 . What are the two major components of Hadoop cluster ? a) Hadoop file system, TaskTracker b) MapReduce, Hadoop Distributed file system c) JobTracker, MapReduce d) JobTracker, TaskTracker
  • 46. www.sensaran.wordpress.com 2 . Which of the following services run in the Master node of Apache Hadoop in cluster mode (fully distributed mode) ? a) JobTracker b) TaskTracker c) JobTracker, MapReduce d) JobTracker, TaskTracker
  • 47. www.sensaran.wordpress.com 3 . Which are the single instance critical tasks? a) NameNode, DataNode b) NameNode, Secondary NameNode c) JobTracker, DataNode d) JobTracker, NameNode
  • 48. www.sensaran.wordpress.com 4 . Which of the following services are used by MapReduce programs ? a) JobTracker and TaskTracker b) Secondary NameNode c) TaskTracker d) NameNode
  • 49. www.sensaran.wordpress.com 5. The NameNode uses RAM for the following purpose a) To store the file contents in HDFS b) To store filenames, list of blocks, and other meta information c) To store the edits log that keeps track of changes in HDFS d) To manage distributed read and write locks on files in HDFS
  • 50. www.sensaran.wordpress.com 6. Which of the following commands is used to start all Hadoop services? a) start-dfs.sh b) stop-dfs.sh c) start-all.sh d) stop-all.sh
  • 51. www.sensaran.wordpress.com 7. Which of the following files in Hadoop configuration is responsible for maintaining information about the JobTracker? a) core-site.xml b) mapred-site.xml c) hadoop-env.sh d) slaves file
  • 52. www.sensaran.wordpress.com 8. Which of the following commands helps to indicate a file or directory? a) ls b) ls - a c) ls -l d) All the above
  • 54. www.sensaran.wordpress.com  Map Reduce is a programming model and software framework first developed by Google (Google’s Map Reduce paper submitted in 2004).  This simplifies the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.  Handles both structured and unstructured data. WHAT IS MAP REDUCE ?
  • 55. www.sensaran.wordpress.com MAP REDUCE The key reason to perform mapping and reducing is to speed up the execution of a specific process by splitting the process into a number of tasks, thus enabling parallel work.
  • 57. www.sensaran.wordpress.com  JobTracker and the TaskTracker are responsible for performing Map Reduce operations.  Generally, the JobTracker is present in the master node of the Hadoop and the TaskTracker service is present in the slave nodes.  The JobTracker service is responsible for assigning the jobs to the DataNode. The DataNode consists of the TaskTracker which performs the tasks that are submitted by the JobTracker and provides the results back to the JobTracker. MAP REDUCE ENGINE
  • 58. www.sensaran.wordpress.com  Map Step – Master node takes large problem input and slices it into smaller sub problems; distributes these to worker nodes. – Worker node may do this again; leads to a multilevel tree structure. – Worker processes smaller problem and hands back to master.  Reduce step – Master node takes the answers to the sub problems and combines them in a predefined way to get the output/answer to original problem. MAP REDUCE STEPS
  • 59. www.sensaran.wordpress.com Let us take a file which has 2 blocks of data BLOCK1 has some text data and BLOCK2 has some more text data MAP REDUCE EXAMPLE – WORD COUNT
  • 61. www.sensaran.wordpress.com  It handle very large scale data: peta, exa bytes, and so on  It works well on Write once and read many (WORM) data.  It allows parallelism without mutexes.  The runtime takes care of splitting and moving data for operations.  The operations are provisioned near the data i.e., data locality is preferred.  The commodity hardware and storage is leveraged. CHARACTERISTICS OF MAP REDUCE
  • 62. www.sensaran.wordpress.com  Simple algorithms such as grep, text-indexing, reverse indexing  Search engine operations like keyword indexing, ad rendering, page rank.  It allows parallelism without mutexes.  Enterprise analytics .  Data-intensive computing such as sorting. REAL-TIME USES OF MAPREDUCE
  • 64. www.sensaran.wordpress.com 1 . Who executes the map job? a) TaskTracker b) NameNode server c) JobTracker d) Secondary NameNode server
  • 65. www.sensaran.wordpress.com 2. What is the sequence of Map Reduce Job ? a) Map, input split, reduce b) Input split, map, reduce c) Map and then reduce d) Map, split, map
  • 66. www.sensaran.wordpress.com 3 . Which one is the responsibility of the Map Reduce developer ? a) Map, reduce b) Reduce, input split c) Input split, map d) Sort, shuffle

Editor's Notes

  1. www.scmGalaxy.com, Author - Rajesh Kumar
  2. www.scmGalaxy.com, Author - Rajesh Kumar
  3. Acco.to IBM