PROJECT REPORT
ON
“Hadoop Distributed File
System”
Submitted By
Mr. VARDHMAN P. KALE
Guided By
Mr. SANDEEP GORE
G.H.RAISONI COLLEGE OF ENGINEERING & MANAGEMENT
PUNE (MAHARASHTRA), India
(Affiliated to Savitribai Phule University Pune, India)
CERTIFICATE
This is certifying that the report entitled
“Hadoop Distributed File System” which is being submitted
here with for the award of ‘Computer Engineering’ of
Savitribai Phule University Pune. This is result of study and
submitted by
MR.VARDHMAN P. KALE
Under my supervision and guidance
Place: Wagholi
Date:
Prof. S.Gore Prof. P.Gupta
(Project Guide) (H.O.D)
Table of Content
Sr.no. Topic Page no
1 Acknowledgement 1
2 Declaration 2
3 Introduction 3
3.1 Introduction to Hadoop 3
3.2 History of Hadoop 7
4 Key Technology 10
4.1 Map Reduce 10
5 HDFS 13
Data Replication
Data Blocks
Communication Protocol
Name node
Data node
Properties
6 Hadoop single node setup 20
7 Summary 23
7.1 Future Work 24
8 Bibliography 25
List of Diagram
Sr.no. Diagram Page no
1 Map Reduce 13
2 HDFS 16
ACKNOWLEDGEMENT
I have taken efforts in this project. However, it would not have been possible without
the kind support and help of many individuals and organizations. I would like to
extend my sincere thanks to all of them.
I am highly indebted to IBM for their guidance and constant supervision as well as
for providing necessary information regarding the project & also for their supportin
completing the project. I would like to express my gratitude towards my parents &
member of IBM for their kind cooperation and encouragement which help me in
completion of this project.
I would like to express my special gratitude and thanks to industry persons for giving me
such attention and time.
My thanks and appreciations also go to my colleague in developing the project and
people who have willingly helped me out with their abilities.
DECLARATION
My topic is Hadoop which is Analysis of Crime in India using Hadoop. Apache
Hadoop is a software framework that supports data-intensive distributed applications
under a free license. Hadoop was inspired by Google's MapReduce and Google File
System (GFS)papers. Hadoop, however, was designed to solve a different problem:
the fast, reliable analysis of bothstructured data and complex data. As a result, many
enterprises deploy Hadoop alongside their legacy IT systems, which allows them to
combine old data and new data sets in powerful new ways. The Hadoop framework
is used by major players including Google, Yahoo and IBM, largely for applications
involving search engines and advertising. I am going to represent the History,
Development and Current Situation of this Technology. This technology is now
under the Apache Software Foundaion via Clodera
OBJECTIVES
1. Large Data Sets:
It is assumed that HDFS always needs to work with large data sets. It will be an underplay if
HDFS is deployed to process severalsmall data sets ranging in some megabytes oreven a few
gigabytes. The architecture of HDFS is designed in such a way that it is best fit to store and
retrieve huge amount of data. What is required is high cumulative data bandwidth and the
scalability feature to spread out from a single node cluster to a hundred or a thousand-node
cluster. The acid test is that HDFS should beable to manage tens ofmillions of files in a single
occurrence.
2. Write Once, ReadMany Model:
HDFS follows the write-once, read-many approach for its files and applications. It assumes
that a file in HDFS once written will not be modified, though it can be access ‘n’ number of
times (though future versions of Hadoop may supportthis feature too)! At present, in HDFS
strictly has one writer at any time. This assumption enables high throughput data access and
also simplifies data coherency issues. A web crawler or a MapReduce application is best
suited for HDFS.
3. Streaming Data Access:
As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming data
access is extremely important in HDFS. As HDFS is designed more for batch processing
rather than interactive use by users. The emphasis is on high throughput of data access rather
than low latency of data access. HDFS focuses not so much on storing the data but how to
retrieve it at the fastest possiblespeed, especially while analyzing logs. In HDFS, reading the
complete data is more important than the time taken to fetch a single record from the data.
HDFS overlooks a few POSIX requirements in order to implement streaming data access.
4. Commodity Hardware:
HDFS (Hadoop Distributed File System) assumes that the cluster(s) will run on common
hardware, that is, non-expensive, ordinary machines rather than high-availability systems. A
great feature of Hadoop is that it can be installed in any average commodity hardware. We
don’tneed super computers or high-end hardware to work on Hadoop. This leads to an overall
costreduction to a great extent.
5. Data Replicationand Fault Tolerance:
HDFS works on the assumption that hardware is bound to fail at some point of time or the
other. This disrupts the smooth and quick processing of large volumes of data. To overcome
this obstacle, in HDFS, the files are divided into large blocks of data and each block is stored
on three nodes:two on the same rack and one on a different rack for fault tolerance. A block
is the amount of data stored on every data node. Though the default block size is 64MB and
the replication factor is three, these are configurable per file. This redundancy enables
robustness, fault detection, quick recovery, scalability, eliminating the need of RAID storage
on hosts and merits of data locality.
6. High Throughput:
Throughput is the amount of work donein a unit time. It describes how fast the data is getting
accessed from the system and it is usually used to measure performance of the system. In
Hadoop HDFS, when we want to perform a task or an action, then the work is divided and
shared among different systems. So, all the systems will be executing the tasks assigned to
them independently and in parallel. So the work will be completed in a very short period of
time. In this way, the Apache HDFS gives good throughput. By reading data in parallel, we
decrease the actual time to read data tremendously.
7. Moving Computation is better than Moving Data:
Hadoop HDFS works on the principle that if a computation is doneby an application near the
data it operates on, it is much more efficient than done far of, particularly when there are large
data sets. The major advantage is reduction in the network congestion and increased overall
throughput of the system. The assumption is that it is often better to locate the computation
closer to where the data is located rather than moving the data to the application space. To
facilitate this, Apache HDFS provides interfaces for applications to relocate themselves nearer
to where the data is located.
8. File System Namespace:
A traditional hierarchical file organization is followed by HDFS, where any user or an
application can create directories and store files inside these directories.
Chapter 1
INTRODUCTION
Introductionto HADOOP
Today, we’re surrounded by data. Peopleupload videos, take pictures on their cell
phones, text friends, update their Facebookstatus, leave comments around the web,
click on ads, and so forth. Machines, too, are generating and keeping more and more
data.
The exponential growth of data first presented challenges to cutting-edge
businesses such as Google, Yahoo, Amazon, and Microsoft. They needed to go
through terabytes and petabytes of data to figure out which websites were popular,
what books were in demand, and what kinds ofads appealed to people. Existing tools
were becoming inadequate to process such large data sets. Google was the first to
publicize MapReduce—asystemthey had used to scale their data processingneeds.
This system aroused a lot of interest because many other businesses were facing
similar scaling challenges, and it wasn’t feasible for everyone to reinvent their own
proprietary tool. Doug Cutting saw an opportunity and led the charge to develop an
open source version of this MapReduce system called Hadoop . Soon after, Yahoo
and others rallied around to support this effort. Today, Hadoop is a core part of the
computing infrastructure for many web companies, such as Yahoo, Facebook ,
LinkedIn , and Twitter. Many more traditional businesses, such as media and
telecom, are beginning to adopt this system too.
Hadoop is an open source framework for writing and running distributed
applications that process large amounts of data. Distributed computing is a wide and
varied field, but the key distinctions of Hadoop are that it is
 Accessible—Hadoop runs on large clusters of commodity machines or
on cloud computing services such as Amazon’s Elastic Compute Cloud
(EC2).
 Robust—Becauseit is intended to run oncommodity hardware, Hadoop
is architected with the assumption of frequent hardware malfunctions.
It can gracefully handle most such failures.
 Scalable—Hadoop scales linearly to handle larger data by adding more
nodes to the cluster.
 Simple—Hadoop allows users to quickly write efficient parallel code.
Hadoop’s accessibility and simplicity give it an edge over writing and running
large distributed programs. Even college students can quickly and cheaply create
their own Hadoop cluster. On the other hand, its robustness and scalability make it
suitable for even the most demanding jobs at Yahoo and Facebook. These features
make Hadoop popular in both academia and industry.
Chapter 2
History of HADOOP
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely
used text search library. Hadoop has its origins in Apache Nutch, an opensourceweb
search engine, itself a part of the Lucene project.
The Origin of the Name “Hadoop”:
 The name Hadoop is not an acronym; it’s a made-up name. The project’s
creator, Doug Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and
pronounce, meaningless, andnotused elsewhere: those aremy naming criteria. Kids
are good at generating such. Googol is a kid’s term.
 Subprojects and “contribute” modules in Hadoop also tend to have names
that are unrelated to their function, often with an elephant or other animal
theme (“Pig,” for example). Smaller components are given more
descriptive (and therefore more mundane) names. This is a good principle,
as it means youcan generally work out what something does fromits name.
For example, the jobtracker keeps track of MapReduce jobs.
 Building a web search engine from scratch was an ambitious goal, for not
only is the software required to crawl and index websites complex to write,
but it is also a challenge to run without a dedicated operations team, since
there are so many moving parts.
 It’s expensive too: Mike Cafarella and Doug Cutting estimated a system
supporting a 1- billion-page index would costaround half a million dollars
in hardware, with a monthly running cost of $30,000. Nevertheless, they
believed it was a worthy goal, as it would open up and ultimately
democratize search engine algorithms. Nutch was started in 2002, and a
working crawler and search system quickly emerged.
 However, they realized that their architecture wouldn’t scale to the billions
of pages on the Web. Help was at hand with the publication of a paper in
2003 that described the architecture of Google’s distributed filesystem,
called GFS, which was being used in production at Google.# GFS, or
something like it, would solve their storage needs for the very large files
generated as a part of the web crawl and indexing process. In particular,
GFS would free up time being spent on administrative tasks such as
managing storage nodes. In 2004, they set about writing an open source
implementation, the Nutch Distributed Filesystem (NDFS).
 In 2004, Google published the paper that introduced MapReduce to the
world. Early in 2005, the Nutch developers had a working MapReduce
implementation in Nutch, and bythe middle ofthat year all the major Nutch
algorithms had been ported to run using MapReduce and NDFS.
 NDFS and the MapReduce implementation in Nutch were applicable
beyond the realm of search, and in February 2006 they moved out ofNutch
to form an independent subprojectof Lucene called Hadoop. At around the
same time, Doug Cutting joined Yahoo!, which provided a dedicated team
and the resources to turn Hadoop into a system that ran at web scale (see
sidebar). This was demonstrated in February 2008 when Yahoo!
announced that its production search index was being generated by a
10,000-core Hadoop cluster.†
 In January 2008, Hadoop was made its own top-level project at Apache,
confirming its success and its diverse, active community. By this timem
Hadoop was being used by many other companies
Chapter 3
Key Technology
 The key technology for Hadoop is the MapReduce programming model
and Hadoop Distributed File System.
 The operation on large data is not possible in serial programming
paradigm. MapReduce do task parallel to accomplish work in less time
which is the main aim of this technology.
 MapReduce require special file system. In the real scenario , the data
which are in terms on perabyte. To store and maintain this much data on
distributed commodity hardware, Hadoop Distributed File System is
invented. It is basically inspired by Google File System.
3.1 MapReduce :
 MapReduce is a framework for processing highly distributable problems
across huge datasets using a large number of computers (nodes),
collectively referred to as a cluster (if all nodes use the same hardware) or
a grid (if the nodes use different hardware).
 Computational processing can occur on data stored either in a filesystem
(unstructured) or in a database (structured).
 "Map" step:The master nodetakes the input, partitions it up into smaller sub-
problems, and distributes them to worker nodes. A worker node may do this
again in turn, leading to a multilevel tree structure. The worker nodeprocesses
the smaller problem, and passes the answer back to its master node.
 "Reduce" step: The master node then collects the answers to all the sub-
problems and combines them in some way to form the output – the answer to
the problem it was originally trying to solve.
 MapReduceallows fordistributed processingofthe map and reduction operations.
Provided each mapping operation is independent of the others, all maps can be
performed in parallel – though in practice it is limited by the number of
independent data sources and/or the number of CPUs near each source. Similarly,
a set of 'reducers' can perform the reduction phase - provided all outputs of the
map operation that share the same key are presented to the same reducer at the
same time.
 MapReduce can be applied to significantly larger datasets than "commodity"
servers can handle – a large server farm can use MapReduce to sort a petabyte of
data in only a few hours.
Figure 3.1 MapReduce Programming Model
 The parallelism also offers some possibility of recovering from partial failure of
servers or storage during the operation: if one mapper or reducer fails, the work
can be rescheduled – assuming the input data is still available.
Chapter 4
HDFS (Hadoop Distributed File System)
 The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other
distributed file systems are significant.
 HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
 HDFS provides high throughput access to application data and is suitable
for applications that have large data sets.
 HDFS relaxes a few POSIXrequirements to enable streaming access to file
system data.
 HDFS was originally built as infrastructure for the Apache Nutch web
search engine project.
 HDFS is now an Apache Hadoop
 HDFS has a master/slave architecture. An HDFS cluster consists ofasingle
NameNode, a master server that manages the file system namespace and
regulates access to files by clients. In addition, there are a number of
DataNodes, usually one per node in the cluster, which manage storage
attached to the nodes that they run on.
 HDFS exposes a file system namespace and allows user data to be stored
in files. Internally, a file is split into one or more blocks and these blocks
are stored in a set of DataNodes.
 The NameNode executes file system namespace operations like opening,
closing, and renaming files and directories. It also determines the mapping
of blocks to DataNodes.
 The DataNodes are responsible for serving read and write requests from
the file system’s clients. The DataNodes also perform block creation,
deletion, and replication upon instruction from the NameNode.
subproject.
Figure 2 HDFS Architecture
 The NameNode and DataNode are pieces of software designed to run on
commodity machines. These machines typically run a GNU/Linux
operating system (OS).
 HDFS is built using the Java language; any machine that supports Javacan
run the NameNode or the DataNode software. Usage of the highly portable
Java language means that HDFS can be deployed on a wide range of
machines.
 A typical deployment has a dedicated machine that runs only the
NameNode software. Each of the other machines in the cluster runs one
instance of the DataNode software. The architecture does not preclude
running multiple DataNodes on the same machine but in a real deployment
that is rarely the case.
 The existence of a single NameNode in a cluster greatly simplifies the
architecture of the system. The NameNode is the arbitrator and repository
for all HDFS metadata. The system is designed in sucha way that user data
never flows through the NameNode.
DATA Replication:
HDFS is designed to store very large files across machines in a large
cluster. Each file is a sequence of blocks. All blocks in the file except the last are of
the same size. Blocks are replicated for fault tolerance. Block size and replicas are
configurable per file.
The Namenode receives a Heartbeat and a BlockReport from each
DataNode in the cluster. BlockReport contains all the blocks on a Datanode.
Data Blocks:
HDFS supportwrite-once-read-many with reads at streaming speeds.
A typical block size is 64MB (or even 128 MB).
A file is chopped into 64MB chunks and stored.
The CommunicationProtocol:
All HDFS communication protocols are layered on top of the TCP/IP protocol
A client establishes a connection to a configurable TCP porton the Namenode
machine. It talks ClientProtocol with the Namenode.
The Datanodes talk to the Namenode using Datanode protocol.
RPC abstraction wraps both ClientProtocol and Datanode protocol.
Namenode is simply a server and never initiates a request; it only responds to
RPC requests issued by DataNodes or clients.
What is NameNode?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all
files in the file system, and tracks where across the cluster the file data is kept. It does not
store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they
want to add/copy/move/delete a file. The NameNode responds the successfulrequests by
returning a list of relevant DataNode servers where the data lives.
The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a
High Availability system. When the NameNode goes down, the file system goes offline.
There is an optional SecondaryNameNode that can be hosted on a separate machine. It only
creates checkpoints of the namespace by merging the edits file into the fsimage file and does
not provide any real redundancy. Hadoop 0.21+ has a BackupNameNode that is part of a
plan to have an HA name service, but it needs active contributions from the people who
want it (i.e. you) to make it Highly Available.
It is essential to look after the NameNode. Here are some recommendations from production
use
 Use a good server with lots of RAM. The more RAM you have, the bigger the file
system, or the smaller the block size.
 Use ECC RAM.
 On Java6u15 or later, run the server VM with compressed pointers -
XX:+UseCompressedOops to cut the JVM heap size down.
 List more than one name node directory in the configuration, so that multiple copies
of the file system meta-data will be stored. As long as the directories are on separate
disks, a single disk failure will not corrupt the meta-data.
 Configure the NameNode to store one set of transaction logs on a separate disk from
the image.
 Configure the NameNode to store another set of transaction logs to a network
mounted disk.
 Monitor the disk spaceavailable to the NameNode. If free spaceis getting low, add
more storage.
What is DataNode?
A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than
one DataNode, with data replicated across them.
On startup, a DataNode connects to the NameNode; spinning until that service comes up. It
then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the
location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances
near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances
can, indeed should, be deployed on the same servers that host DataNode instances, so
that MapReduce operations are performed close to the data.
DataNode instances can talk to each other, which is what they do when they are replicating
data.
 There is usually no need to use RAID storage for DataNode data, because data is
designed to be replicated across multiple servers, rather than multiple disks on the
same server.
 An ideal configuration is for a server to have a DataNode, a TaskTracker, and then
physical disks one TaskTracker slot per CPU. This will allow
every TaskTracker 100% of a CPU, and separate disks to read and write data.
 Avoid using NFS for data storage in production system.
HDFS Properties:
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
 Fault Tolerance - In Apache Hadoop HDFS, Fault-tolerance is working strength of a
system in unfavorable conditions. HDFS is highly fault-tolerant, in HDFS data is
divided into blocks and multiple copies of blocks are created on different machines in
the cluster. If any machine in the cluster goes down due to unfavorable conditions,
then a client can easily access their data from other machine which contains the same
copyof data blocks.
 High Availability- HDFS is highly available file system; data gets replicated among
the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves
present in the HDFS cluster. Hence, when a client want to access his data, they can
access their data from the slaves which contains its blocks and which is available on
the nearest node in the cluster. At the time of failure of node, a client can easily access
their data from other nodes.
 Data Reliability- It stores data reliably by creating a replica of each and every block
present on the nodes and hence, provides fault tolerance facility.
 Scalability- HDFS stores data on multiple nodes in the cluster, when requirement
increases we can scale the cluster. There are two scalability mechanisms available:
vertical and horizontal.
Chapter 5
HADOOP Single Node Setup
The steps involved in setting up a single node Hadoop cluster are as follow:
1. Download the Hadoop Software, the hadoop.tar.gz file using the
ftp://hadoop.apche.org URL, and ensure that the software is installed on every
node of the cluster. Installing the Hadoop Software on all the nodes require
unpacking of the software, the hadoop.apache.orgURL, on the nodes.
2. Create the keys on local machine such that ssh, required by Hadoop, does notneed
password. Use following command to create key on local machine:
$ ssh-keygen-t rsa -P “ “
$ cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys
3. Modify the environment parameters in the hadoop-env.sh file. Use the following
command to change the environment parameter:
Export JAVA_HOME=/path/to/jdk_home_dir
4. Modify the configuration parameters in files given below as shown below.
Do the following changes to the configuration files under hadoop/conf
1) core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>TEMPORARY-DIR-FOR-HADOOPDATASTORE</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
</configuration>
2) mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
</configuration>
3) hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
5. Format the hadoop file system. From hadoop directory run the following
bin/hadoop namenode –format
CONCLUSION
 Big Data Analytics refers to the tools and practices that can be used for
transforming this raw data into meaningful and crucial information which
helps in forming a decision supportsystem for the judiciary and legislature to
take steps towards keeping crimes in check.
 With the ever increasing population and crime rates, certain trends must be
discovered, studied and discussed to take well informed decisions so that law
and order can be maintained properly. If the number of complaints from a
particular state is found to be very high, extra security must be provided to the
residents there by increasing police presence, quick redressal of complaints
and strict vigilance.
 Crimes against women are becoming an increasingly worrying and disturbing
problem for the government. The number of such crimes must be found,
especially the ones against young women (age between 18-30 years). Extra
security must be provided so that law and order can be maintained properly
and there is a sense of safety and well-being among the citizens of the country
FUTURE SCOPE
Hadoop is among the major big data technologies and has a vast scopein the future. Being
cost-effective, scalable and reliable, most of the world’s biggest organizations are
employing Hadoop technology to deal with their massive data for research and production.
It includes storing data on a cluster without any machine or hardware failure, adding a new
hardware to the nodes etc.
Several newbies in IT sectoroften arise a question that what is the scopeof Hadoop in the
future. Well, it can be traced out by the fact that the availability of tons of data through
social networking and other mea ns has been increased and goes on increasing as the world
approaches digitalization.
This generation of massive data brings into use the Hadoop technology which is highly
adopted as compared to other big data technologies. However, there are some other
technologies competing with Hadoop as it has not yet gained stability in the big data market.
It is still in the adoption phase and will take some time to get stable and lead the big data
market.
REFERENCES
[1] “What is apache hadoop?” https://hadoop.apache.org/, accessed:2015-08-13.
[2] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay
scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” in
Proceedings of the 5th European conference on Computer systems. ACM, 2010, pp. 265–
278.
[3] K. S. Esmaili, L. Pamies-Juarez, and A. Datta, “The corestorage primitive: Cross-object
redundancy for efficient data repair & access in erasure coded storage,” arXiv preprint
arXiv:1302.5192, 2013.
[4] G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, and
E. Harris, “Scarlett: Coping with skewed content popularity in mapreduce clusters.” in
Proceedings of the Sixth Conference on Computer Systems, ser. EuroSys ’11. New York,
NY, USA: ACM, 2011, pp. 287–300. [Online]. Available:
http://doi.acm.org/10.1145/1966445.196647
[5] G. Kousiouris, G. Vafiadis, and T. Varvarigou, “Enabling proac-tive data management
in virtualized hadoop clusters based on predicted data activity patterns.” in P2P, Parallel,
Grid, Cloud and Internet Computing (3PGCIC), 2013 Eighth International Conference on,
Oct 2013, pp. 1–8.
[6] A. Papoulis, Signal analysis. McGraw-Hill, 1977, vol. 191.
[7] Q. Wei, B. Veeravalli, B. Gong, L. Zeng, and D. Feng, “Cdrm: A cost-effective dynamic
replication management scheme for cloud storage cluster.” in Cluster Computing
(CLUSTER), 2010 IEEE Inter-national Conference on, Sept 2010, pp. 188–196.
BIBLIOGRAPHY
[1] Jason Venner, Pro Hadoop, Apress
[2] Tom White, Hadoop: The Definitive Guide , O’REILLY
[3] Chuck Lam, Hadoop in Action, MANNING
[4] www.Hadoop.apache.org
[5] http://www.tutorialshadoop.com
[6] Lecture Notes in Computer Science, 2013.

HDFS

  • 1.
    PROJECT REPORT ON “Hadoop DistributedFile System” Submitted By Mr. VARDHMAN P. KALE Guided By Mr. SANDEEP GORE G.H.RAISONI COLLEGE OF ENGINEERING & MANAGEMENT PUNE (MAHARASHTRA), India (Affiliated to Savitribai Phule University Pune, India)
  • 2.
    CERTIFICATE This is certifyingthat the report entitled “Hadoop Distributed File System” which is being submitted here with for the award of ‘Computer Engineering’ of Savitribai Phule University Pune. This is result of study and submitted by MR.VARDHMAN P. KALE Under my supervision and guidance Place: Wagholi Date: Prof. S.Gore Prof. P.Gupta (Project Guide) (H.O.D)
  • 3.
    Table of Content Sr.no.Topic Page no 1 Acknowledgement 1 2 Declaration 2 3 Introduction 3 3.1 Introduction to Hadoop 3 3.2 History of Hadoop 7 4 Key Technology 10 4.1 Map Reduce 10 5 HDFS 13 Data Replication Data Blocks Communication Protocol Name node Data node Properties 6 Hadoop single node setup 20 7 Summary 23 7.1 Future Work 24 8 Bibliography 25
  • 4.
    List of Diagram Sr.no.Diagram Page no 1 Map Reduce 13 2 HDFS 16
  • 5.
    ACKNOWLEDGEMENT I have takenefforts in this project. However, it would not have been possible without the kind support and help of many individuals and organizations. I would like to extend my sincere thanks to all of them. I am highly indebted to IBM for their guidance and constant supervision as well as for providing necessary information regarding the project & also for their supportin completing the project. I would like to express my gratitude towards my parents & member of IBM for their kind cooperation and encouragement which help me in completion of this project. I would like to express my special gratitude and thanks to industry persons for giving me such attention and time. My thanks and appreciations also go to my colleague in developing the project and people who have willingly helped me out with their abilities.
  • 6.
    DECLARATION My topic isHadoop which is Analysis of Crime in India using Hadoop. Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. Hadoop was inspired by Google's MapReduce and Google File System (GFS)papers. Hadoop, however, was designed to solve a different problem: the fast, reliable analysis of bothstructured data and complex data. As a result, many enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old data and new data sets in powerful new ways. The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. I am going to represent the History, Development and Current Situation of this Technology. This technology is now under the Apache Software Foundaion via Clodera
  • 7.
    OBJECTIVES 1. Large DataSets: It is assumed that HDFS always needs to work with large data sets. It will be an underplay if HDFS is deployed to process severalsmall data sets ranging in some megabytes oreven a few gigabytes. The architecture of HDFS is designed in such a way that it is best fit to store and retrieve huge amount of data. What is required is high cumulative data bandwidth and the scalability feature to spread out from a single node cluster to a hundred or a thousand-node cluster. The acid test is that HDFS should beable to manage tens ofmillions of files in a single occurrence. 2. Write Once, ReadMany Model: HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified, though it can be access ‘n’ number of times (though future versions of Hadoop may supportthis feature too)! At present, in HDFS strictly has one writer at any time. This assumption enables high throughput data access and also simplifies data coherency issues. A web crawler or a MapReduce application is best suited for HDFS. 3. Streaming Data Access: As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming data access is extremely important in HDFS. As HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possiblespeed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data. HDFS overlooks a few POSIX requirements in order to implement streaming data access. 4. Commodity Hardware: HDFS (Hadoop Distributed File System) assumes that the cluster(s) will run on common hardware, that is, non-expensive, ordinary machines rather than high-availability systems. A great feature of Hadoop is that it can be installed in any average commodity hardware. We
  • 8.
    don’tneed super computersor high-end hardware to work on Hadoop. This leads to an overall costreduction to a great extent. 5. Data Replicationand Fault Tolerance: HDFS works on the assumption that hardware is bound to fail at some point of time or the other. This disrupts the smooth and quick processing of large volumes of data. To overcome this obstacle, in HDFS, the files are divided into large blocks of data and each block is stored on three nodes:two on the same rack and one on a different rack for fault tolerance. A block is the amount of data stored on every data node. Though the default block size is 64MB and the replication factor is three, these are configurable per file. This redundancy enables robustness, fault detection, quick recovery, scalability, eliminating the need of RAID storage on hosts and merits of data locality. 6. High Throughput: Throughput is the amount of work donein a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In Hadoop HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So, all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the Apache HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously. 7. Moving Computation is better than Moving Data: Hadoop HDFS works on the principle that if a computation is doneby an application near the data it operates on, it is much more efficient than done far of, particularly when there are large data sets. The major advantage is reduction in the network congestion and increased overall throughput of the system. The assumption is that it is often better to locate the computation closer to where the data is located rather than moving the data to the application space. To facilitate this, Apache HDFS provides interfaces for applications to relocate themselves nearer to where the data is located. 8. File System Namespace: A traditional hierarchical file organization is followed by HDFS, where any user or an application can create directories and store files inside these directories.
  • 9.
    Chapter 1 INTRODUCTION Introductionto HADOOP Today,we’re surrounded by data. Peopleupload videos, take pictures on their cell phones, text friends, update their Facebookstatus, leave comments around the web, click on ads, and so forth. Machines, too, are generating and keeping more and more data. The exponential growth of data first presented challenges to cutting-edge businesses such as Google, Yahoo, Amazon, and Microsoft. They needed to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds ofads appealed to people. Existing tools were becoming inadequate to process such large data sets. Google was the first to publicize MapReduce—asystemthey had used to scale their data processingneeds. This system aroused a lot of interest because many other businesses were facing similar scaling challenges, and it wasn’t feasible for everyone to reinvent their own proprietary tool. Doug Cutting saw an opportunity and led the charge to develop an open source version of this MapReduce system called Hadoop . Soon after, Yahoo and others rallied around to support this effort. Today, Hadoop is a core part of the computing infrastructure for many web companies, such as Yahoo, Facebook , LinkedIn , and Twitter. Many more traditional businesses, such as media and telecom, are beginning to adopt this system too. Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is  Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2).
  • 10.
     Robust—Becauseit isintended to run oncommodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.  Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.  Simple—Hadoop allows users to quickly write efficient parallel code. Hadoop’s accessibility and simplicity give it an edge over writing and running large distributed programs. Even college students can quickly and cheaply create their own Hadoop cluster. On the other hand, its robustness and scalability make it suitable for even the most demanding jobs at Yahoo and Facebook. These features make Hadoop popular in both academia and industry.
  • 11.
    Chapter 2 History ofHADOOP Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an opensourceweb search engine, itself a part of the Lucene project. The Origin of the Name “Hadoop”:  The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, andnotused elsewhere: those aremy naming criteria. Kids are good at generating such. Googol is a kid’s term.  Subprojects and “contribute” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means youcan generally work out what something does fromits name. For example, the jobtracker keeps track of MapReduce jobs.  Building a web search engine from scratch was an ambitious goal, for not only is the software required to crawl and index websites complex to write, but it is also a challenge to run without a dedicated operations team, since there are so many moving parts.  It’s expensive too: Mike Cafarella and Doug Cutting estimated a system supporting a 1- billion-page index would costaround half a million dollars in hardware, with a monthly running cost of $30,000. Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratize search engine algorithms. Nutch was started in 2002, and a working crawler and search system quickly emerged.  However, they realized that their architecture wouldn’t scale to the billions of pages on the Web. Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google.# GFS, or
  • 12.
    something like it,would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS).  In 2004, Google published the paper that introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and bythe middle ofthat year all the major Nutch algorithms had been ported to run using MapReduce and NDFS.  NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out ofNutch to form an independent subprojectof Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale (see sidebar). This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster.†  In January 2008, Hadoop was made its own top-level project at Apache, confirming its success and its diverse, active community. By this timem Hadoop was being used by many other companies
  • 13.
    Chapter 3 Key Technology The key technology for Hadoop is the MapReduce programming model and Hadoop Distributed File System.  The operation on large data is not possible in serial programming paradigm. MapReduce do task parallel to accomplish work in less time which is the main aim of this technology.  MapReduce require special file system. In the real scenario , the data which are in terms on perabyte. To store and maintain this much data on distributed commodity hardware, Hadoop Distributed File System is invented. It is basically inspired by Google File System. 3.1 MapReduce :  MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware).  Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured).
  • 14.
     "Map" step:Themaster nodetakes the input, partitions it up into smaller sub- problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multilevel tree structure. The worker nodeprocesses the smaller problem, and passes the answer back to its master node.  "Reduce" step: The master node then collects the answers to all the sub- problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.  MapReduceallows fordistributed processingofthe map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time.  MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. Figure 3.1 MapReduce Programming Model
  • 15.
     The parallelismalso offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.
  • 16.
    Chapter 4 HDFS (HadoopDistributed File System)  The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.  HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.  HDFS provides high throughput access to application data and is suitable for applications that have large data sets.  HDFS relaxes a few POSIXrequirements to enable streaming access to file system data.  HDFS was originally built as infrastructure for the Apache Nutch web search engine project.  HDFS is now an Apache Hadoop
  • 17.
     HDFS hasa master/slave architecture. An HDFS cluster consists ofasingle NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.  HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.  The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.  The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. subproject. Figure 2 HDFS Architecture
  • 18.
     The NameNodeand DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS).  HDFS is built using the Java language; any machine that supports Javacan run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines.  A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.  The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in sucha way that user data never flows through the NameNode.
  • 19.
    DATA Replication: HDFS isdesigned to store very large files across machines in a large cluster. Each file is a sequence of blocks. All blocks in the file except the last are of the same size. Blocks are replicated for fault tolerance. Block size and replicas are configurable per file. The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster. BlockReport contains all the blocks on a Datanode. Data Blocks: HDFS supportwrite-once-read-many with reads at streaming speeds. A typical block size is 64MB (or even 128 MB). A file is chopped into 64MB chunks and stored. The CommunicationProtocol: All HDFS communication protocols are layered on top of the TCP/IP protocol A client establishes a connection to a configurable TCP porton the Namenode machine. It talks ClientProtocol with the Namenode. The Datanodes talk to the Namenode using Datanode protocol. RPC abstraction wraps both ClientProtocol and Datanode protocol. Namenode is simply a server and never initiates a request; it only responds to RPC requests issued by DataNodes or clients.
  • 20.
    What is NameNode? TheNameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successfulrequests by returning a list of relevant DataNode servers where the data lives. The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy. Hadoop 0.21+ has a BackupNameNode that is part of a plan to have an HA name service, but it needs active contributions from the people who want it (i.e. you) to make it Highly Available. It is essential to look after the NameNode. Here are some recommendations from production use  Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.  Use ECC RAM.  On Java6u15 or later, run the server VM with compressed pointers - XX:+UseCompressedOops to cut the JVM heap size down.  List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.  Configure the NameNode to store one set of transaction logs on a separate disk from the image.  Configure the NameNode to store another set of transaction logs to a network mounted disk.  Monitor the disk spaceavailable to the NameNode. If free spaceis getting low, add more storage.
  • 21.
    What is DataNode? ADataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them. On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data. DataNode instances can talk to each other, which is what they do when they are replicating data.  There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.  An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data.  Avoid using NFS for data storage in production system.
  • 22.
    HDFS Properties:  Itis suitable for the distributed storage and processing.  Hadoop provides a command interface to interact with HDFS.  The built-in servers of namenode and datanode help users to easily check the status of cluster.  Streaming access to file system data.  HDFS provides file permissions and authentication.  Fault Tolerance - In Apache Hadoop HDFS, Fault-tolerance is working strength of a system in unfavorable conditions. HDFS is highly fault-tolerant, in HDFS data is divided into blocks and multiple copies of blocks are created on different machines in the cluster. If any machine in the cluster goes down due to unfavorable conditions, then a client can easily access their data from other machine which contains the same copyof data blocks.  High Availability- HDFS is highly available file system; data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in the HDFS cluster. Hence, when a client want to access his data, they can access their data from the slaves which contains its blocks and which is available on the nearest node in the cluster. At the time of failure of node, a client can easily access their data from other nodes.  Data Reliability- It stores data reliably by creating a replica of each and every block present on the nodes and hence, provides fault tolerance facility.  Scalability- HDFS stores data on multiple nodes in the cluster, when requirement increases we can scale the cluster. There are two scalability mechanisms available: vertical and horizontal.
  • 23.
    Chapter 5 HADOOP SingleNode Setup The steps involved in setting up a single node Hadoop cluster are as follow: 1. Download the Hadoop Software, the hadoop.tar.gz file using the ftp://hadoop.apche.org URL, and ensure that the software is installed on every node of the cluster. Installing the Hadoop Software on all the nodes require unpacking of the software, the hadoop.apache.orgURL, on the nodes. 2. Create the keys on local machine such that ssh, required by Hadoop, does notneed password. Use following command to create key on local machine: $ ssh-keygen-t rsa -P “ “ $ cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys 3. Modify the environment parameters in the hadoop-env.sh file. Use the following command to change the environment parameter: Export JAVA_HOME=/path/to/jdk_home_dir 4. Modify the configuration parameters in files given below as shown below. Do the following changes to the configuration files under hadoop/conf 1) core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>TEMPORARY-DIR-FOR-HADOOPDATASTORE</value> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> </configuration>
  • 24.
  • 25.
    CONCLUSION  Big DataAnalytics refers to the tools and practices that can be used for transforming this raw data into meaningful and crucial information which helps in forming a decision supportsystem for the judiciary and legislature to take steps towards keeping crimes in check.  With the ever increasing population and crime rates, certain trends must be discovered, studied and discussed to take well informed decisions so that law and order can be maintained properly. If the number of complaints from a particular state is found to be very high, extra security must be provided to the residents there by increasing police presence, quick redressal of complaints and strict vigilance.  Crimes against women are becoming an increasingly worrying and disturbing problem for the government. The number of such crimes must be found, especially the ones against young women (age between 18-30 years). Extra security must be provided so that law and order can be maintained properly and there is a sense of safety and well-being among the citizens of the country
  • 26.
    FUTURE SCOPE Hadoop isamong the major big data technologies and has a vast scopein the future. Being cost-effective, scalable and reliable, most of the world’s biggest organizations are employing Hadoop technology to deal with their massive data for research and production. It includes storing data on a cluster without any machine or hardware failure, adding a new hardware to the nodes etc. Several newbies in IT sectoroften arise a question that what is the scopeof Hadoop in the future. Well, it can be traced out by the fact that the availability of tons of data through social networking and other mea ns has been increased and goes on increasing as the world approaches digitalization. This generation of massive data brings into use the Hadoop technology which is highly adopted as compared to other big data technologies. However, there are some other technologies competing with Hadoop as it has not yet gained stability in the big data market. It is still in the adoption phase and will take some time to get stable and lead the big data market.
  • 27.
    REFERENCES [1] “What isapache hadoop?” https://hadoop.apache.org/, accessed:2015-08-13. [2] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” in Proceedings of the 5th European conference on Computer systems. ACM, 2010, pp. 265– 278. [3] K. S. Esmaili, L. Pamies-Juarez, and A. Datta, “The corestorage primitive: Cross-object redundancy for efficient data repair & access in erasure coded storage,” arXiv preprint arXiv:1302.5192, 2013. [4] G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, and E. Harris, “Scarlett: Coping with skewed content popularity in mapreduce clusters.” in Proceedings of the Sixth Conference on Computer Systems, ser. EuroSys ’11. New York, NY, USA: ACM, 2011, pp. 287–300. [Online]. Available: http://doi.acm.org/10.1145/1966445.196647 [5] G. Kousiouris, G. Vafiadis, and T. Varvarigou, “Enabling proac-tive data management in virtualized hadoop clusters based on predicted data activity patterns.” in P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2013 Eighth International Conference on, Oct 2013, pp. 1–8. [6] A. Papoulis, Signal analysis. McGraw-Hill, 1977, vol. 191. [7] Q. Wei, B. Veeravalli, B. Gong, L. Zeng, and D. Feng, “Cdrm: A cost-effective dynamic replication management scheme for cloud storage cluster.” in Cluster Computing (CLUSTER), 2010 IEEE Inter-national Conference on, Sept 2010, pp. 188–196.
  • 28.
    BIBLIOGRAPHY [1] Jason Venner,Pro Hadoop, Apress [2] Tom White, Hadoop: The Definitive Guide , O’REILLY [3] Chuck Lam, Hadoop in Action, MANNING [4] www.Hadoop.apache.org [5] http://www.tutorialshadoop.com [6] Lecture Notes in Computer Science, 2013.