The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
Hadoop by kamran khan
1. Sunderdeep Engineering College
Department of Computer Science
Session-2017-18
Topic:-
Submitted to Submitted by
Mr.Ashutosh Rao Kamran Khan
H.O.D. (CSE) Dept. B.tech IIIrd Year
2. Contents
Introduction
What’s Big Data?
3’V of Big Data
Problem & Solution
What’s Hadoop?
HDFS
MapReduce
Architecture of Hadoop
Applications of Hadoop
Pros & Cons of Hadoop
Conclusion
Refrences
3. Introduction
Apache Hadoop is an open source, Scalable, and Fault
tolerant framework written in Java. It efficiently processes large
volumes of data (BIG DATA) on a cluster of commodity hardware.
Hadoop is not only a storage system but is a platform for large
data storage as well as processing.
Created by Doug Cutting, Mike Cafarella in 2005.
Doug named it after his son's toy elephant
Now Apache Hadoop is a registered trademark of the Apache
Software Foundation.
5. What is Big Data?
Data which are very large in size is called Big
Data. Normally we work on data of size
MB(Word ,Excel) or maximum GB(Movies,
Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost
90% of today's data has been generated in
the past 5 years.
6.
7.
8. 3V's of Big Data
Velocity: The data is increasing at a very fast rate. It is
estimated that the volume of data will double in every 2
years.
Variety: Now a days data are not stored in rows and column.
Data is structured as well as unstructured. Log file, CCTV
footage is unstructured data. Data which can be saved in
tables are structured data like the transaction data of the
bank.
Volume: The amount of data which we deal with is of very
large size of Peta bytes.
9. So what is the problem??
Processing that large data is very
difficult in relational database.
It would take too much time to process
data and cost.
10. Traditional Approach
In this approach, an enterprise will have a computer to store and process
big data. Here data will be stored in an RDBMS like Oracle Database, MS
SQL Server or DB2 and sophisticated softwares can be written to interact
with the database, process the required data and present it to the users
for analysis purpose.
This approach works well where we have less volume of data that can be
accommodated by standard database servers, or up to the limit of the
processor which is processing the data. But when it comes to dealing
with huge amounts of data, it is really a tedious task to process such data
through a traditional database server.
11. ‘s Solution!!
Google solved this problem using an algorithm called MapReduce.
This algorithm divides the task into small parts and assigns those
parts to many computers connected over the network, and collects
the results to form the final result dataset.
13. What is Hadoop?
The Apache Hadoop software library is a framework
that allows for the distributed processing of large
data sets across clusters of computers using simple
programming models.
It is made by apache software foundation in 2011.
Written in JAVA.
14. Hadoop is open source software.
Framework
Massive Storage
Processing Power
15. We can solve this problem by Distributed
Computing.
But the problems in distributed computing is –
Hardware failure
Chances of hardware failure is always there.
Combine the data after analysis
Data from all disks have to be combined from all the disks which is a mess.
16. To Solve all the Problems Hadoop Came.
It has two main parts –
Hadoop Distributed File System (HDFS),
Data Processing Framework & MapReduce
17. Hadoop Distributed File System
It ties so many small and reasonable priced machines
together into a single cost effective computer cluster.
Data and application processing are protected against
hardware failure.
If a node goes down, jobs are automatically redirected to
other nodes to make sure the distributed computing does
not fail.
It automatically stores multiple copies of all data.
It provides simplified programming model which allows user
to quickly read and write the distributed system.
19. NameNode in HDFS Architecture is also known as Master node. HDFS Namenode
stores meta-data i.e. number of data block, replicas and other details. This meta-data is
available in memory in the master for faster retrieval of data. NameNode maintains and
manages the slave nodes, and assigns tasks to them. It should deploy on reliable
hardware as it is the centerpiece of HDFS.
DataNode in HDFS Architecture is also known as Slave. In Hadoop HDFS Architecture,
DataNode stores actual data in HDFS. It performs read and write operation as per the
request of the client. DataNodes can deploy on commodity hardware.
In HDFS, when NameNode starts, first it reads HDFS state from an image file, FsImage.
After that, it applies edits from the edits log file. NameNode then writes new HDFS state
to the FsImage. Then it starts normal operation with an empty edits file. At the time of
start-up, NameNode merges FsImage and edits files, so the edit log file could get very
large over time. A side effect of a larger edits file is that next restart of Namenode takes
longer.
Secondary Namenode solves this issue. Secondary NameNode downloads the FsImage
and EditLogs from the NameNode. And then merges EditLogs with the FsImage
(FileSystem Image). It keeps edits log size within a limit. It stores the modified FsImage
into persistent storage. And we can use it in the case of NameNode failure.
Secondary NameNode performs a regular checkpoint in HDFS.
20. The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific
nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often
enough, they are deemed to have failed and the work is scheduled on a
different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to
do then: it may resubmit the job elsewhere, it may mark that specific record as something
to avoid, and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all
running jobs are halted.
21. MapReduce
MapReduce is a programming model for processing and
generating large data sets with a parallel, distributed
algorithm on a cluster.
It is an associative implementation for processing and
generating large data sets.
MAP function that process a key pair to generates a set
of intermediate key pairs.
REDUCE function that merges all intermediate values
associated with the same intermediate key
26. Pros of Hadoop
Computing power
Flexibility
Fault Tolerance
Low Cost
Scalability
27. Cons of Hadoop
1. Integration with existing systems
Hadoop is not optimised for ease for use. Installing and integrating with existing
databases might prove to be difficult, especially since there is no software support
provided.
2. Administration and ease of use
Hadoop requires knowledge of MapReduce, while most data practitioners use SQL. This
means significant training may be required to administer Hadoop clusters.
3. Security
Hadoop lacks the level of security functionality needed for safe enterprise deployment,
especially if it concerns sensitive data.
28. Conclusion:
Hadoop has been very effective solution for companies
dealing with the data in petabytes.
It has solved many problems in industry related to
huge data
management and distributed system.
As it is open source, so it is adopted by companies
widely.