2. Contents
What is Hadoop?
How big it is??
Why need for Large Computing?
How it Works?
Advantages
Disadvantages
Who uses hadoop?
Conclusion
3. What is Hadoop?
It is a open source software written in java
Hadoop software library is a framework that
allows for the distributed processing of large
data sets across clusters of computers using
simple programming models.
HDFC: Self healing high bandwidth clustered
storage.
Map reduce : fault-tolerant distributed
processing.
Operates on unstructured and structured
data.
4. How BIG it is????
• We have ~20,000
machines running Hadoop.
• Our largest clusters
are currently 2000 nodes
Several petabytes of user
data (compressed,
unreplicated).
• We run hundreds of
thousands of jobs every
month.
5. WHY NEED FOR LARGE
COMPUTING????
The New York Stock Exchange generates
about one terabyte of new trade data per
day.
Facebook hosts approximately 10 billion
photos, taking up one petabyte of storage.
The Internet Archive stores around 2
petabytes of data, and is growing at a rate
of 20 terabytes per month.
9. MAPREDUCE
EX : WORD COUNT OVER A GIVEN SET OF
STRINGS
We love India
We 1
love
1
India 1
We 1
Play
1
tennis 1
Love 1
India 1
We 2
tennis 1
play
1
We play tennis
Map Reduce
The Hadoop MapReduce framework harnesses a cluster of
machines and executes user defined MapReduce jobs across
the nodes in the cluster.
10. Advantages
A Reliable shared storage.
Simple analysis system.
Distributed File System.
Tasks are independent.
Easy to handle partial failures - entire nodes
can fail and restart.
11. Disadvantages
Lack of central data can be frustrating.
Still single master, which requires care and
may limit scaling.
Managing job flow isn’t trivial when
intermediate data should be kept.
13. Conclusion
Hadoop is a data grid operating system
which provides an economically scalable
solution for storing and processing large
amounts of unstructured or structured data
over long periods of time.