1
ABSTRACT
Hadoop is a framework for running applications on large clusters built of

commodity hardware. The Hadoop framework transparently provides applications

both reliability and data motion. Hadoop implements a computational paradigm

named Map/Reduce, where the application is divided into many small fragments

of work, each of which may be executed or reexecuted on any node in the cluster.

In addition, it provides a distributed file system (HDFS) that stores data on the

compute nodes, providing very high aggregate bandwidth across the cluster. Both

Map/Reduce and the distributed file system are designed so that node failures are

automatically handled by the framework.                                        2
Problem Statement:

         The amount total digital data in the world has exploded in recent years.
This has happened primarily due to information (or data) generated by various
enterprises all over the globe. In 2006, the universal data was estimated to be 0.18
zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.
         The problem is that while the storage capacities of hard drives have
increased massively over the years, access speeds—the rate at which data can be
read from drives have not kept up. One typical drive from 1990 could store 1370
MB of data and had a transfer speed of 4.4 MB/s, so we could read all the data
from a full drive in around 300 seconds. In 2010, 1 Tb drives are the standard
hard disk size, but the transfer speed is around 100 MB/s, so it takes more than
two and a half hours to read all the data off the disk.
                                                                                     3
Solution Proposed:
Parallelisation:


          A very obvious solution to solving this problem is parallelisation. The
input data is usually large and the computations have to be distributed across
hundreds or thousands of machines in order to finish in a reasonable amount of
time. Reading 1 Tb from a single hard drive may take a long time, but on
parallelizing this over 100 different machines can solve the problem in 2 minutes.
Apache Hadoop is a framework for running applications on large cluster built of
commodity hardware. The Hadoop framework transparently provides applications
both reliability and data motion.
          It solves the problem of Hardware Failure through replication.
Redundant copies of the data are kept by the system so that in the event of failure,
there is another copy available. (Hadoop Distributed File System)
          The second problem is solved by a simple programming model-
MapReduce. This programming paradigm abstracts the problem from data
read/write to computation over a series of keys. Even though HDFS and
MapReduce are the most significant features of Hadoop.                             4
5
6
7
8
9
10
Who Uses Hadoop?




                   11
12
13
14
Data Everywhere




                  15
16
17
Examples of Hadoop in action

•In the Telecommunication industry

•In the Media

•In the Technology Industry




                                     18
19
20
21
22
23
24
Hadoop Architecture

•Two Main Components :

    -Distributed File System

    -MapReduce Engine




                               25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
Conclusion
Hadoop (MapReduce) is one of the very powerful frameworks that enable easy
development on data-intensive application. It objective is help building a
supplication with high scalability with thousands of machines. We can see
Hadoop is very suitable to data-intensive background application and perfect fit to
our project‟s requirements. Apart from running application in parallel, Hadoop
provides some job monitoring features similar to Azure. If any machine crash,
the data could be recovered by other machines, and it will take up the jobs
automatically. When we put Hadoop into cloud, we also see the convenience in
setting up Hadoop. With a few command lines, we can allocate any number of
clusters to run Hadoop, this may save lot of time and effort. We found the
combination of cloud and Hadoop is surely a common way to setup large scale
application with lower cost, but higher elastic property.                        84
Resources


       http://hadoop.apache.org




http://developer.yahoo.com/hadoop




http://www.cloudera.com/resources

                                    85

Hadoop technology

  • 1.
  • 2.
    ABSTRACT Hadoop is aframework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. 2
  • 3.
    Problem Statement: The amount total digital data in the world has exploded in recent years. This has happened primarily due to information (or data) generated by various enterprises all over the globe. In 2006, the universal data was estimated to be 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. The problem is that while the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives have not kept up. One typical drive from 1990 could store 1370 MB of data and had a transfer speed of 4.4 MB/s, so we could read all the data from a full drive in around 300 seconds. In 2010, 1 Tb drives are the standard hard disk size, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk. 3
  • 4.
    Solution Proposed: Parallelisation: A very obvious solution to solving this problem is parallelisation. The input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. Reading 1 Tb from a single hard drive may take a long time, but on parallelizing this over 100 different machines can solve the problem in 2 minutes. Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. It solves the problem of Hardware Failure through replication. Redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. (Hadoop Distributed File System) The second problem is solved by a simple programming model- MapReduce. This programming paradigm abstracts the problem from data read/write to computation over a series of keys. Even though HDFS and MapReduce are the most significant features of Hadoop. 4
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    Examples of Hadoopin action •In the Telecommunication industry •In the Media •In the Technology Industry 18
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    Hadoop Architecture •Two MainComponents : -Distributed File System -MapReduce Engine 25
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
    Conclusion Hadoop (MapReduce) isone of the very powerful frameworks that enable easy development on data-intensive application. It objective is help building a supplication with high scalability with thousands of machines. We can see Hadoop is very suitable to data-intensive background application and perfect fit to our project‟s requirements. Apart from running application in parallel, Hadoop provides some job monitoring features similar to Azure. If any machine crash, the data could be recovered by other machines, and it will take up the jobs automatically. When we put Hadoop into cloud, we also see the convenience in setting up Hadoop. With a few command lines, we can allocate any number of clusters to run Hadoop, this may save lot of time and effort. We found the combination of cloud and Hadoop is surely a common way to setup large scale application with lower cost, but higher elastic property. 84
  • 85.
    Resources http://hadoop.apache.org http://developer.yahoo.com/hadoop http://www.cloudera.com/resources 85