Nguyen Thanh Hai
      Portal team
     August 2012
Agenda

    − Meet Hadoop
1    −
     −
     −
         History
         Data!
         Data Storage and Analysis
     −   What Hadoop is Not


2   − The Hadoop Distributed File System
      − HDFS concept
      − Architecture
      − Goals
      − Command User Interface
3   − MapReduce
      − Overview
      − How MapReduce works


4   − Practice
     − Demo
     − Discussion


                    www.exoplatform.com - Copyright 2012 eXo Platform   2
Meet Hadoop

 - History

 - Data!

 - Data Storage and Analysis

 - What Hadoop is Not




             www.exoplatform.com - Copyright 2012 eXo Platform   3
History




          www.exoplatform.com - Copyright 2012 eXo Platform   4
History
- Hadoop got its start in Nutch. A few of them were attempting to
build an open source web search engine and having trouble
managing computations running on even a handful of computers

- Once Google published its GFS and MapReduce papers, the
route became clear. It'd devised systems to solve precisely the
problems they were having with Nutch. So they started, two of
them, half-time, to try to re-create these systems as a part of
Nutch

- Around that time. Yahoo! got interested, and quickly put
together a team. They split off the distributed computing part of
Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon
grew into a technology that could truly scale to the Web.

                  www.exoplatform.com - Copyright 2012 eXo Platform   5
Data! We live in the data age




               www.exoplatform.com - Copyright 2012 eXo Platform   6
Data! We live in the data age




               www.exoplatform.com - Copyright 2012 eXo Platform   7
Data Storage and Analysis

- While the storage capacities of hard drives have increased massively over
 the years, access speeds the rate at which data can be read from drivers
have not kept up. Once typical drive from 1990 cloud store 1,370 MB of
data and had a transfer speed of 4.4 MB/s. Over 20 years later, one
terabyte drives are the norm, but the transfer speed is around 100MB/s

- This is a long time to read all data on a single drive and writing is even
slower.




                       www.exoplatform.com - Copyright 2012 eXo Platform       8
Data Storage and Analysis

The obvious way:

- Imagine if we have 100 drivers, each holding one hundredth of the data.
Working in parallel, we could read the data in under two minutes.

- Only using one hundredth of a disk may seem wasteful. But we can store one
hundred datasets, each of which is one terabyte, and provide shared access to
them.




                      www.exoplatform.com - Copyright 2012 eXo Platform         9
Data Storage and Analysis

The problems to solve:

- The first: As soon as you start using many pieces of hardware, the chance that
       first
one will fail is fairly high. A common way of avoiding data loss is through
replication: redundant copies of the data are kept by the system so that in the
event of failure, there is another copy available.

- The second: That most analysis tasks need to be able to combine the data in
      second
some way; data read from one disk may need to be combine with the data from
any of the other 99 disks. Various distributed systems allow data to be combined
from multiple sources, but doing this correctly is notoriously challenging

With Hadoop:
Hadoop provides: a reliable shared storage and analysis system. The storage is
provided by HDFS and analysis by MapReduce.



                      www.exoplatform.com - Copyright 2012 eXo Platform          10
What Hadoop is Not
- It is not a substitute for a database. Hadoop stores data in files, and dose not
                               database
index them. If you want to find something, you have to run a MapReduce job
going through all the data. This take time, and mean that you cannot directly use
Hadoop as a substitute for a database. Where Hadoop works is where the data is
too big for a database. With very large datasets, the cost of regenerating indexes
is so high you can't easily index changing data.

- MapReduce is not always the best algorithm. MapReduce is profound idea:
                                        algorithm
talking a simple functional programming operation and applying it, in parallel, to
gigabytes or terabytes of data. But there is a price. For that parallelism, you need
to have each MR operation independent from all the others. If you need to know
everything that has gone before, you have a problem.

- Hadoop and MapReduce is not a place to learn Java programming

- Hadoop is not an ideal place to learn networking error messages

- Hadoop clusters are not a place to learn Unix/Linux system administration


                       www.exoplatform.com - Copyright 2012 eXo Platform          11
The Hadoop Distributed File System

  - HDFS Concept

  - Architecture

  - Goals

  - Command Line User Interface




              www.exoplatform.com - Copyright 2012 eXo Platform   12
HDFS concept

Block:

- A disk has a block size, which is the minimum amount of data that it can read or
write. Filesystem for a single disk build on this by dealing with data in blocks. The
disk blocks are normally 512 bytes.

- HDFS, too, has concept of the block, but it is a much larger unit – 64MB by
default. Like in a filesystem for a single disk, files in HDFS are broken into block-
sized chunks, which are stored as independent units. Unlike a filesystem for a
single disk, a file in HDFS that is smaller than a single block does not occupy a
full block's worth of underlying storage.




                       www.exoplatform.com - Copyright 2012 eXo Platform            13
HDFS Concept

NameNode and DataNodes:

- An Hadoop cluster has two types of node operating in a master-worker pattern:
a namenode (the master) and a number of datanodes (workers)

- The NameNode manages the filesystem namespace. It maintains the filesystem
tree and the metadata for all the files and directories in the tree. It executes file
system namespace operations like opening, closing, and renaming files and
directories. It also determines the mapping of blocks to DataNodes.

- DataNodes are the workhorses of the filesystem. They store and retrieve blocks
when they are told to (client or NameNode), and they report back to the
NameNode periodically with list of blocks that they are storing.




                       www.exoplatform.com - Copyright 2012 eXo Platform           14
Architecture




               www.exoplatform.com - Copyright 2012 eXo Platform   15
Architecture




               www.exoplatform.com - Copyright 2012 eXo Platform   16
HDFS Goals
- Hardware Failure: An HDFS instance may consist of hundreds or thousands of
server machines, each storing part of the file system's data. The fact that these are
a huge number of components and that each component has a non-trivial probability
of failure means that some components of HDFS is always non-functional.
Therefore, detection of faults and quick, automatic recovery from them is core
architectural goal of HDFS.

- Large Data Sets: Applications that run on HDFS have large data sets. A typical
file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large
file. It should provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster. It should support ten of millions of files on single instance.

- “Moving Computation is Cheaper than Moving Data”: A computation
requested by an application is much more efficient if it is executed near the data it
operates on. This is especially true when the size of data is huge. This minimizes
network congestion and increases the overall throughput of the system. The
assumption is that it is often better to migrate the computation closer to where the
data is located rather than moving data to where the application running. HDFS
provides interfaces for applications to move themselves closer to where data is
located.

                       www.exoplatform.com - Copyright 2012 eXo Platform             17
Command Line User Interface




              www.exoplatform.com - Copyright 2012 eXo Platform   18
MapReduce

 - Overview

 - How MapReduce Works




              www.exoplatform.com - Copyright 2012 eXo Platform   19
Overview
- Hadoop MapReduce is a software framework for easily writing application which
process vast amounts of data (multi-terabyte data-sets) in parallel on large cluster
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

- A MapReduce job usually splits the input data-sets into independent chunks which
are processed by the map task in a completely parallel manner. The framework
sorts the output of the maps, which are then input to the reduce task. Typically both
                                                                  task
the input and the output of job are sorted by filesystem. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.

- The MapReduce framework consist of a single master JobTracker and one
worker TaskTrackser per cluster-node. The master is responsible for scheduling
the jobs component tasks on the worker, monitoring them and re-executing the
failed tasks. The workers execute the tasks as directly by the manner.




                      www.exoplatform.com - Copyright 2012 eXo Platform           20
How MapReduce Works




            www.exoplatform.com - Copyright 2012 eXo Platform   21
How MapReduce Works




            www.exoplatform.com - Copyright 2012 eXo Platform   22
Practice

  - Demo

  - Discussion




             www.exoplatform.com - Copyright 2012 eXo Platform   23

Hadoop

  • 1.
    Nguyen Thanh Hai Portal team August 2012
  • 2.
    Agenda − Meet Hadoop 1 − − − History Data! Data Storage and Analysis − What Hadoop is Not 2 − The Hadoop Distributed File System − HDFS concept − Architecture − Goals − Command User Interface 3 − MapReduce − Overview − How MapReduce works 4 − Practice − Demo − Discussion www.exoplatform.com - Copyright 2012 eXo Platform 2
  • 3.
    Meet Hadoop -History - Data! - Data Storage and Analysis - What Hadoop is Not www.exoplatform.com - Copyright 2012 eXo Platform 3
  • 4.
    History www.exoplatform.com - Copyright 2012 eXo Platform 4
  • 5.
    History - Hadoop gotits start in Nutch. A few of them were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers - Once Google published its GFS and MapReduce papers, the route became clear. It'd devised systems to solve precisely the problems they were having with Nutch. So they started, two of them, half-time, to try to re-create these systems as a part of Nutch - Around that time. Yahoo! got interested, and quickly put together a team. They split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web. www.exoplatform.com - Copyright 2012 eXo Platform 5
  • 6.
    Data! We livein the data age www.exoplatform.com - Copyright 2012 eXo Platform 6
  • 7.
    Data! We livein the data age www.exoplatform.com - Copyright 2012 eXo Platform 7
  • 8.
    Data Storage andAnalysis - While the storage capacities of hard drives have increased massively over the years, access speeds the rate at which data can be read from drivers have not kept up. Once typical drive from 1990 cloud store 1,370 MB of data and had a transfer speed of 4.4 MB/s. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100MB/s - This is a long time to read all data on a single drive and writing is even slower. www.exoplatform.com - Copyright 2012 eXo Platform 8
  • 9.
    Data Storage andAnalysis The obvious way: - Imagine if we have 100 drivers, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes. - Only using one hundredth of a disk may seem wasteful. But we can store one hundred datasets, each of which is one terabyte, and provide shared access to them. www.exoplatform.com - Copyright 2012 eXo Platform 9
  • 10.
    Data Storage andAnalysis The problems to solve: - The first: As soon as you start using many pieces of hardware, the chance that first one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. - The second: That most analysis tasks need to be able to combine the data in second some way; data read from one disk may need to be combine with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging With Hadoop: Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. www.exoplatform.com - Copyright 2012 eXo Platform 10
  • 11.
    What Hadoop isNot - It is not a substitute for a database. Hadoop stores data in files, and dose not database index them. If you want to find something, you have to run a MapReduce job going through all the data. This take time, and mean that you cannot directly use Hadoop as a substitute for a database. Where Hadoop works is where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data. - MapReduce is not always the best algorithm. MapReduce is profound idea: algorithm talking a simple functional programming operation and applying it, in parallel, to gigabytes or terabytes of data. But there is a price. For that parallelism, you need to have each MR operation independent from all the others. If you need to know everything that has gone before, you have a problem. - Hadoop and MapReduce is not a place to learn Java programming - Hadoop is not an ideal place to learn networking error messages - Hadoop clusters are not a place to learn Unix/Linux system administration www.exoplatform.com - Copyright 2012 eXo Platform 11
  • 12.
    The Hadoop DistributedFile System - HDFS Concept - Architecture - Goals - Command Line User Interface www.exoplatform.com - Copyright 2012 eXo Platform 12
  • 13.
    HDFS concept Block: - Adisk has a block size, which is the minimum amount of data that it can read or write. Filesystem for a single disk build on this by dealing with data in blocks. The disk blocks are normally 512 bytes. - HDFS, too, has concept of the block, but it is a much larger unit – 64MB by default. Like in a filesystem for a single disk, files in HDFS are broken into block- sized chunks, which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block's worth of underlying storage. www.exoplatform.com - Copyright 2012 eXo Platform 13
  • 14.
    HDFS Concept NameNode andDataNodes: - An Hadoop cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers) - The NameNode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. It executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. - DataNodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (client or NameNode), and they report back to the NameNode periodically with list of blocks that they are storing. www.exoplatform.com - Copyright 2012 eXo Platform 14
  • 15.
    Architecture www.exoplatform.com - Copyright 2012 eXo Platform 15
  • 16.
    Architecture www.exoplatform.com - Copyright 2012 eXo Platform 16
  • 17.
    HDFS Goals - HardwareFailure: An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system's data. The fact that these are a huge number of components and that each component has a non-trivial probability of failure means that some components of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is core architectural goal of HDFS. - Large Data Sets: Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large file. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support ten of millions of files on single instance. - “Moving Computation is Cheaper than Moving Data”: A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of data is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving data to where the application running. HDFS provides interfaces for applications to move themselves closer to where data is located. www.exoplatform.com - Copyright 2012 eXo Platform 17
  • 18.
    Command Line UserInterface www.exoplatform.com - Copyright 2012 eXo Platform 18
  • 19.
    MapReduce - Overview - How MapReduce Works www.exoplatform.com - Copyright 2012 eXo Platform 19
  • 20.
    Overview - Hadoop MapReduceis a software framework for easily writing application which process vast amounts of data (multi-terabyte data-sets) in parallel on large cluster (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. - A MapReduce job usually splits the input data-sets into independent chunks which are processed by the map task in a completely parallel manner. The framework sorts the output of the maps, which are then input to the reduce task. Typically both task the input and the output of job are sorted by filesystem. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. - The MapReduce framework consist of a single master JobTracker and one worker TaskTrackser per cluster-node. The master is responsible for scheduling the jobs component tasks on the worker, monitoring them and re-executing the failed tasks. The workers execute the tasks as directly by the manner. www.exoplatform.com - Copyright 2012 eXo Platform 20
  • 21.
    How MapReduce Works www.exoplatform.com - Copyright 2012 eXo Platform 21
  • 22.
    How MapReduce Works www.exoplatform.com - Copyright 2012 eXo Platform 22
  • 23.
    Practice -Demo - Discussion www.exoplatform.com - Copyright 2012 eXo Platform 23