HADOOP
- Nagarjuna K
-   nagarjunak@outlook.com
Why and What Hadoop ?

A tool to process big data
What is BIG Data ?
Facebook, Google+ etc.,
   Whatever we do getting stored in form of data or inform of logs


Machines too generate lots of data
  Cameras, Mobiles, softwares like STAAD Pro, automated machines in
   industries etc.,


We are having a online discussion now , certainly
 your reading of this presentation is recorded in
 data.
What is BIG Data ?                         ..continued

 Exponential growth of data  challenges to Google, Yahoo,
  Microsoft, Amazon

 Need to go through TBs and PBs of data ?

    Which websites and books were popular ?
    What kind of Ads appeal to them ?


 Existing tools became inadequate to process such large
  data sets.
Why is the data so BIG                    ?
 Till Couple of decade back  Floppy disks

 From then on  CD/DVD Drives

 Half a decade back  Hard drives (500 GB)

 Now  Hard Drives(I TB) are available in abundance
Why is the data so BIG               ?

So WHAT ?



Even the technology to read has taken a leap.
Why is the data so BIG                     ?
                                   Data
                                            Time to
Year     Device        Volume    Transfer
                                            process
                                  speed

       Optical Drive
1990                   1370 MB   4.4 MB/s   5 minutes

        1 TB SATA
2012                    1 TB     100 MB/s    2.5 Hrs
          Drives
How to handle such BIG ?

 BIG elephant
 Numerous small chicken ?
How to handle such BIG ?
Concept of Torrents
  Reduce time to read by reading it from multiple sources
   simultaneously.

  Imagine if we had 100 drives, each holding one hundredth of
   the data. Working in parallel, we could read the data in less
   than two minutes.
How to handle such BIG ? -- Issues

  How to handle a system up and downs ?

  How to combine the data from all the systems ?
Problem1 : System’s Ups and Downs
 Commodity hard ware for data storage and analysis

 Chances of failure are very high

 So, have a redundant copy of the same data across some machines

 In case of eventuality of one machine, you have the other

 Google came up with a file system  GFS (Google File System) which
  implemented all these details.
GFS
 Divides data into chunks and stores in the file System

 Can store data in ranges of PBs also
Problem 2 : How to combine the data ?
 Analyze data across different machines , But how do we merge them to
  get a meaningful outcome ?

 Yes, all (some) of the data has to travel across network. Then only
  merging of the data can occur.

 Doing this is notoriously challenging

 Again Google  Map—Reduce
Map Reduce
 Provides a programming model  abstracts the problem
  of disk reads and writes transforming in to a computation
  of keys and values.

 Two phases

   Map

   Reduce
So what is Hadoop ?
An operating system ?

Provides

 1. A reliable shared storage system

 1. Analysis system
History of Hadoop
 Google was the first to launch GFS and MapReduce

 They published a paper in 2004 announcing the world
  a brand new technology

 This technology was well proven in Google by 2004
  itself
             MapReduce paper by Google
History of Hadoop
 Doug Cutting saw an opportunity and led the charge
  to develop an open source version of this
  MapReduce system called Hadoop .

 Soon after, Yahoo and others rallied around to
 support this effort.

 Now Hadoop is core part in :
   Facebook, Yahoo, LinkedIn, Twitter …
History of Hadoop

GFS  HDFS

MapReduce  MapReduce
HDFS                               -- A Brief
Design  Streaming very large files on commodity cluster.

1. Very Large Files
  MBs to PBs
2. Streaming
  Write once read many approach
  After huge data being placed  We tend to use the data not modify it
  Time to read the whole data is more important
3. Commodity Cluster
  No High end Servers
  Yes, high chance of failure (But HDFS is tolerant enoguh)
  Replication is done
MapReduce                        -- A Brief
Large scale data processing in parallel.

MapReduce provides:
  Automatic parallelization and distribution
  Fault-tolerance
  I/O scheduling
  Status and monitoring
Two phases in MapReduce
  Map
  Reduce
MapReduce                                     -- A Brief

 Map phase
  map (in_key, in_value) -> list(out_key, intermediate_value)
  Processes input key/value pair
  Produces set of intermediate pairs


 Reduce Phase
  reduce (out_key, list(intermediate_value)) -> list(out_value)
  Combines all intermediate values for a particular key
  Produces a set of merged output values (usually just one)
MapReduce   -- A Brief
Hadoop Cluster
Hadoop Ecosystems
Version of Hadoop
We will deal with either of

  Apache hadoop-0.20
  Cloudera hadoop - cdh3
Pre-Requisites
 Core-Java

 Acquaintance with LINUX will help.

 Linux installation on your machines.
Thank you 
 Please email your suggestions to   nagarjunak@outlook.com

Big Data and Hadoop - An Introduction

  • 1.
    HADOOP - Nagarjuna K - nagarjunak@outlook.com
  • 2.
    Why and WhatHadoop ? A tool to process big data
  • 3.
    What is BIGData ? Facebook, Google+ etc.,  Whatever we do getting stored in form of data or inform of logs Machines too generate lots of data  Cameras, Mobiles, softwares like STAAD Pro, automated machines in industries etc., We are having a online discussion now , certainly your reading of this presentation is recorded in data.
  • 4.
    What is BIGData ? ..continued  Exponential growth of data  challenges to Google, Yahoo, Microsoft, Amazon  Need to go through TBs and PBs of data ?  Which websites and books were popular ?  What kind of Ads appeal to them ?  Existing tools became inadequate to process such large data sets.
  • 5.
    Why is thedata so BIG ?  Till Couple of decade back  Floppy disks  From then on  CD/DVD Drives  Half a decade back  Hard drives (500 GB)  Now  Hard Drives(I TB) are available in abundance
  • 6.
    Why is thedata so BIG ? So WHAT ? Even the technology to read has taken a leap.
  • 7.
    Why is thedata so BIG ? Data Time to Year Device Volume Transfer process speed Optical Drive 1990 1370 MB 4.4 MB/s 5 minutes 1 TB SATA 2012 1 TB 100 MB/s 2.5 Hrs Drives
  • 8.
    How to handlesuch BIG ?  BIG elephant  Numerous small chicken ?
  • 9.
    How to handlesuch BIG ? Concept of Torrents  Reduce time to read by reading it from multiple sources simultaneously.  Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in less than two minutes.
  • 10.
    How to handlesuch BIG ? -- Issues How to handle a system up and downs ? How to combine the data from all the systems ?
  • 11.
    Problem1 : System’sUps and Downs  Commodity hard ware for data storage and analysis  Chances of failure are very high  So, have a redundant copy of the same data across some machines  In case of eventuality of one machine, you have the other  Google came up with a file system  GFS (Google File System) which implemented all these details.
  • 12.
    GFS  Divides datainto chunks and stores in the file System  Can store data in ranges of PBs also
  • 13.
    Problem 2 :How to combine the data ?  Analyze data across different machines , But how do we merge them to get a meaningful outcome ?  Yes, all (some) of the data has to travel across network. Then only merging of the data can occur.  Doing this is notoriously challenging  Again Google  Map—Reduce
  • 14.
    Map Reduce  Providesa programming model  abstracts the problem of disk reads and writes transforming in to a computation of keys and values.  Two phases  Map  Reduce
  • 15.
    So what isHadoop ? An operating system ? Provides 1. A reliable shared storage system 1. Analysis system
  • 16.
    History of Hadoop Google was the first to launch GFS and MapReduce  They published a paper in 2004 announcing the world a brand new technology  This technology was well proven in Google by 2004 itself MapReduce paper by Google
  • 17.
    History of Hadoop Doug Cutting saw an opportunity and led the charge to develop an open source version of this MapReduce system called Hadoop .  Soon after, Yahoo and others rallied around to support this effort.  Now Hadoop is core part in :  Facebook, Yahoo, LinkedIn, Twitter …
  • 18.
    History of Hadoop GFS HDFS MapReduce  MapReduce
  • 19.
    HDFS -- A Brief Design  Streaming very large files on commodity cluster. 1. Very Large Files MBs to PBs 2. Streaming Write once read many approach After huge data being placed  We tend to use the data not modify it Time to read the whole data is more important 3. Commodity Cluster No High end Servers Yes, high chance of failure (But HDFS is tolerant enoguh) Replication is done
  • 20.
    MapReduce -- A Brief Large scale data processing in parallel. MapReduce provides: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring Two phases in MapReduce  Map  Reduce
  • 21.
    MapReduce -- A Brief  Map phase  map (in_key, in_value) -> list(out_key, intermediate_value)  Processes input key/value pair  Produces set of intermediate pairs  Reduce Phase  reduce (out_key, list(intermediate_value)) -> list(out_value)  Combines all intermediate values for a particular key  Produces a set of merged output values (usually just one)
  • 22.
    MapReduce -- A Brief
  • 23.
  • 24.
  • 25.
    Version of Hadoop Wewill deal with either of  Apache hadoop-0.20  Cloudera hadoop - cdh3
  • 26.
    Pre-Requisites  Core-Java  Acquaintancewith LINUX will help.  Linux installation on your machines.
  • 27.
    Thank you  Please email your suggestions to nagarjunak@outlook.com