Hadoop
Big Data
• Lots of Data
• The challenges include capture, storage, search, transfer,
analysis and visualization.
• Systems/Enterprise generates huge amount of data from
Terabyte to Petabytes of information.
Characteristics of Big
Data
The 3Vs are
• Volume
• Variety
• Velocity
What is Hadoop?
• Apache Hadoop is the framework that allows for
distributed processing of arrange datasets across cluster
of commodity computers using simple programming
model
• Its is Open source Data Management.
Hadoop System-
Principles
• Scale-Out rather then scale-up
• Bring code to data rather data to code
• Deal with failures – they are common
• Abstract complexity of distributed and concurrent
applications
HDFS
Filesystem cluster is managed by three types processes
• Name node
• Data node
• Secondary node
Files and Blocks
• Files are split into blocks(single unit of storage).
• Replicated across machine at load time.
• By default 3 replication.
Hadoop - MapReduce
• Model for processing large amount of data in parallel.
• Derived from functional programming.
• Can be implemented in multiple languages.
MapReduce Model
• Impose key-value input/output
• Defines map and reduce funtions
map : (k1,v1) -> list (k2,v2)
reduce : (k2,list(v2)) -> list (k3,v3)
MapReduce Framework
• Takes care of distributed processing and coordination
• Scheduling
• Task localization with Data
• Error Handling
• Data Synchronization
Yarn Daemons
- Node Manager
• Manages resources of single node
• There is one instance per node in the cluster
- Resource Manager
• Manages Resources for Cluster
• Instructs Node Manager to allocate resources

Hadoop

  • 1.
  • 2.
    Big Data • Lotsof Data • The challenges include capture, storage, search, transfer, analysis and visualization. • Systems/Enterprise generates huge amount of data from Terabyte to Petabytes of information.
  • 3.
    Characteristics of Big Data The3Vs are • Volume • Variety • Velocity
  • 4.
    What is Hadoop? •Apache Hadoop is the framework that allows for distributed processing of arrange datasets across cluster of commodity computers using simple programming model • Its is Open source Data Management.
  • 5.
    Hadoop System- Principles • Scale-Outrather then scale-up • Bring code to data rather data to code • Deal with failures – they are common • Abstract complexity of distributed and concurrent applications
  • 6.
    HDFS Filesystem cluster ismanaged by three types processes • Name node • Data node • Secondary node
  • 7.
    Files and Blocks •Files are split into blocks(single unit of storage). • Replicated across machine at load time. • By default 3 replication.
  • 8.
    Hadoop - MapReduce •Model for processing large amount of data in parallel. • Derived from functional programming. • Can be implemented in multiple languages.
  • 9.
    MapReduce Model • Imposekey-value input/output • Defines map and reduce funtions map : (k1,v1) -> list (k2,v2) reduce : (k2,list(v2)) -> list (k3,v3)
  • 10.
    MapReduce Framework • Takescare of distributed processing and coordination • Scheduling • Task localization with Data • Error Handling • Data Synchronization
  • 11.
    Yarn Daemons - NodeManager • Manages resources of single node • There is one instance per node in the cluster - Resource Manager • Manages Resources for Cluster • Instructs Node Manager to allocate resources