Hadoop

Big Data
• Lots of Data
• The challenges include capture, storage, search, transfer,
analysis and visualization.
• Systems/Enterprise generates huge amount of data from
Terabyte to Petabytes of information.

Characteristics of Big
Data
The 3Vs are
• Volume
• Variety
• Velocity

What is Hadoop?
• Apache Hadoop is the framework that allows for
distributed processing of arrange datasets across cluster
of commodity computers using simple programming
model
• Its is Open source Data Management.

Hadoop System-
Principles
• Scale-Out rather then scale-up
• Bring code to data rather data to code
• Deal with failures – they are common
• Abstract complexity of distributed and concurrent
applications

HDFS
Filesystem cluster is managed by three types processes
• Name node
• Data node
• Secondary node

Files and Blocks
• Files are split into blocks(single unit of storage).
• Replicated across machine at load time.
• By default 3 replication.

Hadoop - MapReduce
• Model for processing large amount of data in parallel.
• Derived from functional programming.
• Can be implemented in multiple languages.

MapReduce Model
• Impose key-value input/output
• Defines map and reduce funtions
map : (k1,v1) -> list (k2,v2)
reduce : (k2,list(v2)) -> list (k3,v3)

MapReduce Framework
• Takes care of distributed processing and coordination
• Scheduling
• Task localization with Data
• Error Handling
• Data Synchronization

Yarn Daemons
- Node Manager
• Manages resources of single node
• There is one instance per node in the cluster
- Resource Manager
• Manages Resources for Cluster
• Instructs Node Manager to allocate resources

Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Hadoop

Recently uploaded

Hadoop