simple introduction to hadoop

Data Processing
1. Where is data stored ?

Data Processing
1. Where is data stored ?
2. Where does the compute run ?

Storage
● In Memory
● On Disk

Storage
● In Memory
● On Disk
○ File System

Storage
● In Memory
● On Disk
○ File System
■ Local ﬁle system - xfs , zfs , etc

Storage
● In Memory
● On Disk
○ File System
■ Local ﬁle system - xfs , zfs
■ Distributed File System

Storage
● In Memory
● On Disk
○ File System
■ Local ﬁle system - xfs , zfs
■ Distributed File System
● HDFS - Hadoop Distributed File System
● S3
● Ceph etc

H-Distributed-FS
Motivation:
● Parallel Processing of Data

H-Distributed-FS
Motivation:
○ When data is distributed, it can be processed in parallel *
* some problem statements are not a ﬁt for this.

H-Distributed-FS
Motivation:
○ When data is distributed, it can be processed in parallel *
● Computation goes to data and not data to computation
* some problem statements are not a ﬁt for this.

Word Count Problem
● Single ﬁle

Word Count Problem
● Single ﬁle
○ O(n) time
complexity

Word Count Problem
● Distributed data

Word Count Problem
● Distributed data
O(m) time complexity
‘m’ = size of largest ﬁle

Parallel Compute
● Parallel Computation on the data

Parallel Compute
● Computation goes to data & not data
to computation. Compute works over
the data local to it.

Parallel Compute
● A ﬁnal aggregation happens.

Parallel Compute
● A ﬁnal aggregation happens.
This Compute Paradigm is called
“MAP REDUCE framework”

Some Notes
● Hadoop Version 1 - has Map Reduce (MR) framework
● Hadoop Version 2 - has Yarn (resource scheduler)
● Apache Spark
○ Alternative to Map Reduce Compute framework
○ Can use Apache Spark with data in HDFS
○ Can run in Yarn with data in s3 ! (we just use yarn features
without using MR and HDFS features) :)

Hdfs + MapReduce = Hadoop
By Vishnu Rao
mash213.wordpress.com
linkedin.com/in/213vishnu

simple introduction to hadoop

More Related Content

What's hot

Similar to simple introduction to hadoop

More from vishnu rao

Recently uploaded

simple introduction to hadoop