Storage + Compute
Data Processing
Data Processing
1. Where is data stored ?
Data Processing
1. Where is data stored ?
2. Where does the compute run ?
Storage
Storage
● In Memory
Storage
● In Memory
● On Disk
Storage
● In Memory
● On Disk
○ File System
Storage
● In Memory
● On Disk
○ File System
■ Local file system - xfs , zfs , etc
Storage
● In Memory
● On Disk
○ File System
■ Local file system - xfs , zfs
■ Distributed File System
Storage
● In Memory
● On Disk
○ File System
■ Local file system - xfs , zfs
■ Distributed File System
● HDFS - Hadoop Distributed File System
● S3
● Ceph etc
H-Distributed-FS
H-Distributed-FS
Motivation:
H-Distributed-FS
Motivation:
● Parallel Processing of Data
H-Distributed-FS
Motivation:
● Parallel Processing of Data
○ When data is distributed, it can be processed in parallel *
* some problem statements are not a fit for this.
H-Distributed-FS
Motivation:
● Parallel Processing of Data
○ When data is distributed, it can be processed in parallel *
● Computation goes to data and not data to computation
* some problem statements are not a fit for this.
Word Count Problem
● Single file
Word Count Problem
● Single file
○ O(n) time
complexity
Word Count Problem
● Distributed data
Word Count Problem
● Distributed data
O(m) time complexity
‘m’ = size of largest file
Parallel Compute
Parallel Compute
● Parallel Computation on the data
Parallel Compute
● Parallel Computation on the data
● Computation goes to data & not data
to computation. Compute works over
the data local to it.
Parallel Compute
● Parallel Computation on the data
● Computation goes to data & not data
to computation. Compute works over
the data local to it.
● A final aggregation happens.
Parallel Compute
● Parallel Computation on the data
● Computation goes to data & not data
to computation. Compute works over
the data local to it.
● A final aggregation happens.
This Compute Paradigm is called
“MAP REDUCE framework”
Storage + Compute
Hdfs + MapReduce
Hdfs + MapReduce = Hadoop
Some Notes
● Hadoop Version 1 - has Map Reduce (MR) framework
● Hadoop Version 2 - has Yarn (resource scheduler)
● Apache Spark
○ Alternative to Map Reduce Compute framework
○ Can use Apache Spark with data in HDFS
○ Can run in Yarn with data in s3 ! (we just use yarn features
without using MR and HDFS features) :)
Hdfs + MapReduce = Hadoop
By Vishnu Rao
mash213.wordpress.com
linkedin.com/in/213vishnu

simple introduction to hadoop