Hadoop 101 v2

Hadoop 101
A really quick overview of the concepts…

But what if you have more data?

What if you need compute power for complex algorithms?

8 core? 16 Cores? 64 cores? 512 GB
RAM?

A network of commodity computers

Run jobs on PART of the data on each computer then
AGGRETAGE the intermediary results from each computer.

Let’s add a computer to manage the process of
job delegation, merging the results...
and keeping track of the results...

We also need something to keep track of what files are
where, so we know where the data is that needs to be
computed...

When you have a lot of computers, and even more hard
drives,
one thing I can guarantee...

Computers will eventually fail.

Hard drives will eventually fail.

If a computer fails and you only have one copy of your
data...

You will be very, very unhappy.

So lets store multiple copies of the data. Hard drives are
CHEAP!

If one hard drive fails... we are still OK

If one computer fails... we are still OK

Even if a whole rack fails... we are still OK

Once we find a failure let’s have the system recopy the
copies.

Send the compute job to all nodes.

And let it run on it’s part of the data….

We have three copies—we can redistribute the compute

And take the one that finishes fastest

Merge sorted sets based on some key…
A-E F-J K-O P-T U-Z

…and write partial results
PART-01 PART-02 PART-03 PART-04 PART-05

Guess, what? We’ve just invented Hadoop!
PART-03
PART-01
PART-02
A-E F-J

So let’s talk about the pieces of Hadoop.

Data nodes store and manage the data on a single “slave”
computer
Data Node

Task trackers manage the compute
Data Node
Task Tracker

Job tracker manages task trackers, ships code to compute
nodes
Data Node
Task Tracker
Job Tracker

Name node manages distribution and replication on the
data nodes
Data Node
Task Tracker
Job Tracker
Name Node

Map Reduce
Task Tracker
Job Tracker

HDFS (Hadoop Distributed File System)
Data Node
Name Node

Recommended