Hadoop 101
A really quick overview of the concepts…
A few Terabytes of Data...
Text processing--a few hours?
But what if you have more data?
Network Storage--Petabytes!
Network Storage--Petabytes!
What if you need compute power for complex algorithms?
8 core? 16 Cores? 64 cores? 512 GB
RAM?
A network of commodity computers
Run jobs on PART of the data on each computer then
AGGRETAGE the intermediary results from each computer.
Let’s add a computer to manage the process of
job delegation, merging the results...
and keeping track of the results...
We also need something to keep track of what files are
where, so we know where the data is that needs to be
computed...
When you have a lot of computers, and even more hard
drives,
one thing I can guarantee...
Computers will eventually fail.
Computers will eventually fail.
Hard drives will eventually fail.
Hard drives will eventually fail.
Hard drives will eventually fail.
Hard drives will eventually fail.
Even whole racks will fail.
If a computer fails and you only have one copy of your
data...
You will be very, very unhappy.
So lets store multiple copies of the data. Hard drives are
CHEAP!
So lets store multiple copies of the data. Hard drives are
CHEAP!
So lets store multiple copies of the data. Hard drives are
CHEAP!
So lets store multiple copies of the data. Hard drives are
CHEAP!
If one hard drive fails... we are still OK
If one computer fails... we are still OK
Even if a whole rack fails... we are still OK
Once we find a failure let’s have the system recopy the
copies.
Send the compute job to all nodes.
And let it run on it’s part of the data….
And let it run on it’s part of the data….
And let it run on it’s part of the data….
And let it run on it’s part of the data….
One is stuck….
We have three copies—we can redistribute the compute
And take the one that finishes fastest
Merge sorted sets based on some key…
A-E F-J K-O P-T U-Z
…and write partial results
PART-01 PART-02 PART-03 PART-04 PART-05
Guess, what? We’ve just invented Hadoop!
PART-03
PART-01
PART-02
A-E F-J
So let’s talk about the pieces of Hadoop.
Data nodes store and manage the data on a single “slave”
computer
Data Node
Task trackers manage the compute
Data Node
Task Tracker
Job tracker manages task trackers, ships code to compute
nodes
Data Node
Task Tracker
Job Tracker
Name node manages distribution and replication on the
data nodes
Data Node
Task Tracker
Job Tracker
Name Node
Map Reduce
Task Tracker
Job Tracker
HDFS (Hadoop Distributed File System)
Data Node
Name Node
HDFS
Visual Example
Map
Shuffle
Reduce
Putting It All Together
Hadoop 101 v2
Hadoop 101 v2
Upcoming SlideShare
Loading in …5
×

Hadoop 101 v2

515 views

Published on

Given at IoT Asia 2014

Published in: Data & Analytics, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
515
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
33
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop 101 v2

  1. 1. Hadoop 101 A really quick overview of the concepts…
  2. 2. A few Terabytes of Data...
  3. 3. Text processing--a few hours?
  4. 4. But what if you have more data?
  5. 5. Network Storage--Petabytes!
  6. 6. Network Storage--Petabytes!
  7. 7. What if you need compute power for complex algorithms?
  8. 8. 8 core? 16 Cores? 64 cores? 512 GB RAM?
  9. 9. A network of commodity computers
  10. 10. Run jobs on PART of the data on each computer then AGGRETAGE the intermediary results from each computer.
  11. 11. Let’s add a computer to manage the process of job delegation, merging the results... and keeping track of the results...
  12. 12. We also need something to keep track of what files are where, so we know where the data is that needs to be computed...
  13. 13. When you have a lot of computers, and even more hard drives, one thing I can guarantee...
  14. 14. Computers will eventually fail.
  15. 15. Computers will eventually fail.
  16. 16. Hard drives will eventually fail.
  17. 17. Hard drives will eventually fail.
  18. 18. Hard drives will eventually fail.
  19. 19. Hard drives will eventually fail.
  20. 20. Even whole racks will fail.
  21. 21. If a computer fails and you only have one copy of your data...
  22. 22. You will be very, very unhappy.
  23. 23. So lets store multiple copies of the data. Hard drives are CHEAP!
  24. 24. So lets store multiple copies of the data. Hard drives are CHEAP!
  25. 25. So lets store multiple copies of the data. Hard drives are CHEAP!
  26. 26. So lets store multiple copies of the data. Hard drives are CHEAP!
  27. 27. If one hard drive fails... we are still OK
  28. 28. If one computer fails... we are still OK
  29. 29. Even if a whole rack fails... we are still OK
  30. 30. Once we find a failure let’s have the system recopy the copies.
  31. 31. Send the compute job to all nodes.
  32. 32. And let it run on it’s part of the data….
  33. 33. And let it run on it’s part of the data….
  34. 34. And let it run on it’s part of the data….
  35. 35. And let it run on it’s part of the data….
  36. 36. One is stuck….
  37. 37. We have three copies—we can redistribute the compute
  38. 38. And take the one that finishes fastest
  39. 39. Merge sorted sets based on some key… A-E F-J K-O P-T U-Z
  40. 40. …and write partial results PART-01 PART-02 PART-03 PART-04 PART-05
  41. 41. Guess, what? We’ve just invented Hadoop! PART-03 PART-01 PART-02 A-E F-J
  42. 42. So let’s talk about the pieces of Hadoop.
  43. 43. Data nodes store and manage the data on a single “slave” computer Data Node
  44. 44. Task trackers manage the compute Data Node Task Tracker
  45. 45. Job tracker manages task trackers, ships code to compute nodes Data Node Task Tracker Job Tracker
  46. 46. Name node manages distribution and replication on the data nodes Data Node Task Tracker Job Tracker Name Node
  47. 47. Map Reduce Task Tracker Job Tracker
  48. 48. HDFS (Hadoop Distributed File System) Data Node Name Node
  49. 49. HDFS
  50. 50. Visual Example
  51. 51. Map
  52. 52. Shuffle
  53. 53. Reduce
  54. 54. Putting It All Together

×