Hadoop

Simple. Scalable.
@markgunnels

mark@catamorphiclabs.com
Java. Clojure. Ruby.

    Cloudera Certified
posscon.org

April 15, 16, and 17
Agenda

 Overview
 Massively Large Data Sets and the problems therein
 Distributed File System
 MapReduce
 Pig
Overview
Doug Cutting

   Genius
Favorite Hadoop Story

     New York Times
4 Terabytes of Source Articles.
24 Hours.
5.5 Terabytes of PDFs.
Did it again.
$240.
Infoporn from Yahoo

 73 hours
 490 TB Shuffling
 280 TB Output
 4000 Nodes
 16 PB Disk Space
 32K Cores
 64 TB RAM
Hadoop solves...
Analyzing Massively Large
        Datasets
Two Problems

You have to distribute.
Data Storage

 Capacity has increased rapidly
 beyond read speeds. Datasets
won't fit on one disk. Tolerate node
         ...
Data Analysis

  Combine data from many
machines. Tolerate node failure.
How Hadoop solves these
      problems.
Send Code to Data. Not Data
        to Code.
Data Storage

    HDFS
Name Node. Data Nodes.

   Master - Slave Relationship
Shard massive files across
   multiple machines.
       MB, GB, and TB
Tolerant of Node Failure

 Files replicated across at least 3
               nodes.
HDFS behaves like a normal
       file system.
      No true appends yet.
Demonstration.
Data Analysis

  MapReduce
Job Tracker. Task Nodes.

   Master - Slave Relationship.
map
Demonstration
pmap
Demonstration
reduce
Demonstration
(reduce (pmap))
Demonstration.
MapReduce

   Java
Nobody likes it.

       :-)
MapReduce

Ruby. Python. Unix Utilities.
MapReduce

  Clojure
Hadoop Ecosystem

Pigkeeper. Hive. Cascading.
Pig
HBase
Upcoming SlideShare
Loading in...5
×

Hadoop - Simple. Scalable.

1,862

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,862
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop - Simple. Scalable.

  1. 1. Hadoop Simple. Scalable.
  2. 2. @markgunnels mark@catamorphiclabs.com
  3. 3. Java. Clojure. Ruby. Cloudera Certified
  4. 4. posscon.org April 15, 16, and 17
  5. 5. Agenda Overview Massively Large Data Sets and the problems therein Distributed File System MapReduce Pig
  6. 6. Overview
  7. 7. Doug Cutting Genius
  8. 8. Favorite Hadoop Story New York Times
  9. 9. 4 Terabytes of Source Articles.
  10. 10. 24 Hours.
  11. 11. 5.5 Terabytes of PDFs.
  12. 12. Did it again.
  13. 13. $240.
  14. 14. Infoporn from Yahoo 73 hours 490 TB Shuffling 280 TB Output 4000 Nodes 16 PB Disk Space 32K Cores 64 TB RAM
  15. 15. Hadoop solves...
  16. 16. Analyzing Massively Large Datasets
  17. 17. Two Problems You have to distribute.
  18. 18. Data Storage Capacity has increased rapidly beyond read speeds. Datasets won't fit on one disk. Tolerate node failure.
  19. 19. Data Analysis Combine data from many machines. Tolerate node failure.
  20. 20. How Hadoop solves these problems.
  21. 21. Send Code to Data. Not Data to Code.
  22. 22. Data Storage HDFS
  23. 23. Name Node. Data Nodes. Master - Slave Relationship
  24. 24. Shard massive files across multiple machines. MB, GB, and TB
  25. 25. Tolerant of Node Failure Files replicated across at least 3 nodes.
  26. 26. HDFS behaves like a normal file system. No true appends yet.
  27. 27. Demonstration.
  28. 28. Data Analysis MapReduce
  29. 29. Job Tracker. Task Nodes. Master - Slave Relationship.
  30. 30. map
  31. 31. Demonstration
  32. 32. pmap
  33. 33. Demonstration
  34. 34. reduce
  35. 35. Demonstration
  36. 36. (reduce (pmap))
  37. 37. Demonstration.
  38. 38. MapReduce Java
  39. 39. Nobody likes it. :-)
  40. 40. MapReduce Ruby. Python. Unix Utilities.
  41. 41. MapReduce Clojure
  42. 42. Hadoop Ecosystem Pigkeeper. Hive. Cascading.
  43. 43. Pig
  44. 44. HBase
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×