Hadoop

Simple. Scalable.
@markgunnels

mark@catamorphiclabs.com
Java. Clojure. Ruby.

    Cloudera Certified
posscon.org

April 15, 16, and 17
Agenda

 Overview
 Massively Large Data Sets and the problems therein
 Distributed File System
 MapReduce
 Pig
Overview
Doug Cutting

   Genius
Favorite Hadoop Story

     New York Times
4 Terabytes of Source Articles.
24 Hours.
5.5 Terabytes of PDFs.
Did it again.
$240.
Infoporn from Yahoo

 73 hours
 490 TB Shuffling
 280 TB Output
 4000 Nodes
 16 PB Disk Space
 32K Cores
 64 TB RAM
Hadoop solves...
Analyzing Massively Large
        Datasets
Two Problems

You have to distribute.
Data Storage

 Capacity has increased rapidly
 beyond read speeds. Datasets
won't fit on one disk. Tolerate node
               failure.
Data Analysis

  Combine data from many
machines. Tolerate node failure.
How Hadoop solves these
      problems.
Send Code to Data. Not Data
        to Code.
Data Storage

    HDFS
Name Node. Data Nodes.

   Master - Slave Relationship
Shard massive files across
   multiple machines.
       MB, GB, and TB
Tolerant of Node Failure

 Files replicated across at least 3
               nodes.
HDFS behaves like a normal
       file system.
      No true appends yet.
Demonstration.
Data Analysis

  MapReduce
Job Tracker. Task Nodes.

   Master - Slave Relationship.
map
Demonstration
pmap
Demonstration
reduce
Demonstration
(reduce (pmap))
Demonstration.
MapReduce

   Java
Nobody likes it.

       :-)
MapReduce

Ruby. Python. Unix Utilities.
MapReduce

  Clojure
Hadoop Ecosystem

Pigkeeper. Hive. Cascading.
Pig
HBase

Hadoop - Simple. Scalable.