Hadoop - Simple. Scalable.

@markgunnels

mark@catamorphiclabs.com

Java. Clojure. Ruby.

Cloudera Certified

posscon.org

April 15, 16, and 17

Agenda

Overview
Massively Large Data Sets and the problems therein
Distributed File System
MapReduce
Pig

Favorite Hadoop Story

New York Times

4 Terabytes of Source Articles.

Infoporn from Yahoo

73 hours
490 TB Shuffling
280 TB Output
4000 Nodes
16 PB Disk Space
32K Cores
64 TB RAM

Analyzing Massively Large
Datasets

Two Problems

You have to distribute.

Data Storage

Capacity has increased rapidly
beyond read speeds. Datasets
won't fit on one disk. Tolerate node
failure.

Data Analysis

Combine data from many
machines. Tolerate node failure.

How Hadoop solves these
problems.

Send Code to Data. Not Data
to Code.

Name Node. Data Nodes.

Master - Slave Relationship

Shard massive files across
multiple machines.
MB, GB, and TB

Tolerant of Node Failure

Files replicated across at least 3
nodes.

HDFS behaves like a normal
file system.
No true appends yet.

Job Tracker. Task Nodes.

Master - Slave Relationship.

MapReduce

Ruby. Python. Unix Utilities.

Hadoop Ecosystem

Pigkeeper. Hive. Cascading.

More Related Content