Intro to Hadoop

Intro to Hadoop
TriHUG, July 2010

Jeff Turner
Bronto Software

Who am I ?

Director of Platform Engineering at Bronto

Former Googler/FeedBurner(er)

Web Analytics background

Still working this out in therapy

What is a Hadoop?
Open source distributed computing framework built on Java

Named by Doug Cutting (Apache Lucene) after son’s toy elephant

Main components: HDFS and MapReduce

Heavily used and sponsored by Yahoo

Also used by Facebook, Twitter, Rackspace, LinkedIn, countless others

Tremendous community and growing popularity

What does Hadoop do?
Networks nodes together to combine storage and computing power

Scales to petabytes of storage

Manages fault tolerance and data replication automagically

Excels at processing semi-structured and unstructured data

Provides framework for analyzing data in parallel (MapReduce)

What does Hadoop not do?
No random access (it’s not a database)

Not real-time (it’s batch oriented)

Make things obvious (there’s a learning curve)

Where do we start?
1. HDFS & MapReduce

2. ???

3. Proﬁt

Hadoop’s Filesystem (HDFS)
Hadoop Distributed File System, based on Google’s GFS whitepaper

Data stored in blocks across cluster

Hadoop manages replication, node failure, rebalancing

Namenode is the master; Datanodes are slaves

Data stored on disk, but not accessible via local ﬁle system; use Hadoop
API/tools

How HDFS stores data
Hadoop Client/API talks local filesystem
to Namenode file001 (1,2,3) file002 (2)
Namenode
file003 (1,3) file004 (3)
Namenode looks up file005 (2) file006 (4)
block locations, returns
which Datanodes have
data Datanode 1 file001, file003
HDFS
Hadoop Client/API talks Datanode 2 file001, file002, file005
to Datanodes to read file
data
Datanode 3 file001, file003, file004

Datanode 4 file006

Namenode
file003 (1,3) file004 (3)
HDFS
Hadoop Client/API talks Datanode 2 file001, file002, file005
to Datanodes to read file
data
Datanode 3 file001, file003, file004
This is the only way to
access HDFS data Datanode 4 file006

Namenode
file003 (1,3) file004 (3)
HDFS
Hadoop Client/API talks Datanode 2 HDFS data on
file001, file002, file005
to Datanodes to read file local file system
data is stored in
Datanode 3 file001, file003, file004 blocks all over
This is the only way to the cluster
access HDFS data Datanode 4 file006

About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data
Datanode Datanode

Namenode keeps track of available
Datanodes and file locations across the
cluster
Namenode

Namenode is a SPOF

Datanode Datanode

of data
Datanode Datanode

cluster

Namenode is a SPOF

If you lose Namenode metadata, Hadoop Datanode Datanode
has no idea which ﬁles are in which blocks

of data

cluster

Namenode is a SPOF

If you lose Namenode metadata, Hadoop
has no idea which ﬁles are in which blocks

HDFS Tips & Tricks
Write Namenode data to multiple local & a remote device (NFS mount)

No RAID, use JBOD. More disks == more disk I/O

Mount disks with noatime (skip writing last accessed time on ﬁle reads)

LZO compression; saves space, speeds network transfer

Tweak and test settings with included JARs: TestDFSIO, sort example

Quick break before we move on to MapReduce

Hadoop’s MapReduce
Framework for running tasks in parallel, based on Google’s whitepaper

JobTracker is the master; schedules tasks on nodes, monitors tasks and re-
tries failures

TaskTrackers are the slaves; runs speciﬁed task against speciﬁed bits of data
on HDFS

Map/Reduce functions operate on smaller parts of problem, distributed
across multiple nodes

Oversimpliﬁed MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"


1. Each line of log ﬁle is input into the map function. The mapper (filename, file-contents):
map parses the line, emits a key/value pair representing for each line in file-contents:
page = parsePage(line)
the page, and that it was viewed once. emit(page, 1)



2. Reducer is given a key and all occurrences of values reduce (key, values):
for that key. The reducer sums the values and outputs a int views = 0
key/value pair that represents the page and a total # of for each value in values:
views++
views.
emit(key, views)



2. Reducer is given a key and all occurrences of values reduce (key, values):
for that key. The reducer sums the values and outputs a int views = 0
key/value pair that represents the page and a total # of for each value in values:
views++
views.
emit(key, views)

3. The result is a count of how many times a webpage (index1, 3)
(index2, 1)
has appeared in this log ﬁle.

Hadoop MapReduce data ﬂow

InputFormat controls where data comes from,
breaks into InputSplits

RecordReader knows how to read InputSplit, passes
data to map function

Mappers do their thing, output intermediate data to
local disk

Hadoop shufﬂes, sorts keys in map output so all
occurrences of same key are passed to reducer
together

Reducers do their thing, send output to
OutputFormat

chart from Yahoo! Hadoop Tutorial
OutputFormat controls where data goes http://developer.yahoo.com/hadoop/tutorial/index.html

Input/Output Formats

TextInputFormat - Reads text ﬁles, each line is an input

TextOutputFormat - Writes output from Hadoop to plain text

DBInputFormat - Reads JDBC sources, rows map to custom DBWritable

DBOutputFormat - Writes to JDBC sources, again using DBWritable

ColumnFamilyInputFormat - Reads rows from a Cassandra ColumnFamily

MapReduce Tips & Tricks
You don’t have to do it in Java; current MapReduce abstractions are
awesome

Pig, Hive - performance is close enough to native MR, with big productivity
boost

Hadoop Streaming - passes data through stdin/stdout so you can use any
language. Ruby, Python popular choices

Amazon’s Elastic MapReduce - on-demand MR jobs on EC2 instances

Hadoop at Bronto
5 node cluster, adding 8 more; each node 4x 1TB drives, 16GB memory, 8
cores

Mostly Pig scripts, some Java utility MR jobs

Jobs process raw data/mail logs; store aggregate stats in Cassandra

Ad-hoc scripts analyze internal logs for app monitoring/debugging

Using Cassandra with Hadoop (we’re rolling our own InputFormat)

Summary
Hadoop excels at big data, analytics, batch processing

Not real-time, no random access; not a database

HDFS makes it all possible: massively scalable, fault tolerant ﬁle system

MapReduce provides framework for processing data on HDFS

Pig, Hive easy to use, big productivity gain, close enough performance in
most cases

Questions?
email: jeff.turner@bronto.com
twitter: twitter.com/jefft

We’re hiring: http://bronto.com/company/careers

Intro to Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Intro to Hadoop

Similar to Intro to Hadoop (20)

Recently uploaded

Recently uploaded (20)

Intro to Hadoop

Editor's Notes