Common and unique use cases for Apache Hadoop

Common
and
Unique
Use
Cases

for
Apache
Hadoop

August
30,
2011

Agenda

•  What
is
Apache
Hadoop?

•  Log
Processing

•  Catching
`Osama’

•  Extract
Transform
Load
(ETL)

•  AnalyBcs
in
HBase

•  Machine
Learning

•  Final
Thoughts

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Exploding
Data
Volumes

•  Online

•  Web-‐ready
devices

•  Social
media

Complex, Unstructured
•  Digital
content

•  Smart
grids

•  Enterprise
Relational

•  TransacBons

•  R&D
data

•  OperaBonal
(control)
data

Digital
universe
grew
by
62%
last
year
to

2,500
exabytes
of
new
informaBon
in
800K
petabytes
and
will
grow
to
1.2

2012
with
Internet
as
primary
driver
“zeabytes”
this
year

Source:
An
IDC
White
Paper
-‐
sponsored
by
EMC.
As
the
Economy
Contracts,
the

Digital
Universe
Expands.
May
2009

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Origin
of
Hadoop

How
does
an
elephant
sneak
up
on
you?

Hadoop
wins

Terabyte
sort

benchmark

Releases

Open
Source,
CDH3
and

Publishes
MapReduce
Cloudera

MapReduce,
&
HDFS
Runs
4,000
Enterprise

Open
Source,
GFS
Paper
project
Node
Hadoop

Web
Crawler
created
by
Cluster

project
Launches
SQL

Doug
Cucng

created
by
Support
for

Doug
Cucng
Hadoop

2002
2003
2004
2005
2006
2007
2008
2009
2010

Copyright
2011
Cloudera
Inc.
All
rights
reserved

What
is
Apache
Hadoop?

Open
Source
Storage
and
Processing
Engine

• 
Consolidates
Everything

• 
Move
complex
and
relaBonal

data
into
a
single
repository

• 
Stores
Inexpensively

• 
Keep
raw
data
always
available

MapReduce

• 
Use
commodity
hardware

• 
Processes
at
the
Source

• 
Eliminate
ETL
bolenecks

Hadoop
Distributed
• 
Mine
data
ﬁrst,
govern
later

File
System
(HDFS)

Copyright
2011
Cloudera
Inc.
All
rights
reserved

What
is
Apache
Hadoop?

The
Standard
Way
Big
Data
Gets
Done

•  Hadoop
is
Flexible:

•  Structured,
unstructured

•  Schema,
no
schema

•  High
volume,
merely
terabytes

•  All
kinds
of
analyBc
applicaBons

•  Hadoop
is
Open:
100%
Apache-‐licensed
open
source

•  Hadoop
is
Scalable:
Proven
at
petabyte
scale

•  Beneﬁts:

•  Controls
costs
by
storing
data
more
aﬀordably
per
terabyte
than
any
other

plalorm

•  Drives
revenue
by
extracBng
value
from
data
that
was
previously
out
of
reach

Copyright
2011
Cloudera
Inc.
All
rights
reserved

What
is
Apache
Hadoop?

The
Importance
of
Being
Open

No
Lock-‐In
-‐
Investments
in
skills,
services
&

hardware
are
preserved
regardless
of
vendor
choice

Community
Development
-‐
Hadoop
&

related
projects
are
expanding
at
a

rapid
pace

Rich
Ecosystem
-‐
Dozens
of

complementary
somware,
hardware

and
services
ﬁrms

Copyright
2011
Cloudera
Inc.
All
rights
reserved

Log
Processing

A
Perfect
Fit

•  Common
uses
of
logs

•  Find
or
count
events
(grep)

grep
“ERROR”
file

grep
-‐c
“ERROR”
file

•  Calculate
metrics
(performance
or
user
behavior
analysis)

awk
‘{sums[$1]+=$2;
counts[$1]+=1}
END
{for(k
in
counts)
{print
sums[k]/counts
[k]}}’

•  InvesBgate
user
sessions

grep
“USER”
files
…
|
sort
|
less

Log
Processing

A
Perfect
Fit

•  Shoot…too
much
data

•  Homegrown
parallel
processing
omen
done
on
per
ﬁle
basis,
cause
it’s

easy

•  No
parallelism
on
a
single
large
ﬁle

Task
0

access_log

Task
1
Task
2

access_log
access_log

Log
Processing

A
Perfect
Fit

•  MapReduce
to
the
rescue!

•  Processing
is
done
per
unit
of
data

Task
0
Task
1
Task
2
Task
3

access_log

0-‐64MB

64-‐128MB
128-‐192MB
192-‐256MB

Each
task
is
responsible
for
a
unit
of
data

Log
Processing

A
Perfect
Fit

•  Network
or
disk
are
bolenecks

•  Reading
100GB
of
data

•  14
minutes
with
1GbE
network
connecBon

•  22
minutes
on
standard
disk
drive

access_log

ited

Bandwidth
is
lim
grep

Log
Processing

A
Perfect
Fit

•  Hadoop
to
the
rescue!

•  Eliminates
network
boleneck,
data
is
on
local
disk

•  Data
is
read
from
many,
many
disks
in
parallel

Physical
Machines

NodeA
NodeX
NodeY
NodeZ

Task
0
Task
1
Task
2
Task
3

0-‐64MB
64-‐128MB
128-‐192MB
192-‐256MB

Log
Processing

A
Perfect
Fit

•  Hadoop
currently
scales
to
4,000
nodes

•  Goal
for
next
release
is
10,000
nodes

•  Nodes
typically
have
12
hard
drives

•  A
single
hard
drive
has
throughput
of
about
75MB/second

•  12
Hard
Drives
*
75
MB/second
*
4000
Nodes
=
3.4
TB/second

•  That’s
bytes,
not
bits

•  That’s
enough
bandwidth
to
read
1PB
(1000
TB)
in
5
minutes

Catching
`Osama’

Embarrassingly
Parallel

•  You
have
a
few
billion
images
of
faces
with
geo-‐tags

•  Tremendous
storage
problem

•  Tremendous
processing
problem

•  Bandwidth

•  CoordinaBon

Catching
`Osama’

Embarrassingly
Parallel

•  Store
the
images
in
Hadoop

•  When
processing,
Hadoop
will
read
the
images
from

local
disk,
thousands
of
local
disks
spread
throughout

the
cluster

•  Use
Map
only
job
to
compare
input
images
against

`needle’
image

Catching
`Osama’

Embarrassingly
Parallel

Tasks
have
copy
of
`needle’

Map
Task
0
Map
Task
1

Store
images
in
Sequence
Files

Output
faces

`matching’
needle

Extract
Transform
Load
(ETL)

Everyone
is
doing
it

•  One
of
the
most
common
use
cases
I
see
is
replacing

ETL
processes

•  Hadoop
is
a
huge
sink
of
cheap
storage
and
processing

•  Aggregates
built
in
Hadoop
and
exported

•  Apache
Hive
provides
SQL
like
querying
on
raw
data

Extract
Transform
Load
(ETL)

Everyone
is
doing
it

`Real’
Time
System
(Website)
Data
Warehouse

Business

Intelligence

ApplicaBons

Online
AnalyBcal

DB
DB

ETL

Much
blood
shed,
here

Extract
Transform
Load
(ETL)

Everyone
is
doing
it

`Real’
Time
System
(Website)
Data
Warehouse

Business

Intelligence

ApplicaBons

Online
AnalyBcal

DB
DB

Import Hadoop

Export

Extract
Transform
Load
(ETL)

Everyone
is
doing
it

`Real’
Time
System
(Website)
Data
Warehouse

Business

Intelligence

ApplicaBons

Online
AnalyBcal

DB
DB

Apache Hadoop

Sqoop

Apache

Sqoop

AnalyScs
in
HBase

Scaling
writes

•  AnalyBcs
is
omen
simply
counBng
things

•  Facebook
chose
HBase
to
store
it’s
massive
counter
infrastructure
(more

later)

•  How
might
one
implement
a
counter
infrastructure
in
HBase?

AnalyScs
in
HBase

Scaling
writes

User
&
Content
Type
Counters

`Like’
buon
IMG
request

sends
HTTP
request
to
User
Content
Counter

Facebook
servers
which
brock@me.com
NEWS
5431

increments
several
counters

brock@me.com
TECH
79310

brock@me.com
SHOPPING
59

tom@him.com
SPORTS
94214

Individual
Page
Counters

URL
Counter

com.cloudera/blog/…
154

com.cloudera/downloads/…
923621

com.cloudera/resources/…
2138

AnalyScs
in
HBase

Scaling
writes

Individual
Page
Counters

Host
is
reversed
in
URL
as
part
of
the
key
URL
Counter

com.cloudera/blog/…
154

com.cloudera/downloads/…
923621

com.cloudera/resources/…
2138

•  Data
is
physically
stored
in
sorted
order

•  Scanning
all
`com.cloudera’
counters
results
in
sequenBal
I/O

Facebook
AnalyScs

Scaling
writes

•  Real-‐Bme
counters
of
URLs
shared,
links
“liked”,

impressions
generated

•  20
billion
events/day
(200K
events/sec)

•  ~30
second
latency
from
click
to
count

•  Heavy
use
of
incrementColumnValue
API
for

consistent
counters

•  Tried
MySQL,
Cassandra,
seled
on
HBase

hp://Bny.cloudera.com/hbase-‐„-‐analyBcs

Machine
Learning

Apache
Mahout

Text
Clustering
on
Google
News

Machine
Learning

Apache
Mahout

CollaboraBve
Filtering
on
Amazon

Machine
Learning

Apache
Mahout

ClassiﬁcaBon
in
GMail

Machine
Learning

Apache
Mahout

•  Apache
Mahout
implements

•  CollaboraBve
Filtering

•  ClassiﬁcaBon

•  Clustering

•  Frequent
itemset

•  More
coming
with
the
integraBon
of
MapReduce.Next

Final
Thoughts

Use
the
right
tool

•  Other
use
cases

•  OpenTSDB
an
open
distributed,
scalable
Time
Series
Database
(TSDB)

•  Building
Search
Indexes
(canonical
use
case)

•  Facebook
Messaging

•  Cheap
and
Deep
Storage,
e.g.
archiving
emails
for
SOX
compliance

•  Audit
Logging

•  Non-‐Use
Cases

•  Data
processing
is
handled
by
one
beefy
server

•  Data
requires
transacBons

About
the
Presenter

•  Brock
Noland

•  brock@cloudera.com

•  hp://twier.com/brocknoland

•  TC-‐HUG
hp://tch.ug

Common and unique use cases for Apache Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Common and unique use cases for Apache Hadoop

Recently uploaded

Common and unique use cases for Apache Hadoop