What is Hadoop? Nov 20 2013 - IRMAC

Adam
Muise
–
Hortonworks

WELCOME
TO
HADOOP

“Big
Data”
is
the
marke=ng
term

of
the
decade

What
lurks
behind
the
hype
is

the
democra=za=on
of
Data.

You
need
to
deal
with
Data.

You’re
probably
not
as
good
at

that
as
you
think.

Put
it
away,
delete
it,
tweet
it,

compress
it,
shred
it,
wikileak-‐it,
put

it
in
a
database,
put
it
in
SAN/NAS,

put
it
in
the
cloud,
hide
it
in
tape…

You
are
obsessive
compulsive

about
collec=ng
and
structuring

your
data.

Let’s
talk
challenges…

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume
Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume

Volume

Volume

Volume
Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume

Volume
Volume
Volume
Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume
Volume
Volume
Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume
Volume

Volume

Volume
Volume

Volume
Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume

Volume
Volume
Volume

Volume

Volume

Volume
Volume
Volume

Volume
Volume

Volume

Volume

Volume
Volume

Volume

Volume

Volume

Volume
Volume

Volume
Volume

Volume
Volume

Volume

Volume
Volume

Volume

Volume

Volume
Volume
Volume
Volume

Volume

Storage,
Management,
Processing

all
become
challenges
with
Data
at

Volume

Tradi=onal
technologies
adopt
a

divide,
drop,
and
conquer
approach

Another
EDW

Analy=cal
DB

Data

Data
Data

Data
Data

Data

Data

Data
Data

Data

Data
Data

Data
Data

Data

Data

Data
Data

The
solu=on?

EDW

Data

Data
Data

Data
Data

Data

Data

Data
Data

OLTP

Data

Data
Data

Data
Data

Data

Data

Data
Data

Yet
Another
EDW

Data

Data
Data

Data
Data

Data

Data

Data
Data

Another
EDW

Analy=cal
DB

Data

Data
Data

Data
Data

Data

Data

Data
Data

Data

Data
Data

Data
Data

Data

Data

Data
Data

OLTP

Ummm…you

dropped
something

EDW

Data

Data
Data

Data
Data

Data

Data

Data
Data

Data

Data
Data

Data
Data

Data

Data

Data
Data

Yet
Another
EDW

Data

Data
Data

Data
Data

Data

Data

Data
Data

Data

Data

Data
Data

Data

Data

Data
Data

Data
Data
Data
Data

Data
Data

Data
Data

Data
Data

Data

Data
Data
Data

Data
Data
Data

Data
Data
Data

Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data

Data
Data
Data

Data
Data
Data

Data

Data
Data

Data
Data

Data
Data
Data
Data
Data
Data

Data
Data

Data

Data

Data

Data

Data

Data
Data

Data
Data

Data

Data
Data

Data
Data
Data
Data
Data

Data

Data
Data

Analyzing
the
data
usually
raises

more
interes=ng
ques=ons…

…which
leads
to
more
data

Wait,
you’ve
seen
this
before.

Data

Data

Data

…

Sausage
Factory

Data

Data
Data

Data
Data

Data

Data

Data
Data

…

Data

Data
Data

Data
Data

Data
Data

Data

Data
Data

Data
Data

Data
Data

Data

Data
Data

Data

Data
Data

Data
Data

Data

Data
Data
Data

Data

Data
Data
Data
Data

Data

Data

Data
Data
Data

Data
Data
Data

Data
Data
Data
Data

Data

Data

Data

Data

Data
Data
Data
Data

Data
Data

What
keeps
us
from
Data?

“Prices,
Stupid
passwords,
and

Boring
Sta=s=cs.”

-‐
Hans
Rosling

h"p://www.youtube.com/watch?v=hVimVzgtD6w

Your
data
silos
are
lonely
places.

EDW

Accounts

Customers

Web
Proper=es

Data

Data

Data

Data

Data
Data

Data
Data

Data
Data

Data
Data

Data

Data

Data
Data

Data
Data
Data

Data
Data
Data

Data

Data
Data
Data

Data

Data

Data
Data

Data
Data

Data
Data

Data
Data

…
Data
likes
to
be
together.

EDW

Accounts

Customers

Data

Data

Web
Proper=es

Data
Data
Data
Data

Data

Data
Data
Data

Data
Data

Data

Data

Data

Data
Data
Data
Data
Data

Data

Data
Data

Data

Data
Data
Data
Data

Data

Data

Data
Data

Data
Data

Data
Data

CDR

Data

Data
Data
Machine
Data

Facebook

Data

Data
Data

Data

Data

Data
Data
Data

Data
Data

Data

Data
Data

Data
Data
Data
Data
Data
Data
Data

Data
Data

Data

Data

Data

Data
Data

Data

Data
Data

Data

Data
Data

Weather
Data

TwiYer

Data

Data
likes
to
socialize
too.
Data
Data

EDW

Data
Data

Data

Data

Data
Data

Accounts

Data

Web
Proper=es

Data
Data

Data

Customers

Data
Data
Data
Data

Data
Data

Data

Data

Data
Data

Data
Data

Data
Data
Data
Data
Data

Data
Data

Data

Data
Data
Data
Data
Data
Data

Data

Data

Data
Data
Data
Data

New
types
of
data
don’t
quite
ﬁt
into

your
pris=ne
view
of
the
world.

Logs

Data
Data

Data

Data

Data
Data

Data

Machine
Data

Data
Data

Data

Data

Data
Data

Data

My
LiYle
Data
Empire

Data

?
Data

?
Data
Data

Data

Data
Data

?
?

Data

Data

To
resolve
this,
some
people
take

hints
from
Lord
Of
The
Rings...

…and
create
One-‐Schema-‐To-‐
Rule-‐Them-‐All…

EDW

Data

Data
Data

Data
Data

Schema

Data

Data

Data
Data

ETL

Data

Data

Data

ETL

ETL

ETL

EDW

Data

Data
Data

Data
Data

Schema

Data

Data

Data
Data

…but
that
has
its
problems
too.

ETL

Data

Data

Data

ETL

ETL

ETL

EDW

Data

Data
Data

Data
Data

Schema

Data

Data

Data
Data

So
what
is
the
answer?

Enter
the
Hadoop.

………

hYp://www.fabulouslybroke.com/2011/05/ninja-‐elephants-‐and-‐other-‐awesome-‐stories/

Hadoop
was
created
because
Big
IT

never
cut
it
for
the
Internet

Proper=es
like
Google,
Yahoo,

Facebook,
TwiYer,
and
LinkedIn

Tradi=onal
architecture
didn’t

scale
enough…

App
App
App
App

App
App
App
App

DB
DB

DB

SAN

App
App
App
App

DB
DB

DB

SAN

DB
DB

DB

SAN

Databases
become
bloated
and

useless

$upercompu=ng

Tradi=onal
architectures
cost
too

much
at
that
volume…

$/TB

$pecial

Hardware

If
you
could
design
a
system
that

would
handle
this,
what
would
it

look
like?

It
would
probably
need
a
highly

resilient,
self-‐healing,
cost-‐eﬃcient,

distributed
ﬁle
system…

Storage

Storage

Storage

Storage

Storage

Storage

Storage

Storage

Storage

It
would
probably
need
a
completely

parallel
processing
framework
that

took
tasks
to
the
data…

Processing
Processing
Processing

Storage
Storage
Storage

Processing
Processing
Processing

Storage
Storage
Storage

Processing
Processing
Processing

Storage
Storage
Storage

It
would
probably
run
on
commodity

hardware,
virtualized
machines,
and

common
OS
pladorms

Processing
Processing
Processing

Storage
Storage
Storage

Processing
Processing
Processing

Storage
Storage
Storage

Processing
Processing
Processing

Storage
Storage
Storage

It
would
probably
be
open
source
so

innova=on
could
happen
as
quickly

as
possible

It
would
need
a
cri=cal
mass
of

users

{Processing
+
Storage}

=

{MapReduce/YARN+
HDFS}

HDFS
stores
data
in
blocks
and

replicates
those
blocks

block1

Processing
Processing
Processing

Storage
Storage
Storage

block2

block2

Processing
Processing
Processing

block1

Storage
Storage
Storage

block3

block2

Processing

Storage

block3

Processing
Processing

block1

Storage
Storage

block3

If
a
block
fails
then
HDFS
always
has

the
other
copies
and
heals
itself

block1

Processing
Processing
Processing

block3

Storage
Storage
Storage

block2

block2

Processing
Processing
Processing

block1

Storage
Storage
Storage

block3

block2

Processing

Storage

block3

Processing
Processing

block1

Storage
Storage

X

MapReduce
is
a
programming

paradigm
that
completely
parallel

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Data

Data

Data

Reducer

Data

Data

Data

Reducer

Data

Data

Data

MapReduce
has
three
phases:

Map,
Sort/Shuﬄe,
Reduce

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Mapper

Mapper

Key,
Value

Key,
Value

Key,
Value

Reducer

Key,
Value

Key,
Value

Key,
Value

Mapper

Reducer

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Mapper

Reducer

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Key,
Value

Mapper

Key,
Value

Key,
Value

Key,
Value

MapReduce
applies
to
a
lot
of

data
processing
problems

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Data

Data

Data

Reducer

Data

Data

Data

Reducer

Data

Data

Data

YARN
=
Yet
Another
Resource

Nego=ator

YARN
abstracts
resource

management
so
you
can
run
more

than
just
MapReduce

MapReduce
V2

MapReduce
V?
STORM

Giraph

Tez

YARN

HDFS2

MPI

HBase

…
and

more

Node
Manager

Resource
Manager

Container

Scheduler

Pig

AppMaster

Container

Resource
Manager

+

Node
Managers

=
YARN

Node
Manager

Container

Container

Storm

Node
Manager

Node
Manager

MapReduce

AppMaster

Container

Container

Container

Container

Container

AppMaster

YARN
turns
Hadoop
into
a
smart

phone:
An
App
Ecosystem

hortonworks.com/yarn/

Check
out
the
book
too…

Preview
at:

hortonworks.com/yarn/

YARN
is
an
essen=al
part
of
a

balanced
breakfast
in
Hadoop
2.2.0

Hadoop
has
other
open
source

projects…

Tez
=
{
Generic
Tasks
+
Pipelining
}

Super
Fast
MapReduce

Hive
=
{SQL
-‐>
Tez
||
MapReduce}

SQL-‐IN-‐HADOOP

Pig
=
{PigLa=n
-‐>
Tez
||

MapReduce}

HCatalog
=
{metadata*
for

MapReduce,
Hive,
Pig,
HBase}

*metadata
=
tables,
columns,
par==ons,
types

Oozie
=
Job::{Task,
Task,
if
Task,

then
Task,
ﬁnal
Task}

Falcon

Feed
Feed

Feed

Feed

Hadoop

DR

Feed

Replica=on

Feed

Feed

Hadoop

Feed

Flume

Files

Flume

JMS

Weblogs

Events

Flume

Flume

Flume

Flume

Flume

Hadoop

Sqoop

DB

DB

Sqoop

Hadoop

Sqoop

Ambari
=
{install,
manage,

monitor}

HBase
=
{real-‐=me,
distributed-‐
map,
big-‐tables}

Storm
=
{Complex
Event
Processing,

Near-‐Real-‐Time,
Provisioned
by

YARN
}

Storm

HDFS

YARN

Pig

MapReduce

Apache
Hadoop

HCatalog

Hive

HBase

Ambari

Sqoop

Falcon

Flume

Storm

Pig

HDFS

YARN

MapReduce

Hortonworks
Data
Pladorm

HCatalog

Hive

HBase

Ambari

Sqoop

Falcon

Flume

What
else
are
we
working
on?

hortonworks.com/labs/

Hadoop
is
the
new
Data
Opera=ng

System
for
the
Enterprise

There is NO second place

Hortonworks

…the
Bull
Elephant
of
Hadoop
Innova@on

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Page
76

What is Hadoop? Nov 20 2013 - IRMAC

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What is Hadoop? Nov 20 2013 - IRMAC

Similar to What is Hadoop? Nov 20 2013 - IRMAC (20)

More from Adam Muise

More from Adam Muise (15)

Recently uploaded

Recently uploaded (20)

What is Hadoop? Nov 20 2013 - IRMAC