Tackling Big Data with the Elephant in the Room

TACKLING BIG DATA
WITH THE ELEPHANT
IN THE ROOM

WHAT’S THE PROBLEM WITH BIG DATA?
Volume VarietyVelocity

WHAT’S THE SOLUTION TO BIG DATA?
“In pioneer days they used oxen for heavy
pulling, and when one oxen couldn’t budge
a log, they didn’t try to grow a larger ox.
We shouldn’t be trying for bigger
computers, but for more systems of
computers.” – Grace Hopper

HADOOP’S SOLUTION
Sqoop
Pig Hive
HBase Mahout Flume
Oozie …
Hadoop Distributed
File System
MapReduce
Hadoop
Core
Components
Hadoop
Ecosystem

HOW DOES HDFS WORK?
Large
Data
File
Block #1
Block #2

HOW DOES HDFS WORK?
Large
Data
File
Block #1
Block #2
Block #1
Block #1
Block #1

HOW DOES HDFS WORK?
Large
Data
File
Block #1
Block #2
Block #1
Block #1
Block #1
Block #2
Block #2
Block #2

WHAT IS MAP-REDUCE?
Core Ideas
–  Data Locality
–  Parallelism
–  Block Independence
Three Stages
1.  Map
2.  Swap & Sort
3.  Reduce

WORD COUNT MAP
the cat sat on the mat
the aardvark sat on the
…
Node 1
the mahout drove the
….
Node 2
The aardvark sat on the
…
The mahout drove the
…

Mapper
WORD COUNT MAP
…
Node 1
….
Node 2
Mapper
map()
map()

Mapper
WORD COUNT MAP
…
Node 1
….
Node 2
Mapper
map()
map()
the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
mahout 1
drove 1
the 1

Mapper
WORD COUNT MAP
…
Node 1
….
Node 2
Mapper
map()
map()
the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
mahout 1
drove 1
the 1
map()
the 1
aardvark 1
sat 1
on 1
the 1

WORD COUNT SWAP & SORT
the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
mahout 1
drove 1
the 1
the 1
aardvark 1
sat 1
on 1
the 1

the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
mahout 1
drove 1
the 1
the 1
aardvark 1
sat 1
on 1
the 1
aardvark 1
cat 1
mat 1
on 1,1
sat 1
the 1,1,1,1
drove 1
mahout 1
the 1,1

the 1
cat 1
sat 1
on 1
the 1
mat 1
the 1
mahout 1
drove 1
the 1
the 1
aardvark 1
sat 1
on 1
the 1
aardvark 1
cat 1
mat 1
on 1,1
sat 1
the 1,1,1,1
drove 1
mahout 1
the 1,1
aardvark 1
cat 1
mat 1
mahout 1
sat 1
drove 1
on 1,1
the 1,1,1,1,1,1
Node 3
Node 4
Node 5

WORD COUNT REDUCER
aardvark 1
cat 1
mat 1
mahout 1
sat 1
drove 1
on 1,1
the 1,1,1,1,1,1
Node 3
Node 4
Node 5
Reducer 0
Reducer 1
Reducer 2
aardvark 1
cat 1
mat 1
mahout 1
sat 1
drove 1
on 2
the 6

TAKE-AWAYS
Sqoop
Pig Hive
HBase Mahout Flume
Oozie …
Hadoop Distributed
File System
MapReduce
Hadoop
Core
Components
Hadoop
Ecosystem

Tackling Big Data with the Elephant in the Room

Recommended

Recommended

More Related Content

Similar to Tackling Big Data with the Elephant in the Room

Similar to Tackling Big Data with the Elephant in the Room (13)

Recently uploaded

Recently uploaded (20)

Tackling Big Data with the Elephant in the Room