10 concepts the enterprise decision maker needs to understand about Hadoop

10 concepts the enterprise
decision maker needs to
understand about Hadoop
Donald Miner
Strata + Hadoop World 2016 – San Jose
March 31st, 2016

dminer@minerkasch.com
@donaldpminer
Donald Miner

Purpose of this talk
An honest and minimal introduction to Hadoop
Why is Hadoop popular?
What does Hadoop do well and why?
What is bad about Hadoop?

#1 - Hadoop masks being a distributed system

// This block of code defines the behavior of the map phase
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// Split the line of text into words
StringTokenizer itr = new StringTokenizer(value.toString());
// Go through each word and send it
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
// "I've seen this word once!"
context.write(word, one);
}
}
[1]$ hadoop fs -put hamlet.txt datz/hamlet.txt
[2]$ hadoop fs -put macbeth.txt data/macbeth.txt
[3]$ hadoop fs -mv datz/hamlet.txt data/hamlet.txt
[4]$ hadoop fs -ls data/
-rw-r–r– 1 don don 139k 2012-01-31 23:49 /user/don/data/caesar.txt
-rw-r–r– 1 don don 180k 2013-09-25 20:45 /user/don/data/hamlet.txt
-rw-r–r– 1 don don 117k 2013-09-25 20:46 /user/don/data/macbeth.txt

Why is this so important?
What does it not do for me?

#2 - Hadoop scales out linearly
The amount of data, the amount of time something takes,
and the amount of hardware you have are linearly linked1
1. usually

Double the compute,
Half the time!

Double the data,
twice the time!

Double the compute,
Double the compute
The same time!

Data locality!

#3 - Hadoop runs on commodity hardware

• Non-proprietary
• Easy to acquire (all it takes is $)
• Value (not necessarily cheap)
• Let software handle the hard problems

#4 - Hadoop handles unstructured data
Query languages like SQL assume some sort of structure
Relational databases and other databases require structure
MapReduce/Spark is just Java/Scala/Python/etc
You can do anything Java can do
HDFS just stores files
You can store anything in a file

#4 - Hadoop handles unstructured data

#5 - In Hadoop, you load data first and ask questions later
BEFORE:
ETL
Years of planning
Schemas & ER Diagrams

LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
WITH HADOOP:

#6 - HDFS stores the data but has some major limitations
• Stores files in folders
• Nobody cares what’s in your files
• Chunks large files into blocks (~64MB-2GB)
• 3 replicas of each block
• Blocks are scattered all over the place
• Can scale to thousands of nodes and hundreds of petabytes
FILE BLOCKS

Limitations:
• Low IOPs
• Higher latency
• Can’t edit files
• Can’t handle small files
• Low storage efficiency (33%)
• Low throughput on single files
• But…
• High aggregate throughput
• Massive scale
• Software only
• Few bottlenecks

#7 - YARN controls everything going on and is
mostly behind the scenes
• Controls the compute resources on the cluster
• Was the key new feature in Hadoop 2.0
• Abstracted resource management from MapReduce to be more
general
• MapReduce became just any other application
• YARN is key in enabling multiple compute engines at once

#7 - YARN controls everything going on and is
mostly behind the scenes

#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
• Analyzes raw data in HDFS where the data is
• Jobs are split into Mappers and Reducers
Reducers (you code this, too)
Automatically Groups by the
mapper’s output key
Aggregate, count, statistics
Outputs to HDFS
Mappers (you code this)
Loads data from HDFS
Filter, transform, parse
Outputs (key, value) pairs

important, too)
“MapReduce is slow”
“MapReduce is hard to use”

Real-time Large-scale analyticsAd-hoc
MapReduce!
important, too)

important, too)
MapReduce!Storm/streaming

important, too)
MapReduce!Storm/streaming Impala/HAWQ/Stinger

important, too)
MapReduce!Storm/streaming Impala/HAWQ/Stinger Spark

important, too)
MapReduce!Storm/streaming Spark

important, too)
MapReduce!Spark
Not everyone has this problem, but it’s a really interesting problem!

#8 - MapReduce may be getting a bad rap, but
it’s still really important

#9 - Hadoop is open source
Free – money isn’t just a financial barrier, but also a bureaucratic one, too
Help yourself – Hadoop is a complex system underneath and sometimes
you need to figure something out for yourself
Adoption – it’s easier to adopt, so adoption is more widespread
Expansion – can be extended by anyone

#9 - Hadoop is open source

#10 - The Hadoop ecosystem is constantly growing and evolving
Not only do individual Hadoop
components improve…
But Hadoop overall improves with new
components that do new things
differently.
And they piece together into something
that gets a lot of work done.

#10 - The Hadoop ecosystem is constantly growing and evolving

Play by Hadoop’s rules and it’ll give you what you want

10 concepts the enterprise
decision maker needs to
understand about Hadoop
Donald Miner
Strata + Hadoop World 2016 – San Jose
March 31st, 2016
dminer@minerkasch.com
@donaldpminer

10 concepts the enterprise decision maker needs to understand about Hadoop

More Related Content

What's hot

Viewers also liked

Similar to 10 concepts the enterprise decision maker needs to understand about Hadoop

Recently uploaded

10 concepts the enterprise decision maker needs to understand about Hadoop

Editor's Notes