10 concepts the enterprise
decision maker needs to
understand about Hadoop
Donald Miner
Strata + Hadoop World 2016 – San Jose
March 31st, 2016
dminer@minerkasch.com
@donaldpminer
Donald Miner
Purpose of this talk
An honest and minimal introduction to Hadoop
Why is Hadoop popular?
What does Hadoop do well and why?
What is bad about Hadoop?
#1 - Hadoop masks being a distributed system
#1 - Hadoop masks being a distributed system
// This block of code defines the behavior of the map phase
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// Split the line of text into words
StringTokenizer itr = new StringTokenizer(value.toString());
// Go through each word and send it
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
// "I've seen this word once!"
context.write(word, one);
}
}
[1]$ hadoop fs -put hamlet.txt datz/hamlet.txt
[2]$ hadoop fs -put macbeth.txt data/macbeth.txt
[3]$ hadoop fs -mv datz/hamlet.txt data/hamlet.txt
[4]$ hadoop fs -ls data/
-rw-r–r– 1 don don 139k 2012-01-31 23:49 /user/don/data/caesar.txt
-rw-r–r– 1 don don 180k 2013-09-25 20:45 /user/don/data/hamlet.txt
-rw-r–r– 1 don don 117k 2013-09-25 20:46 /user/don/data/macbeth.txt
#1 - Hadoop masks being a distributed system
Why is this so important?
What does it not do for me?
#2 - Hadoop scales out linearly
The amount of data, the amount of time something takes,
and the amount of hardware you have are linearly linked1
1. usually
#2 - Hadoop scales out linearly
Double the compute,
Half the time!
#2 - Hadoop scales out linearly
Double the data,
twice the time!
#2 - Hadoop scales out linearly
Double the compute,
Double the compute
The same time!
#2 - Hadoop scales out linearly
Data locality!
#2 - Hadoop scales out linearly
Why is this so important?
What does it not do for me?
#3 - Hadoop runs on commodity hardware
#3 - Hadoop runs on commodity hardware
• Non-proprietary
• Easy to acquire (all it takes is $)
• Value (not necessarily cheap)
• Let software handle the hard problems
#3 - Hadoop runs on commodity hardware
Why is this so important?
What does it not do for me?
#4 - Hadoop handles unstructured data
Query languages like SQL assume some sort of structure
Relational databases and other databases require structure
MapReduce/Spark is just Java/Scala/Python/etc
You can do anything Java can do
HDFS just stores files
You can store anything in a file
#4 - Hadoop handles unstructured data
Why is this so important?
What does it not do for me?
#5 - In Hadoop, you load data first and ask questions later
BEFORE:
ETL
Years of planning
Schemas & ER Diagrams
LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
WITH HADOOP:
#5 - In Hadoop, you load data first and ask questions later
#5 - In Hadoop, you load data first and ask questions later
Why is this so important?
What does it not do for me?
#5 - In Hadoop, you load data first and ask questions later
#6 - HDFS stores the data but has some major limitations
• Stores files in folders
• Nobody cares what’s in your files
• Chunks large files into blocks (~64MB-2GB)
• 3 replicas of each block
• Blocks are scattered all over the place
• Can scale to thousands of nodes and hundreds of petabytes
FILE BLOCKS
#6 - HDFS stores the data but has some major limitations
Limitations:
• Low IOPs
• Higher latency
• Can’t edit files
• Can’t handle small files
• Low storage efficiency (33%)
• Low throughput on single files
• But…
• High aggregate throughput
• Massive scale
• Software only
• Few bottlenecks
Why is this so important?
What does it not do for me?
#6 - HDFS stores the data but has some major limitations
#7 - YARN controls everything going on and is
mostly behind the scenes
• Controls the compute resources on the cluster
• Was the key new feature in Hadoop 2.0
• Abstracted resource management from MapReduce to be more
general
• MapReduce became just any other application
• YARN is key in enabling multiple compute engines at once
Why is this so important?
What does it not do for me?
#7 - YARN controls everything going on and is
mostly behind the scenes
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
• Analyzes raw data in HDFS where the data is
• Jobs are split into Mappers and Reducers
Reducers (you code this, too)
Automatically Groups by the
mapper’s output key
Aggregate, count, statistics
Outputs to HDFS
Mappers (you code this)
Loads data from HDFS
Filter, transform, parse
Outputs (key, value) pairs
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
“MapReduce is slow”
“MapReduce is hard to use”
Real-time Large-scale analyticsAd-hoc
MapReduce!
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming Impala/HAWQ/Stinger
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming Impala/HAWQ/Stinger Spark
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming Spark
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Spark
Not everyone has this problem, but it’s a really interesting problem!
Why is this so important?
What does it not do for me?
#8 - MapReduce may be getting a bad rap, but
it’s still really important
#9 - Hadoop is open source
Free – money isn’t just a financial barrier, but also a bureaucratic one, too
Help yourself – Hadoop is a complex system underneath and sometimes
you need to figure something out for yourself
Adoption – it’s easier to adopt, so adoption is more widespread
Expansion – can be extended by anyone
Why is this so important?
What does it not do for me?
#9 - Hadoop is open source
#10 - The Hadoop ecosystem is constantly growing and evolving
Not only do individual Hadoop
components improve…
But Hadoop overall improves with new
components that do new things
differently.
And they piece together into something
that gets a lot of work done.
Why is this so important?
What does it not do for me?
#10 - The Hadoop ecosystem is constantly growing and evolving
Play by Hadoop’s rules and it’ll give you what you want
10 concepts the enterprise
decision maker needs to
understand about Hadoop
Donald Miner
Strata + Hadoop World 2016 – San Jose
March 31st, 2016
dminer@minerkasch.com
@donaldpminer

10 concepts the enterprise decision maker needs to understand about Hadoop

  • 1.
    10 concepts theenterprise decision maker needs to understand about Hadoop Donald Miner Strata + Hadoop World 2016 – San Jose March 31st, 2016
  • 2.
  • 3.
    Purpose of thistalk An honest and minimal introduction to Hadoop Why is Hadoop popular? What does Hadoop do well and why? What is bad about Hadoop?
  • 4.
    #1 - Hadoopmasks being a distributed system
  • 5.
    #1 - Hadoopmasks being a distributed system // This block of code defines the behavior of the map phase public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { // Split the line of text into words StringTokenizer itr = new StringTokenizer(value.toString()); // Go through each word and send it while (itr.hasMoreTokens()) { word.set(itr.nextToken()); // "I've seen this word once!" context.write(word, one); } } [1]$ hadoop fs -put hamlet.txt datz/hamlet.txt [2]$ hadoop fs -put macbeth.txt data/macbeth.txt [3]$ hadoop fs -mv datz/hamlet.txt data/hamlet.txt [4]$ hadoop fs -ls data/ -rw-r–r– 1 don don 139k 2012-01-31 23:49 /user/don/data/caesar.txt -rw-r–r– 1 don don 180k 2013-09-25 20:45 /user/don/data/hamlet.txt -rw-r–r– 1 don don 117k 2013-09-25 20:46 /user/don/data/macbeth.txt
  • 6.
    #1 - Hadoopmasks being a distributed system Why is this so important? What does it not do for me?
  • 7.
    #2 - Hadoopscales out linearly The amount of data, the amount of time something takes, and the amount of hardware you have are linearly linked1 1. usually
  • 8.
    #2 - Hadoopscales out linearly Double the compute, Half the time!
  • 9.
    #2 - Hadoopscales out linearly Double the data, twice the time!
  • 10.
    #2 - Hadoopscales out linearly Double the compute, Double the compute The same time!
  • 11.
    #2 - Hadoopscales out linearly Data locality!
  • 12.
    #2 - Hadoopscales out linearly Why is this so important? What does it not do for me?
  • 13.
    #3 - Hadoopruns on commodity hardware
  • 14.
    #3 - Hadoopruns on commodity hardware • Non-proprietary • Easy to acquire (all it takes is $) • Value (not necessarily cheap) • Let software handle the hard problems
  • 15.
    #3 - Hadoopruns on commodity hardware Why is this so important? What does it not do for me?
  • 16.
    #4 - Hadoophandles unstructured data Query languages like SQL assume some sort of structure Relational databases and other databases require structure MapReduce/Spark is just Java/Scala/Python/etc You can do anything Java can do HDFS just stores files You can store anything in a file
  • 17.
    #4 - Hadoophandles unstructured data Why is this so important? What does it not do for me?
  • 18.
    #5 - InHadoop, you load data first and ask questions later BEFORE: ETL Years of planning Schemas & ER Diagrams
  • 19.
    LOAD DATA FIRST,ASK QUESTIONS LATER Data is parsed/interpreted as it is loaded out of HDFS WITH HADOOP: #5 - In Hadoop, you load data first and ask questions later
  • 20.
    #5 - InHadoop, you load data first and ask questions later
  • 21.
    Why is thisso important? What does it not do for me? #5 - In Hadoop, you load data first and ask questions later
  • 22.
    #6 - HDFSstores the data but has some major limitations • Stores files in folders • Nobody cares what’s in your files • Chunks large files into blocks (~64MB-2GB) • 3 replicas of each block • Blocks are scattered all over the place • Can scale to thousands of nodes and hundreds of petabytes FILE BLOCKS
  • 23.
    #6 - HDFSstores the data but has some major limitations Limitations: • Low IOPs • Higher latency • Can’t edit files • Can’t handle small files • Low storage efficiency (33%) • Low throughput on single files • But… • High aggregate throughput • Massive scale • Software only • Few bottlenecks
  • 24.
    Why is thisso important? What does it not do for me? #6 - HDFS stores the data but has some major limitations
  • 25.
    #7 - YARNcontrols everything going on and is mostly behind the scenes • Controls the compute resources on the cluster • Was the key new feature in Hadoop 2.0 • Abstracted resource management from MapReduce to be more general • MapReduce became just any other application • YARN is key in enabling multiple compute engines at once
  • 26.
    Why is thisso important? What does it not do for me? #7 - YARN controls everything going on and is mostly behind the scenes
  • 27.
    #8 - MapReducemay be getting a bad rap, but it’s still really important (but other engines are important, too) • Analyzes raw data in HDFS where the data is • Jobs are split into Mappers and Reducers Reducers (you code this, too) Automatically Groups by the mapper’s output key Aggregate, count, statistics Outputs to HDFS Mappers (you code this) Loads data from HDFS Filter, transform, parse Outputs (key, value) pairs
  • 28.
    #8 - MapReducemay be getting a bad rap, but it’s still really important (but other engines are important, too) “MapReduce is slow” “MapReduce is hard to use”
  • 29.
    Real-time Large-scale analyticsAd-hoc MapReduce! #8- MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too)
  • 30.
    #8 - MapReducemay be getting a bad rap, but it’s still really important (but other engines are important, too) Real-time Large-scale analyticsAd-hoc MapReduce!Storm/streaming
  • 31.
    #8 - MapReducemay be getting a bad rap, but it’s still really important (but other engines are important, too) Real-time Large-scale analyticsAd-hoc MapReduce!Storm/streaming Impala/HAWQ/Stinger
  • 32.
    #8 - MapReducemay be getting a bad rap, but it’s still really important (but other engines are important, too) Real-time Large-scale analyticsAd-hoc MapReduce!Storm/streaming Impala/HAWQ/Stinger Spark
  • 33.
    #8 - MapReducemay be getting a bad rap, but it’s still really important (but other engines are important, too) Real-time Large-scale analyticsAd-hoc MapReduce!Storm/streaming Spark
  • 34.
    #8 - MapReducemay be getting a bad rap, but it’s still really important (but other engines are important, too) Real-time Large-scale analyticsAd-hoc MapReduce!Spark Not everyone has this problem, but it’s a really interesting problem!
  • 35.
    Why is thisso important? What does it not do for me? #8 - MapReduce may be getting a bad rap, but it’s still really important
  • 36.
    #9 - Hadoopis open source Free – money isn’t just a financial barrier, but also a bureaucratic one, too Help yourself – Hadoop is a complex system underneath and sometimes you need to figure something out for yourself Adoption – it’s easier to adopt, so adoption is more widespread Expansion – can be extended by anyone
  • 37.
    Why is thisso important? What does it not do for me? #9 - Hadoop is open source
  • 38.
    #10 - TheHadoop ecosystem is constantly growing and evolving Not only do individual Hadoop components improve… But Hadoop overall improves with new components that do new things differently. And they piece together into something that gets a lot of work done.
  • 39.
    Why is thisso important? What does it not do for me? #10 - The Hadoop ecosystem is constantly growing and evolving
  • 40.
    Play by Hadoop’srules and it’ll give you what you want
  • 41.
    10 concepts theenterprise decision maker needs to understand about Hadoop Donald Miner Strata + Hadoop World 2016 – San Jose March 31st, 2016 dminer@minerkasch.com @donaldpminer

Editor's Notes

  • #5 Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
  • #6 Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
  • #7 Importance: Get more done faster Barrier of entry Downsides: Knowing what you are doing Abstraction bleeding through
  • #8 Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
  • #9 Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
  • #10 Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
  • #11 Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
  • #12 Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
  • #13 Importance: Code stays the same as your cluster and problem grows Massively scalable Downsides: Need to do things in a linear way It’s not always true
  • #14 Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
  • #15 Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
  • #16  importance: Ease of accessibility Cloud Downsides: Sometimes have a hard time leveraging fancier hardware
  • #17 Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
  • #18  importance: Unstructured data Downsides: Cost of flexibility
  • #19 In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
  • #20 In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
  • #21 In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
  • #22 Importance Solve the chicken + egg Cost of flexibility
  • #23 HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
  • #24 HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
  • #25  importance: Scalable data storage that works for analytics Downsides: It’s bad storage
  • #26 YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
  • #27 Importance: YARN brings people closer to universal distributed system without getting in the way (same path) Downsides: Cost of abstraction + system complication
  • #28 MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  • #29 MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  • #30 MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  • #31 MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  • #32 MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  • #33 MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  • #34 MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  • #35 MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  • #36  Importance: MapReduce – fault tolerance, long running jobs, reliability Other parts of the ecosystem work together to solve a problem Downside: Lack of a universal interface – Spark? Holy grail?
  • #37 Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
  • #38  Importance: Organic growth Competition Downsides: ????
  • #39 The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
  • #40  Importance: Innovation Organization Downside: Fractured Hard to track Lack of cohesiveness