Elephant in the cloud

Elephant in the Cloud:
a quest for the next generation
Hadoop architecture

Roman Shaposhnik

Sr. Manager, Open Source Hadoop Platform @Pivotal

(Twitter: @rhatr)

Who’s this guy?

•  Sr. Manager @Pivotal building a team of OS contributors

•  Apache Software Foundation guy (VP of Apache Incubator, VP of
Apache Bigtop, committer on Hadoop, Giraph, Sqoop, etc)

•  Used to be root@Cloudera

•  Used to be PHB@Yahoo! (original Hadoop team)

•  Used to be a hacker at Sun microsystems (Sun Studio compilers
and tools)

Long, long time ago…

HDFS
ASF Projects

FLOSS Projects

Pivotal Products

MapReduce

In a blink of an eye:

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN

Genesis of Hadoop

• Google papers on GFS and MapReduce

• A subproject of Apache Nutch

• A bet by Yahoo!

Data brings value

• What features to add to the product

• Data analysis must enable decisions

• V3: volume, velocity, variety

Big Data Utility Gap
70% of data
generated by
customers
80% of data
being stored
3% being
prepared for
analysis
0.5% being
analyzed
<0.5% being
operationalized
Average Enterprises
3 Exabytes
per day
now
40 Trillion total
Gigabytes in 2020
(Or 162 iPhones of
storage for every
human)
?

Hadoop’s childhood

• HDFS: Hadoop Distributed Filesystem

• MapReduce: computational framework

HDFS: not a POSIXfs

• Huge blocks: 64Mb (128Mb)

• Mostly immutable ﬁles (append, truncate)

• Streaming data access

• Block replication

How do I use it?

$ hadoop fs –lsr /

# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt

$ ls /mnt

# mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt

$ ls /mnt

Principle #1

HDFS is the datalake

Pivotal’s Focus on Data Lakes
Existing EDW

/ Datamarts

Raw “untouched” Data

In-MemoryParallelIngest

Data

Management
(Search Engine)

Processed Data

In-Memory Services

BI/AnalyticalTools

Data Lake

ERP

HR

SFDC

New Data

Sources/Formats

Machine

Traditional

Data Sources

Finally! I now
have full
transparency
on the data
with amazing
speed!

All data
is now
accessible!

I can now
afford
“Big
Data”

Business Users

ELT Processing
with Hadoop
HDFS

MapReduce/SQL/Pig/Hive

Analytical
Data
Marts/
Sandboxes

SecurityandControl

HDFS enables the stack

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN

Principle #2

Apps share their
internal state

MapReduce

• Batch oriented (long jobs; ﬁnal results)

• Brings the computation to the data

• Very constrained programming model

• Embarrassingly parallel programming model

• Used to be the only game in town for compute

MapReduce Overview

• Record = (Key, Value)

• Key : Comparable, Serializable

• Value: Serializable

• Logical Phases: Input, Map, Shufﬂe, Reduce,
Output

Map

• Input: (Key1, Value1)

• Output: List(Key2, Value2)

• Projections, Filtering, Transformation

Shufﬂe

• Input: List(Key2, Value2)

• Output

• Sort(Partition(List(Key2, List(Value2))))

• Provided by Hadoop : Several
Customizations Possible

Reduce

• Input: List(Key2, List(Value2))

• Output: List(Key3, Value3)

• Aggregations

Anatomy of MapReduce

d a c

a b c

a 3

b 1

c 2

a 1

b 1

c 1

a 1

c 1

a 1

a 1 1 1

b 1

c 1 1

HDFS mappers reducers HDFS

How do I use it?

public static class TokenizerMapper

extends MapperObject, Text, Text, IntWritable {

public void map(Object key, Text value, Context context) {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

How do I use it?

public static class IntSumReducer

extends ReducerText,IntWritable,Text,IntWritable {

public void reduce(Text key, IterableIntWritable values, Context context) {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

How do I run it?

$ hadoop jar hadoop-examples.jar wordcount

input

output

Principle #3

MapReduce is assembly
language of Hadoop

Hadoop’s childhood

• Compact (pretty much a single jar)

• Challenged in scalability and SPOFs

• Extremely batch oriented

• Hard for non-Java programmers

Hadoop 1.0

HDFS
ASF Projects

FLOSS Projects

Pivotal Products

MapReduce

Hadoop 2.0

HDFS
ASF Projects

FLOSS Projects

Pivotal Products

MapReduce Tez
YARN
Hamster
YARN

Hadoop 2.0

• HDFS 2.0

• Yet Another Resource Negotiator (YARN)

• MapReduce is just an “application” now

• Tez is another “application”

• Pivotal’s Hamster (OpenMPI) yet another one

MapReduce 1.0

Job

Tracker

Task

Tracker
(HDFS)

Task

Tracker
(HDFS)

task1

task1

task1

task1

task1

task1

task1

task1

task1

taskN

YARN (AKA MR2.0)

Resource
Manager

Job

Tracker

task1

task1

task1

task1

task1

Task

Tracker

YARN

• Yet Another Resource Negotiator

• Resource Manager

• Node Managers

• Application Masters

• Speciﬁc to paradigm, e.g. MR Application
master (aka JobTracker)

YARN: beyond MR

Resource
Manager

MPI

MPI

Hamster

•  Hadoop and MPI on the same cluster

•  OpenMPI Runtime on Hadoop YARN

•  Hadoop Provides: Resource Scheduling,
Process monitoring, Distributed File System

•  Open MPI Provides: Process launching,
Communication, I/O forwarding

Hamster Components

• Hamster Application Master

• Gang Scheduler, YARN Application
Preemption

• Resource Isolation (lxc Containers)

• ORTE: Hamster Runtime

• Process launching, Wireup, Interconnect

Hadoop ecosystem

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
Hamster
YARN

There’s way too much stuff

• Tracking dependencies

• Integration testing

• Optimizing the defaults

• Rationalizing the behaviour

Wait! We’ve seen this!

GNU Software

Linux kernel

Apache Bigtop

Hadoop ecosystem

(Hbase, Pig, Hive)

Hadoop
(HDFS,YARN, MR)

Principle #4

Apache Bigtop is how
the Hadoop distros get
deﬁned

The ecosystem

• Apache HBase

• Apache Crunch, Pig, Hive and Phoenix

• Apache Giraph

• Apache Oozie

• Apache Mahout

• Apache Sqoop and Flume

Apache HBase

• Small mutable records vs. HDFS ﬁles

• HFiles kept in HDFS

• Memcached for HDFS

• Built on HDFS and Zookeeper

• Google’s Bigtable

Hbase datamodel

• Driven by the original Webtable usecase:

com.cnn.www html...

content:

CNN

CNN.co

anchor:a.com

anchor:b.com

How do I use it?

HTable table = new HTable(conﬁg, “table”);

Put p = new Put(Bytes.toBytes(“row”));

p.add(Bytes.toBytes(“family”),

Bytes.toBytes(“qualiﬁer”),

Bytes.toBytes(“data”));

table.put(p);

Dataﬂow model

HBase

HDFS

Producer

Consumer

When do I use it?

• Serving up large amounts of data

• Fast random access

• Scan operations

Principle #5

HBase: when you
need OLAP + OLTP

What if its OLTP?

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
Hamster
YARN

GemFire XD

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
GemFire XD
Hamster
YARN

GemFire XD: a better HBase?

• Close sourced but extremely mature

• SQL/Objects/JSON data model

• High concurrency, high update load

• Mostly selective point queries (no scans)

• Tiered storage architecture

YCSB Benchmark; Throughput is 2-12X

0

100000

200000

300000

400000

500000

600000

700000

800000

AU

BU

CU

D

FU

LOAD

Throughput(ops/sec)

HBase

4

8

12

16

0

100000

200000

300000

400000

500000

600000

700000

800000

AU

BU

CU

D

FU

LOAD

Throughput(ops/sec)

GemFire XD

4

8

12

16

YCSB Benchmark; Latency is 2X – 20X
better

0

2000

4000

6000

8000

10000

12000

14000

Latency(μsec)

HBase

4

8

12

16

0

2000

4000

6000

8000

10000

12000

14000

Latency(μsec)

GemFire XD

4

8

12

16

Principle #6

There are always 3
implementations

Querying data

• MapReduce: “an assembly language”

• Apache Pig: a data manipulation DSL (now
Turing complete!)

• Apache Hive: a batch-oriented SQL on top
of Hadoop

How do I use Pig?

grunt A = load ‘./input.txt’;

grunt B = foreach A generate
ﬂatten(TOKENIZE((chararray)$0)) as
words;

grunt C = group B by word;

grunt D = foreach C generate COUNT(B),
group;

How do I use Hive?

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts AS

SELECT word, count(1) AS count FROM

(SELECT explode(split(line, 's')) AS word FROM docs)

GROUP BY word

ORDER BY word;

Can we short Oracle now?

• No indexing

• Batch oriented scheduling

• Optimization for long running queries

• Metadata management is still in ﬂux

[Close to] real-time SQL

• Impala (inspired by Google’s F1)

• Hive/Tez (AKA Stinger)

• Facebook’s Presto (Hive’s lineage)

• Pivotal’s HAWQ

HAWQ

• GreenPlum MPP database core

• True ANSI SQL support

• HDFS storage backend

• Parquet support is coming

Getting data in: Flume

• Designed for collecting log data

• Flexible deployment topology

Sqoop: RDBMs connection

• Sqoop 1

• A MapReduce tool

• Must use Oozie for workﬂows

• Sqoop 2

• Well, 0.99.x really

• A standalone service

Spring XD

• Uniﬁed, distributed, extensible system for data
ingestions, real time analytics and data exports

• Apache Licensed, not ASF

• A runtime service, not a library

• AKA “Oozie + Flume + Sqoop + Morphlines”

How do I use it?

# deployment: ./xd-singlenode

$ ./xd-shell

xd: hadoop conﬁg fs –namenode hdfs://nn:8020

xd: stream create –deﬁnition “time | hdfs”
–name ticktock

xd: stream destroy –name ticktock

Feeding the Elephant

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
GemFire XD
SpringXD
Hamster
YARN

Spark the disruptor

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

GemFireXD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
SpringXD
YARN
Hamster
YARN

What’s wrong with MR?

Source: UC Berkeley Spark project (just the image)

Spark innovations

• Resilient Distribtued Datasets (RDDs)

• Distributed on a cluster

• Manipulated via parallel operators (map, etc.)

• Automatically rebuilt on failure

• A parallel ecosystem

• A solution to iterative and multi-stage apps

RDDs

warnings = textFile(…).ﬁlter(_.contains(“warning”))

.map(_.split(‘ ‘)(1))

HadoopRDD
path = hdfs://

FilteredRDD
contains…

MappedRDD

split…

Parallel operators

• map, reduce

• sample, ﬁlter

• groupBy, reduceByKey

• join, leftOuterJoin, rightOuterJoin

• union, cross

An alternative backend

• Shark: a Hive on Spark

• Spork: a Pig on Spark

• Mlib: machine learning on Spark

• GraphX: Graph processing on Spark

• Also featuring its own streaming engine

How do I use it?

val file = spark.textFile(hdfs://...)

val counts = file.flatMap(line = line.split( ))

.map(word = (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile(hdfs://...)

Principle #8

Spark is the
technology of 2014

What’s new?

• True elasticity

• Resource partitioning

• Security

• Data marketplace

• Data leaks/breaches

Hadoop Maturity
ETL Ofﬂoad

Accommodate massive
data growth with existing
EDW investments

Data Lakes

Unify Unstructured and
Structured Data Access

Big Data
Apps

Build analytic-led
applications impacting
top line revenue

Data-Driven
Enterprise

App Dev and Operational
Management on HDFS
Data Architecture

Pivotal HD on Pivotal CF
Ÿ Enterprise PaaS Management System
Ÿ Flexible multi-language ‘buildpack’
architecture
Ÿ Deployed applications enjoy built-in
services
Ÿ On-Premise Hadoop as a Service
Ÿ Single cluster deployment of Pivotal HD
Ÿ Developers instantly bind to shared
Hadoop Clusters
Ÿ Speeds up time-to-value

Pivotal Data Fabric Evolution
Analytic
Data Marts

SQL Services
Operational
Intelligence

In-Memory Database
Run-Time
Applications

Data Staging
Platform

Data Mgmt. Services
Pivotal Data Platform

Stream
Ingestion

Streaming Services
Software-Deﬁned Datacenter

New Data-fabrics

In-Memory Grid
...ETC

Principle #9

Hadoop in the Cloud
is one of many
distributed
frameworks

2014 is the year of Hadoop

HDFS
Pig
Sqoop Flume
Coordination and
workﬂow
management

Zookeeper
Command
Center
ASF Projects

FLOSS Projects

Pivotal Products

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN

Credits

• Apache Software Foundation

• Milind Bhandarkar

• Konstantin Boudnik

• Robert Geiger

• Susheel Kaushik

• Mak Gokhale

Elephant in the cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Elephant in the cloud

Similar to Elephant in the cloud (20)

Recently uploaded

Recently uploaded (20)

Elephant in the cloud