Elephant in the Cloud:
a quest for the next generation
Hadoop architecture	

Roman Shaposhnik	

Sr. Manager, Open Source Hadoop Platform @Pivotal	

(Twitter: @rhatr)
Who’s this guy?	

•  Sr. Manager @Pivotal building a team of OS contributors	

•  Apache Software Foundation guy (VP of Apache Incubator, VP of
Apache Bigtop, committer on Hadoop, Giraph, Sqoop, etc)	

•  Used to be root@Cloudera	

•  Used to be PHB@Yahoo! (original Hadoop team)	

•  Used to be a hacker at Sun microsystems (Sun Studio compilers
and tools)
Agenda	

&
Agenda
Long, long time ago…	

HDFS
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

MapReduce
In a blink of an eye:	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Genesis of Hadoop	

• Google papers on GFS and MapReduce	

• A subproject of Apache Nutch	

• A bet by Yahoo!
Data brings value	

• What features to add to the product	

• Data analysis must enable decisions	

• V3: volume, velocity, variety
Big Data brings big value
Entering: Industrial Data
Big Data Utility Gap
70% of data
generated by
customers
80% of data
being stored
3% being
prepared for
analysis
0.5% being
analyzed
<0.5% being
operationalized
Average Enterprises
3 Exabytes
per day
now
40 Trillion total
Gigabytes in 2020
(Or 162 iPhones of
storage for every
human)
?
Hadoop’s childhood	

• HDFS: Hadoop Distributed Filesystem	

• MapReduce: computational framework
HDFS: not a POSIXfs	

• Huge blocks: 64Mb (128Mb)	

• Mostly immutable files (append, truncate)	

• Streaming data access	

• Block replication
How do I use it?	

$ hadoop fs –lsr /	

	

# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt	

$ ls /mnt	

	

# mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt	

$ ls /mnt
Principle #1	

HDFS is the datalake
Pivotal’s Focus on Data Lakes
Existing EDW 	

/ Datamarts	

Raw “untouched” Data	

In-MemoryParallelIngest	

Data	

Management
(Search Engine)	

Processed Data	

In-Memory Services	

BI/AnalyticalTools	

Data Lake	

ERP	

HR	

SFDC	

New Data	

Sources/Formats	

Machine	

Traditional	

Data Sources	

Finally! I now
have full
transparency
on the data
with amazing
speed!	

All data
is now
accessible!	

I can now
afford 
“Big
Data”	

Business Users	

ELT Processing
with Hadoop
HDFS	

MapReduce/SQL/Pig/Hive	

Analytical
Data
Marts/
Sandboxes	

SecurityandControl
HDFS enables the stack	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Principle #2	

Apps share their
internal state
MapReduce	

• Batch oriented (long jobs; final results)	

• Brings the computation to the data	

• Very constrained programming model	

• Embarrassingly parallel programming model	

• Used to be the only game in town for compute
MapReduce Overview	

• Record = (Key, Value)	

• Key : Comparable, Serializable	

• Value: Serializable	

• Logical Phases: Input, Map, Shuffle, Reduce,
Output
Map	

• Input: (Key1, Value1)	

• Output: List(Key2, Value2)	

• Projections, Filtering, Transformation
Shuffle	

• Input: List(Key2, Value2)	

• Output	

• Sort(Partition(List(Key2, List(Value2))))	

• Provided by Hadoop : Several
Customizations Possible
Reduce	

• Input: List(Key2, List(Value2))	

• Output: List(Key3, Value3)	

• Aggregations
Anatomy of MapReduce	

d a c 	

a b c	

a 3	

b 1	

c 2	

a 1	

b 1 	

c 1	

a 1	

c 1 	

a 1	

a 1 1 1	

b 1 	

c 1 1	

HDFS mappers reducers HDFS
MapReduce DataFlow
How do I use it?	

	

public static class TokenizerMapper	

extends MapperObject, Text, Text, IntWritable {	

	

public void map(Object key, Text value, Context context) {	

StringTokenizer itr = new StringTokenizer(value.toString());	

while (itr.hasMoreTokens()) {	

word.set(itr.nextToken());	

context.write(word, one);	

}	

}	

}
How do I use it?	

public static class IntSumReducer	

extends ReducerText,IntWritable,Text,IntWritable {	

	

public void reduce(Text key, IterableIntWritable values, Context context) {	

int sum = 0;	

for (IntWritable val : values) {	

sum += val.get();	

}	

result.set(sum);	

context.write(key, result);	

}	

}
How do I run it?	

	

$ hadoop jar hadoop-examples.jar wordcount 	

input 	

output
Principle #3	

MapReduce is assembly
language of Hadoop
Hadoop’s childhood	

• Compact (pretty much a single jar)	

• Challenged in scalability and SPOFs	

• Extremely batch oriented	

• Hard for non-Java programmers
Then, something happened
Hadoop 1.0	

HDFS
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

MapReduce
Hadoop 2.0	

HDFS
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

MapReduce Tez
YARN
Hamster
YARN
Hadoop 2.0	

• HDFS 2.0	

• Yet Another Resource Negotiator (YARN)	

• MapReduce is just an “application” now	

• Tez is another “application”	

• Pivotal’s Hamster (OpenMPI) yet another one
MapReduce 1.0	

Job	

Tracker	

Task	

Tracker
(HDFS)	

Task	

Tracker
(HDFS)	

task1	

task1	

task1	

task1	

task1	

task1	

task1	

task1	

task1	

taskN
YARN (AKA MR2.0)	

Resource
Manager	

Job	

Tracker	

task1	

task1	

task1	

task1	

task1	

Task	

Tracker
YARN (AKA MR2.0)	

Resource
Manager	

Job	

Tracker	

task1	

task1	

task1	

task1	

task1	

Task	

Tracker
YARN	

• Yet Another Resource Negotiator	

• Resource Manager	

• Node Managers	

• Application Masters	

• Specific to paradigm, e.g. MR Application
master (aka JobTracker)
YARN: beyond MR	

Resource
Manager	

MPI	

MPI
Hamster	

•  Hadoop and MPI on the same cluster	

•  OpenMPI Runtime on Hadoop YARN	

•  Hadoop Provides: Resource Scheduling, 
Process monitoring, Distributed File System	

•  Open MPI Provides: Process launching, 
Communication, I/O forwarding
Hamster Components	

• Hamster Application Master	

• Gang Scheduler, YARN Application
Preemption	

• Resource Isolation (lxc Containers)	

• ORTE: Hamster Runtime	

• Process launching, Wireup, Interconnect
Hamster Architecture
Hadoop 2.0	

HDFS
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

MapReduce Tez
YARN
Hamster
YARN
Hadoop ecosystem	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
Hamster
YARN
There’s way too much stuff	

• Tracking dependencies	

• Integration testing	

• Optimizing the defaults	

• Rationalizing the behaviour
Wait! We’ve seen this!	

GNU Software	

 Linux kernel
Apache Bigtop	

Hadoop ecosystem	

(Hbase, Pig, Hive)	

Hadoop
(HDFS,YARN, MR)
Principle #4	

Apache Bigtop is how
the Hadoop distros get
defined
The ecosystem	

• Apache HBase	

• Apache Crunch, Pig, Hive and Phoenix	

• Apache Giraph	

• Apache Oozie	

• Apache Mahout	

• Apache Sqoop and Flume
Apache HBase	

• Small mutable records vs. HDFS files	

• HFiles kept in HDFS	

• Memcached for HDFS	

• Built on HDFS and Zookeeper	

• Google’s Bigtable
Hbase datamodel	

• Driven by the original Webtable usecase:	

	

com.cnn.www html...	

content:	

CNN	

 CNN.co	

anchor:a.com	

 anchor:b.com
How do I use it?	

HTable table = new HTable(config, “table”);	

Put p = new Put(Bytes.toBytes(“row”));	

p.add(Bytes.toBytes(“family”),	

Bytes.toBytes(“qualifier”),	

Bytes.toBytes(“data”));	

table.put(p);
Dataflow model	

HBase	

HDFS	

Producer	

 Consumer
When do I use it?	

• Serving up large amounts of data	

• Fast random access	

• Scan operations
Principle #5	

HBase: when you
need OLAP + OLTP
What if its OLTP?	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
Hamster
YARN
GemFire XD	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
GemFire XD
Hamster
YARN
GemFire XD: a better HBase?	

• Close sourced but extremely mature	

• SQL/Objects/JSON data model	

• High concurrency, high update load	

• Mostly selective point queries (no scans)	

• Tiered storage architecture
YCSB Benchmark; Throughput is 2-12X	

0	

100000	

200000	

300000	

400000	

500000	

600000	

700000	

800000	

AU	

 BU	

 CU	

 D	

 FU	

 LOAD	

Throughput(ops/sec)	

HBase	

4	

8	

12	

16	

0	

100000	

200000	

300000	

400000	

500000	

600000	

700000	

800000	

AU	

 BU	

 CU	

 D	

 FU	

 LOAD	

Throughput(ops/sec)	

GemFire XD	

4	

8	

12	

16
YCSB Benchmark; Latency is 2X – 20X
better	

0	

2000	

4000	

6000	

8000	

10000	

12000	

14000	

Latency(μsec)	

HBase	

4	

8	

12	

16	

0	

2000	

4000	

6000	

8000	

10000	

12000	

14000	

Latency(μsec)	

GemFire XD	

4	

8	

12	

16
Principle #6	

There are always 3
implementations
Querying data	

• MapReduce: “an assembly language”	

• Apache Pig: a data manipulation DSL (now
Turing complete!)	

• Apache Hive: a batch-oriented SQL on top
of Hadoop
How do I use Pig?	

grunt A = load ‘./input.txt’;	

grunt B = foreach A generate 
flatten(TOKENIZE((chararray)$0)) as
words;	

grunt C = group B by word;	

grunt D = foreach C generate COUNT(B), 
group;
How do I use Hive?	

CREATE TABLE docs (line STRING);	

LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;	

CREATE TABLE word_counts AS	

SELECT word, count(1) AS count FROM	

(SELECT explode(split(line, 's')) AS word FROM docs) 	

GROUP BY word	

ORDER BY word;
Can we short Oracle now?	

• No indexing	

• Batch oriented scheduling	

• Optimization for long running queries	

• Metadata management is still in flux
[Close to] real-time SQL	

• Impala (inspired by Google’s F1)	

• Hive/Tez (AKA Stinger)	

• Facebook’s Presto (Hive’s lineage)	

• Pivotal’s HAWQ
HAWQ	

• GreenPlum MPP database core	

• True ANSI SQL support	

• HDFS storage backend	

• Parquet support is coming
Principle #7	

SQL on Hadoop
Feeding the elephant
Getting data in: Flume	

• Designed for collecting log data	

• Flexible deployment topology
Sqoop: RDBMs connection	

• Sqoop 1	

• A MapReduce tool	

• Must use Oozie for workflows	

• Sqoop 2	

• Well, 0.99.x really	

• A standalone service
Spring XD	

• Unified, distributed, extensible system for data
ingestions, real time analytics and data exports	

• Apache Licensed, not ASF	

• A runtime service, not a library	

• AKA “Oozie + Flume + Sqoop + Morphlines”
How do I use it?	

# deployment: ./xd-singlenode	

$ ./xd-shell	

xd: hadoop config fs –namenode hdfs://nn:8020	

xd: stream create –definition “time | hdfs” 
–name ticktock	

xd: stream destroy –name ticktock
Feeding the Elephant	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
GemFire XD
SpringXD
Hamster
YARN
Spark the disruptor	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFireXD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
SpringXD
YARN
Hamster
YARN
What’s wrong with MR?	

Source: UC Berkeley Spark project (just the image)
Spark innovations	

• Resilient Distribtued Datasets (RDDs)	

• Distributed on a cluster	

• Manipulated via parallel operators (map, etc.)	

• Automatically rebuilt on failure	

• A parallel ecosystem	

• A solution to iterative and multi-stage apps
RDDs	

warnings = textFile(…).filter(_.contains(“warning”))	

.map(_.split(‘ ‘)(1))	

	

	

	

	

	

	

	

HadoopRDD
path = hdfs://	

FilteredRDD
contains…	

MappedRDD	

split…
Parallel operators	

• map, reduce	

• sample, filter	

• groupBy, reduceByKey	

• join, leftOuterJoin, rightOuterJoin	

• union, cross
An alternative backend	

• Shark: a Hive on Spark	

• Spork: a Pig on Spark	

• Mlib: machine learning on Spark	

• GraphX: Graph processing on Spark	

• Also featuring its own streaming engine
How do I use it?	

val file = spark.textFile(hdfs://...)	

val counts = file.flatMap(line = line.split( ))	

.map(word = (word, 1))	

.reduceByKey(_ + _)	

counts.saveAsTextFile(hdfs://...)
Principle #8	

Spark is the
technology of 2014
Where’s the cloud?
What’s new?	

• True elasticity	

• Resource partitioning	

• Security	

• Data marketplace	

• Data leaks/breaches
Hadoop Maturity
ETL Offload	

Accommodate massive 
data growth with existing
EDW investments	

Data Lakes	

Unify Unstructured and
Structured Data Access	

Big Data
Apps	

Build analytic-led
applications impacting 
top line revenue	

Data-Driven
Enterprise	

App Dev and Operational
Management on HDFS
Data Architecture
Pivotal HD on Pivotal CF
Ÿ Enterprise PaaS Management System
Ÿ Flexible multi-language ‘buildpack’
architecture
Ÿ Deployed applications enjoy built-in
services
Ÿ On-Premise Hadoop as a Service
Ÿ Single cluster deployment of Pivotal HD
Ÿ Developers instantly bind to shared
Hadoop Clusters
Ÿ Speeds up time-to-value
Pivotal Data Fabric Evolution
Analytic
Data Marts	

SQL Services
Operational
Intelligence	

In-Memory Database
Run-Time
Applications	

Data Staging
Platform	

Data Mgmt. Services
Pivotal Data Platform	

Stream 
Ingestion	

Streaming Services
Software-Defined Datacenter	

New Data-fabrics	

In-Memory Grid
...ETC
Principle #9	

Hadoop in the Cloud
is one of many
distributed
frameworks
2014 is the year of Hadoop	

HDFS
Pig
Sqoop Flume
Coordination and
workflow
management	

Zookeeper
Command
Center
ASF Projects	

 FLOSS Projects	

 Pivotal Products	

GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI	

Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
A NEW PLATFORM FOR A NEW
ERA
Credits	

• Apache Software Foundation	

• Milind Bhandarkar	

• Konstantin Boudnik	

• Robert Geiger	

• Susheel Kaushik	

• Mak Gokhale
Questions ?

Elephant in the cloud

  • 1.
    Elephant in theCloud: a quest for the next generation Hadoop architecture Roman Shaposhnik Sr. Manager, Open Source Hadoop Platform @Pivotal (Twitter: @rhatr)
  • 2.
    Who’s this guy? • Sr. Manager @Pivotal building a team of OS contributors •  Apache Software Foundation guy (VP of Apache Incubator, VP of Apache Bigtop, committer on Hadoop, Giraph, Sqoop, etc) •  Used to be root@Cloudera •  Used to be PHB@Yahoo! (original Hadoop team) •  Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)
  • 3.
  • 4.
  • 5.
    Long, long timeago… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  • 6.
    In a blinkof an eye: HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • 7.
    Genesis of Hadoop • Googlepapers on GFS and MapReduce • A subproject of Apache Nutch • A bet by Yahoo!
  • 8.
    Data brings value • Whatfeatures to add to the product • Data analysis must enable decisions • V3: volume, velocity, variety
  • 9.
    Big Data bringsbig value
  • 10.
  • 11.
    Big Data UtilityGap 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed <0.5% being operationalized Average Enterprises 3 Exabytes per day now 40 Trillion total Gigabytes in 2020 (Or 162 iPhones of storage for every human) ?
  • 13.
    Hadoop’s childhood • HDFS: HadoopDistributed Filesystem • MapReduce: computational framework
  • 15.
    HDFS: not aPOSIXfs • Huge blocks: 64Mb (128Mb) • Mostly immutable files (append, truncate) • Streaming data access • Block replication
  • 16.
    How do Iuse it? $ hadoop fs –lsr / # hadoop-fuse-dfs dfs://hadoop-hdfs /mnt $ ls /mnt # mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt $ ls /mnt
  • 17.
  • 18.
    Pivotal’s Focus onData Lakes Existing EDW / Datamarts Raw “untouched” Data In-MemoryParallelIngest Data Management (Search Engine) Processed Data In-Memory Services BI/AnalyticalTools Data Lake ERP HR SFDC New Data Sources/Formats Machine Traditional Data Sources Finally! I now have full transparency on the data with amazing speed! All data is now accessible! I can now afford “Big Data” Business Users ELT Processing with Hadoop HDFS MapReduce/SQL/Pig/Hive Analytical Data Marts/ Sandboxes SecurityandControl
  • 19.
    HDFS enables thestack HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • 20.
    Principle #2 Apps sharetheir internal state
  • 21.
    MapReduce • Batch oriented (longjobs; final results) • Brings the computation to the data • Very constrained programming model • Embarrassingly parallel programming model • Used to be the only game in town for compute
  • 22.
    MapReduce Overview • Record =(Key, Value) • Key : Comparable, Serializable • Value: Serializable • Logical Phases: Input, Map, Shuffle, Reduce, Output
  • 23.
    Map • Input: (Key1, Value1) • Output:List(Key2, Value2) • Projections, Filtering, Transformation
  • 24.
    Shuffle • Input: List(Key2, Value2) • Output • Sort(Partition(List(Key2,List(Value2)))) • Provided by Hadoop : Several Customizations Possible
  • 25.
  • 26.
    Anatomy of MapReduce da c a b c a 3 b 1 c 2 a 1 b 1 c 1 a 1 c 1 a 1 a 1 1 1 b 1 c 1 1 HDFS mappers reducers HDFS
  • 27.
  • 28.
    How do Iuse it? public static class TokenizerMapper extends MapperObject, Text, Text, IntWritable { public void map(Object key, Text value, Context context) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  • 29.
    How do Iuse it? public static class IntSumReducer extends ReducerText,IntWritable,Text,IntWritable { public void reduce(Text key, IterableIntWritable values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 30.
    How do Irun it? $ hadoop jar hadoop-examples.jar wordcount input output
  • 31.
    Principle #3 MapReduce isassembly language of Hadoop
  • 32.
    Hadoop’s childhood • Compact (prettymuch a single jar) • Challenged in scalability and SPOFs • Extremely batch oriented • Hard for non-Java programmers
  • 33.
  • 34.
    Hadoop 1.0 HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  • 35.
    Hadoop 2.0 HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce Tez YARN Hamster YARN
  • 36.
    Hadoop 2.0 • HDFS 2.0 • YetAnother Resource Negotiator (YARN) • MapReduce is just an “application” now • Tez is another “application” • Pivotal’s Hamster (OpenMPI) yet another one
  • 37.
  • 38.
  • 39.
  • 40.
    YARN • Yet Another ResourceNegotiator • Resource Manager • Node Managers • Application Masters • Specific to paradigm, e.g. MR Application master (aka JobTracker)
  • 41.
  • 42.
    Hamster •  Hadoop andMPI on the same cluster •  OpenMPI Runtime on Hadoop YARN •  Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System •  Open MPI Provides: Process launching, Communication, I/O forwarding
  • 43.
    Hamster Components • Hamster ApplicationMaster • Gang Scheduler, YARN Application Preemption • Resource Isolation (lxc Containers) • ORTE: Hamster Runtime • Process launching, Wireup, Interconnect
  • 44.
  • 45.
    Hadoop 2.0 HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce Tez YARN Hamster YARN
  • 46.
    Hadoop ecosystem HDFS Pig Sqoop Flume Coordinationand workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN Hamster YARN
  • 47.
    There’s way toomuch stuff • Tracking dependencies • Integration testing • Optimizing the defaults • Rationalizing the behaviour
  • 48.
    Wait! We’ve seenthis! GNU Software Linux kernel
  • 49.
    Apache Bigtop Hadoop ecosystem (Hbase,Pig, Hive) Hadoop (HDFS,YARN, MR)
  • 50.
    Principle #4 Apache Bigtopis how the Hadoop distros get defined
  • 51.
    The ecosystem • Apache HBase • ApacheCrunch, Pig, Hive and Phoenix • Apache Giraph • Apache Oozie • Apache Mahout • Apache Sqoop and Flume
  • 52.
    Apache HBase • Small mutablerecords vs. HDFS files • HFiles kept in HDFS • Memcached for HDFS • Built on HDFS and Zookeeper • Google’s Bigtable
  • 53.
    Hbase datamodel • Driven bythe original Webtable usecase: com.cnn.www html... content: CNN CNN.co anchor:a.com anchor:b.com
  • 54.
    How do Iuse it? HTable table = new HTable(config, “table”); Put p = new Put(Bytes.toBytes(“row”)); p.add(Bytes.toBytes(“family”), Bytes.toBytes(“qualifier”), Bytes.toBytes(“data”)); table.put(p);
  • 55.
  • 56.
    When do Iuse it? • Serving up large amounts of data • Fast random access • Scan operations
  • 57.
    Principle #5 HBase: whenyou need OLAP + OLTP
  • 58.
    What if itsOLTP? HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN Hamster YARN
  • 59.
    GemFire XD HDFS Pig Sqoop Flume Coordinationand workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN GemFire XD Hamster YARN
  • 60.
    GemFire XD: abetter HBase? • Close sourced but extremely mature • SQL/Objects/JSON data model • High concurrency, high update load • Mostly selective point queries (no scans) • Tiered storage architecture
  • 61.
    YCSB Benchmark; Throughputis 2-12X 0 100000 200000 300000 400000 500000 600000 700000 800000 AU BU CU D FU LOAD Throughput(ops/sec) HBase 4 8 12 16 0 100000 200000 300000 400000 500000 600000 700000 800000 AU BU CU D FU LOAD Throughput(ops/sec) GemFire XD 4 8 12 16
  • 62.
    YCSB Benchmark; Latencyis 2X – 20X better 0 2000 4000 6000 8000 10000 12000 14000 Latency(μsec) HBase 4 8 12 16 0 2000 4000 6000 8000 10000 12000 14000 Latency(μsec) GemFire XD 4 8 12 16
  • 63.
    Principle #6 There arealways 3 implementations
  • 64.
    Querying data • MapReduce: “anassembly language” • Apache Pig: a data manipulation DSL (now Turing complete!) • Apache Hive: a batch-oriented SQL on top of Hadoop
  • 65.
    How do Iuse Pig? grunt A = load ‘./input.txt’; grunt B = foreach A generate flatten(TOKENIZE((chararray)$0)) as words; grunt C = group B by word; grunt D = foreach C generate COUNT(B), group;
  • 66.
    How do Iuse Hive? CREATE TABLE docs (line STRING); LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) GROUP BY word ORDER BY word;
  • 67.
    Can we shortOracle now? • No indexing • Batch oriented scheduling • Optimization for long running queries • Metadata management is still in flux
  • 68.
    [Close to] real-timeSQL • Impala (inspired by Google’s F1) • Hive/Tez (AKA Stinger) • Facebook’s Presto (Hive’s lineage) • Pivotal’s HAWQ
  • 69.
    HAWQ • GreenPlum MPP databasecore • True ANSI SQL support • HDFS storage backend • Parquet support is coming
  • 70.
  • 71.
  • 72.
    Getting data in:Flume • Designed for collecting log data • Flexible deployment topology
  • 73.
    Sqoop: RDBMs connection • Sqoop1 • A MapReduce tool • Must use Oozie for workflows • Sqoop 2 • Well, 0.99.x really • A standalone service
  • 74.
    Spring XD • Unified, distributed,extensible system for data ingestions, real time analytics and data exports • Apache Licensed, not ASF • A runtime service, not a library • AKA “Oozie + Flume + Sqoop + Morphlines”
  • 75.
    How do Iuse it? # deployment: ./xd-singlenode $ ./xd-shell xd: hadoop config fs –namenode hdfs://nn:8020 xd: stream create –definition “time | hdfs” –name ticktock xd: stream destroy –name ticktock
  • 76.
    Feeding the Elephant HDFS Pig SqoopFlume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout YARN GemFire XD SpringXD Hamster YARN
  • 77.
    Spark the disruptor HDFS Pig SqoopFlume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFireXD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX SpringXD YARN Hamster YARN
  • 78.
    What’s wrong withMR? Source: UC Berkeley Spark project (just the image)
  • 79.
    Spark innovations • Resilient DistribtuedDatasets (RDDs) • Distributed on a cluster • Manipulated via parallel operators (map, etc.) • Automatically rebuilt on failure • A parallel ecosystem • A solution to iterative and multi-stage apps
  • 80.
    RDDs warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split…
  • 81.
    Parallel operators • map, reduce • sample,filter • groupBy, reduceByKey • join, leftOuterJoin, rightOuterJoin • union, cross
  • 82.
    An alternative backend • Shark:a Hive on Spark • Spork: a Pig on Spark • Mlib: machine learning on Spark • GraphX: Graph processing on Spark • Also featuring its own streaming engine
  • 83.
    How do Iuse it? val file = spark.textFile(hdfs://...) val counts = file.flatMap(line = line.split( )) .map(word = (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(hdfs://...)
  • 84.
    Principle #8 Spark isthe technology of 2014
  • 85.
  • 86.
    What’s new? • True elasticity • Resourcepartitioning • Security • Data marketplace • Data leaks/breaches
  • 87.
    Hadoop Maturity ETL Offload Accommodatemassive data growth with existing EDW investments Data Lakes Unify Unstructured and Structured Data Access Big Data Apps Build analytic-led applications impacting top line revenue Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture
  • 88.
    Pivotal HD onPivotal CF Ÿ Enterprise PaaS Management System Ÿ Flexible multi-language ‘buildpack’ architecture Ÿ Deployed applications enjoy built-in services Ÿ On-Premise Hadoop as a Service Ÿ Single cluster deployment of Pivotal HD Ÿ Developers instantly bind to shared Hadoop Clusters Ÿ Speeds up time-to-value
  • 89.
    Pivotal Data FabricEvolution Analytic Data Marts SQL Services Operational Intelligence In-Memory Database Run-Time Applications Data Staging Platform Data Mgmt. Services Pivotal Data Platform Stream Ingestion Streaming Services Software-Defined Datacenter New Data-fabrics In-Memory Grid ...ETC
  • 90.
    Principle #9 Hadoop inthe Cloud is one of many distributed frameworks
  • 91.
    2014 is theyear of Hadoop HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • 92.
    A NEW PLATFORMFOR A NEW ERA
  • 93.
    Credits • Apache Software Foundation • MilindBhandarkar • Konstantin Boudnik • Robert Geiger • Susheel Kaushik • Mak Gokhale
  • 94.