SlideShare a Scribd company logo
Hadoop	
  2:	
  Efficient	
  mul3-­‐tenant	
  
workloads	
  that	
  enable	
  the	
  
Modern	
  Data	
  Architecture	
  
	
  
SCALE	
  12X,	
  Los	
  Angeles	
  
February	
  23,	
  2014	
  

David	
  Kaiser	
  

@ddkaiser	
  
linkedin.com/in/dkaiser	
  
facebook.com/dkaiser	
  
dkaiser@cdk.com	
  
dkaiser@hortonworks.com	
  
	
  
Who Am I?
20+	
  years	
  experience	
  with	
  Linux	
  
	
  
3	
  years	
  experience	
  with	
  Hadoop	
  
	
  
Career	
  experiences:	
  
•  Data	
  Warehousing	
  
•  Geospa3al	
  Analy3cs	
  
•  Open-­‐source	
  Solu3ons	
  and	
  Architecture	
  
	
  
Employed	
  at	
  Hortonworks	
  as	
  a	
  Senior	
  Solu3ons	
  Engineer	
  
	
  

David	
  Kaiser	
  

@ddkaiser	
  
linkedin.com/in/dkaiser	
  
facebook.com/dkaiser	
  
dkaiser@cdk.com	
  
dkaiser@hortonworks.com	
  
	
  
Hadoop 2: Efficient multi-tenant workloads
that enable the Modern Data Architecture
•  Abstract:
– Hadoop	
  is	
  about	
  so	
  much	
  more	
  than	
  batch	
  processing.	
  	
  With	
  the	
  
recent	
  release	
  of	
  Hadoop	
  2,	
  there	
  have	
  been	
  significant	
  changes	
  
to	
  how	
  a	
  Hadoop	
  cluster	
  uses	
  resources.	
  
– YARN,	
  the	
  new	
  resource	
  management	
  component,	
  allows	
  for	
  a	
  
more	
  efficient	
  mix	
  of	
  workloads	
  across	
  hardware	
  resources,	
  and	
  
enables	
  new	
  applica3ons	
  and	
  new	
  processing	
  paradigms	
  such	
  as	
  
stream-­‐processing.	
  
– This	
  talk	
  will	
  discuss	
  the	
  new	
  design	
  and	
  components	
  of	
  Hadoop	
  
2,	
  and	
  provide	
  examples	
  of	
  Modern	
  Data	
  Architectures	
  that	
  
leverage	
  Hadoop	
  2.	
  
What is This Thing?

hZp://hadoop.apache.org/	
  
Misconceptions
•  Bucket	
  brigade	
  for	
  large	
  or	
  slow	
  data	
  processing	
  tasks	
  
Misconceptions
•  Bucket	
  brigade	
  for	
  large	
  or	
  slow	
  data	
  processing	
  tasks	
  
•  Batch	
  processor	
  –	
  Another	
  mainframe	
  
Misconceptions
•  Bucket	
  brigade	
  for	
  large	
  or	
  slow	
  data	
  processing	
  tasks	
  
•  Batch	
  processor	
  –	
  Another	
  mainframe	
  
•  Dumb/inflexible,	
  trendy,	
  too	
  simple	
  
Misconceptions
•  Incorrect	
  assump3on	
  that	
  Java	
  ==	
  SLOW	
  
Misconceptions
•  Incorrect	
  assump3on	
  that	
  Java	
  ==	
  SLOW	
  
Misconceptions
•  Incorrect	
  assump3on	
  that	
  Java	
  ==	
  SLOW	
  
Misconceptions
•  Incorrect	
  assump3on	
  that	
  Java	
  ==	
  EVIL	
  
Misconceptions
•  Incorrect	
  assump3on	
  that	
  Java	
  ==	
  EVIL	
  
Hadoop + Linux
Provides a 100% Open-Source framework for efficient
scalable data processing on commodity hardware
Hadoop	
  –	
  The	
  
Open-­‐source	
  
Data	
  Opera3ng	
  System	
  

Linux	
  –	
  The	
  
Open-­‐source	
  
Opera3ng	
  System	
  
Commodity	
  
Hardware	
  
Hadoop Fundamentals
•  Hadoop is a single system, across multiple Linux systems
•  Two basic capabilities of Hadoop
– Reliable,	
  Redundant	
  and	
  Distributed	
  Storage	
  	
  
– Distributed	
  Computa3on	
  

•  Storage: Hadoop Distributed File System (HDFS)
– Replicated,	
  distributed	
  filesystem	
  
– Blocks	
  wriZen	
  to	
  underlying	
  filesystem	
  on	
  mul3ple	
  nodes	
  

•  Computation
– Resource	
  management	
  
– Frameworks	
  to	
  divide	
  workloads	
  across	
  collec3on	
  of	
  resources	
  
–  Hadoop	
  V1:	
  MapReduce	
  framework	
  only	
  
–  Hadoop	
  V2:	
  MapReduce,	
  Tez,	
  Spark,	
  others…	
  
HDFS: File create lifecycle
HDFS	
  CLIENT	
  
FILE	
  
B1	
   B2	
  
FILE	
  
2	
  

1	
  

Create	
  

ack	
  
4	
  

3	
  

Complete	
  

NameNode	
  

B1	
  
B2	
  
B1	
  

RACK3	
  

B1	
  

RACK2	
  

RACK1	
  

ack	
  

B2	
  

ack	
  
B2	
  

Page	
  16	
  
Hadoop 1 Computation
•  MapReduce Framework
–  Combined	
  both	
  Resource	
  Management	
  and	
  Applica3on	
  Logic	
  in	
  the	
  same	
  code	
  
	
  

•  Limitations
–  Resource	
  alloca3on	
  units	
  (slots)	
  fixed	
  per	
  cluster	
  
–  Difficult	
  to	
  use	
  a	
  cluster	
  for	
  differing	
  or	
  simultaneous	
  workloads	
  
The 1st Generation of Hadoop: Batch
HADOOP	
  1.0	
  

Built	
  for	
  Web-­‐Scale	
  Batch	
  Apps	
  
	
  

Single	
  App	
  

Single	
  App	
  

INTERACTIVE	
  

ONLINE	
  

Single	
  App	
  

Single	
  App	
  

Single	
  App	
  

BATCH	
  

BATCH	
  

BATCH	
  

HDFS	
  

HDFS	
  

HDFS	
  

•  All	
  other	
  usage	
  
paZerns	
  must	
  leverage	
  
that	
  same	
  
infrastructure	
  
•  Forces	
  the	
  crea3on	
  of	
  
silos	
  for	
  managing	
  
mixed	
  workloads	
  
Hadoop MapReduce Classic
• JobTracker
– Manages	
  cluster	
  resources	
  and	
  job	
  scheduling	
  

• TaskTracker
– Per-­‐node	
  agent	
  
– Manage	
  tasks	
  

Page 19
MapReduce Classic: Limitations
• Scalability
– Maximum	
  Cluster	
  size	
  –	
  4,000	
  nodes	
  
– Maximum	
  concurrent	
  tasks	
  –	
  40,000	
  
– Coarse	
  synchroniza3on	
  in	
  JobTracker	
  
•  Availability

– Failure	
  kills	
  all	
  queued	
  and	
  running	
  jobs	
  
•  Hard partition of resources into map and reduce slots

– Low	
  resource	
  u3liza3on	
  
•  Lacks support for alternate paradigms and services

– Itera3ve	
  applica3ons	
  implemented	
  using	
  MapReduce	
  are	
  10x	
  slower	
  

Page 20
Hadoop 1: Poor Utilization of Cluster Resources
Hadoop	
  1	
  JobTracker	
  and	
  TaskTracker	
  used	
  fixed-­‐sized	
  “slots”	
  for	
  resource	
  alloca3on	
  

Map	
  tasks	
  are	
  wai3ng	
  for	
  
the	
  slots	
  which	
  are	
  NOT	
  
currently	
  used	
  by	
  reduce	
  
tasks	
  	
  

Hard-­‐Coded	
  values.	
  Task	
  
tracker	
  must	
  be	
  restarted	
  
aker	
  a	
  change	
  
Hadoop 2: Moving Past MapReduce
Single	
  Use	
  System	
  

Mul/	
  Purpose	
  Pla5orm	
  

Batch	
  Apps	
  

Batch,	
  Interac/ve,	
  Online,	
  Streaming,	
  …	
  

HADOOP	
  1.0	
  

HADOOP	
  2.0	
  
MapReduce	
  

Others	
  

(data	
  processing)	
  

MapReduce	
  

YARN	
  

(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  

(cluster	
  resource	
  management)	
  

HDFS	
  

HDFS2	
  

(redundant,	
  reliable	
  storage)	
  

(redundant,	
  highly-­‐available	
  &	
  reliable	
  storage)	
  

Page	
  22	
  
Apache Tez as the new Primitive
MapReduce	
  as	
  Base	
  

Apache	
  Tez	
  as	
  Base	
  

HADOOP	
  1.0	
  

HADOOP	
  2.0	
  
Batch	
  

MapReduce	
  

Pig	
  

(data	
  flow)	
  

	
  
Hive	
   Others	
  
(sql)	
  

(cascading)	
  
	
  

MapReduce	
  

Data	
  Flow	
  
Pig	
  

SQL	
  
Hive	
  

	
  
Others	
  

Real	
  Time	
  	
  
Stream	
  	
  
Processing	
  
Storm	
  

(cascading)	
  

	
  

Tez	
  

(execu3on	
  engine)	
  

HBase,	
  
Accumulo	
  

??	
  (HOYA)	
  

(con3nuous	
  execu3on)	
  

YARN	
  

(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  

(cluster	
  resource	
  management)	
  

HDFS	
  

HDFS2	
  

(redundant,	
  reliable	
  storage)	
  

Online	
  	
  
Data	
  	
  
Processing	
  

(redundant,	
  reliable	
  storage)	
  
Tez – Execution Performance
•  Performance gains over Map Reduce
–  Eliminate	
  replicated	
  write	
  barrier	
  between	
  successive	
  computa3ons.	
  
–  Eliminate	
  job	
  launch	
  overhead	
  of	
  workflow	
  jobs.	
  
–  Eliminate	
  extra	
  stage	
  of	
  map	
  reads	
  in	
  every	
  workflow	
  job.	
  
–  Eliminate	
  queue	
  and	
  resource	
  conten3on	
  suffered	
  by	
  workflow	
  jobs	
  that	
  are	
  started	
  aker	
  
a	
  predecessor	
  job	
  completes.	
  

Pig/Hive	
  -­‐	
  MR	
  

Pig/Hive	
  -­‐	
  Tez	
  

Page	
  24	
  
YARN: Taking Hadoop Beyond Batch
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service
ApplicaSons	
  Run	
  NaSvely	
  in	
  Hadoop	
  
BATCH	
  
INTERACTIVE	
  
(MapReduce)	
  
(Tez)	
  

ONLINE	
  
(HBase)	
  

STREAMING	
  
(Storm,	
  S4,…)	
  

GRAPH	
  
(Giraph)	
  

IN-­‐MEMORY	
  
(Spark)	
  

HPC	
  MPI	
  
(OpenMPI)	
  

OTHER	
  
(Search)	
  
(Weave…)	
  

YARN	
  (Cluster	
  Resource	
  Management)	
  	
  	
  
HDFS2	
  (Redundant,	
  Reliable	
  Storage)	
  
Page 25
YARN Overview
•  Goals:
– Reduce	
  the	
  responsibili3es	
  of	
  the	
  JobTracker	
  
–  Separate	
  the	
  resource	
  management	
  du3es	
  away	
  from	
  the	
  job	
  coordina3on	
  du3es	
  

– Allow	
  mul3ple	
  simultaneous	
  jobs	
  
–  Enables	
  different	
  style	
  and	
  sized	
  workloads	
  in	
  one	
  cluster	
  

•  Design:
– A	
  separate	
  Resource	
  Manager	
  
–  1	
  Global	
  Resource	
  Scheduler	
  for	
  the	
  en3re	
  cluster	
  
–  Each	
  worker	
  (slave)	
  node	
  runs	
  a	
  Node	
  Manager,	
  
manages	
  life-­‐cycle	
  of	
  containers	
  

– JobTracker	
  is	
  now	
  called	
  Applica3on	
  Master	
  
–  Each	
  Applica3on	
  has	
  1	
  Applica3on	
  Master	
  
–  Manages	
  applica3on	
  scheduling	
  and	
  task	
  execu3on	
  
YARN Architecture
ResourceManager	
  
Client	
  1	
  
Scheduler	
  

NodeManager	
  

NodeManager	
  

Client	
  2	
  

NodeManager	
  

NodeManager	
  

Container	
  1.1	
  
Container	
  2.1	
  

Container	
  2.4	
  

NodeManager	
  

AM	
  1	
  

NodeManager	
  

NodeManager	
  

Container	
  1.2	
  

NodeManager	
  

Container	
  1.3	
  

NodeManager	
  

AM2	
  

NodeManager	
  

NodeManager	
  

Container	
  2.2	
  

NodeManager	
  

Container	
  2.3	
  
Capacity Sharing: Concepts
• Application
– Applica3on	
  is	
  a	
  temporal	
  job	
  or	
  a	
  service	
  submiZed	
  to	
  YARN	
  
– Examples	
  
–  Map	
  Reduce	
  Job	
  (job)	
  
–  Storm	
  topology	
  (service)	
  

• Container
– Basic	
  unit	
  of	
  alloca3on	
  
– Fine-­‐grained	
  resource	
  alloca3on	
  across	
  mul3ple	
  resource	
  types	
  
(memory,	
  cpu,	
  disk,	
  network,	
  etc.)	
  
–  container_0	
  =	
  2GB	
  
–  container_1	
  =	
  1GB	
  

– Replaces	
  fixed	
  map/reduce	
  slots	
  (from	
  Hadoop	
  1.x)	
  
	
  

28	
  
YARN – Resource Allocation & Usage!
•  ResourceRequest!
–  Fine-­‐grained	
  resource	
  ask	
  to	
  the	
  ResourceManager	
  
–  Ask	
  for	
  a	
  specific	
  amount	
  of	
  resources	
  
(memory,	
  cpu	
  etc.)	
  on	
  a	
  specific	
  machine	
  or	
  rack	
  	
  
–  Use	
  special	
  value	
  of	
  *	
  for	
  resource	
  name	
  for	
  any	
  machine	
  

ResourceRequest!
priority!
resourceName!
capability!
numContainers!

priority!

capability!

!
0!

!
<2gb, 1 core>!

resourceName! numContainers!

<4gb, 1 core>!

1!

rack0!

1!

*!
1!

host01!

1!

*!

1!
Page	
  29	
  
CGroup
•  Linux Kernel capability to limit, account and isolate resources
–  CPU	
  :	
  Controlling	
  the	
  prioriza3on	
  of	
  processes	
  in	
  the	
  group.	
  Think	
  of	
  it	
  as	
  a	
  more	
  advanced	
  
nice	
  level	
  
–  Memory	
  :	
  Allow	
  for	
  setng	
  limits	
  on	
  RAM	
  and	
  swap	
  usage	
  
–  Disk	
  I/O	
  	
  
–  Network	
  

•  YARN currently support, CPU / Memory
List of YARN Apps
•  MapReduce (of course)
•  Apache Tez
–  Apache	
  Hive	
  
–  Apache	
  Pig	
  

•  Apache Hama - Iterative, Bulk Synchronous Parallel (BSP) engine
•  Apache Giraph - Iterative, BSP-based Graph Analysis engine
•  HBase on YARN (HOYA)
•  Apache Storm – Real-time stream processing
•  Apache Spark – Advanced DAG execution engine that supports cyclic data
flow and in-memory computing
•  Apache S4 – Real-time processing
•  Open MPI – Open source Message Passing Interface for HPC
http://wiki.apache.org/hadoop/PoweredByYarn
The YARN Book
•  “Coming Soon”
•  Expected by 2nd Quarter 2014
•  Complete coverage of YARN
Modern Data Architecture
•  Effective use of data – especially BIG Data – is enhanced when data is
co-located, enabling discovery and mining of unanticipated patterns.
•  A “Data Lake” is the growing body of all data
–  Encompassing	
  more	
  than	
  a	
  single	
  warehouse	
  
–  Data	
  can	
  con3nuously	
  stream	
  in	
  to	
  and	
  out	
  of	
  the	
  lake	
  
Multi-Tenancy Requirements
Multi-Tenancy in one shared cluster
•  Multiple Business Units
•  Multiple Applications

Requirements
•  Shared Processing Capacity
•  Shared Storage Capacity
•  Data Access Security

Page	
  34	
  
Multi-Tenancy: Capabilities
• Group and User:
– Use	
  of	
  Linux	
  and	
  HDFS	
  permissions	
  to	
  separate	
  files	
  and	
  directories	
  to	
  
create	
  tenant	
  boundaries	
  –	
  can	
  be	
  integrated	
  with	
  LDAP	
  (or	
  AD)	
  
	
  

• Security
– Used	
  to	
  enforce	
  tenant	
  boundaries	
  –	
  can	
  be	
  integrated	
  with	
  Kerberos	
  
	
  

• Capacity:
– Storage	
  quota	
  setup	
  to	
  manage	
  consump3on	
  
– Capacity	
  resource	
  scheduler	
  queues	
  to	
  balance	
  shared	
  processing	
  
resources	
  between	
  tenants	
  –	
  Use	
  ACLs	
  to	
  define	
  tenants	
  

Page	
  35	
  
FUNCTION	
  

Capacity	
  
Sharing	
  

FUNCTION	
  

Capacity	
  
Enforcement	
  

FUNCTION	
  

The Capacity Scheduler

Admin-­‐
istraSon	
  

• 

Queues	
  with	
  priori3es	
  

• 

ACLs	
  for	
  job	
  submit	
  permissions	
  

•  Max	
  capacity	
  per	
  queue	
  
•  User	
  limits	
  within	
  queue	
  

•  Monitoring	
  +	
  Management	
  Admin	
  ACLs	
  
•  Capacity-­‐Scheduler.xml	
  

Page 36
Roadmap: Capacity Scheduling
Feature	
  

DescripSon	
  

CS	
  Pre-­‐emp3on	
  

•  Enhance	
  SLA	
  support	
  
•  Re-­‐claim	
  capacity	
  from	
  tasks	
  in	
  queue	
  that	
  have	
  
been	
  over-­‐scheduled	
  	
  

Queue	
  Hierarchy	
  

•  Granular	
  configura3on	
  of	
  queues	
  
•  Provide	
  constraints	
  across	
  a	
  set	
  of	
  queues	
  

Node	
  Labels	
  

•  Schedule	
  tasks	
  on	
  specific	
  cluster	
  nodes	
  
•  Account	
  for	
  op3mized	
  hardware	
  

Container	
  Isola3on	
  

•  Stronger	
  isola3on	
  of	
  resources	
  for	
  each	
  
container,	
  incorpora3ng	
  CPU	
  

CPU	
  Scheduling	
  

•  Schedule	
  and	
  share	
  CPU	
  core	
  capacity	
  across	
  
tasks	
  

37	
  
Capacity Scheduler by example
Total	
  Cluster	
  capacity	
  
• 20	
  slots	
  
• 11	
  Mappers	
  
• 9	
  Reducers	
  

Queue	
  :	
  ProducSon	
  
• Guarantee	
  70%	
  resources	
  
• 14	
  slots	
  –	
  8M	
  /	
  6R	
  
• Max	
  100%	
  

Queue	
  :	
  Dev	
  
• Guarantee	
  10%	
  resources	
  
• 2	
  slots	
  –	
  1M	
  /	
  1R	
  
• Max	
  50%	
  

Queue	
  :	
  Default	
  
• Guarantee	
  20%	
  resources	
  
• 4	
  slots	
  –	
  2M	
  /	
  2R	
  
• Max	
  80%	
  
Hierarchical queues
root	
  

Dev	
  
10%	
  

Eng	
  
20%	
  

Default	
  
20%	
  

Test	
  
80%	
  

Produc3on	
  
70%	
  

DevOps	
  
10%	
  

Reserved	
  
20%	
  

Prod	
  
70%	
  

P0	
  
70%	
  

P1	
  
30%	
  

39	
  
CS: Example Queue Configuration
•  Default: 10 users | Ad-hoc BI Query jobs etc. | General User SLAs
•  Dev: 4 users | Ad-hoc Data Science Only (Pig+Mahout) | Lower SLAs
•  Applications: 2 users | Batch ETL and Report Generation jobs | Production SLAs
Yarn.scheduler.capacity.root.default	
  
Capacity	
  

ACLs	
  

Min:	
  0.10	
  |	
  Max:	
  0.20	
  |	
  User	
  Limit:	
  0.8	
  

‘Users’	
  group	
  

Yarn.scheduler.capacity.root.dev	
  
Capacity	
  

ACLs	
  

Min:	
  0.10	
  |	
  Max:	
  0.10	
  |	
  User	
  Limit:	
  0.5	
  

‘Engineering’	
  group	
  

Yarn.scheduler.capacity.root.producSon	
  
Capacity	
  

ACLs	
  

Min:	
  0.20	
  |	
  Max:	
  0.70	
  |	
  User	
  Limit:	
  1.0	
  

‘Applica3ons’	
  group	
  
40	
  
CS: Configuration
•  yarn.scheduler.capacity.root.default.acl_administer_jobs=*
•  yarn.scheduler.capacity.root.default.acl_submit_jobs=*
•  yarn.scheduler.capacity.root.default.capacity=100
•  yarn.scheduler.capacity.root.default.maximum-capacity=100
•  yarn.scheduler.capacity.root.default.user-limit-factor=1

• http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/
hadoop-yarn-site/CapacityScheduler.html
CS: Configuration
•  yarn.scheduler.capacity.root.default.acl_administer_jobs=Admin
•  yarn.scheduler.capacity.root.default.acl_submit_jobs=Users
•  yarn.scheduler.capacity.root.default.capacity=10
•  yarn.scheduler.capacity.root.default.maximum-capacity=20
•  yarn.scheduler.capacity.root.default.user-limit-factor=0.8
•  yarn.scheduler.capacity.root.dev.acl_administer_jobs=Engineering
•  yarn.scheduler.capacity.root.dev.acl_submit_jobs=Engineering
•  yarn.scheduler.capacity.root.dev.capacity=10
•  yarn.scheduler.capacity.root.dev.maximum-capacity=10
•  yarn.scheduler.capacity.root.dev.user-limit-factor=0.5
•  yarn.scheduler.capacity.root.production.acl_administer_jobs=Applications
•  yarn.scheduler.capacity.root.production.acl_submit_jobs=Admin
•  yarn.scheduler.capacity.root.production.capacity=20
•  yarn.scheduler.capacity.root.production.maximum-capacity=70
•  yarn.scheduler.capacity.root.production.user-limit-factor=1.0
Capacity Scheduler by example
•  Job 1 : Launch in production queue
–  Require	
  100	
  slots	
  
–  Get	
  14	
  slots	
  at	
  a	
  3me	
  

Cluster	
  resources	
  

Produc3on	
  
Development	
  
Default	
  
Idle	
  
Capacity Scheduler by example
•  Job 1 : Running in Production queue
–  Using	
  14	
  slots	
  	
  

•  Job 2 : Schedule in Development queue
–  Require	
  50	
  slots	
  
–  Get	
  4	
  slots	
  at	
  a	
  3me	
  

Cluster	
  resources	
  

Produc3on	
  
Development	
  
Default	
  
Idle	
  
Capacity Scheduler by example
•  Job 1 : Running in Production queue
–  98	
  complete,	
  only	
  2	
  slots	
  in	
  use	
  un3l	
  finish	
  

•  Job 2 : Schedule in Development queue
–  Require	
  50	
  slots	
  
–  S3ll	
  only	
  getng	
  4	
  slots	
  at	
  a	
  3me	
  

Cluster	
  resources	
  

Produc3on	
  
Development	
  
Default	
  
Idle	
  
Summary
•  YARN is the logical extension of Apache Hadoop
–  Complements	
  HDFS,	
  the	
  data	
  reservoir	
  
	
  

•  Resource Management for the Enterprise Data Lake
–  Shared,	
  secure,	
  mul3-­‐tenant	
  Hadoop	
  

Allows for all processing in Hadoop

BATCH	
  
INTERACTIVE	
  
(MapReduce)	
  
(Tez)	
  

ONLINE	
  
(HBase)	
  

STREAMING	
  
(Storm,	
  S4,…)	
  

GRAPH	
  
(Giraph)	
  

IN-­‐MEMORY	
  
(Spark)	
  

HPC	
  MPI	
  
(OpenMPI)	
  

OTHER	
  
(Search)	
  
(Weave…)	
  

YARN	
  (Cluster	
  Resource	
  Management)	
  	
  	
  
HDFS2	
  (Redundant,	
  Reliable	
  Storage)	
  
Page	
  46	
  
Your Fastest On-ramp to Enterprise Hadoop™!

hZp://hortonworks.com/products/hortonworks-­‐sandbox/	
  

The	
  Sandbox	
  lets	
  you	
  experience	
  Apache	
  Hadoop	
  from	
  the	
  convenience	
  of	
  your	
  own	
  
laptop	
  –	
  no	
  data	
  center,	
  no	
  cloud	
  and	
  no	
  internet	
  connec3on	
  needed!	
  
	
  
The	
  Hortonworks	
  Sandbox	
  is:	
  
•  A	
  free	
  download:	
  	
  hZp://hortonworks.com/products/hortonworks-­‐sandbox/	
  
•  A	
  complete,	
  self	
  contained	
  virtual	
  machine	
  with	
  Apache	
  Hadoop	
  pre-­‐configured	
  
•  A	
  personal,	
  portable	
  and	
  standalone	
  Hadoop	
  environment	
  
•  A	
  set	
  of	
  hands-­‐on,	
  step-­‐by-­‐step	
  tutorials	
  that	
  allow	
  you	
  to	
  learn	
  and	
  explore	
  Hadoop	
  
Page	
  47	
  
Ques3ons?	
  

David	
  Kaiser	
  

@ddkaiser	
  
linkedin.com/in/dkaiser	
  
facebook.com/dkaiser	
  
dkaiser@cdk.com	
  
dkaiser@hortonworks.com	
  
	
  

More Related Content

What's hot

Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
Emilio Coppa
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
Adam Kawa
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
DataWorks Summit
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
Bikas Saha
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Sumeet Singh
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
FARUK BERKSÖZ
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
Cloudera, Inc.
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
rightsize
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
Wangda Tan
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
GetInData
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
Anil Reddy
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
Uwe Printz
 

What's hot (20)

Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 

Viewers also liked

Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
BlueData, Inc.
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
DataWorks Summit
 
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarnApache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Shuya Tsukamoto
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Cloudera, Inc.
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
DataWorks Summit/Hadoop Summit
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
DataWorks Summit
 
Architecting a multi-tenanted platform
Architecting a multi-tenanted platform Architecting a multi-tenanted platform
Architecting a multi-tenanted platform
DataWorks Summit/Hadoop Summit
 
The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)
Cloudera, Inc.
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Adam Kawa
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
Top 5 IoT Use Cases
Top 5 IoT Use CasesTop 5 IoT Use Cases
Top 5 IoT Use Cases
Cloudera, Inc.
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BT
Cloudera, Inc.
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 

Viewers also liked (20)

Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarnApache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Architecting a multi-tenanted platform
Architecting a multi-tenanted platform Architecting a multi-tenanted platform
Architecting a multi-tenanted platform
 
The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1

 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
 
Top 5 IoT Use Cases
Top 5 IoT Use CasesTop 5 IoT Use Cases
Top 5 IoT Use Cases
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BT
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 

Similar to Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Adam Muise
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
KennyPratheepKumar
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
Uwe Printz
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
rishavkumar1402
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
Joseph Niemiec
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 
Anju
AnjuAnju
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Rajan Kanitkar
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Analytics 3
Analytics 3Analytics 3
Analytics 3
Srikanth Ayithy
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
Manaranjan Pradhan
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
VIVEKVANAVAN
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
veeracynixit
 

Similar to Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn (20)

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Anju
AnjuAnju
Anju
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Analytics 3
Analytics 3Analytics 3
Analytics 3
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 

Recently uploaded

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 

Recently uploaded (20)

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 

Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn

  • 1. Hadoop  2:  Efficient  mul3-­‐tenant   workloads  that  enable  the   Modern  Data  Architecture     SCALE  12X,  Los  Angeles   February  23,  2014   David  Kaiser   @ddkaiser   linkedin.com/in/dkaiser   facebook.com/dkaiser   dkaiser@cdk.com   dkaiser@hortonworks.com    
  • 2. Who Am I? 20+  years  experience  with  Linux     3  years  experience  with  Hadoop     Career  experiences:   •  Data  Warehousing   •  Geospa3al  Analy3cs   •  Open-­‐source  Solu3ons  and  Architecture     Employed  at  Hortonworks  as  a  Senior  Solu3ons  Engineer     David  Kaiser   @ddkaiser   linkedin.com/in/dkaiser   facebook.com/dkaiser   dkaiser@cdk.com   dkaiser@hortonworks.com    
  • 3. Hadoop 2: Efficient multi-tenant workloads that enable the Modern Data Architecture •  Abstract: – Hadoop  is  about  so  much  more  than  batch  processing.    With  the   recent  release  of  Hadoop  2,  there  have  been  significant  changes   to  how  a  Hadoop  cluster  uses  resources.   – YARN,  the  new  resource  management  component,  allows  for  a   more  efficient  mix  of  workloads  across  hardware  resources,  and   enables  new  applica3ons  and  new  processing  paradigms  such  as   stream-­‐processing.   – This  talk  will  discuss  the  new  design  and  components  of  Hadoop   2,  and  provide  examples  of  Modern  Data  Architectures  that   leverage  Hadoop  2.  
  • 4. What is This Thing? hZp://hadoop.apache.org/  
  • 5. Misconceptions •  Bucket  brigade  for  large  or  slow  data  processing  tasks  
  • 6. Misconceptions •  Bucket  brigade  for  large  or  slow  data  processing  tasks   •  Batch  processor  –  Another  mainframe  
  • 7. Misconceptions •  Bucket  brigade  for  large  or  slow  data  processing  tasks   •  Batch  processor  –  Another  mainframe   •  Dumb/inflexible,  trendy,  too  simple  
  • 8. Misconceptions •  Incorrect  assump3on  that  Java  ==  SLOW  
  • 9. Misconceptions •  Incorrect  assump3on  that  Java  ==  SLOW  
  • 10. Misconceptions •  Incorrect  assump3on  that  Java  ==  SLOW  
  • 11. Misconceptions •  Incorrect  assump3on  that  Java  ==  EVIL  
  • 12. Misconceptions •  Incorrect  assump3on  that  Java  ==  EVIL  
  • 13.
  • 14. Hadoop + Linux Provides a 100% Open-Source framework for efficient scalable data processing on commodity hardware Hadoop  –  The   Open-­‐source   Data  Opera3ng  System   Linux  –  The   Open-­‐source   Opera3ng  System   Commodity   Hardware  
  • 15. Hadoop Fundamentals •  Hadoop is a single system, across multiple Linux systems •  Two basic capabilities of Hadoop – Reliable,  Redundant  and  Distributed  Storage     – Distributed  Computa3on   •  Storage: Hadoop Distributed File System (HDFS) – Replicated,  distributed  filesystem   – Blocks  wriZen  to  underlying  filesystem  on  mul3ple  nodes   •  Computation – Resource  management   – Frameworks  to  divide  workloads  across  collec3on  of  resources   –  Hadoop  V1:  MapReduce  framework  only   –  Hadoop  V2:  MapReduce,  Tez,  Spark,  others…  
  • 16. HDFS: File create lifecycle HDFS  CLIENT   FILE   B1   B2   FILE   2   1   Create   ack   4   3   Complete   NameNode   B1   B2   B1   RACK3   B1   RACK2   RACK1   ack   B2   ack   B2   Page  16  
  • 17. Hadoop 1 Computation •  MapReduce Framework –  Combined  both  Resource  Management  and  Applica3on  Logic  in  the  same  code     •  Limitations –  Resource  alloca3on  units  (slots)  fixed  per  cluster   –  Difficult  to  use  a  cluster  for  differing  or  simultaneous  workloads  
  • 18. The 1st Generation of Hadoop: Batch HADOOP  1.0   Built  for  Web-­‐Scale  Batch  Apps     Single  App   Single  App   INTERACTIVE   ONLINE   Single  App   Single  App   Single  App   BATCH   BATCH   BATCH   HDFS   HDFS   HDFS   •  All  other  usage   paZerns  must  leverage   that  same   infrastructure   •  Forces  the  crea3on  of   silos  for  managing   mixed  workloads  
  • 19. Hadoop MapReduce Classic • JobTracker – Manages  cluster  resources  and  job  scheduling   • TaskTracker – Per-­‐node  agent   – Manage  tasks   Page 19
  • 20. MapReduce Classic: Limitations • Scalability – Maximum  Cluster  size  –  4,000  nodes   – Maximum  concurrent  tasks  –  40,000   – Coarse  synchroniza3on  in  JobTracker   •  Availability – Failure  kills  all  queued  and  running  jobs   •  Hard partition of resources into map and reduce slots – Low  resource  u3liza3on   •  Lacks support for alternate paradigms and services – Itera3ve  applica3ons  implemented  using  MapReduce  are  10x  slower   Page 20
  • 21. Hadoop 1: Poor Utilization of Cluster Resources Hadoop  1  JobTracker  and  TaskTracker  used  fixed-­‐sized  “slots”  for  resource  alloca3on   Map  tasks  are  wai3ng  for   the  slots  which  are  NOT   currently  used  by  reduce   tasks     Hard-­‐Coded  values.  Task   tracker  must  be  restarted   aker  a  change  
  • 22. Hadoop 2: Moving Past MapReduce Single  Use  System   Mul/  Purpose  Pla5orm   Batch  Apps   Batch,  Interac/ve,  Online,  Streaming,  …   HADOOP  1.0   HADOOP  2.0   MapReduce   Others   (data  processing)   MapReduce   YARN   (cluster  resource  management    &  data  processing)   (cluster  resource  management)   HDFS   HDFS2   (redundant,  reliable  storage)   (redundant,  highly-­‐available  &  reliable  storage)   Page  22  
  • 23. Apache Tez as the new Primitive MapReduce  as  Base   Apache  Tez  as  Base   HADOOP  1.0   HADOOP  2.0   Batch   MapReduce   Pig   (data  flow)     Hive   Others   (sql)   (cascading)     MapReduce   Data  Flow   Pig   SQL   Hive     Others   Real  Time     Stream     Processing   Storm   (cascading)     Tez   (execu3on  engine)   HBase,   Accumulo   ??  (HOYA)   (con3nuous  execu3on)   YARN   (cluster  resource  management    &  data  processing)   (cluster  resource  management)   HDFS   HDFS2   (redundant,  reliable  storage)   Online     Data     Processing   (redundant,  reliable  storage)  
  • 24. Tez – Execution Performance •  Performance gains over Map Reduce –  Eliminate  replicated  write  barrier  between  successive  computa3ons.   –  Eliminate  job  launch  overhead  of  workflow  jobs.   –  Eliminate  extra  stage  of  map  reads  in  every  workflow  job.   –  Eliminate  queue  and  resource  conten3on  suffered  by  workflow  jobs  that  are  started  aker   a  predecessor  job  completes.   Pig/Hive  -­‐  MR   Pig/Hive  -­‐  Tez   Page  24  
  • 25. YARN: Taking Hadoop Beyond Batch Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service ApplicaSons  Run  NaSvely  in  Hadoop   BATCH   INTERACTIVE   (MapReduce)   (Tez)   ONLINE   (HBase)   STREAMING   (Storm,  S4,…)   GRAPH   (Giraph)   IN-­‐MEMORY   (Spark)   HPC  MPI   (OpenMPI)   OTHER   (Search)   (Weave…)   YARN  (Cluster  Resource  Management)       HDFS2  (Redundant,  Reliable  Storage)   Page 25
  • 26. YARN Overview •  Goals: – Reduce  the  responsibili3es  of  the  JobTracker   –  Separate  the  resource  management  du3es  away  from  the  job  coordina3on  du3es   – Allow  mul3ple  simultaneous  jobs   –  Enables  different  style  and  sized  workloads  in  one  cluster   •  Design: – A  separate  Resource  Manager   –  1  Global  Resource  Scheduler  for  the  en3re  cluster   –  Each  worker  (slave)  node  runs  a  Node  Manager,   manages  life-­‐cycle  of  containers   – JobTracker  is  now  called  Applica3on  Master   –  Each  Applica3on  has  1  Applica3on  Master   –  Manages  applica3on  scheduling  and  task  execu3on  
  • 27. YARN Architecture ResourceManager   Client  1   Scheduler   NodeManager   NodeManager   Client  2   NodeManager   NodeManager   Container  1.1   Container  2.1   Container  2.4   NodeManager   AM  1   NodeManager   NodeManager   Container  1.2   NodeManager   Container  1.3   NodeManager   AM2   NodeManager   NodeManager   Container  2.2   NodeManager   Container  2.3  
  • 28. Capacity Sharing: Concepts • Application – Applica3on  is  a  temporal  job  or  a  service  submiZed  to  YARN   – Examples   –  Map  Reduce  Job  (job)   –  Storm  topology  (service)   • Container – Basic  unit  of  alloca3on   – Fine-­‐grained  resource  alloca3on  across  mul3ple  resource  types   (memory,  cpu,  disk,  network,  etc.)   –  container_0  =  2GB   –  container_1  =  1GB   – Replaces  fixed  map/reduce  slots  (from  Hadoop  1.x)     28  
  • 29. YARN – Resource Allocation & Usage! •  ResourceRequest! –  Fine-­‐grained  resource  ask  to  the  ResourceManager   –  Ask  for  a  specific  amount  of  resources   (memory,  cpu  etc.)  on  a  specific  machine  or  rack     –  Use  special  value  of  *  for  resource  name  for  any  machine   ResourceRequest! priority! resourceName! capability! numContainers! priority! capability! ! 0! ! <2gb, 1 core>! resourceName! numContainers! <4gb, 1 core>! 1! rack0! 1! *! 1! host01! 1! *! 1! Page  29  
  • 30. CGroup •  Linux Kernel capability to limit, account and isolate resources –  CPU  :  Controlling  the  prioriza3on  of  processes  in  the  group.  Think  of  it  as  a  more  advanced   nice  level   –  Memory  :  Allow  for  setng  limits  on  RAM  and  swap  usage   –  Disk  I/O     –  Network   •  YARN currently support, CPU / Memory
  • 31. List of YARN Apps •  MapReduce (of course) •  Apache Tez –  Apache  Hive   –  Apache  Pig   •  Apache Hama - Iterative, Bulk Synchronous Parallel (BSP) engine •  Apache Giraph - Iterative, BSP-based Graph Analysis engine •  HBase on YARN (HOYA) •  Apache Storm – Real-time stream processing •  Apache Spark – Advanced DAG execution engine that supports cyclic data flow and in-memory computing •  Apache S4 – Real-time processing •  Open MPI – Open source Message Passing Interface for HPC http://wiki.apache.org/hadoop/PoweredByYarn
  • 32. The YARN Book •  “Coming Soon” •  Expected by 2nd Quarter 2014 •  Complete coverage of YARN
  • 33. Modern Data Architecture •  Effective use of data – especially BIG Data – is enhanced when data is co-located, enabling discovery and mining of unanticipated patterns. •  A “Data Lake” is the growing body of all data –  Encompassing  more  than  a  single  warehouse   –  Data  can  con3nuously  stream  in  to  and  out  of  the  lake  
  • 34. Multi-Tenancy Requirements Multi-Tenancy in one shared cluster •  Multiple Business Units •  Multiple Applications Requirements •  Shared Processing Capacity •  Shared Storage Capacity •  Data Access Security Page  34  
  • 35. Multi-Tenancy: Capabilities • Group and User: – Use  of  Linux  and  HDFS  permissions  to  separate  files  and  directories  to   create  tenant  boundaries  –  can  be  integrated  with  LDAP  (or  AD)     • Security – Used  to  enforce  tenant  boundaries  –  can  be  integrated  with  Kerberos     • Capacity: – Storage  quota  setup  to  manage  consump3on   – Capacity  resource  scheduler  queues  to  balance  shared  processing   resources  between  tenants  –  Use  ACLs  to  define  tenants   Page  35  
  • 36. FUNCTION   Capacity   Sharing   FUNCTION   Capacity   Enforcement   FUNCTION   The Capacity Scheduler Admin-­‐ istraSon   •  Queues  with  priori3es   •  ACLs  for  job  submit  permissions   •  Max  capacity  per  queue   •  User  limits  within  queue   •  Monitoring  +  Management  Admin  ACLs   •  Capacity-­‐Scheduler.xml   Page 36
  • 37. Roadmap: Capacity Scheduling Feature   DescripSon   CS  Pre-­‐emp3on   •  Enhance  SLA  support   •  Re-­‐claim  capacity  from  tasks  in  queue  that  have   been  over-­‐scheduled     Queue  Hierarchy   •  Granular  configura3on  of  queues   •  Provide  constraints  across  a  set  of  queues   Node  Labels   •  Schedule  tasks  on  specific  cluster  nodes   •  Account  for  op3mized  hardware   Container  Isola3on   •  Stronger  isola3on  of  resources  for  each   container,  incorpora3ng  CPU   CPU  Scheduling   •  Schedule  and  share  CPU  core  capacity  across   tasks   37  
  • 38. Capacity Scheduler by example Total  Cluster  capacity   • 20  slots   • 11  Mappers   • 9  Reducers   Queue  :  ProducSon   • Guarantee  70%  resources   • 14  slots  –  8M  /  6R   • Max  100%   Queue  :  Dev   • Guarantee  10%  resources   • 2  slots  –  1M  /  1R   • Max  50%   Queue  :  Default   • Guarantee  20%  resources   • 4  slots  –  2M  /  2R   • Max  80%  
  • 39. Hierarchical queues root   Dev   10%   Eng   20%   Default   20%   Test   80%   Produc3on   70%   DevOps   10%   Reserved   20%   Prod   70%   P0   70%   P1   30%   39  
  • 40. CS: Example Queue Configuration •  Default: 10 users | Ad-hoc BI Query jobs etc. | General User SLAs •  Dev: 4 users | Ad-hoc Data Science Only (Pig+Mahout) | Lower SLAs •  Applications: 2 users | Batch ETL and Report Generation jobs | Production SLAs Yarn.scheduler.capacity.root.default   Capacity   ACLs   Min:  0.10  |  Max:  0.20  |  User  Limit:  0.8   ‘Users’  group   Yarn.scheduler.capacity.root.dev   Capacity   ACLs   Min:  0.10  |  Max:  0.10  |  User  Limit:  0.5   ‘Engineering’  group   Yarn.scheduler.capacity.root.producSon   Capacity   ACLs   Min:  0.20  |  Max:  0.70  |  User  Limit:  1.0   ‘Applica3ons’  group   40  
  • 41. CS: Configuration •  yarn.scheduler.capacity.root.default.acl_administer_jobs=* •  yarn.scheduler.capacity.root.default.acl_submit_jobs=* •  yarn.scheduler.capacity.root.default.capacity=100 •  yarn.scheduler.capacity.root.default.maximum-capacity=100 •  yarn.scheduler.capacity.root.default.user-limit-factor=1 • http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/ hadoop-yarn-site/CapacityScheduler.html
  • 42. CS: Configuration •  yarn.scheduler.capacity.root.default.acl_administer_jobs=Admin •  yarn.scheduler.capacity.root.default.acl_submit_jobs=Users •  yarn.scheduler.capacity.root.default.capacity=10 •  yarn.scheduler.capacity.root.default.maximum-capacity=20 •  yarn.scheduler.capacity.root.default.user-limit-factor=0.8 •  yarn.scheduler.capacity.root.dev.acl_administer_jobs=Engineering •  yarn.scheduler.capacity.root.dev.acl_submit_jobs=Engineering •  yarn.scheduler.capacity.root.dev.capacity=10 •  yarn.scheduler.capacity.root.dev.maximum-capacity=10 •  yarn.scheduler.capacity.root.dev.user-limit-factor=0.5 •  yarn.scheduler.capacity.root.production.acl_administer_jobs=Applications •  yarn.scheduler.capacity.root.production.acl_submit_jobs=Admin •  yarn.scheduler.capacity.root.production.capacity=20 •  yarn.scheduler.capacity.root.production.maximum-capacity=70 •  yarn.scheduler.capacity.root.production.user-limit-factor=1.0
  • 43. Capacity Scheduler by example •  Job 1 : Launch in production queue –  Require  100  slots   –  Get  14  slots  at  a  3me   Cluster  resources   Produc3on   Development   Default   Idle  
  • 44. Capacity Scheduler by example •  Job 1 : Running in Production queue –  Using  14  slots     •  Job 2 : Schedule in Development queue –  Require  50  slots   –  Get  4  slots  at  a  3me   Cluster  resources   Produc3on   Development   Default   Idle  
  • 45. Capacity Scheduler by example •  Job 1 : Running in Production queue –  98  complete,  only  2  slots  in  use  un3l  finish   •  Job 2 : Schedule in Development queue –  Require  50  slots   –  S3ll  only  getng  4  slots  at  a  3me   Cluster  resources   Produc3on   Development   Default   Idle  
  • 46. Summary •  YARN is the logical extension of Apache Hadoop –  Complements  HDFS,  the  data  reservoir     •  Resource Management for the Enterprise Data Lake –  Shared,  secure,  mul3-­‐tenant  Hadoop   Allows for all processing in Hadoop BATCH   INTERACTIVE   (MapReduce)   (Tez)   ONLINE   (HBase)   STREAMING   (Storm,  S4,…)   GRAPH   (Giraph)   IN-­‐MEMORY   (Spark)   HPC  MPI   (OpenMPI)   OTHER   (Search)   (Weave…)   YARN  (Cluster  Resource  Management)       HDFS2  (Redundant,  Reliable  Storage)   Page  46  
  • 47. Your Fastest On-ramp to Enterprise Hadoop™! hZp://hortonworks.com/products/hortonworks-­‐sandbox/   The  Sandbox  lets  you  experience  Apache  Hadoop  from  the  convenience  of  your  own   laptop  –  no  data  center,  no  cloud  and  no  internet  connec3on  needed!     The  Hortonworks  Sandbox  is:   •  A  free  download:    hZp://hortonworks.com/products/hortonworks-­‐sandbox/   •  A  complete,  self  contained  virtual  machine  with  Apache  Hadoop  pre-­‐configured   •  A  personal,  portable  and  standalone  Hadoop  environment   •  A  set  of  hands-­‐on,  step-­‐by-­‐step  tutorials  that  allow  you  to  learn  and  explore  Hadoop   Page  47  
  • 48. Ques3ons?   David  Kaiser   @ddkaiser   linkedin.com/in/dkaiser   facebook.com/dkaiser   dkaiser@cdk.com   dkaiser@hortonworks.com