Running Hadoop as Service in AltiScale Platform

Experiences in running Hadoop As A Service
chaiken@altiscale.com = #HadoopSherpa
DAVID CHAIKEN • 21 NOVEMBER 2014

Talk Outline
Altiscale Company Introduction and Perspective
Altiscale Architecture
Use Cases: Performance, Job Analysis, Scheduling
Infinite Hadoop
Challenges to the Hadoop Community
Copyright
©
2014
Al2scale,
Inc.

Corporate Background
Hadoop-as-a-Service (HaaS) innovator
Company founded in 2012 (Palo Alto & Chennai)
Founding team from Yahoo
• Raymie Stata, CEO, Former CTO
• David Chaiken, CTO, Former Chief Architect
• Charles Wimmer, Head of Operations, Former SRE
Employees from Yahoo, Google, Netflix, LinkedIn,
VMware and others
Top-tier investors
Copyright
©
2014
Al2scale,
Inc.

Altiscale Chennai
Long-term colleagues from Yahoo and before
IIT Madras Research Park (back gate of IIT-M)
Architecture, Core Development, Test (Apache Bigtop)
Control Plane agile development, 2-week sprints
Next: Test++, Customer Support, Operations
Copyright
©
2014
Al2scale,
Inc.

Everybody Loves Hadoop But…
Significant capex expenditure on
infrastructure
• Complex to manage and maintain
Time to get cluster up and running
is long
Capacity planning is difficult
Skillset is difficult to recruit, train
and retain
What
about
the
cloud?
Copyright
©
2014
Al2scale,
Inc.

True Hadoop-as-a-Service
Altiscale is the industry’s first purpose-built,
petabyte scale Hadoop cloud
• Altiscale operates Hadoop for you
• Infrastructure optimized to run Hadoop
fast and reliably
• Pay for Hadoop service, not
infrastructure
Copyright
©
2014
Al2scale,
Inc.

We Team With You To Help Deliver Insights
Customer
Al,scale
Poten2al
insights
from
a
flood
of
data
generated
by
the
connected
world
Our
Opera2ons
Team
and
Hadoop
Cloud
helps
realize
those
insights
+
Copyright
©
2014
Al2scale,
Inc.

Customers
Copyright
©
2014
Al2scale,
Inc.

How We Do It
Virtual
Hadoop
Cluster
Pre-‐configured
Apps
YARN
Service
HDFS
Service
More
Apps
Data
Connect
File
Transfer
KaRa
Flume
Hive
Pig
Oozie
We
op2mize
the
job
to
complete
fast
and
cost-‐effec2vely
Our
Hadoop
Helpdesk
gives
you
access
to
Hadoop
experts
Your
data
is
migrated
to
HDFS
and
a
virtual
Hadoop
cluster
in
our
cloud
Our
Hadoop
Opera2ons
Team
maintains
the
cluster
and
plans
the
job
Our
team
monitors
and
manages
the
job
through
to
comple2on
We
provide
an
up2me
SLA
so
our
Hadoop
cloud
is
always
available
Copyright
©
2014
Al2scale,
Inc.

Altiscale Architecture: Data and Control Planes
Copyright
©
2014
Al2scale,
Inc.

Altiscale Architecture: Customer Environments
Copyright
©
2014
Al2scale,
Inc.

Altiscale Architecture: O&O Hadoop Cluster
Copyright
©
2014
Al2scale,
Inc.

Altiscale Architecture: Host Components
Copyright
©
2014
Al2scale,
Inc.

Altiscale Architecture: Workbenches
Copyright
©
2014
Al2scale,
Inc.

Altiscale Architecture: Data Transfer
Copyright
©
2014
Al2scale,
Inc.

Altiscale Architecture: Portal and REST API
Copyright
©
2014
Al2scale,
Inc.

Altiscale Architecture: Control Plane Databases
Copyright
©
2014
Al2scale,
Inc.

Altiscale Architecture: Control Plane Services
Copyright
©
2014
Al2scale,
Inc.

Altiscale Architecture: Hadoop-Based Analysis
Copyright
©
2014
Al2scale,
Inc.

Hadoop as a Service Offering
Data is migrated to our HDFS
service HDFS
Service
Data
Connectors
Foundry
Apps
Apache
Mahout
Cascading
Revolu2on
R
KaRa/Camus
Avro
Pentaho
Kele
Matlab
Spark
Sqoop
H2O
Core
Apps
Apache
Hive
Apache
Pig
Apache
Oozie
Apache
HCatalog
Apache
Flume
R
JDK/JRE
Python
HpFS
FUSE
LZOP,
Snappy,
gzip
Terminal access to Hadoop
cluster and associated apps
Portal provides job status,
billing and support information
1
2
3
Copyright
©
2014
Al2scale,
Inc.

Challenges…
Copyright
©
2014
Al2scale,
Inc.

Performance Challenges…
Disks: Configuration, Controllers, Density, Cost
Network: Jumbo Packet MTU
Memory:
echo never >
/sys/kernel/mm/redhat_transparent_hugepage/enabled
Network: When does locality matter?
Flash: When to use SSD?
Copyright
©
2014
Al2scale,
Inc.

Customer Case Study: Analyze Query
Customer provided Hive query + data sets
(100GBs to ~5 TBs)
Needed help optimizing the query
Didn’t rewrite query immediately
Wanted to characterize query performance and
isolate bottlenecks first

Analyze and Tune Execution
Ran original query on the datasets in our environment:
• Two M/R Stages: Stage-1, Stage-2
Long running reducers run out of memory
• set mapreduce.reduce.memory.mb=5120!
• Reduces slots and extends reduce time
Query fails to launch Stage-2 with out of memory
• set HADOOP_HEAPSIZE=1024 on client machine
Query has 250,000 Mappers in Stage-2 which causes
failure
• set mapred.max.split.size=5368709120
to reduce Mappers

Analysis: Job Execution Characteristics
Next challenge - how to visualize job execution?
Existing hadoop/hive logs not sufficient for this task
Wrote internal tools
• parse job history files
• plot mapper and reducer execution

Analysis: Reduce (Stage-1) Long Tail
Single
reduce
task

Analysis Execution: Findings
Lone, long running reducer in first stage of query
Analyzed input data:
• Query split input data by userId
• Bucketizing input data by userId
• One very large bucket: “invalid” userId
• Discussed “invalid” userid with customer
An error value is a common pattern!
• Need to differentiate between “Don’t know and don’t care”
or “don’t know and do care.”

Interactive (DRAM-centric) Processing Systems
Loading data into DRAM makes processing fast!
Examples: Spark, Impala, 0xdata, …, [SAP HANA], …
Streaming systems (Storm, DataTorrent) may be similar
Need to increase YARN container memory size

Hive + Interactive: Watch Out for Container Size
Caution: larger YARN container settings for interactive
jobs may not be right for batch systems like Hive
Container size: needs to combine vcores and memory:
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores ...!

Hive + Interactive: Watch Out for Fragmentation
Attempting to schedule interactive systems and batch
systems like Hive may result in fragmentation
Interactive systems may require all-or-nothing
scheduling
Batch jobs with little tasks may starve interactive jobs

Hive + Interactive: Watch Out for Fragmentation
Solutions for fragmentation…
Reserve interactive nodes before starting batch jobs
Reduce interactive container size (if the algorithm permits)
Node labels (YARN-726) and gang scheduling (YARN-624)

Altiscale: Hadoop Storage and Compute
Altiscale’s point of view on Hadoop as a Service:
• sell HDFS in increments of 10 TB
• sell compute in increments of
10K TaskHours/Month
We market Infinite Hadoop, and provide services
so that customers need not worry about cluster nodes.
But Apache Hadoop user interfaces provide
node-oriented view of clusters…
Copyright
©
2014
Al2scale,
Inc.

Feedback from Customers
Storage plan normally easy to estimate
Compute plan is hard to estimate
• Customer pain point: achieving necessary
computation needs sometimes requires more peak
compute capacity than provided by the number of
nodes required for storage
• Opportunity: average compute often requires less
than the number of nodes required for storage
Copyright
©
2014
Al2scale,
Inc.

Solution: Change Altiscale’s Product!
Make “Infinite” computation available to customers
Multitenancy implementation phases, each of which
includes a milestone with production deliverables
0. Automation for burn/add/remove nodes
1. Deploy Linux containers using Docker
2. Decouple compute/storage + manual bursting
3. Automation: orchestrate add/remove nodes according to
allocation plan from the capacity team.
4. Optimized: predictive allocation, economic incentives
Copyright
©
2014
Al2scale,
Inc.

What Customers Get
On demand access to “Infinite” Computation
Ability to handle unexpected needs
without contacting Altiscale
“Access to a $10M cluster for just $1M”
Future…
Ability to package Hadoop job environment using
Docker (YARN-1964)
Copyright
©
2014
Al2scale,
Inc.

Challenges to the Hadoop Community
Hive + Hadoop debugging can get very complex
• Sifting through many logs and screens
• Automatic transmission versus manual transmission
Static partitioning induced by Java Virtual Machine has
benefits but also induces challenges.
Where there are difficulties, there’s opportunity:
• Better tooling, instrumentation, integration of logs/metrics
YARN still evolving into an operating system
Just starting to build real multitenancy into Hadoop.
Hadoop as a Service: aggregate and share expertise

Running Hadoop as Service in AltiScale Platform

Running Hadoop as Service in AltiScale Platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Running Hadoop as Service in AltiScale Platform

Similar to Running Hadoop as Service in AltiScale Platform (20)

More from InMobi Technology

More from InMobi Technology (20)

Recently uploaded

Recently uploaded (20)

Running Hadoop as Service in AltiScale Platform