Streamline Hadoop DevOps with
Apache Ambari
Alejandro Fernandez
May 18, 2017
Speaker
Alejandro Fernandez
Staff Software Engineer @
Hortonworks
Apache Ambari PMC
alejandro@apache.org
WHY ARE WE HERE?
“WORKING FROM MIAMI”
What is Apache Ambari?
Apache Ambari is the open-source platform to
deploy, manage and monitor Hadoop clusters
Poll
Have heard of Ambari before?
Have tried it, in sandbox or production?
Deploy
Secure/L
DAP
Smart
Configs
Upgrade
Monitor
Scale,
Extend,
Analyze
Simplify Operations - Lifecycle
Ease-of-Use Deploy
2,335
1,764
1,764
1,499
1,688
April ’15 Jul-Sep ’15 Dec ’15-Feb ’16 Aug-Nov ’16 Mar’17
20.5k commits over 4.5 years by 80 committers/contributors
AND GROWING
# of Jiras
Exciting Enterprise Features in Ambari 2.5
Core
AMBARI-18731: Scale Testing on 2500 Agents
AMBARI-18990: Self-Heal DB Inconsistencies
Alerts & Log Search
AMBARI-19257: Built-in SNMP Alert
AMBARI-16880: Simplified Log Rotation
Configs
Security
AMBARI-18650: Password Credential Store
AMBARI-18365: API Authentication
Using SPNEGO
Ambari Metrics System
AMBARI-17859: New Grafana dashboards
AMBARI-15901: AMS High Availability
AMBARI-19320: HDFS TopN User and
Operation Visualization
Service Features
AMBARI-2330: Service Auto-Restart
AMBARI-19275: Download All Client
Configs
AMBARI-7748: Manage JournalNode HA
Deploy On Premise
7 8
Kerberos SSL High Availability
Stack & Version
Deploy On The Cloud
Certified environments
Sysprepped VMs
Hundreds of similar clusters
Ephemeral workloads
Deploy with Blueprints
• Systematic way of defining a cluster
• Export existing cluster into blueprint
/api/v1/clusters/:clusterName?format=blueprint
Config
s
Topology Hosts Cluster
Create a cluster with Blueprints
{
"configurations" : [
{
"hdfs-site" : {
"dfs.datanode.data.dir" : "/hadoop/1,
/hadoop/2,/hadoop/3"
}
}
],
"host_groups" : [
{
"name" : "master-host",
"components" : [
{ "name" : "NAMENODE” },
{ "name" : "RESOURCEMANAGER” },
…
],
"cardinality" : "1"
},
{
"name" : "worker-host",
"components" : [
{ "name" : "DATANODE" },
{ "name" : "NODEMANAGER” },
…
],
"cardinality" : "1+"
},
],
"Blueprints" : {
"stack_name" : "HDP",
"stack_version" : "2.5"
}
}
{
"blueprint" : "my-blueprint",
"host_groups" :[
{
"name" : "master-host",
"hosts" : [
{
"fqdn" : "master001.ambari.apache.org"
}
]
},
{
"name" : "worker-host",
"hosts" : [
{
"fqdn" : "worker001.ambari.apache.org"
},
{
"fqdn" : "worker002.ambari.apache.org"
},
…
{
"fqdn" : "worker099.ambari.apache.org"
}
]
}
]
}
1. POST /api/v1/blueprints/my-blueprint 2. POST /api/v1/clusters/my-cluster
Create a cluster with Blueprints
{
"configurations" : [
{
"hdfs-site" : {
"dfs.datanode.data.dir" : "/hadoop/1,
/hadoop/2,/hadoop/3"
}
}
],
"host_groups" : [
{
"name" : "master-host",
"components" : [
{ "name" : "NAMENODE” },
{ "name" : "RESOURCEMANAGER” },
…
],
"cardinality" : "1"
},
{
"name" : "worker-host",
"components" : [
{ "name" : "DATANODE" },
{ "name" : "NODEMANAGER” },
…
],
"cardinality" : "1+"
},
],
"Blueprints" : {
"stack_name" : "HDP",
"stack_version" : "2.5"
}
}
{
"blueprint" : "my-blueprint",
"host_groups" :[
{
"name" : "master-host",
"hosts" : [
{
"fqdn" : "master001.ambari.apache.org"
}
]
},
{
"name" : "worker-host",
"hosts" : [
{
"fqdn" : "worker001.ambari.apache.org"
},
{
"fqdn" : "worker002.ambari.apache.org"
},
…
{
"fqdn" : "worker099.ambari.apache.org"
}
]
}
]
}
1. POST /api/v1/blueprints/my-blueprint 2. POST /api/v1/clusters/my-cluster
Create a cluster with Blueprints
{
"configurations" : [
{
"hdfs-site" : {
"dfs.datanode.data.dir" : "/hadoop/1,
/hadoop/2,/hadoop/3"
}
}
],
"host_groups" : [
{
"name" : "master-host",
"components" : [
{ "name" : "NAMENODE” },
{ "name" : "RESOURCEMANAGER” },
…
],
"cardinality" : "1"
},
{
"name" : "worker-host",
"components" : [
{ "name" : "DATANODE" },
{ "name" : "NODEMANAGER” },
…
],
"cardinality" : "1+"
},
],
"Blueprints" : {
"stack_name" : "HDP",
"stack_version" : "2.5"
}
}
{
"blueprint" : "my-blueprint",
"host_groups" :[
{
"name" : "master-host",
"hosts" : [
{
"fqdn" : "master001.ambari.apache.org"
}
]
},
{
"name" : "worker-host",
"hosts" : [
{
"fqdn" : "worker001.ambari.apache.org"
},
{
"fqdn" : "worker002.ambari.apache.org"
},
…
{
"fqdn" : "worker099.ambari.apache.org"
}
]
}
]
}
1. POST /api/v1/blueprints/my-blueprint 2. POST /api/v1/clusters/my-cluster
Create a cluster with Blueprints
{
"configurations" : [
{
"hdfs-site" : {
"dfs.datanode.data.dir" : "/hadoop/1,
/hadoop/2,/hadoop/3"
}
}
],
"host_groups" : [
{
"name" : "master-host",
"components" : [
{ "name" : "NAMENODE” },
{ "name" : "RESOURCEMANAGER” },
…
],
"cardinality" : "1"
},
{
"name" : "worker-host",
"components" : [
{ "name" : "DATANODE" },
{ "name" : "NODEMANAGER” },
…
],
"cardinality" : "1+"
},
],
"Blueprints" : {
"stack_name" : "HDP",
"stack_version" : "2.5"
}
}
{
"blueprint" : "my-blueprint",
"host_groups" :[
{
"name" : "master-host",
"hosts" : [
{
"fqdn" : "master001.ambari.apache.org"
}
]
},
{
"name" : "worker-host",
"hosts" : [
{
"fqdn" : "worker001.ambari.apache.org"
},
{
"fqdn" : "worker002.ambari.apache.org"
},
…
{
"fqdn" : "worker099.ambari.apache.org"
}
]
}
]
}
1. POST /api/v1/blueprints/my-blueprint 2. POST /api/v1/clusters/my-cluster
Blueprints for Large Scale
• Kerberos, secure out-of-the-box
• High Availability is setup initially for
NameNode, YARN, Hive, Oozie, etc
• Host Discovery allows Ambari to
automatically install services for a Host
when it comes online
• Stack Advisor for config recommendations
POST /api/v1/clusters/MyCluster/hosts
[
{
"blueprint" : "single-node-hdfs-test2",
"host_groups" :[
{
"host_group" : "worker",
"host_count" : 3,
"host_predicate" : "Hosts/cpu_count>1”
}, {
"host_group" : "super-worker",
"host_count" : 5,
"host_predicate" : "Hosts/cpu_count>2&
Hosts/total_mem>3000000"
}
]
}
]
Blueprint Host Discovery
Service Layout
Common Services Stack Override
Custom Service
Starter Pack:
• metainfo.xml
• Python scripts: lifecycle management
• Configs: key, value, description, allow empty, password, etc.
• Templates: Jinja template with config replacement
• Role Command Order: dependency of start, stop commands
• Service Advisor: recommend/validate configs on changes
• Kerberos: principals and keytabs, configs to change when Kerberized
• Widgets: UI config knobs, sections
• Alerts: definition, type: [port, web, python script], interval
• Metrics: for Ambari Metrics System
Custom Service – metainfo.xml
<service>
<name>SAMPLESRV</name>
<displayName>New Sample Service</displayName>
<comment>A New Sample Service</comment>
<version>1.0.0</version>
<components>
<component>
<name>SAMPLESRV_MASTER</name>
<displayName>Sample Srv Master</displayName>
<category>MASTER</category>
<cardinality>1</cardinality>
<commandScript>
<script>scripts/master.py</script>
<scriptType>PYTHON</scriptType>
<timeout>600</timeout>
</commandScript>
</component>
<component>
<name>SAMPLESRV_SLAVE_OR_CLIENT</name>
<displayName>Sample Slave or Client</displayName>
<category>SLAVE | CLIENT</category>
<cardinality>0+ | 0-1 | 1 | 1+</cardinality>
<commandScript>
<script>scripts/slave_or_client.py</script>
<scriptType>PYTHON</scriptType>
<timeout>600</timeout>
</commandScript>
</component>
</components>
...
metainfo.xml
Custom Service – metainfo.xml
...
<customCommand>
<name>DECOMMISSION</name>
<commandScript>
<script>scripts/decommission.py</script>
<scriptType>PYTHON</scriptType>
<timeout>1200</timeout>
</commandScript>
</customCommand>
<dependency>
<name>HDFS/NAMENODE</name>
<scope>cluster | host</scope>
<auto-deploy>
<enabled>true | false</enabled>
</auto-deploy>
</dependency>
...
<requiredServices>
<service>HDFS</service>
</requiredServices>
metainfo.xml
Custom Service – metainfo.xml
...
<configuration-dependencies>
<config-type>service-env</config-type>
<config-type>service-site</config-type>
<config-type>hdfs-site</config-type>
</configuration-dependencies>
<osSpecifics>
<osSpecific>
<osFamily>any</osFamily>
<packages>
<package>
<name>rpm_apt_pkg_name</name>
</package>
</packages>
</osSpecific>
</osSpecifics>
metainfo.xml
Custom Service – Python Script
import sys
from resource_management import Script
class Master(Script):
def install(self, env):
print 'Install the Sample Srv Master'
def stop(self, env):
print 'Stop the Sample Srv Master'
def start(self, env):
print 'Start the Sample Srv Master'
def status(self, env):
print 'Status of the Sample Srv Master'
def configure(self, env):
print 'Configure the Sample Srv Master'
if __name__ == "__main__":
Master().execute()
master.py
Stack Advisor
Kerberos
HTTPS
Zookeeper Servers
Memory Settings
…
High Availability
atlas.rest.address =
http(s)://host:port
# Atlas Servers
atlas.enabletTLS = true|false
atlas.server.http.port = 21000
atlas.server.https.port = 21443
Example
Configuration
s
Service Advisors in Ambari 3.0
• Break up single Stack Advisor into 22 Service Advisors
• Rewrite in Java for stronger checking and faster speed
• Use Drools
Comprehensive Security
LDAP/AD
• User auth
• Sync
Kerberos
• MIT KDC
• Keytab
management
Atlas
• Governance
• Compliance
• Linage & history
• Data classification
Ranger
• Security policies
• Audit
• Authorization
Knox
• Perimeter security
• Supports LDAP/AD
• Sec. for
REST/HTTP
• SSL
Kerberos
Ambari manages Kerberos principals and keytabs
Works with existing MIT KDC or Active Directory
Once Kerberized, handles
• Adding hosts
• Adding components
to existing hosts
• Adding services
• Moving components
to different hosts
Testing at Scale: 3000 Agents
Agent
Multiplier
• Each Agent has own hostname, home dir, log dir, PID, ambari-agent.ini file
• Agent Multiplier can bootstrap 50 Agents per VM
• Tried Docker + Weave before and not very stable for networking
Agent 1
VM
Agent 1
Agent 50
VM
Testing at Scale: 3000 Agents
Ambari
Server
Dummy Services
• Happy: always passes
• Sleepy: always timesout
• Grumpy: always fails
• Zookeeper
• HDFS
• YARN
• HBASE
PERF Stack
 Scale (server cannot tell the
difference)
 Kerberos
 Stack Advisor
 Alerts
 Rolling & Express Upgrade
 UI
Testing
Optimize for Large Scale
export AMBARI_JVM_ARGS=$AMBARI_JVM_ARGS' -Xms2048m -Xmx8192m
ambari-env.sh
ambari.properties
10 Hosts 50 Hosts 100 Hosts > 500 Hosts
agent.threadpool.size.max 25 35 75 100
alerts.cache.enabled true
alerts.cache.size 50000 100000
alerts.execution.scheduler.maxThreads 2 4
• Dedicated database server with SSD
• MySQL 5.7 and DB tuning
• Purge old Ambari history: commands, alerts, BP topology, upgrades.
https://community.hortonworks.com/articles/80635/optimize-ambari-performance-for-large-clusters.html
Background: Upgrade Terminology
Manual
Upgrade
The user follows instructions to upgrade
the stack
Incurs downtime
Background: Upgrade Terminology
Manual
Upgrade
The user follows instructions to upgrade
the stack
Incurs downtime
Rolling
Upgrade
Automated
Upgrades one component
per host at a time
Preserves cluster operation
and minimizes service impact
Background: Upgrade Terminology
Express
Upgrade
Automated
Runs in parallel across hosts
Incurs downtime
Manual
Upgrade
The user follows instructions to upgrade
the stack
Incurs downtime
Rolling
Upgrade
Automated
Upgrades one component
per host at a time
Preserves cluster operation
and minimizes service impact
Automated Upgrade: Rolling or Express
Check
Prerequisite
s
Review the
prereqs to
confirm
your
cluster
configs are
ready
Prepare
Take
backups of
critical
cluster
metadata
Perform
Upgrade
Perform
the HDP
upgrade.
The steps
depend on
upgrade
method:
Rolling or
Express
Register +
Install
Register
the HDP
repository
and install
the target
HDP
version on
the cluster
Finalize
Finalize
the
upgrade,
making the
target
version the
current
version
Process: Rolling Upgrade
ZooKeeper
Ranger/KMS
Hive
Spark
Knox
Storm
Slider
Flume
Finalize or
Downgrade
Core
Masters
Core Slaves
HDFS
YARN
HBase
Clients HDFS, YARN, MR, Tez,
HBase, Pig. Hive, etc.
Oozie
Kafka
Falcon
Accumulo
On Failure,
• Retry
• Ignore
• Downgrade
NN1 NN2
DataNodes
Process: Express Upgrade
Stop High-Level:
Spark, Storm, etc
Back up HDFS,
HBase, Hive
Change Stack +
Configs
Zookeeper
Knox
Storm
Slider
Flume
Finalize or
Downgrade
Ranger/KMS
Stop Low-Level:
YARN, MR, HDFS, ZK
Falcon
Accumulo
HDFS
YARN
MapReduce2
HBase
Hive
Oozie
On Failure,
• Retry
• Ignore
• Downgrade
1001
Hosts in
Parallel
1001
Hosts in
Parallel
Total Time: 2:53 13:16 26:26
Scales linearly with # of hosts
Total Time: 0:32 1:14 2:19
Scales linearly with # batches (defaults to 100 hosts at a
time)
5.4 X 10.7 X 11.4 X
Alerting Framework
Alert Type Description Thresholds (units)
WEB Connects to a Web URL. Alert status is
based on the HTTP response code
Response Code (n/a)
Connection Timeout (seconds)
PORT Connects to a port. Alert status is based
on response time
Response (seconds)
METRIC Checks the value of a service metric. Units
vary, based on the metric being checked
Metric Value (units vary)
Connection Timeout (seconds)
AGGREGAT
E
Aggregates the status for another alert % Affected (percentage)
SCRIPT Executes a script to handle the alert check Varies
SERVER Executes a server-side runnable class to
handle the alert check
Varies
AMS Architecture
• Custom Sinks – HDFS, YARN, HBase, Storm,
Kafka, Flume, Accumulo
• Monitors – lightweight daemon for system metrics
• Collector – API daemon + HBase (embedded / distributed)
• Phoenix schema designed for fast reads
• Managed HBase
• Grafana support from version 2.2.2
Ambari
Collector API
Grafana
Phoenix
HDP
Services
System
MONITORSSINKS
Metrics Collector
Cluster Zookeeper
METRICS
MONITOR
YARN
Kafka
Flume
METRICS SINKS
HBase
Storm
Hive
NiFi
HDFS
METRICS COLLECTOR
HBase
Master + RS
Phoenix
Aggregators
Collector API
Helix
Participant
METRICS COLLECTOR
HBase
Master + RS
Phoenix
Aggregators
Collector API
Helix
Participant
AMS Distributed Collector Arch Details
AMS Features
• Simple POST API for sending metrics.
• Rich GET API to fetch metrics in specific granularity
 Point in time & series
 Top N support
 Rate support
• Performs Host level aggregation as well as time based down sampling
• Highly tunable system
Adjust rate of collecting/sending metrics
Adjust granularity of data being stored
Skip Aggregation for certain metrics
Whitelist metrics
• Metadata API that provides information on what metrics are being
collected and which component is sending these metrics
• Abstract Sink implementation to facilitate easy integration with metrics
collector
• HTTPS Support
Grafana for Ambari Metrics
• Grafana as a “Native UI”
for Ambari Metrics
• Pre-built Dashboards
Host-level, Service-level
• Supports HTTPS
• System Home, Servers
• HDFS Home,
NameNodes, DataNodes
• YARN Home,
Applications, Job History
Server
• HBase Home,
Performance
FEATURES DASHBOARDS
AMS - Grafana Integration
Log Search
Search and index HDP logs!
Capabilities
• Rapid Search of all HDP component logs
• Search across time ranges, log levels, and for
keywords
Solr
Logsearch
Ambari
Log Search
WO R K E R
N O D E
L O G
F E E D E R
Solr
LO G
S EA RC H
U I
Solr
Solr
A M BA R I
Java Process
Multi-output Support
Grok filters
Solr Cloud
Local Disk Storage
Future of Apache Ambari 3.0
• Cloud features
• Service multi-instance (e.g., two ZK quorums)
• Service multi-versions (Spark 2.0 & Spark 2.2)
• YARN assemblies & services
• Patch Upgrades: upgrade individual components in
the same stack version, e.g., just DN and RM in HDP
3.0.*.* with zero downtime
• Ambari High Availability
Resources
Contribute to Ambari:
https://cwiki.apache.org/confluence/display/AMBARI/Quick+Start+Guide
Referenced Articles:https://community.hortonworks.com/articles/43816/how-to-createadd-the-service-
stop-the-service.html
https://community.hortonworks.com/articles/80635/optimize-ambari-performance-for-large-clusters.html
Image Sources:
http://www.vacationgetaways4less.com/wp-content/gallery/miami-newport-beachside-hotel-resort-
banner/miami-beach-south-beach-night-730x302.jpg
https://ak9.picdn.net/shutterstock/videos/2139614/thumb/1.jpg
Many thanks to the ASF, audience, and event
organizers.
2 mins for questions…
github.com/afernandez
alejandro@apache.org

Streamline Hadoop DevOps with Apache Ambari

  • 1.
    Streamline Hadoop DevOpswith Apache Ambari Alejandro Fernandez May 18, 2017
  • 2.
    Speaker Alejandro Fernandez Staff SoftwareEngineer @ Hortonworks Apache Ambari PMC alejandro@apache.org
  • 3.
    WHY ARE WEHERE? “WORKING FROM MIAMI”
  • 4.
    What is ApacheAmbari? Apache Ambari is the open-source platform to deploy, manage and monitor Hadoop clusters
  • 5.
    Poll Have heard ofAmbari before? Have tried it, in sandbox or production?
  • 6.
  • 7.
    2,335 1,764 1,764 1,499 1,688 April ’15 Jul-Sep’15 Dec ’15-Feb ’16 Aug-Nov ’16 Mar’17 20.5k commits over 4.5 years by 80 committers/contributors AND GROWING # of Jiras
  • 8.
    Exciting Enterprise Featuresin Ambari 2.5 Core AMBARI-18731: Scale Testing on 2500 Agents AMBARI-18990: Self-Heal DB Inconsistencies Alerts & Log Search AMBARI-19257: Built-in SNMP Alert AMBARI-16880: Simplified Log Rotation Configs Security AMBARI-18650: Password Credential Store AMBARI-18365: API Authentication Using SPNEGO Ambari Metrics System AMBARI-17859: New Grafana dashboards AMBARI-15901: AMS High Availability AMBARI-19320: HDFS TopN User and Operation Visualization Service Features AMBARI-2330: Service Auto-Restart AMBARI-19275: Download All Client Configs AMBARI-7748: Manage JournalNode HA
  • 9.
    Deploy On Premise 78 Kerberos SSL High Availability Stack & Version
  • 10.
    Deploy On TheCloud Certified environments Sysprepped VMs Hundreds of similar clusters Ephemeral workloads
  • 11.
    Deploy with Blueprints •Systematic way of defining a cluster • Export existing cluster into blueprint /api/v1/clusters/:clusterName?format=blueprint Config s Topology Hosts Cluster
  • 12.
    Create a clusterwith Blueprints { "configurations" : [ { "hdfs-site" : { "dfs.datanode.data.dir" : "/hadoop/1, /hadoop/2,/hadoop/3" } } ], "host_groups" : [ { "name" : "master-host", "components" : [ { "name" : "NAMENODE” }, { "name" : "RESOURCEMANAGER” }, … ], "cardinality" : "1" }, { "name" : "worker-host", "components" : [ { "name" : "DATANODE" }, { "name" : "NODEMANAGER” }, … ], "cardinality" : "1+" }, ], "Blueprints" : { "stack_name" : "HDP", "stack_version" : "2.5" } } { "blueprint" : "my-blueprint", "host_groups" :[ { "name" : "master-host", "hosts" : [ { "fqdn" : "master001.ambari.apache.org" } ] }, { "name" : "worker-host", "hosts" : [ { "fqdn" : "worker001.ambari.apache.org" }, { "fqdn" : "worker002.ambari.apache.org" }, … { "fqdn" : "worker099.ambari.apache.org" } ] } ] } 1. POST /api/v1/blueprints/my-blueprint 2. POST /api/v1/clusters/my-cluster
  • 13.
    Create a clusterwith Blueprints { "configurations" : [ { "hdfs-site" : { "dfs.datanode.data.dir" : "/hadoop/1, /hadoop/2,/hadoop/3" } } ], "host_groups" : [ { "name" : "master-host", "components" : [ { "name" : "NAMENODE” }, { "name" : "RESOURCEMANAGER” }, … ], "cardinality" : "1" }, { "name" : "worker-host", "components" : [ { "name" : "DATANODE" }, { "name" : "NODEMANAGER” }, … ], "cardinality" : "1+" }, ], "Blueprints" : { "stack_name" : "HDP", "stack_version" : "2.5" } } { "blueprint" : "my-blueprint", "host_groups" :[ { "name" : "master-host", "hosts" : [ { "fqdn" : "master001.ambari.apache.org" } ] }, { "name" : "worker-host", "hosts" : [ { "fqdn" : "worker001.ambari.apache.org" }, { "fqdn" : "worker002.ambari.apache.org" }, … { "fqdn" : "worker099.ambari.apache.org" } ] } ] } 1. POST /api/v1/blueprints/my-blueprint 2. POST /api/v1/clusters/my-cluster
  • 14.
    Create a clusterwith Blueprints { "configurations" : [ { "hdfs-site" : { "dfs.datanode.data.dir" : "/hadoop/1, /hadoop/2,/hadoop/3" } } ], "host_groups" : [ { "name" : "master-host", "components" : [ { "name" : "NAMENODE” }, { "name" : "RESOURCEMANAGER” }, … ], "cardinality" : "1" }, { "name" : "worker-host", "components" : [ { "name" : "DATANODE" }, { "name" : "NODEMANAGER” }, … ], "cardinality" : "1+" }, ], "Blueprints" : { "stack_name" : "HDP", "stack_version" : "2.5" } } { "blueprint" : "my-blueprint", "host_groups" :[ { "name" : "master-host", "hosts" : [ { "fqdn" : "master001.ambari.apache.org" } ] }, { "name" : "worker-host", "hosts" : [ { "fqdn" : "worker001.ambari.apache.org" }, { "fqdn" : "worker002.ambari.apache.org" }, … { "fqdn" : "worker099.ambari.apache.org" } ] } ] } 1. POST /api/v1/blueprints/my-blueprint 2. POST /api/v1/clusters/my-cluster
  • 15.
    Create a clusterwith Blueprints { "configurations" : [ { "hdfs-site" : { "dfs.datanode.data.dir" : "/hadoop/1, /hadoop/2,/hadoop/3" } } ], "host_groups" : [ { "name" : "master-host", "components" : [ { "name" : "NAMENODE” }, { "name" : "RESOURCEMANAGER” }, … ], "cardinality" : "1" }, { "name" : "worker-host", "components" : [ { "name" : "DATANODE" }, { "name" : "NODEMANAGER” }, … ], "cardinality" : "1+" }, ], "Blueprints" : { "stack_name" : "HDP", "stack_version" : "2.5" } } { "blueprint" : "my-blueprint", "host_groups" :[ { "name" : "master-host", "hosts" : [ { "fqdn" : "master001.ambari.apache.org" } ] }, { "name" : "worker-host", "hosts" : [ { "fqdn" : "worker001.ambari.apache.org" }, { "fqdn" : "worker002.ambari.apache.org" }, … { "fqdn" : "worker099.ambari.apache.org" } ] } ] } 1. POST /api/v1/blueprints/my-blueprint 2. POST /api/v1/clusters/my-cluster
  • 16.
    Blueprints for LargeScale • Kerberos, secure out-of-the-box • High Availability is setup initially for NameNode, YARN, Hive, Oozie, etc • Host Discovery allows Ambari to automatically install services for a Host when it comes online • Stack Advisor for config recommendations
  • 17.
    POST /api/v1/clusters/MyCluster/hosts [ { "blueprint" :"single-node-hdfs-test2", "host_groups" :[ { "host_group" : "worker", "host_count" : 3, "host_predicate" : "Hosts/cpu_count>1” }, { "host_group" : "super-worker", "host_count" : 5, "host_predicate" : "Hosts/cpu_count>2& Hosts/total_mem>3000000" } ] } ] Blueprint Host Discovery
  • 18.
  • 19.
    Custom Service Starter Pack: •metainfo.xml • Python scripts: lifecycle management • Configs: key, value, description, allow empty, password, etc. • Templates: Jinja template with config replacement • Role Command Order: dependency of start, stop commands • Service Advisor: recommend/validate configs on changes • Kerberos: principals and keytabs, configs to change when Kerberized • Widgets: UI config knobs, sections • Alerts: definition, type: [port, web, python script], interval • Metrics: for Ambari Metrics System
  • 20.
    Custom Service –metainfo.xml <service> <name>SAMPLESRV</name> <displayName>New Sample Service</displayName> <comment>A New Sample Service</comment> <version>1.0.0</version> <components> <component> <name>SAMPLESRV_MASTER</name> <displayName>Sample Srv Master</displayName> <category>MASTER</category> <cardinality>1</cardinality> <commandScript> <script>scripts/master.py</script> <scriptType>PYTHON</scriptType> <timeout>600</timeout> </commandScript> </component> <component> <name>SAMPLESRV_SLAVE_OR_CLIENT</name> <displayName>Sample Slave or Client</displayName> <category>SLAVE | CLIENT</category> <cardinality>0+ | 0-1 | 1 | 1+</cardinality> <commandScript> <script>scripts/slave_or_client.py</script> <scriptType>PYTHON</scriptType> <timeout>600</timeout> </commandScript> </component> </components> ... metainfo.xml
  • 21.
    Custom Service –metainfo.xml ... <customCommand> <name>DECOMMISSION</name> <commandScript> <script>scripts/decommission.py</script> <scriptType>PYTHON</scriptType> <timeout>1200</timeout> </commandScript> </customCommand> <dependency> <name>HDFS/NAMENODE</name> <scope>cluster | host</scope> <auto-deploy> <enabled>true | false</enabled> </auto-deploy> </dependency> ... <requiredServices> <service>HDFS</service> </requiredServices> metainfo.xml
  • 22.
    Custom Service –metainfo.xml ... <configuration-dependencies> <config-type>service-env</config-type> <config-type>service-site</config-type> <config-type>hdfs-site</config-type> </configuration-dependencies> <osSpecifics> <osSpecific> <osFamily>any</osFamily> <packages> <package> <name>rpm_apt_pkg_name</name> </package> </packages> </osSpecific> </osSpecifics> metainfo.xml
  • 23.
    Custom Service –Python Script import sys from resource_management import Script class Master(Script): def install(self, env): print 'Install the Sample Srv Master' def stop(self, env): print 'Stop the Sample Srv Master' def start(self, env): print 'Start the Sample Srv Master' def status(self, env): print 'Status of the Sample Srv Master' def configure(self, env): print 'Configure the Sample Srv Master' if __name__ == "__main__": Master().execute() master.py
  • 24.
    Stack Advisor Kerberos HTTPS Zookeeper Servers MemorySettings … High Availability atlas.rest.address = http(s)://host:port # Atlas Servers atlas.enabletTLS = true|false atlas.server.http.port = 21000 atlas.server.https.port = 21443 Example Configuration s
  • 25.
    Service Advisors inAmbari 3.0 • Break up single Stack Advisor into 22 Service Advisors • Rewrite in Java for stronger checking and faster speed • Use Drools
  • 26.
    Comprehensive Security LDAP/AD • Userauth • Sync Kerberos • MIT KDC • Keytab management Atlas • Governance • Compliance • Linage & history • Data classification Ranger • Security policies • Audit • Authorization Knox • Perimeter security • Supports LDAP/AD • Sec. for REST/HTTP • SSL
  • 27.
    Kerberos Ambari manages Kerberosprincipals and keytabs Works with existing MIT KDC or Active Directory Once Kerberized, handles • Adding hosts • Adding components to existing hosts • Adding services • Moving components to different hosts
  • 28.
    Testing at Scale:3000 Agents Agent Multiplier • Each Agent has own hostname, home dir, log dir, PID, ambari-agent.ini file • Agent Multiplier can bootstrap 50 Agents per VM • Tried Docker + Weave before and not very stable for networking Agent 1 VM Agent 1 Agent 50 VM
  • 29.
    Testing at Scale:3000 Agents Ambari Server Dummy Services • Happy: always passes • Sleepy: always timesout • Grumpy: always fails • Zookeeper • HDFS • YARN • HBASE PERF Stack  Scale (server cannot tell the difference)  Kerberos  Stack Advisor  Alerts  Rolling & Express Upgrade  UI Testing
  • 31.
    Optimize for LargeScale export AMBARI_JVM_ARGS=$AMBARI_JVM_ARGS' -Xms2048m -Xmx8192m ambari-env.sh ambari.properties 10 Hosts 50 Hosts 100 Hosts > 500 Hosts agent.threadpool.size.max 25 35 75 100 alerts.cache.enabled true alerts.cache.size 50000 100000 alerts.execution.scheduler.maxThreads 2 4 • Dedicated database server with SSD • MySQL 5.7 and DB tuning • Purge old Ambari history: commands, alerts, BP topology, upgrades. https://community.hortonworks.com/articles/80635/optimize-ambari-performance-for-large-clusters.html
  • 32.
    Background: Upgrade Terminology Manual Upgrade Theuser follows instructions to upgrade the stack Incurs downtime
  • 33.
    Background: Upgrade Terminology Manual Upgrade Theuser follows instructions to upgrade the stack Incurs downtime Rolling Upgrade Automated Upgrades one component per host at a time Preserves cluster operation and minimizes service impact
  • 34.
    Background: Upgrade Terminology Express Upgrade Automated Runsin parallel across hosts Incurs downtime Manual Upgrade The user follows instructions to upgrade the stack Incurs downtime Rolling Upgrade Automated Upgrades one component per host at a time Preserves cluster operation and minimizes service impact
  • 35.
    Automated Upgrade: Rollingor Express Check Prerequisite s Review the prereqs to confirm your cluster configs are ready Prepare Take backups of critical cluster metadata Perform Upgrade Perform the HDP upgrade. The steps depend on upgrade method: Rolling or Express Register + Install Register the HDP repository and install the target HDP version on the cluster Finalize Finalize the upgrade, making the target version the current version
  • 36.
    Process: Rolling Upgrade ZooKeeper Ranger/KMS Hive Spark Knox Storm Slider Flume Finalizeor Downgrade Core Masters Core Slaves HDFS YARN HBase Clients HDFS, YARN, MR, Tez, HBase, Pig. Hive, etc. Oozie Kafka Falcon Accumulo On Failure, • Retry • Ignore • Downgrade NN1 NN2 DataNodes
  • 37.
    Process: Express Upgrade StopHigh-Level: Spark, Storm, etc Back up HDFS, HBase, Hive Change Stack + Configs Zookeeper Knox Storm Slider Flume Finalize or Downgrade Ranger/KMS Stop Low-Level: YARN, MR, HDFS, ZK Falcon Accumulo HDFS YARN MapReduce2 HBase Hive Oozie On Failure, • Retry • Ignore • Downgrade 1001 Hosts in Parallel 1001 Hosts in Parallel
  • 38.
    Total Time: 2:5313:16 26:26 Scales linearly with # of hosts
  • 39.
    Total Time: 0:321:14 2:19 Scales linearly with # batches (defaults to 100 hosts at a time) 5.4 X 10.7 X 11.4 X
  • 40.
    Alerting Framework Alert TypeDescription Thresholds (units) WEB Connects to a Web URL. Alert status is based on the HTTP response code Response Code (n/a) Connection Timeout (seconds) PORT Connects to a port. Alert status is based on response time Response (seconds) METRIC Checks the value of a service metric. Units vary, based on the metric being checked Metric Value (units vary) Connection Timeout (seconds) AGGREGAT E Aggregates the status for another alert % Affected (percentage) SCRIPT Executes a script to handle the alert check Varies SERVER Executes a server-side runnable class to handle the alert check Varies
  • 41.
    AMS Architecture • CustomSinks – HDFS, YARN, HBase, Storm, Kafka, Flume, Accumulo • Monitors – lightweight daemon for system metrics • Collector – API daemon + HBase (embedded / distributed) • Phoenix schema designed for fast reads • Managed HBase • Grafana support from version 2.2.2 Ambari Collector API Grafana Phoenix HDP Services System MONITORSSINKS Metrics Collector
  • 42.
    Cluster Zookeeper METRICS MONITOR YARN Kafka Flume METRICS SINKS HBase Storm Hive NiFi HDFS METRICSCOLLECTOR HBase Master + RS Phoenix Aggregators Collector API Helix Participant METRICS COLLECTOR HBase Master + RS Phoenix Aggregators Collector API Helix Participant AMS Distributed Collector Arch Details
  • 43.
    AMS Features • SimplePOST API for sending metrics. • Rich GET API to fetch metrics in specific granularity  Point in time & series  Top N support  Rate support • Performs Host level aggregation as well as time based down sampling • Highly tunable system Adjust rate of collecting/sending metrics Adjust granularity of data being stored Skip Aggregation for certain metrics Whitelist metrics • Metadata API that provides information on what metrics are being collected and which component is sending these metrics • Abstract Sink implementation to facilitate easy integration with metrics collector • HTTPS Support
  • 44.
    Grafana for AmbariMetrics • Grafana as a “Native UI” for Ambari Metrics • Pre-built Dashboards Host-level, Service-level • Supports HTTPS • System Home, Servers • HDFS Home, NameNodes, DataNodes • YARN Home, Applications, Job History Server • HBase Home, Performance FEATURES DASHBOARDS
  • 45.
    AMS - GrafanaIntegration
  • 46.
    Log Search Search andindex HDP logs! Capabilities • Rapid Search of all HDP component logs • Search across time ranges, log levels, and for keywords Solr Logsearch Ambari
  • 47.
    Log Search WO RK E R N O D E L O G F E E D E R Solr LO G S EA RC H U I Solr Solr A M BA R I Java Process Multi-output Support Grok filters Solr Cloud Local Disk Storage
  • 48.
    Future of ApacheAmbari 3.0 • Cloud features • Service multi-instance (e.g., two ZK quorums) • Service multi-versions (Spark 2.0 & Spark 2.2) • YARN assemblies & services • Patch Upgrades: upgrade individual components in the same stack version, e.g., just DN and RM in HDP 3.0.*.* with zero downtime • Ambari High Availability
  • 49.
    Resources Contribute to Ambari: https://cwiki.apache.org/confluence/display/AMBARI/Quick+Start+Guide ReferencedArticles:https://community.hortonworks.com/articles/43816/how-to-createadd-the-service- stop-the-service.html https://community.hortonworks.com/articles/80635/optimize-ambari-performance-for-large-clusters.html Image Sources: http://www.vacationgetaways4less.com/wp-content/gallery/miami-newport-beachside-hotel-resort- banner/miami-beach-south-beach-night-730x302.jpg https://ak9.picdn.net/shutterstock/videos/2139614/thumb/1.jpg Many thanks to the ASF, audience, and event organizers. 2 mins for questions… github.com/afernandez alejandro@apache.org

Editor's Notes

  • #3 Project Management Committee Committer and PMC since 2014 Co-architects for Rolling & Express Upgrade, PERF stack, Atlas integration
  • #7 Single pane of glass. Services, deploy, manage, configure, lifecycle Provision on the cloud, blueprints High Availability + Wizards Metrics (dashboards) Security, using MIT Kerberos or others Alerts (SNMP, emails, etc) Host management Views framework Stack Upgrades
  • #8 Deploy: Blueprints with Host Discovery Secure: Kerberos, LDAP sync Smart Configs: stack advisor, painful to configure a thousand related knobs. E.g, change zookeeper quorum then that has an effect on several services. Log folder, then affects log search. Upgrade: Rolling and Express Upgrade, get patches Monitor: Ambari Alerts, Ambari Metrics, LogSearch Analyze, Scale, Extend: Views, Management Packs
  • #9 We just released Ambari 2.5 in March with almost 1800 Jiras (features and bug fixes), and have 2-3 major releases per year. 0.9 in Sep 2012 1.5 in April 2014 1.6 in July 2014 2.0.0=1,688 2.1.0=1,866 2.1.1=276 2.1.2=379 2.2.0=798 2.2.1=206 2.2.2=495 2.3.0 was not used 2.4.0=2,189 2.4.1=33 2.4.2=113 2.5.0=1,523 2.5.1=241 and counting Cadence is 2-3 major releases per year, with follow up maintenance releases in the months after. http://jsfiddle.net/mp8rqq5x/2/
  • #10 Services: Service Auto-Start: UI for enabling which services should restart automatically if they die or host is rebooted. Download all client configs with a single click Wizard for NameNode HA, to save namespace, format new namenode, move existing JN, more than 3 JNs Security: Passwords are now stored in the credential store by default for Hive, Oozie, Ranger, and Log Search Support Kerberos token authentication Core: Perf fixes for up to 2.5k agents, PERF stack, and simulate up to 50 agents per VM. DB Consistency Checker can now fix itself during mismatches. AMS: New Grafana dashboard for Ambari Server performance (JVM, GC, top queries) HDFS TopN User and Operation Visualization – Shows most frequent operations being performed on the NameNode, and more importantly who’s performing those operations, intervals are 1, 5, 25 min sliding window. Ambari Metrics Collector now supports active-active HA to distribute the load. Alerts & Log Search: Default SNMP alert now sends traps using an Ambari-specific MIB Log Search now has settings for max backup file size and max # of backup files.
  • #12 Cloudbreak can install on AWS (EC2, S3, RDS), MSFT Azure, Cluster install takes 5, mostly downloading packages, installing bits, and starting services.
  • #13 Used by HDInsight (Microsoft Azure) and Hortonworks QA Allow cluster creation or scaling to be started via the REST API prior to all/any hosts being available. As hosts register with Ambari server they will be matched to request host groups and provisioned according to the requested topology Allow host predicates to be specified along with host count to provide more flexibility in matching hosts to host groups. This will allow for host flavors where different host groups are matched to different host flavors Break up the current monolithic provisioning request into a request for each host operation. For example, install on host A, start on host A, install on hostB, etc. This will allow hosts to make progress even when another host encounters a failure. Allow a host count to be specified in the cluster creation template instead of host names. This is documented in https://issues.apache.org/jira/browse/AMBARI-6275 Install a cluster with two API calls
  • #14 The blueprint contains the configs, assignment of topology to host group, stack version The creation actually assigns hosts to each host group.
  • #15 The blueprint contains the configs, assignment of topology to host group, stack version The creation actually assigns hosts to each host group.
  • #16 The blueprint contains the configs, assignment of topology to host group, stack version The creation actually assigns hosts to each host group.
  • #17 The blueprint contains the configs, assignment of topology to host group, stack version The creation actually assigns hosts to each host group.
  • #19 Dynamic availability Allow host_count to be specified instead of host_names As hosts register, they will be matched to the request host groups and provisioned according to to the requested topology When specifying a host_count, a predicate can also be specified for finer-grained control 3 Terabytes since units is in MB
  • #20 Configuration files Package with python scripts and templates Alert definitions Kerberos configurations, principals, identities, keytabs Meta data Metric details UI controls and widgets
  • #21 Lucene, Cassandra Configuration files Package with python scripts and templates Alert definitions Kerberos configurations, principals, identities, keytabs Meta data Metric details UI controls and widgets
  • #22 Configuration files Package with python scripts and templates Alert definitions Kerberos configurations, principals, identities, keytabs Meta data Metric details UI controls and widgets
  • #23 Configuration files Package with python scripts and templates Alert definitions Kerberos configurations, principals, identities, keytabs Meta data Metric details UI controls and widgets
  • #24 Configuration files Package with python scripts and templates Alert definitions Kerberos configurations, principals, identities, keytabs Meta data Metric details UI controls and widgets
  • #25 Configuration files Package with python scripts and templates Alert definitions Kerberos configurations, principals, identities, keytabs Meta data Metric details UI controls and widgets
  • #26 Stack Advisor, can now ship the recommendations for a service with the service itself, instead of a monolithic stack advisor for the entire stack. Makes it easier to integrate customer services. In Ambari 3.0, splitting this, each Service will have its own independent Advisor in Java + Drools
  • #27 Split up Rewrite in Java Drools
  • #28 Kerberos: LDAP/AD Services: Ranger, Atlas, Knox. Ranger: setup security policies on who can access what. Authorization of audit files, plugins for other services like HDFS, Hive, Storm, etc. Atlas: Lineage of data, compliance, especially in health care and financial institutions Knox: perimeter security for HTTP and REST calls in the Hadoop Services. Works with SSL, Kerberos. Kerberos Key Distribution Center so we can define service principals and keytabs.
  • #29  Can use existing KDC (key distribution center) or install one for Hadoop Hadoop uses a rule-based system to create mappings between service principals and their related UNIX username
  • #33 Dynamic availability Allow host_count to be specified instead of host_names As hosts register, they will be matched to the request host groups and provisioned according to to the requested topology When specifying a host_count, a predicate can also be specified for finer-grained control 3 Terabytes since units is in MB
  • #34 Express Upgrade: fasted method to upgrade the stack since upgrades an entire component in batches of 100 hosts at a time Rolling Upgrade, one component at a time per host, which can take up to 1 min. For a 100 node cluster with
  • #35 Express Upgrade: fasted method to upgrade the stack since upgrades an entire component in batches of 100 hosts at a time Rolling Upgrade, one component at a time per host, which can take up to 1 min. For a 100 node cluster with
  • #36 Express Upgrade: fasted method to upgrade the stack since upgrades an entire component in batches of 100 hosts at a time Rolling Upgrade, one component at a time per host, which can take up to 1 min. For a 100 node cluster with
  • #37 Can install new bits ahead of time, side-by-side with existing bits, while the cluster is running Option to downgrade
  • #38 One component per host at a time.
  • #39 Batches of 100 hosts in parallel. Must first stop on the current version, take backups, change version, and start on the newer stack
  • #45 Motivation: Limited Ganglia capabilities OpenTSDB – GPL license and needs a Hadoop cluster Aggregation at multiple levels: Service, Time Scale, tested at 3000 nodes Fine grained control over retention, collection intervals, aggregation Pluggable and Extensible
  • #48 This Grafana instance is specifically for AMS, not meant to be general-purpose If customer is already using Grafana, this is not a replacement. Grafana will support read-only access for anonymous users, and HTTPS Aggregates across entire cluster, filter by host, top/bottom x, functions like avg/sum/min/max, filter by date range YARN, 10%of jobs keep failing randomly, 2-3 weeks to find out. One one of the hosts, there was a kernel issue so containers failed when it reached that host. YARN NM dashboard, that host would have shown as failed. HBASE cluster, certain regions servers showed very high load and RPC queue length. Faulty network card on one of the datanodes, by looking at packet round trip, data packets blocked / num ops on DN dashboard.
  • #50 This is not HDP Search, it is not something that the customer has to separately license, it is an embedded Solr instance
  • #51 Agent/Collection process running on each host Written in Java Tails all service log files Parses logs using Grok/regex. Can merge multiple line logs, e.g. stack trace On restart, can resume from last read line. Uses checkpoint files to maintain state Extendable design to send logs to multiple destination type. Currently can send logs to Solr and Kafka