Building a Robust Cloud Foundry
HA, Security and DR
Haydon Ryan | Duncan Winn
This Talk
• High Availability (HA)
• Security
• Backing Up to Mitigate Disasters
© Copyright 2014 Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved.
HA
High Availability Focus
Keep apps and services running in a performant,
reliable and recoverable manner with timely error
detection
1. Application Instances
2. Platform Processes
3. Platform VMs
4. Availability Zones
Keep Cloud Foundry running in a performant, reliable
and recoverable manner with timely error detection
HA Deployments
Data Center Data Center
vs
Single Foundation
Deployment
Dual Foundation
Deployment
Data Center
AZ AZ
RDS
WHAT IF I TOLD YOU
IT’S POSSIBLE TO SANELY
STREACH LAYER 2
User Targets
myapp.mycf.com
DNS
Resolution
NSX Boundary NSX Boundary
VIP VIP
SSL Termination
SSL Termination
DNS Global Traffic Management (GTM)
HA ProxyHA Proxy
LTM ApplianceLTM Appliance
HA ProxyHA Proxy
LTM Appliance LTM Appliance
Domains
System Application
myapp.mycf.comtargetsClient
cf1.comcf push myappDeveloperapi.runtime-cf1.comcf apiDeveloper
CF1
cf2.comcf push myappDeveloperapi.runtime-cf2.comcf apiDeveloper
CF2
myapp.mycf.comtargetsClient
myapp.mycf.comtargetsClient
myapp.mycf.comtargetsClient
Services
Services
AppApp
Services
Service Service
AppApp
Services
HA Deployments
Data Center Data Center
vs
Single Foundation
Deployment
Dual Foundation
Deployment
Data Center
AZ AZ
RDS
Customer Requirements
• AWS with One VPC
• Specific IP Ranges
• Using their internal corporate DNS
• no ELBs or Route 53 due to security setup
• Multiple Deployments of Cloud Foundry
• Availability Requirements:
• App uptime
• Failure matrix for downtime situations 15
16
HA Proxy HA Proxy
Bind DNS
CF Router CF Router
HA Proxy HA ProxySSL Termination
Who does the deployment need to
be highly available for?
• Users
17
• Developers
• Operations
Any non-critical jobs?
• clock_global
• used to clean up cc jobs.
• Rely on Resurrector?
• Redeploy to a different AZ by changing
the resource_pool
18
Critical Jobs & VMs
• haproxy
• router
• nats
• cloud controller
• uaa/login?
• doppler?
19
Any less-critical jobs?
• loggregator / doppler
• loggregator traffic controller
• etcd
• Jumpbox?
• bosh?
20
Caveats with this design
• Single points of failure?
• DNS
• Bosh
• Jumpbox
• Human interaction required in outage
• Bind DNS does not do health monitoring.
Monitoring scripts were outside the scope
of the engagement. 21
22
AZ 2 Private Subnet
Customer
Managed
Interstate Data
Center
VPC
10.202.64.0/19
AZ 1 Private Subnet Bosh Subnet
jumpbox
CF SG
Direct
connect
Bosh SG
login
uaa
bosh
router
dea cc
natshealth etcd
doppler
cc
worker
loggregator
traffic
controller
clock
RDS Subnet
RDS SG
boshdb
uaadb
ccdb
apps
manager
router
bind dns
Customer Managed
NAT
bastion
ha
Proxy
ha
Proxy
ha
Proxy
ha
Proxy
router
router
login
uaadea cc
natshealth etcd
doppler
cc
worker
loggregator
traffic
controller
AZ 1
AZ 2
How We Deployed Services
• Proxy is a Single Point of
Failure
• No Load Balancer to use
• Acceptable by customer in
failure matrix 23
Proxy Server
Server
App
Proxy
Proxy
Best Practices for Services
24
• By Default the service
binding uses the first
proxy address only
Proxy
Proxy Server
Server
Server
App
Load	
  Balancer
Which Deployment
25
Data Center Data Center
Dual Foundation
Deployment
Single Foundation
Dual AZs
Data Center
Single Foundation
Single DC
Data Center
AZ AZ
RDS
© Copyright 2014 Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved.
Security and Networking
(on AWS)
Security
• Security is Hard
• Three main concepts
• Restrict
• Limit scope if Compromised
• Mitigate
• Feedback Loop
Restrict Users
• Individual Multi Factor Authentication
• IaaS Console/Hypervisor
• Jumpbox
• Separate accounts
• jumpbox
• bosh
• github
28
Restrict Packets
• IaaS
• Security Groups (Instance Level) (better)
• ACLs (Subnet Level)
• Routes
29
Restrict Containers
• Cloud Foundry
• Application Security Groups
• dea network properties
• (allow_networks, deny_networks)
30
Pivotal Cloud Foundry for AWS 1.4
31
VPC
10.0.0.0/16
RDS Subnet
Private Subnet
Public
Subnet
Ops
Manager
Elastic Runtime SG
ELB
Internet
Gateway
NAT SG
Ops Manager SG
RDS SG
login
uaa micro
router
vpc
all
NAT
restricted ip
80, 443, 22*
dea
Common traffic flow
sg allow rules
cc
natshealth etcd
doppler
cc
worker
loggregator
traffic
controller
clock
boshdbuaadb ccdb
apps
manager
db
autoscaling
ELB SG
80?,443
vpc
all
vpc
all
was it just DEAs that used NAT?
Limit Scope if Compromised
• Different user/pass for each component
• Strong passwords (and usernames)
• 20 Characters Long
• RANDOM
• Both Cases
• best avoid special characters
• eg: YxLIodYrUBQJrvMRYSQL
• Avoid cloud cow 32
http://vanmethod.deviantart.com/art/Purple-­‐Cow-­‐on-­‐a-­‐Cloud-­‐146265642
Limit Scope if Compromised
33
Runner
UAA
Login
uaadb
mySql App	
  Data
Post Breach Security Measures
• Roll
• AWS Credentials
• Username and password (Manifest)
• PEMs
• Investigate:
• Vm Logs (stored in Splunk / CloudWatch Logs)
• Bosh and Login Audit Trail
• Isolate the VM for investigation
• Resurrector will resurrect a non compromised VM
• Feedback:
• Incident Reports and Management Support 34
Paranoid Level Security for AWS
• Cloudtrail
• Alerts
• Audit Logs
• Rollback’
• Remove ability to delete
• s3 buckets
• subnets / vpc
• backups
• Everything else can be recovered from a backup… 35
© Copyright 2014 Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved.
Disaster
Recovery
Backing Up Cloud Foundry
Configuration
CCDB UAADB Apps Man DB BOSH DB
BlobstoreNFS Server
SCENARIO ONE
LOSE PCF OPS-MGR
OR
CF DEPLOYMENT
Restoring Ops Manager
Export
Configuration
Create New Ops Manager
Import
Configuration
Configuration
Backup Ops Manager
scp ubuntu@<OPS MRG HOST>:/var/tempest/workspaces/default/deployments/*yml .
Backup Deployment Manifests
Deployment Manifests in BOSH
~$ bosh deployments
bosh download manifest cf-c700aee17d9f801eb152 cfmanifest.yml
SCENARIO TWO
LOSE BOSH
Restoring Bosh With PCF
Export
Configuration Import
Configuration
:/var/tempest/workspaces/default/deployments/micro
BOSH	
  
Director
+ bosh.yml
Restoring Bosh Manually
BOSH
BOSH DB
bosh.yml
pg_dump /var/vcap/store
/dev/xvda
/dev/sdb
/dev/sdf
Volume:
BOSH DB
External MySQL
Blobstore
Critical Databases
Backup Cloud Controller DB Encryption Credentials
Locate Databases Info From Deployment Manifest
bosh download manifest cf-c700aee17d9f801eb152 cfmanifest.yml
NFS / Blobstore
✦ Managing Access with ACLs
✦ Create Group Bucket Policy for “Deny DeleteBucket”
✦ Turn on versioning
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": [
"s3:DeleteBucket",
"s3:DeleteObjectVersion"
],
"Resource": [
"*"
]
}
]
}
© Copyright 2014 Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved.
Takeaway
Takeaways
✦ Tradeoffs: No “One Size Fits All”
✦ Service Layer
✦ Existing: Environmental Security and Networking Constraints
✦ Backup: Configuration, Databases, Blobstore (This is your CF).
KEEP
CALM
AND
CF PUSH

Cloud Foundry Summit 2015: Building a Robust Cloud Foundry (HA, Security and DR)

  • 2.
    Building a RobustCloud Foundry HA, Security and DR Haydon Ryan | Duncan Winn
  • 3.
    This Talk • HighAvailability (HA) • Security • Backing Up to Mitigate Disasters
  • 4.
    © Copyright 2014Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved. HA
  • 5.
    High Availability Focus Keepapps and services running in a performant, reliable and recoverable manner with timely error detection 1. Application Instances 2. Platform Processes 3. Platform VMs 4. Availability Zones Keep Cloud Foundry running in a performant, reliable and recoverable manner with timely error detection
  • 6.
    HA Deployments Data CenterData Center vs Single Foundation Deployment Dual Foundation Deployment Data Center AZ AZ RDS
  • 7.
    WHAT IF ITOLD YOU IT’S POSSIBLE TO SANELY STREACH LAYER 2
  • 8.
    User Targets myapp.mycf.com DNS Resolution NSX BoundaryNSX Boundary VIP VIP SSL Termination SSL Termination DNS Global Traffic Management (GTM) HA ProxyHA Proxy LTM ApplianceLTM Appliance HA ProxyHA Proxy LTM Appliance LTM Appliance
  • 9.
    Domains System Application myapp.mycf.comtargetsClient cf1.comcf pushmyappDeveloperapi.runtime-cf1.comcf apiDeveloper CF1 cf2.comcf push myappDeveloperapi.runtime-cf2.comcf apiDeveloper CF2 myapp.mycf.comtargetsClient myapp.mycf.comtargetsClient myapp.mycf.comtargetsClient
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    HA Deployments Data CenterData Center vs Single Foundation Deployment Dual Foundation Deployment Data Center AZ AZ RDS
  • 15.
    Customer Requirements • AWSwith One VPC • Specific IP Ranges • Using their internal corporate DNS • no ELBs or Route 53 due to security setup • Multiple Deployments of Cloud Foundry • Availability Requirements: • App uptime • Failure matrix for downtime situations 15
  • 16.
    16 HA Proxy HAProxy Bind DNS CF Router CF Router HA Proxy HA ProxySSL Termination
  • 17.
    Who does thedeployment need to be highly available for? • Users 17 • Developers • Operations
  • 18.
    Any non-critical jobs? •clock_global • used to clean up cc jobs. • Rely on Resurrector? • Redeploy to a different AZ by changing the resource_pool 18
  • 19.
    Critical Jobs &VMs • haproxy • router • nats • cloud controller • uaa/login? • doppler? 19
  • 20.
    Any less-critical jobs? •loggregator / doppler • loggregator traffic controller • etcd • Jumpbox? • bosh? 20
  • 21.
    Caveats with thisdesign • Single points of failure? • DNS • Bosh • Jumpbox • Human interaction required in outage • Bind DNS does not do health monitoring. Monitoring scripts were outside the scope of the engagement. 21
  • 22.
    22 AZ 2 PrivateSubnet Customer Managed Interstate Data Center VPC 10.202.64.0/19 AZ 1 Private Subnet Bosh Subnet jumpbox CF SG Direct connect Bosh SG login uaa bosh router dea cc natshealth etcd doppler cc worker loggregator traffic controller clock RDS Subnet RDS SG boshdb uaadb ccdb apps manager router bind dns Customer Managed NAT bastion ha Proxy ha Proxy ha Proxy ha Proxy router router login uaadea cc natshealth etcd doppler cc worker loggregator traffic controller AZ 1 AZ 2
  • 23.
    How We DeployedServices • Proxy is a Single Point of Failure • No Load Balancer to use • Acceptable by customer in failure matrix 23 Proxy Server Server App Proxy Proxy
  • 24.
    Best Practices forServices 24 • By Default the service binding uses the first proxy address only Proxy Proxy Server Server Server App Load  Balancer
  • 25.
    Which Deployment 25 Data CenterData Center Dual Foundation Deployment Single Foundation Dual AZs Data Center Single Foundation Single DC Data Center AZ AZ RDS
  • 26.
    © Copyright 2014Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved. Security and Networking (on AWS)
  • 27.
    Security • Security isHard • Three main concepts • Restrict • Limit scope if Compromised • Mitigate • Feedback Loop
  • 28.
    Restrict Users • IndividualMulti Factor Authentication • IaaS Console/Hypervisor • Jumpbox • Separate accounts • jumpbox • bosh • github 28
  • 29.
    Restrict Packets • IaaS •Security Groups (Instance Level) (better) • ACLs (Subnet Level) • Routes 29
  • 30.
    Restrict Containers • CloudFoundry • Application Security Groups • dea network properties • (allow_networks, deny_networks) 30
  • 31.
    Pivotal Cloud Foundryfor AWS 1.4 31 VPC 10.0.0.0/16 RDS Subnet Private Subnet Public Subnet Ops Manager Elastic Runtime SG ELB Internet Gateway NAT SG Ops Manager SG RDS SG login uaa micro router vpc all NAT restricted ip 80, 443, 22* dea Common traffic flow sg allow rules cc natshealth etcd doppler cc worker loggregator traffic controller clock boshdbuaadb ccdb apps manager db autoscaling ELB SG 80?,443 vpc all vpc all was it just DEAs that used NAT?
  • 32.
    Limit Scope ifCompromised • Different user/pass for each component • Strong passwords (and usernames) • 20 Characters Long • RANDOM • Both Cases • best avoid special characters • eg: YxLIodYrUBQJrvMRYSQL • Avoid cloud cow 32 http://vanmethod.deviantart.com/art/Purple-­‐Cow-­‐on-­‐a-­‐Cloud-­‐146265642
  • 33.
    Limit Scope ifCompromised 33 Runner UAA Login uaadb mySql App  Data
  • 34.
    Post Breach SecurityMeasures • Roll • AWS Credentials • Username and password (Manifest) • PEMs • Investigate: • Vm Logs (stored in Splunk / CloudWatch Logs) • Bosh and Login Audit Trail • Isolate the VM for investigation • Resurrector will resurrect a non compromised VM • Feedback: • Incident Reports and Management Support 34
  • 35.
    Paranoid Level Securityfor AWS • Cloudtrail • Alerts • Audit Logs • Rollback’ • Remove ability to delete • s3 buckets • subnets / vpc • backups • Everything else can be recovered from a backup… 35
  • 36.
    © Copyright 2014Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved. Disaster Recovery
  • 37.
    Backing Up CloudFoundry Configuration CCDB UAADB Apps Man DB BOSH DB BlobstoreNFS Server
  • 38.
    SCENARIO ONE LOSE PCFOPS-MGR OR CF DEPLOYMENT
  • 39.
    Restoring Ops Manager Export Configuration CreateNew Ops Manager Import Configuration
  • 40.
    Configuration Backup Ops Manager scpubuntu@<OPS MRG HOST>:/var/tempest/workspaces/default/deployments/*yml . Backup Deployment Manifests
  • 41.
    Deployment Manifests inBOSH ~$ bosh deployments bosh download manifest cf-c700aee17d9f801eb152 cfmanifest.yml
  • 42.
  • 43.
    Restoring Bosh WithPCF Export Configuration Import Configuration :/var/tempest/workspaces/default/deployments/micro BOSH   Director + bosh.yml
  • 44.
    Restoring Bosh Manually BOSH BOSHDB bosh.yml pg_dump /var/vcap/store /dev/xvda /dev/sdb /dev/sdf Volume: BOSH DB External MySQL Blobstore
  • 45.
    Critical Databases Backup CloudController DB Encryption Credentials Locate Databases Info From Deployment Manifest bosh download manifest cf-c700aee17d9f801eb152 cfmanifest.yml
  • 46.
    NFS / Blobstore ✦Managing Access with ACLs ✦ Create Group Bucket Policy for “Deny DeleteBucket” ✦ Turn on versioning { "Version": "2012-10-17", "Statement": [ { "Effect": "Deny", "Action": [ "s3:DeleteBucket", "s3:DeleteObjectVersion" ], "Resource": [ "*" ] } ] }
  • 47.
    © Copyright 2014Pivotal. All rights reserved.© Copyright 2014 Pivotal. All rights reserved. Takeaway
  • 48.
    Takeaways ✦ Tradeoffs: No“One Size Fits All” ✦ Service Layer ✦ Existing: Environmental Security and Networking Constraints ✦ Backup: Configuration, Databases, Blobstore (This is your CF).
  • 49.