Upgrading under the
weight of all that
state
Quinton Anderson
Context
Canonical
Model
Source
Source
Source
Source
Raw
Data
Business
Data
Access
Layer
Access
Layer
Access
Layer
Access
Layer
Load
Balancer
//TODO
Function
Cntrl-V
Scaling
Downstream
systems
• Specialised
management
systems
• Reporting Systems
• Product
management
Channel &
product
systems
Master Data
Management
Hadoop
• Leverage all data & reduce
integration costs
• Comprehensive dataset –
internal & external, realtime &
batch, structured & unstructured
• Advanced analytics / machine
learning
Group Data
Warehouse
• Understand our business
• Accurate, conformed, and
reconciled data
• Access layer to support BI &
reporting
BI/Reporting
• User facing tools
• Regulatory reporting
• Dishoarding
• Self service BI for the
masses
Customer record &
insights
All data
Price,
conversation,
credit dec.
etc.
Financial Data
Subset of
data
User
access
Information for
people
Core Financial
Systems and
functions
• P&L
• Recon
• General Ledger
• Etc…
Closed loop,
automated ‘decisions’
Decisioning
• Personalise/optimise decisions,
maximise customer value
• E.g. price, credit decision, next
conversation, experience
Core information repositories
Analytics applications
Other systems
Channels
Hadoop
Rules
Serving and decisioning
Analytic
Records
Systems Of Record
Core
Banking
Payments
Event Processor
Raw Data
Derived Data
Feature Store
Event Store
Scoring
Machine
Learning
www
Event Streams
Customer
Information
data loaded
Data
analysed &
processed
Insights &
events
captured
Integration API/Service Discovery
> 4000 Daily Batch Jobs
> 6 PB of State and growing
Hbase,
Cassandra,
HDFS,
Influx,
Elastic Search,
Kafka,
Etcd,
Zookeeper
OpenStack Swift
Oracle,
MySQL,
Postgres
Hundreds of services
MR1,
MR2,
Spark,
Akka
Dev,
Test,
Staging,
Prod 1,
Prod 2,
Etc…
== Complexity
Imperative:
Culture
Architecture
Immutable
Someone else’s computer
State Locality
Workload non-locality
Flexible over optimal
Practically, it is a closed system
State management is my problem
All abstractions are leaky
Repo(s) CI/CD Apps
Docker Calico
Mesos Yarn
Spark, MR, Impala, etc
Marathon,
Chronos, Cassandra, etc
CI/CD
CI/CD
Repo(s)
Repo(s)
Open
Stack
Nova
Nova/Ironic
OS
KVM
OS
Firmware + Hardware + Tags
Strategies
Outsource the problem, and tool away the
resulting issues
Delete it, tool away the resulting issues
Be stateless, tool away the resulting issues
Implement some patterns, incrementally
optimise. Tool away the resulting issues
Excess Capacity
Patterns
Consumer
Router
DB
Old Old
Web
App
DB
Web
App
Consumer
Router
DB
Old Old
Web
App
DB
Web
App
L4
HAProxy
Old Old Old Old
L4
HAProxy
Old Old Old Old New
L4
HAProxy
Old Old Old Old New
L4
HAProxy
Old Old Old Old New
L4
HAProxy
Old Old Old New New
L4
HAProxy
Old Old New New New
L4
HAProxy
Old New New New New
L4
HAProxy
New New New New New
== Incrementally accept risk
In place upgrade
Stateful
CAP, PACELC
Data models
Atomicity
Access patterns
Implementation approaches = ??
Upgrade Duration
O(N)
for node in nodes:
if info[node]['instance']:
if Status(node).run().wait() == AVAILBLE_FOR_MAINTENANCE:
MaintenanceMode(node).run().wait()
Upgrade(node).run().wait()
Health = HealthTests(node).run.wait()
UpdateStatus(node, health).run.wait()
all_good = True
host = self.cdh.get_host(self.host_map[self.node_name])
if host.healthSummary != 'GOOD':
all_good = False
# Look up the host by its roles
for c in self.cdh.get_all_clusters():
for s in c.get_all_services():
for r in s.get_all_roles():
h = r.hostRef
if h.hostId == self.host_map[self.node_name]:
if r.healthSummary != 'GOOD':
all_good = False
return all_good
O(log N)
nodeComputation = for {
_ <- Status(node)
_ <- MaintenanceMode(_,node)
_ <- Upgrade(node)
nodeResult <- HealthTests(node)
} yield nodeResult
upgrade = for {
node <- group
comp <- nodeComputation(node)
} yield comp.exec
groups.map(upgrade)
Repo(s) CI/CD Apps
Docker Calico
Mesos Yarn
Spark, MR, Impala, etc
Marathon,
Chronos, Cassandra, etc
CI/CD
CI/CD
Repo(s)
Repo(s)
Open
Stack
Nova
Nova/Ironic
OS
KVM
OS
Firmware + Hardware + Tags
Workflow
Jenkins
Environment
Branch PR
Merge
Dev
Deploy
Master
Deploy
Test
Change
Plan
clusters:
green-cluster:
dns:
nameservers:
- x.x.x.x
data_domain: *.*.*
etcd:
token: green-cluster
masters:
able:
provision_id: 1
lan:
-
mac: 0c:c4:7a:c1:2e:92
ip: 1.1.11.151/24
vlan: 11
gateway: 1.1.1.1
ironic_id: a7af76ad-6583-4209-ba5f-cf1477b6405e
flavor: ramish-baremetal-flavor2
image: *mesos-master-green
theta:
provision_id: 2
lan:
-
mac: 0c:c4:7a:a9:04:0c
ip: 1.1.11.53/24
vlan: 11
gateway: 1.1.1.1
ironic_id: 8ff1fd1c-4893-11e6-a447-2f366077ca0e
flavor: ramish-baremetal-flavor2
image: *mesos-master-green
tobias:
provision_id: 3
lan:
-
mac: 0c:c4:7a:a8:f6:ac
ip: 1.11.11.52/24
vlan: 11
gateway: 1.1.1.1
ironic_id: c89fdd08-232c-40fe-b965-49fc3e4dcba7
flavor: ramish-baremetal-flavor2
image: *mesos-master-green
Recommendations
Instrument as much of deployment and
provisioning as you can
Optimise incrementally, learn the
right hard lessons
Allow for manual intervention, but
attack it aggressively
Encourage your people to intervene
Prevent Pets
Spend more time on testing

Moving forward under the weight of all that state