Server fleet management using Camunda by Akhil Ahuja

Outline
● Introduction
● Linkedin
● Platform scale
● Technology scale
● Espresso (My team)
● Hardware
● Failure impact, frequency and recovery
● Challenges and need for automation
● How camunda fits
● Solution story
● Our current deployment and Future

Camunda has them all and more
● Akhil Ahuja
https://www.linkedin.com/in/akhilvahuja/
● 5+ years at Linkedin - First job
● Worked on 2 data teams
● Current team: Espresso - nosql document store
● Working to integrate Camunda for a year and a half

Let’s talk about Platform Impact
+ 35%
y/y messages sent
2
New Sign-ups per second
100+ M
Job Applications every month
645+ M
Members across 200 countries
4 M
confirmed hires in last 12
months
13500+
Online Learning courses

60B
Graph Edges
2+PB
Data in the online
document store
Technology Scale
Services in
production
1.5k
~340 k 200,000
Edge QPS Servers in production
~6T
Kafka messages processed
daily

Espresso
● NoSQL document store
● Bridges gap between RDBMS & k-v stores
● Use cases
■ Profiles, Invitations, InMails, etc.
● Relevant information
○ Different DBs for different use cases
○ Database divided in partitions
○ 3 replicas for each partition
○ 1 server -> n partitions
○ 1 server -> max 1 replica per partition

Espresso usage at Linkedin
As of march 2019
Clusters Data size
Number of
serversDatabases
● PROD: 129
● CORP: 2
● EI: 8
• PROD: ~16k servers
• CORP: 52 servers
• EI: >200 servers
● PROD: ~300
● CORP: 16
● EI: 433
• PROD: 2.7PB 9PB
• CORP: 6TB 12TB
• EI: 31TB 62TB

Espresso Hardware
● Espresso runs on SSDs
○ Only data on SSDs
○ Service logs, etc. on spinning disks
● Lots of other components are critical to functioning of the software
○ Motherboard
○ Memory
○ Disk

Impact of hardware failure
● Availability Dip
● Software shuts down
● Loss of redundancy
● Redundancy restored automatically after a few hours
● Higher disk utilisation on cluster
● Less servers for serving traffic

Challenges with HW recovery
● Long recovery cycles
○ Large number of servers
○ Vendor dependent in some cases
● Difficult to track recovery for all nodes
○ Jira based system to track progress
● Bringing nodes back in rotation
○ Manual
○ Depends on notification from a person
○ Missing recovery steps can cause errors

Requirements for automation
● Automate human action. Twice
○ Taking the node OOR on failure
○ Bringing it back after ticket close
● Automate ticket close detection
● Verification
○ Safe to take server OOR
○ Right software and configs are deployed
○ Health after server is brought back
● Lots of async waits

Solution
● Automate a 2 part workflow
○ Detection failure and take nodes OOR
○ Monitor hosts with issues and bring them back in rotation on fix

Entry barriers
● Programming language
○ SREs code in python
○ Most SRE libraries in python
■ Needed Java equivalent libraries
● Integrating within Linkedin development environment
○ Challenges with UI
■ Jetty servers
■ Gradle
○ Authentication and authorization
● Learning curve
○ New automation engine concepts
Start small let results drive work

Automate part 2
● Signals
○ Messages in Camunda context
○ For nodes with HW failures
■ Kafka message handler
■ Kafka plugged into monitoring systems
○ For resolution of HW problems
■ Jira message handler
● Perform prep steps on node
○ Leveraled salt over REST
● Find right software and config version
● Deploy
● Terminate

Design 2 (Final design)
Business key: hostname

Notifications(In addition to service code)
● Notifications for success/failure
○ Used history event handler:
○
○
processEngine.getProcessEngineConfiguration()).setHistoryEventHandler(new
CompositeDbHistoryEventHandler(eventsNotificationHandler))
if (historyEvent.isEventOfType(HistoryEventTypes.PROCESS_INSTANCE_END))
{
SEND_SUCCESS_NOTIFICATION
}
else if (historyEvent.isEventOfType(HistoryEventTypes.INCIDENT_CREATE))
{
SEND_FAILURE_NOTIFICATION
}

Glitches and fixes
processEngine.getProcessEngineConfiguration()).getJobExecutor().setLockTimeInMillis(
SOME_HIGHER_NUMBER).
1. Symptom - Spurious retry of jobs.
○ OBSERVATION: Host prep job succeeded but was retried 3 times
○ REASON: JOB TIME > DEFAULT_JOB_LOCK_TIME
○ SOLUTION:
○ Caveat - applies to all jobs
2. Symptom - Incorrect Notifications being sent
○ OBSERVATION: Several notifications per instance run
○ REASON: HistoryEventTypes initializes the enum wrong before 7.8
○ It was fixed here: https://github.com/camunda/camunda-bpm-
platform/commit/9248a6a9a54d5a3204f963f0d7aa86d56c32bfa9#diff-a5a3a7f2bbf22567a6ff551cf2a39089
○ I wish I would have reported and fixed it. May be the next one :)

Glitches and fixes - contd.
● Symptom: Failure to send messages via REST API
○ OBSERVATION: On using REST API to send - ClassNotFoundException
○ REASON: Sync task tries to find code on REST server
○ SOLUTION:
■ <property name="jobExecutorDeploymentAware">true</property>
● Thanks to Generali’s presentation at CamundaCon 2018
■ Asynchronous continuations and transaction boundaries
● Symptom: Duplicate workflow per host (business key).
○ OBSERVATION: Multiple instances to fix one host
○ REASON: Assumption that camunda enforces business key uniqueness
○ Solution:
■ Check in message handler and drop duplicates

Metrics
● Implemented metrics as a part of the history event handler.
● Was blissfully unaware of https://docs.camunda.org/manual/7.8/user-guide/process-engine/metrics/
● Need more metrics for duration tracking in real time. Looking forward to hearing from all of you
● How about https://jitpack.io/p/stephenott/camunda-prometheus-process-engine-plugin?

Success
Saved a month of manual work for an SRE in the first 2
quarters

Things changed
● Time passed: >6 months
● Scope of problem increased
○ Deployments were impacted too
● New host pool management(HPM) design
○ Detected HW failures and take nodes OOR
○ Swap in a node from spare pool
■ Restores serving capacity
■ No nodes with old version of software in cluster
○ Monitor bad hosts and add them to spares on fix

Part 3 in production
● We changed initial production workflow
○ On fix
■ Bring node OOR
■ Move node from repair pool to spare pool

Challenges with HPM
● Heterogeneous clusters one spare pool
○ Node selection was time consuming
■ Needed to verify hardware compatibility
○ Solution: One pool per type of HW.
■ Disks were the only concerning component
● Limited variety
■ Other things were largely homogenous
● Throttling
○ Not more than X instances per day - security
○ Needs to be avoided despite parallel requests
○ Solution:
■ Throttling was moved out of workflow
■ Inside a lock in message handler

Our deployment
● Database: Mysql
○ Managed by mysql team
● Camunda UI
○ Deployed inside a jetty server
○ Built using gradle
● Workflows
○ Each built as a separate jetty deployable
● Wrapper product - Perseus
○ Linkedin integration libraries
○ Additional engine plugins
○ Camunda as dependency

Other use cases
● Database deletes
○ Automatic flow for deleting databases
○ Takes care of downstream calls
○ Monitors requests to database before deleting
● Move databases
○ To help balance clusters
○ Takes care of updating upstream and downstream actions
● Cluster scan
○ Monitors nodes with no HW failures
○ Restarts service on these to bring back capacity

Future
● PerseusAsAService(PAAS)
○ A platform for automation
■ Multiple WF definitions as a single deployable
■ Design to production within a week reusing libraries
● Cluster setup
○ For heavy workload tasks
■ Divide tasks among servers
■ Task -> Server mapping
● Metrics
○ Real time metrics independent of commits to mysql
○ As an engine plugin
■ Future workflows get them by default
○ Integrate with LI monitoring systems

Server fleet management using Camunda by Akhil Ahuja

Server fleet management using Camunda by Akhil Ahuja

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Server fleet management using Camunda by Akhil Ahuja

Similar to Server fleet management using Camunda by Akhil Ahuja (20)

More from camunda services GmbH

More from camunda services GmbH (20)

Recently uploaded

Recently uploaded (20)

Server fleet management using Camunda by Akhil Ahuja