Linkedin has multiple data-centers hosting tens of thousands of servers across them. A large percentage of these servers host our data infrastructure - our distributed data store called Espresso is sizeable amongst them. The fleet of servers contain various hardware components including, but not limited to, SSDs; and hardware has a tendency of failing from time to time. In case of hardware failures the servers need to undergo maintenance which can take a significant amount of time based on type of failure. This creates reduced capacity for that duration and throws an interesting problem of maintaining capacity in the face of multiple failures. This talk covers how LinkedIn uses Camunda wrapped around with several components to achieve hands-off capacity management via multiple workflows, with asynchronous pauses and synchronisation among them. It will also highlight how we achieved seamless integrations with various platforms and components within Linkedin's Infrastructure, and a few best practices that helped us achieve the final state.
Server fleet management using Camunda by Akhil Ahuja
1.
2. Outline
● Introduction
● Linkedin
● Platform scale
● Technology scale
● Espresso (My team)
● Hardware
● Failure impact, frequency and recovery
● Challenges and need for automation
● How camunda fits
● Solution story
● Our current deployment and Future
3. Camunda has them all and more
● Akhil Ahuja
https://www.linkedin.com/in/akhilvahuja/
● 5+ years at Linkedin - First job
● Worked on 2 data teams
● Current team: Espresso - nosql document store
● Working to integrate Camunda for a year and a half
5. Let’s talk about Platform Impact
+ 35%
y/y messages sent
2
New Sign-ups per second
100+ M
Job Applications every month
645+ M
Members across 200 countries
4 M
confirmed hires in last 12
months
13500+
Online Learning courses
6. 60B
Graph Edges
2+PB
Data in the online
document store
Technology Scale
Services in
production
1.5k
~340 k 200,000
Edge QPS Servers in production
~6T
Kafka messages processed
daily
7. Espresso
● NoSQL document store
● Bridges gap between RDBMS & k-v stores
● Use cases
■ Profiles, Invitations, InMails, etc.
● Relevant information
○ Different DBs for different use cases
○ Database divided in partitions
○ 3 replicas for each partition
○ 1 server -> n partitions
○ 1 server -> max 1 replica per partition
8. Espresso usage at Linkedin
As of march 2019
Clusters Data size
Number of
serversDatabases
● PROD: 129
● CORP: 2
● EI: 8
• PROD: ~16k servers
• CORP: 52 servers
• EI: >200 servers
● PROD: ~300
● CORP: 16
● EI: 433
• PROD: 2.7PB 9PB
• CORP: 6TB 12TB
• EI: 31TB 62TB
9. Espresso Hardware
● Espresso runs on SSDs
○ Only data on SSDs
○ Service logs, etc. on spinning disks
● Lots of other components are critical to functioning of the software
○ Motherboard
○ Memory
○ Disk
10.
11. Impact of hardware failure
● Availability Dip
● Software shuts down
● Loss of redundancy
● Redundancy restored automatically after a few hours
● Higher disk utilisation on cluster
● Less servers for serving traffic
13. Challenges with HW recovery
● Long recovery cycles
○ Large number of servers
○ Vendor dependent in some cases
● Difficult to track recovery for all nodes
○ Jira based system to track progress
● Bringing nodes back in rotation
○ Manual
○ Depends on notification from a person
○ Missing recovery steps can cause errors
14. Requirements for automation
● Automate human action. Twice
○ Taking the node OOR on failure
○ Bringing it back after ticket close
● Automate ticket close detection
● Verification
○ Safe to take server OOR
○ Right software and configs are deployed
○ Health after server is brought back
● Lots of async waits
17. Solution
● Automate a 2 part workflow
○ Detection failure and take nodes OOR
○ Monitor hosts with issues and bring them back in rotation on fix
18. Entry barriers
● Programming language
○ SREs code in python
○ Most SRE libraries in python
■ Needed Java equivalent libraries
● Integrating within Linkedin development environment
○ Challenges with UI
■ Jetty servers
■ Gradle
○ Authentication and authorization
● Learning curve
○ New automation engine concepts
Start small let results drive work
19. Automate part 2
● Signals
○ Messages in Camunda context
○ For nodes with HW failures
■ Kafka message handler
■ Kafka plugged into monitoring systems
○ For resolution of HW problems
■ Jira message handler
● Perform prep steps on node
○ Leveraled salt over REST
● Find right software and config version
● Deploy
● Terminate
22. Notifications(In addition to service code)
● Notifications for success/failure
○ Used history event handler:
○
○
processEngine.getProcessEngineConfiguration()).setHistoryEventHandler(new
CompositeDbHistoryEventHandler(eventsNotificationHandler))
if (historyEvent.isEventOfType(HistoryEventTypes.PROCESS_INSTANCE_END))
{
SEND_SUCCESS_NOTIFICATION
}
else if (historyEvent.isEventOfType(HistoryEventTypes.INCIDENT_CREATE))
{
SEND_FAILURE_NOTIFICATION
}
23. Glitches and fixes
processEngine.getProcessEngineConfiguration()).getJobExecutor().setLockTimeInMillis(
SOME_HIGHER_NUMBER).
1. Symptom - Spurious retry of jobs.
○ OBSERVATION: Host prep job succeeded but was retried 3 times
○ REASON: JOB TIME > DEFAULT_JOB_LOCK_TIME
○ SOLUTION:
○ Caveat - applies to all jobs
2. Symptom - Incorrect Notifications being sent
○ OBSERVATION: Several notifications per instance run
○ REASON: HistoryEventTypes initializes the enum wrong before 7.8
○ It was fixed here: https://github.com/camunda/camunda-bpm-
platform/commit/9248a6a9a54d5a3204f963f0d7aa86d56c32bfa9#diff-a5a3a7f2bbf22567a6ff551cf2a39089
○ I wish I would have reported and fixed it. May be the next one :)
24. Glitches and fixes - contd.
● Symptom: Failure to send messages via REST API
○ OBSERVATION: On using REST API to send - ClassNotFoundException
○ REASON: Sync task tries to find code on REST server
○ SOLUTION:
■ <property name="jobExecutorDeploymentAware">true</property>
● Thanks to Generali’s presentation at CamundaCon 2018
■ Asynchronous continuations and transaction boundaries
● Symptom: Duplicate workflow per host (business key).
○ OBSERVATION: Multiple instances to fix one host
○ REASON: Assumption that camunda enforces business key uniqueness
○ Solution:
■ Check in message handler and drop duplicates
25. Metrics
● Implemented metrics as a part of the history event handler.
● Was blissfully unaware of https://docs.camunda.org/manual/7.8/user-guide/process-engine/metrics/
● Need more metrics for duration tracking in real time. Looking forward to hearing from all of you
● How about https://jitpack.io/p/stephenott/camunda-prometheus-process-engine-plugin?
28. Things changed
● Time passed: >6 months
● Scope of problem increased
○ Deployments were impacted too
● New host pool management(HPM) design
○ Detected HW failures and take nodes OOR
○ Swap in a node from spare pool
■ Restores serving capacity
■ No nodes with old version of software in cluster
○ Monitor bad hosts and add them to spares on fix
31. Part 3 in production
● We changed initial production workflow
○ On fix
■ Bring node OOR
■ Move node from repair pool to spare pool
32. Challenges with HPM
● Heterogeneous clusters one spare pool
○ Node selection was time consuming
■ Needed to verify hardware compatibility
○ Solution: One pool per type of HW.
■ Disks were the only concerning component
● Limited variety
■ Other things were largely homogenous
● Throttling
○ Not more than X instances per day - security
○ Needs to be avoided despite parallel requests
○ Solution:
■ Throttling was moved out of workflow
■ Inside a lock in message handler
33. Our deployment
● Database: Mysql
○ Managed by mysql team
● Camunda UI
○ Deployed inside a jetty server
○ Built using gradle
● Workflows
○ Each built as a separate jetty deployable
● Wrapper product - Perseus
○ Linkedin integration libraries
○ Additional engine plugins
○ Camunda as dependency
34. Other use cases
● Database deletes
○ Automatic flow for deleting databases
○ Takes care of downstream calls
○ Monitors requests to database before deleting
● Move databases
○ To help balance clusters
○ Takes care of updating upstream and downstream actions
● Cluster scan
○ Monitors nodes with no HW failures
○ Restarts service on these to bring back capacity
35. Future
● PerseusAsAService(PAAS)
○ A platform for automation
■ Multiple WF definitions as a single deployable
■ Design to production within a week reusing libraries
● Cluster setup
○ For heavy workload tasks
■ Divide tasks among servers
■ Task -> Server mapping
● Metrics
○ Real time metrics independent of commits to mysql
○ As an engine plugin
■ Future workflows get them by default
○ Integrate with LI monitoring systems