SlideShare a Scribd company logo
1 of 36
Download to read offline
Outline
● Introduction
● Linkedin
● Platform scale
● Technology scale
● Espresso (My team)
● Hardware
● Failure impact, frequency and recovery
● Challenges and need for automation
● How camunda fits
● Solution story
● Our current deployment and Future
Camunda has them all and more
● Akhil Ahuja
https://www.linkedin.com/in/akhilvahuja/
● 5+ years at Linkedin - First job
● Worked on 2 data teams
● Current team: Espresso - nosql document store
● Working to integrate Camunda for a year and a half
LinkedIn
Let’s talk about Platform Impact
+ 35%
y/y messages sent
2
New Sign-ups per second
100+ M
Job Applications every month
645+ M
Members across 200 countries
4 M
confirmed hires in last 12
months
13500+
Online Learning courses
60B
Graph Edges
2+PB
Data in the online
document store
Technology Scale
Services in
production
1.5k
~340 k 200,000
Edge QPS Servers in production
~6T
Kafka messages processed
daily
Espresso
● NoSQL document store
● Bridges gap between RDBMS & k-v stores
● Use cases
■ Profiles, Invitations, InMails, etc.
● Relevant information
○ Different DBs for different use cases
○ Database divided in partitions
○ 3 replicas for each partition
○ 1 server -> n partitions
○ 1 server -> max 1 replica per partition
Espresso usage at Linkedin
As of march 2019
Clusters Data size
Number of
serversDatabases
● PROD: 129
● CORP: 2
● EI: 8
• PROD: ~16k servers
• CORP: 52 servers
• EI: >200 servers
● PROD: ~300
● CORP: 16
● EI: 433
• PROD: 2.7PB 9PB
• CORP: 6TB 12TB
• EI: 31TB 62TB
Espresso Hardware
● Espresso runs on SSDs
○ Only data on SSDs
○ Service logs, etc. on spinning disks
● Lots of other components are critical to functioning of the software
○ Motherboard
○ Memory
○ Disk
Impact of hardware failure
● Availability Dip
● Software shuts down
● Loss of redundancy
● Redundancy restored automatically after a few hours
● Higher disk utilisation on cluster
● Less servers for serving traffic
Failure frequency
Challenges with HW recovery
● Long recovery cycles
○ Large number of servers
○ Vendor dependent in some cases
● Difficult to track recovery for all nodes
○ Jira based system to track progress
● Bringing nodes back in rotation
○ Manual
○ Depends on notification from a person
○ Missing recovery steps can cause errors
Requirements for automation
● Automate human action. Twice
○ Taking the node OOR on failure
○ Bringing it back after ticket close
● Automate ticket close detection
● Verification
○ Safe to take server OOR
○ Right software and configs are deployed
○ Health after server is brought back
● Lots of async waits
Features needed
Async wait
Solution
● Automate a 2 part workflow
○ Detection failure and take nodes OOR
○ Monitor hosts with issues and bring them back in rotation on fix
Entry barriers
● Programming language
○ SREs code in python
○ Most SRE libraries in python
■ Needed Java equivalent libraries
● Integrating within Linkedin development environment
○ Challenges with UI
■ Jetty servers
■ Gradle
○ Authentication and authorization
● Learning curve
○ New automation engine concepts
Start small let results drive work
Automate part 2
● Signals
○ Messages in Camunda context
○ For nodes with HW failures
■ Kafka message handler
■ Kafka plugged into monitoring systems
○ For resolution of HW problems
■ Jira message handler
● Perform prep steps on node
○ Leveraled salt over REST
● Find right software and config version
● Deploy
● Terminate
Initial design 1
Design 2 (Final design)
Business key: hostname
Notifications(In addition to service code)
● Notifications for success/failure
○ Used history event handler:
○
○
processEngine.getProcessEngineConfiguration()).setHistoryEventHandler(new
CompositeDbHistoryEventHandler(eventsNotificationHandler))
if (historyEvent.isEventOfType(HistoryEventTypes.PROCESS_INSTANCE_END))
{
SEND_SUCCESS_NOTIFICATION
}
else if (historyEvent.isEventOfType(HistoryEventTypes.INCIDENT_CREATE))
{
SEND_FAILURE_NOTIFICATION
}
Glitches and fixes
processEngine.getProcessEngineConfiguration()).getJobExecutor().setLockTimeInMillis(
SOME_HIGHER_NUMBER).
1. Symptom - Spurious retry of jobs.
○ OBSERVATION: Host prep job succeeded but was retried 3 times
○ REASON: JOB TIME > DEFAULT_JOB_LOCK_TIME
○ SOLUTION:
○ Caveat - applies to all jobs
2. Symptom - Incorrect Notifications being sent
○ OBSERVATION: Several notifications per instance run
○ REASON: HistoryEventTypes initializes the enum wrong before 7.8
○ It was fixed here: https://github.com/camunda/camunda-bpm-
platform/commit/9248a6a9a54d5a3204f963f0d7aa86d56c32bfa9#diff-a5a3a7f2bbf22567a6ff551cf2a39089
○ I wish I would have reported and fixed it. May be the next one :)
Glitches and fixes - contd.
● Symptom: Failure to send messages via REST API
○ OBSERVATION: On using REST API to send - ClassNotFoundException
○ REASON: Sync task tries to find code on REST server
○ SOLUTION:
■ <property name="jobExecutorDeploymentAware">true</property>
● Thanks to Generali’s presentation at CamundaCon 2018
■ Asynchronous continuations and transaction boundaries
● Symptom: Duplicate workflow per host (business key).
○ OBSERVATION: Multiple instances to fix one host
○ REASON: Assumption that camunda enforces business key uniqueness
○ Solution:
■ Check in message handler and drop duplicates
Metrics
● Implemented metrics as a part of the history event handler.
● Was blissfully unaware of https://docs.camunda.org/manual/7.8/user-guide/process-engine/metrics/
● Need more metrics for duration tracking in real time. Looking forward to hearing from all of you
● How about https://jitpack.io/p/stephenott/camunda-prometheus-process-engine-plugin?
Success
Saved a month of manual work for an SRE in the first 2
quarters
Don’t forget part 1
Things changed
● Time passed: >6 months
● Scope of problem increased
○ Deployments were impacted too
● New host pool management(HPM) design
○ Detected HW failures and take nodes OOR
○ Swap in a node from spare pool
■ Restores serving capacity
■ No nodes with old version of software in cluster
○ Monitor bad hosts and add them to spares on fix
Part 1 in production
Part 2 in production
Part 3 in production
● We changed initial production workflow
○ On fix
■ Bring node OOR
■ Move node from repair pool to spare pool
Challenges with HPM
● Heterogeneous clusters one spare pool
○ Node selection was time consuming
■ Needed to verify hardware compatibility
○ Solution: One pool per type of HW.
■ Disks were the only concerning component
● Limited variety
■ Other things were largely homogenous
● Throttling
○ Not more than X instances per day - security
○ Needs to be avoided despite parallel requests
○ Solution:
■ Throttling was moved out of workflow
■ Inside a lock in message handler
Our deployment
● Database: Mysql
○ Managed by mysql team
● Camunda UI
○ Deployed inside a jetty server
○ Built using gradle
● Workflows
○ Each built as a separate jetty deployable
● Wrapper product - Perseus
○ Linkedin integration libraries
○ Additional engine plugins
○ Camunda as dependency
Other use cases
● Database deletes
○ Automatic flow for deleting databases
○ Takes care of downstream calls
○ Monitors requests to database before deleting
● Move databases
○ To help balance clusters
○ Takes care of updating upstream and downstream actions
● Cluster scan
○ Monitors nodes with no HW failures
○ Restarts service on these to bring back capacity
Future
● PerseusAsAService(PAAS)
○ A platform for automation
■ Multiple WF definitions as a single deployable
■ Design to production within a week reusing libraries
● Cluster setup
○ For heavy workload tasks
■ Divide tasks among servers
■ Task -> Server mapping
● Metrics
○ Real time metrics independent of commits to mysql
○ As an engine plugin
■ Future workflows get them by default
○ Integrate with LI monitoring systems
Server fleet management using Camunda by Akhil Ahuja

More Related Content

What's hot

[Webinar] Announcing the Camunda Cloud Public Beta - February 2020
[Webinar] Announcing the Camunda Cloud Public Beta - February 2020[Webinar] Announcing the Camunda Cloud Public Beta - February 2020
[Webinar] Announcing the Camunda Cloud Public Beta - February 2020camunda services GmbH
 
Case study: Camunda BPM in PwC project
Case study: Camunda BPM in PwC projectCase study: Camunda BPM in PwC project
Case study: Camunda BPM in PwC projectcamunda services GmbH
 
Camunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scaleCamunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scalecamunda services GmbH
 
Camunda Product Update – The present and the future of Process Automation
Camunda Product Update – The present and the future of Process AutomationCamunda Product Update – The present and the future of Process Automation
Camunda Product Update – The present and the future of Process Automationcamunda services GmbH
 
CamundaCon 2018: How to combine Camunda with RPA (Camunda)
CamundaCon 2018: How to combine Camunda with RPA (Camunda)CamundaCon 2018: How to combine Camunda with RPA (Camunda)
CamundaCon 2018: How to combine Camunda with RPA (Camunda)camunda services GmbH
 
BPMN and DMN for Processing Business Data with Camunda
BPMN and DMN for Processing Business Data with CamundaBPMN and DMN for Processing Business Data with Camunda
BPMN and DMN for Processing Business Data with CamundaAndré Borgonovo
 
Predictive Process Monitoring in Camunda
Predictive Process Monitoring in CamundaPredictive Process Monitoring in Camunda
Predictive Process Monitoring in Camundacamunda services GmbH
 
Camunda BPM 7.2: Tasklist and Javascript Forms SDK (English)
Camunda BPM 7.2: Tasklist and Javascript Forms SDK (English)Camunda BPM 7.2: Tasklist and Javascript Forms SDK (English)
Camunda BPM 7.2: Tasklist and Javascript Forms SDK (English)camunda services GmbH
 
Getting started with JBPM
Getting started with JBPMGetting started with JBPM
Getting started with JBPMGirish Bapat
 
JBossOneDayTalk 2011: Using jBPM to bring more power to your business processes
JBossOneDayTalk 2011: Using jBPM to bring more power to your business processesJBossOneDayTalk 2011: Using jBPM to bring more power to your business processes
JBossOneDayTalk 2011: Using jBPM to bring more power to your business processesKris Verlaenen
 

What's hot (20)

[Webinar] Announcing the Camunda Cloud Public Beta - February 2020
[Webinar] Announcing the Camunda Cloud Public Beta - February 2020[Webinar] Announcing the Camunda Cloud Public Beta - February 2020
[Webinar] Announcing the Camunda Cloud Public Beta - February 2020
 
Case study: Camunda BPM in PwC project
Case study: Camunda BPM in PwC projectCase study: Camunda BPM in PwC project
Case study: Camunda BPM in PwC project
 
Camunda BPM 7.12 Release Webinar
Camunda BPM 7.12 Release WebinarCamunda BPM 7.12 Release Webinar
Camunda BPM 7.12 Release Webinar
 
Camunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scaleCamunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scale
 
Camunda Product Update – The present and the future of Process Automation
Camunda Product Update – The present and the future of Process AutomationCamunda Product Update – The present and the future of Process Automation
Camunda Product Update – The present and the future of Process Automation
 
Camunda BPM 7.2 - English
Camunda BPM 7.2 - EnglishCamunda BPM 7.2 - English
Camunda BPM 7.2 - English
 
Camunda bpm 7.0-release-webinar-en
Camunda bpm 7.0-release-webinar-enCamunda bpm 7.0-release-webinar-en
Camunda bpm 7.0-release-webinar-en
 
camunda latest features and roadmap
camunda latest features and roadmapcamunda latest features and roadmap
camunda latest features and roadmap
 
camunda for developer-friendly BPM
camunda for developer-friendly BPMcamunda for developer-friendly BPM
camunda for developer-friendly BPM
 
CamundaCon 2018: How to combine Camunda with RPA (Camunda)
CamundaCon 2018: How to combine Camunda with RPA (Camunda)CamundaCon 2018: How to combine Camunda with RPA (Camunda)
CamundaCon 2018: How to combine Camunda with RPA (Camunda)
 
jBPM Introduction - JudCon Brazil 2013
jBPM Introduction - JudCon Brazil 2013jBPM Introduction - JudCon Brazil 2013
jBPM Introduction - JudCon Brazil 2013
 
Webinar: BPMN with camunda
Webinar: BPMN with camundaWebinar: BPMN with camunda
Webinar: BPMN with camunda
 
Camunda BPM 7.13 Webinar
Camunda BPM 7.13 WebinarCamunda BPM 7.13 Webinar
Camunda BPM 7.13 Webinar
 
BPMN and DMN for Processing Business Data with Camunda
BPMN and DMN for Processing Business Data with CamundaBPMN and DMN for Processing Business Data with Camunda
BPMN and DMN for Processing Business Data with Camunda
 
Predictive Process Monitoring in Camunda
Predictive Process Monitoring in CamundaPredictive Process Monitoring in Camunda
Predictive Process Monitoring in Camunda
 
Camunda BPM 7.2: Tasklist and Javascript Forms SDK (English)
Camunda BPM 7.2: Tasklist and Javascript Forms SDK (English)Camunda BPM 7.2: Tasklist and Javascript Forms SDK (English)
Camunda BPM 7.2: Tasklist and Javascript Forms SDK (English)
 
Webinar: Camunda und Liferay
Webinar: Camunda und LiferayWebinar: Camunda und Liferay
Webinar: Camunda und Liferay
 
Getting started with JBPM
Getting started with JBPMGetting started with JBPM
Getting started with JBPM
 
Bhoomika Bisht 1.1
Bhoomika Bisht 1.1Bhoomika Bisht 1.1
Bhoomika Bisht 1.1
 
JBossOneDayTalk 2011: Using jBPM to bring more power to your business processes
JBossOneDayTalk 2011: Using jBPM to bring more power to your business processesJBossOneDayTalk 2011: Using jBPM to bring more power to your business processes
JBossOneDayTalk 2011: Using jBPM to bring more power to your business processes
 

Similar to Server fleet management using Camunda by Akhil Ahuja

PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale (Lone Star PHP 2014)PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale (Lone Star PHP 2014)David Timothy Strauss
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafkaconfluent
 
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaiFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaScyllaDB
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIURohit Jnagal
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleDmytro Semenov
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERShuyi Chen
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Storyvanphp
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarShubham Tagra
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Hisham Mardam-Bey
 

Similar to Server fleet management using Camunda by Akhil Ahuja (20)

PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale (Lone Star PHP 2014)PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale (Lone Star PHP 2014)
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaiFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
 
Netty training
Netty trainingNetty training
Netty training
 
Netty training
Netty trainingNetty training
Netty training
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan Kumar
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1
 

More from camunda services GmbH

Using Camunda on Kubernetes through Operators
Using Camunda on Kubernetes through OperatorsUsing Camunda on Kubernetes through Operators
Using Camunda on Kubernetes through Operatorscamunda services GmbH
 
Tips on how to build Camunda Run for production
Tips on how to build Camunda Run for productionTips on how to build Camunda Run for production
Tips on how to build Camunda Run for productioncamunda services GmbH
 
Blitzumfrage zur aktuellen Nutzung von Prozessautomatisierung in Unternehmen
Blitzumfrage zur aktuellen Nutzung von Prozessautomatisierung in UnternehmenBlitzumfrage zur aktuellen Nutzung von Prozessautomatisierung in Unternehmen
Blitzumfrage zur aktuellen Nutzung von Prozessautomatisierung in Unternehmencamunda services GmbH
 
Webinar - A Developer's Quick Start Guide to Open Source Process Automation U...
Webinar - A Developer's Quick Start Guide to Open Source Process Automation U...Webinar - A Developer's Quick Start Guide to Open Source Process Automation U...
Webinar - A Developer's Quick Start Guide to Open Source Process Automation U...camunda services GmbH
 
Extending human workflow preparing people and processes for the digital era w...
Extending human workflow preparing people and processes for the digital era w...Extending human workflow preparing people and processes for the digital era w...
Extending human workflow preparing people and processes for the digital era w...camunda services GmbH
 
Webinar: Monitoring & Orchestrating Your Microservices Landscape using Workfl...
Webinar: Monitoring & Orchestrating Your Microservices Landscape using Workfl...Webinar: Monitoring & Orchestrating Your Microservices Landscape using Workfl...
Webinar: Monitoring & Orchestrating Your Microservices Landscape using Workfl...camunda services GmbH
 
Process Automation Forum, Processautomatisierung neu gedacht für das digitale...
Process Automation Forum, Processautomatisierung neu gedacht für das digitale...Process Automation Forum, Processautomatisierung neu gedacht für das digitale...
Process Automation Forum, Processautomatisierung neu gedacht für das digitale...camunda services GmbH
 
Process Automation Forum Zurich, finnova AG Bankware
Process Automation Forum Zurich, finnova AG BankwareProcess Automation Forum Zurich, finnova AG Bankware
Process Automation Forum Zurich, finnova AG Bankwarecamunda services GmbH
 
Process Automation Forum Munich, Swiss Life
Process Automation Forum Munich, Swiss LifeProcess Automation Forum Munich, Swiss Life
Process Automation Forum Munich, Swiss Lifecamunda services GmbH
 
Process Automation Forum Vienna, A1 & J-IT
Process Automation Forum Vienna, A1 & J-ITProcess Automation Forum Vienna, A1 & J-IT
Process Automation Forum Vienna, A1 & J-ITcamunda services GmbH
 
Process Automation Forum Vienna, Raiffeisen
Process Automation Forum Vienna, RaiffeisenProcess Automation Forum Vienna, Raiffeisen
Process Automation Forum Vienna, Raiffeisencamunda services GmbH
 
Process Automation Forum Düsseldorf, Provinzial Rheinland Versicherung AG
Process Automation Forum Düsseldorf, Provinzial Rheinland Versicherung AGProcess Automation Forum Düsseldorf, Provinzial Rheinland Versicherung AG
Process Automation Forum Düsseldorf, Provinzial Rheinland Versicherung AGcamunda services GmbH
 
[Webinar] BPM Renaissance: 5 Tips to Thrive in a Cloud-Native World
[Webinar] BPM Renaissance: 5 Tips to Thrive in a Cloud-Native World[Webinar] BPM Renaissance: 5 Tips to Thrive in a Cloud-Native World
[Webinar] BPM Renaissance: 5 Tips to Thrive in a Cloud-Native Worldcamunda services GmbH
 
[Webinar] Transforming Telcos with Digital Process Automation (December 10, 2...
[Webinar] Transforming Telcos with Digital Process Automation (December 10, 2...[Webinar] Transforming Telcos with Digital Process Automation (December 10, 2...
[Webinar] Transforming Telcos with Digital Process Automation (December 10, 2...camunda services GmbH
 
I want my process back! #microservices #serverless
I want my process back! #microservices #serverlessI want my process back! #microservices #serverless
I want my process back! #microservices #serverlesscamunda services GmbH
 

More from camunda services GmbH (20)

Using Camunda on Kubernetes through Operators
Using Camunda on Kubernetes through OperatorsUsing Camunda on Kubernetes through Operators
Using Camunda on Kubernetes through Operators
 
Tips on how to build Camunda Run for production
Tips on how to build Camunda Run for productionTips on how to build Camunda Run for production
Tips on how to build Camunda Run for production
 
Process Driven Customer Interaction
Process Driven Customer InteractionProcess Driven Customer Interaction
Process Driven Customer Interaction
 
Exploring Automation in Government
Exploring Automation in GovernmentExploring Automation in Government
Exploring Automation in Government
 
The Pulse of Process Automation
The Pulse of Process AutomationThe Pulse of Process Automation
The Pulse of Process Automation
 
Blitzumfrage zur aktuellen Nutzung von Prozessautomatisierung in Unternehmen
Blitzumfrage zur aktuellen Nutzung von Prozessautomatisierung in UnternehmenBlitzumfrage zur aktuellen Nutzung von Prozessautomatisierung in Unternehmen
Blitzumfrage zur aktuellen Nutzung von Prozessautomatisierung in Unternehmen
 
Webinar - A Developer's Quick Start Guide to Open Source Process Automation U...
Webinar - A Developer's Quick Start Guide to Open Source Process Automation U...Webinar - A Developer's Quick Start Guide to Open Source Process Automation U...
Webinar - A Developer's Quick Start Guide to Open Source Process Automation U...
 
Extending human workflow preparing people and processes for the digital era w...
Extending human workflow preparing people and processes for the digital era w...Extending human workflow preparing people and processes for the digital era w...
Extending human workflow preparing people and processes for the digital era w...
 
Webinar: Monitoring & Orchestrating Your Microservices Landscape using Workfl...
Webinar: Monitoring & Orchestrating Your Microservices Landscape using Workfl...Webinar: Monitoring & Orchestrating Your Microservices Landscape using Workfl...
Webinar: Monitoring & Orchestrating Your Microservices Landscape using Workfl...
 
Process Automation Forum, Processautomatisierung neu gedacht für das digitale...
Process Automation Forum, Processautomatisierung neu gedacht für das digitale...Process Automation Forum, Processautomatisierung neu gedacht für das digitale...
Process Automation Forum, Processautomatisierung neu gedacht für das digitale...
 
Process Automation Forum Zurich, finnova AG Bankware
Process Automation Forum Zurich, finnova AG BankwareProcess Automation Forum Zurich, finnova AG Bankware
Process Automation Forum Zurich, finnova AG Bankware
 
Process Automation Forum Munich, Swiss Life
Process Automation Forum Munich, Swiss LifeProcess Automation Forum Munich, Swiss Life
Process Automation Forum Munich, Swiss Life
 
Process Automation Forum Vienna, A1 & J-IT
Process Automation Forum Vienna, A1 & J-ITProcess Automation Forum Vienna, A1 & J-IT
Process Automation Forum Vienna, A1 & J-IT
 
Process Automation Forum Vienna, Raiffeisen
Process Automation Forum Vienna, RaiffeisenProcess Automation Forum Vienna, Raiffeisen
Process Automation Forum Vienna, Raiffeisen
 
Process Automation Forum Düsseldorf, Provinzial Rheinland Versicherung AG
Process Automation Forum Düsseldorf, Provinzial Rheinland Versicherung AGProcess Automation Forum Düsseldorf, Provinzial Rheinland Versicherung AG
Process Automation Forum Düsseldorf, Provinzial Rheinland Versicherung AG
 
[Webinar] BPM Renaissance: 5 Tips to Thrive in a Cloud-Native World
[Webinar] BPM Renaissance: 5 Tips to Thrive in a Cloud-Native World[Webinar] BPM Renaissance: 5 Tips to Thrive in a Cloud-Native World
[Webinar] BPM Renaissance: 5 Tips to Thrive in a Cloud-Native World
 
Zeebe + Operate January 2020 Update
Zeebe + Operate January 2020 UpdateZeebe + Operate January 2020 Update
Zeebe + Operate January 2020 Update
 
Optimize 2.7 Release Webinar
Optimize 2.7 Release WebinarOptimize 2.7 Release Webinar
Optimize 2.7 Release Webinar
 
[Webinar] Transforming Telcos with Digital Process Automation (December 10, 2...
[Webinar] Transforming Telcos with Digital Process Automation (December 10, 2...[Webinar] Transforming Telcos with Digital Process Automation (December 10, 2...
[Webinar] Transforming Telcos with Digital Process Automation (December 10, 2...
 
I want my process back! #microservices #serverless
I want my process back! #microservices #serverlessI want my process back! #microservices #serverless
I want my process back! #microservices #serverless
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Server fleet management using Camunda by Akhil Ahuja

  • 1.
  • 2. Outline ● Introduction ● Linkedin ● Platform scale ● Technology scale ● Espresso (My team) ● Hardware ● Failure impact, frequency and recovery ● Challenges and need for automation ● How camunda fits ● Solution story ● Our current deployment and Future
  • 3. Camunda has them all and more ● Akhil Ahuja https://www.linkedin.com/in/akhilvahuja/ ● 5+ years at Linkedin - First job ● Worked on 2 data teams ● Current team: Espresso - nosql document store ● Working to integrate Camunda for a year and a half
  • 5. Let’s talk about Platform Impact + 35% y/y messages sent 2 New Sign-ups per second 100+ M Job Applications every month 645+ M Members across 200 countries 4 M confirmed hires in last 12 months 13500+ Online Learning courses
  • 6. 60B Graph Edges 2+PB Data in the online document store Technology Scale Services in production 1.5k ~340 k 200,000 Edge QPS Servers in production ~6T Kafka messages processed daily
  • 7. Espresso ● NoSQL document store ● Bridges gap between RDBMS & k-v stores ● Use cases ■ Profiles, Invitations, InMails, etc. ● Relevant information ○ Different DBs for different use cases ○ Database divided in partitions ○ 3 replicas for each partition ○ 1 server -> n partitions ○ 1 server -> max 1 replica per partition
  • 8. Espresso usage at Linkedin As of march 2019 Clusters Data size Number of serversDatabases ● PROD: 129 ● CORP: 2 ● EI: 8 • PROD: ~16k servers • CORP: 52 servers • EI: >200 servers ● PROD: ~300 ● CORP: 16 ● EI: 433 • PROD: 2.7PB 9PB • CORP: 6TB 12TB • EI: 31TB 62TB
  • 9. Espresso Hardware ● Espresso runs on SSDs ○ Only data on SSDs ○ Service logs, etc. on spinning disks ● Lots of other components are critical to functioning of the software ○ Motherboard ○ Memory ○ Disk
  • 10.
  • 11. Impact of hardware failure ● Availability Dip ● Software shuts down ● Loss of redundancy ● Redundancy restored automatically after a few hours ● Higher disk utilisation on cluster ● Less servers for serving traffic
  • 13. Challenges with HW recovery ● Long recovery cycles ○ Large number of servers ○ Vendor dependent in some cases ● Difficult to track recovery for all nodes ○ Jira based system to track progress ● Bringing nodes back in rotation ○ Manual ○ Depends on notification from a person ○ Missing recovery steps can cause errors
  • 14. Requirements for automation ● Automate human action. Twice ○ Taking the node OOR on failure ○ Bringing it back after ticket close ● Automate ticket close detection ● Verification ○ Safe to take server OOR ○ Right software and configs are deployed ○ Health after server is brought back ● Lots of async waits
  • 16.
  • 17. Solution ● Automate a 2 part workflow ○ Detection failure and take nodes OOR ○ Monitor hosts with issues and bring them back in rotation on fix
  • 18. Entry barriers ● Programming language ○ SREs code in python ○ Most SRE libraries in python ■ Needed Java equivalent libraries ● Integrating within Linkedin development environment ○ Challenges with UI ■ Jetty servers ■ Gradle ○ Authentication and authorization ● Learning curve ○ New automation engine concepts Start small let results drive work
  • 19. Automate part 2 ● Signals ○ Messages in Camunda context ○ For nodes with HW failures ■ Kafka message handler ■ Kafka plugged into monitoring systems ○ For resolution of HW problems ■ Jira message handler ● Perform prep steps on node ○ Leveraled salt over REST ● Find right software and config version ● Deploy ● Terminate
  • 21. Design 2 (Final design) Business key: hostname
  • 22. Notifications(In addition to service code) ● Notifications for success/failure ○ Used history event handler: ○ ○ processEngine.getProcessEngineConfiguration()).setHistoryEventHandler(new CompositeDbHistoryEventHandler(eventsNotificationHandler)) if (historyEvent.isEventOfType(HistoryEventTypes.PROCESS_INSTANCE_END)) { SEND_SUCCESS_NOTIFICATION } else if (historyEvent.isEventOfType(HistoryEventTypes.INCIDENT_CREATE)) { SEND_FAILURE_NOTIFICATION }
  • 23. Glitches and fixes processEngine.getProcessEngineConfiguration()).getJobExecutor().setLockTimeInMillis( SOME_HIGHER_NUMBER). 1. Symptom - Spurious retry of jobs. ○ OBSERVATION: Host prep job succeeded but was retried 3 times ○ REASON: JOB TIME > DEFAULT_JOB_LOCK_TIME ○ SOLUTION: ○ Caveat - applies to all jobs 2. Symptom - Incorrect Notifications being sent ○ OBSERVATION: Several notifications per instance run ○ REASON: HistoryEventTypes initializes the enum wrong before 7.8 ○ It was fixed here: https://github.com/camunda/camunda-bpm- platform/commit/9248a6a9a54d5a3204f963f0d7aa86d56c32bfa9#diff-a5a3a7f2bbf22567a6ff551cf2a39089 ○ I wish I would have reported and fixed it. May be the next one :)
  • 24. Glitches and fixes - contd. ● Symptom: Failure to send messages via REST API ○ OBSERVATION: On using REST API to send - ClassNotFoundException ○ REASON: Sync task tries to find code on REST server ○ SOLUTION: ■ <property name="jobExecutorDeploymentAware">true</property> ● Thanks to Generali’s presentation at CamundaCon 2018 ■ Asynchronous continuations and transaction boundaries ● Symptom: Duplicate workflow per host (business key). ○ OBSERVATION: Multiple instances to fix one host ○ REASON: Assumption that camunda enforces business key uniqueness ○ Solution: ■ Check in message handler and drop duplicates
  • 25. Metrics ● Implemented metrics as a part of the history event handler. ● Was blissfully unaware of https://docs.camunda.org/manual/7.8/user-guide/process-engine/metrics/ ● Need more metrics for duration tracking in real time. Looking forward to hearing from all of you ● How about https://jitpack.io/p/stephenott/camunda-prometheus-process-engine-plugin?
  • 26. Success Saved a month of manual work for an SRE in the first 2 quarters
  • 28. Things changed ● Time passed: >6 months ● Scope of problem increased ○ Deployments were impacted too ● New host pool management(HPM) design ○ Detected HW failures and take nodes OOR ○ Swap in a node from spare pool ■ Restores serving capacity ■ No nodes with old version of software in cluster ○ Monitor bad hosts and add them to spares on fix
  • 29. Part 1 in production
  • 30. Part 2 in production
  • 31. Part 3 in production ● We changed initial production workflow ○ On fix ■ Bring node OOR ■ Move node from repair pool to spare pool
  • 32. Challenges with HPM ● Heterogeneous clusters one spare pool ○ Node selection was time consuming ■ Needed to verify hardware compatibility ○ Solution: One pool per type of HW. ■ Disks were the only concerning component ● Limited variety ■ Other things were largely homogenous ● Throttling ○ Not more than X instances per day - security ○ Needs to be avoided despite parallel requests ○ Solution: ■ Throttling was moved out of workflow ■ Inside a lock in message handler
  • 33. Our deployment ● Database: Mysql ○ Managed by mysql team ● Camunda UI ○ Deployed inside a jetty server ○ Built using gradle ● Workflows ○ Each built as a separate jetty deployable ● Wrapper product - Perseus ○ Linkedin integration libraries ○ Additional engine plugins ○ Camunda as dependency
  • 34. Other use cases ● Database deletes ○ Automatic flow for deleting databases ○ Takes care of downstream calls ○ Monitors requests to database before deleting ● Move databases ○ To help balance clusters ○ Takes care of updating upstream and downstream actions ● Cluster scan ○ Monitors nodes with no HW failures ○ Restarts service on these to bring back capacity
  • 35. Future ● PerseusAsAService(PAAS) ○ A platform for automation ■ Multiple WF definitions as a single deployable ■ Design to production within a week reusing libraries ● Cluster setup ○ For heavy workload tasks ■ Divide tasks among servers ■ Task -> Server mapping ● Metrics ○ Real time metrics independent of commits to mysql ○ As an engine plugin ■ Future workflows get them by default ○ Integrate with LI monitoring systems