This document provides an overview of Digio and Mantel Group, two Australian technology consulting companies. It then discusses the problem of managing GitLab runners at scale and the solution developed using AWS services like Lambda, EventBridge, and auto-scaling groups to dynamically scale the runners based on workload. Cost estimates are provided showing the solution can cost $0.19-0.79 per month depending on usage. Benefits over alternative approaches like Docker Machine are also outlined. The document concludes with some Terraform best practices and a demonstration of the GitLab runner scaling solution.
3. ● Principal Engineer for Digio
● Focus on platform engineering
● Background in development
● 10 years AWS experience
● Worked ~2 years each in Azure
and GCP
● 12 years Infrastructure as Code
● Passion for automating things
● 4 years Terraform experience
● Terraform associate certified
● Previous AWS associate
certified but now I’m lazy
5. Digio and Mantel Group
Melbourne
Sydney
Brisbane
Auckland
Queenstown
Magnetic Island
Perth
Adelaide
We’re an Australian-owned, Principle based technology-
led consulting business founded in Melbourne.
Digio is Australia’s Premier Digital Services provider from concept to
production, continually evolving alongside technologies and method.
We are a dynamic business established in November 2017 and have
grown to a team of over 200 across Australia and New Zealand.
We are part of the broader Mantel Group currently comprised of 9
technology brands and a total team size of over 800. As a group we
have been recognised in the AFR’s 2020 fastest growing companies,
achieved #1 Best Place to Work for 2021 and 2022 in the Great Place
to Work Survey and awarded AWS 2022 Services Partner and
Migration Partner of the year.
Hobart
6. Mantel Group Brands
Working with Mantel Group not only enables access to expertise within Digio, but across all current and future brands.
A broad end-to-end capability that is vendor agnostic, yet has deep specialisations…
Software
Engineering (API)
Software
Engineering (QA)
Platform
Enablement
Software
Engineering (.NET)
Security & Identity
Managed Services
Data & Analytics
Data Strategy
Analytics & BI
Advanced Analytics
Platform Agnostic
Data Engineering
Technology
Strategy & Advisory
Software
Engineering (Web)
Application
Modernisation
Capabilities
Capabilities Capabilities
Cloud Native
Migration
Security
Data & Analytics
Managed Services
Digital Workplace
Capabilities
Automation &
DevOps
Cloud Computing
Analytics &
Machine Learning
Security & Identity
MarTech
Collaboration &
Productivity
Capabilities
Training &
Certification
Application
Transformation
Capabilities
Pursuit Model
Discovery Sprints
Rapid Prototyping
Service Design
ML Engineering
UX/UI Design
Software
Engineering (Mobile)
Capabilities
Platform
Enablement
Data Engineering
Data Architecture
Training &
Certification
Capabilities
Native Mobile
Technology
Strategy
Native Mobile
Product / Design
Strategy
Software
Engineering
(Android)
Software
Engineering (iOS)
Delivery & Method
Advanced Analytics
Capabilities
Data Engineering
Data Architecture
Data Strategy
Analytics & BI
Coaching & Training
16. Function URL vs EventBridge with polling
The webhook is:
● Faster to respond to events as
it runs ~instantly
● Zero AWS cost to enable
● Cheaper if the repository /
runner activity is low
● Could be abused via third
parties executing the function
without security permissions.
EventBridge is:
● More predictable in terms of
AWS spend
● 14 millions free invocations
● Slow to respond
● Lower cost if the GitLab
project activity is high
17. ● Make use of:
○ CloudWatch metrics and CloudWatch alarms
○ Triggers on auto-scale group
○ Scale policies to determine how many instances to scale
Scaling Out
18. ● Requires multiple inputs and considerations
○ Avoid churn of runners
● Scale down based on load (number of active runners and jobs in the queue)
● Make use of a premature transition to states (see Avoiding premature
transitions to alarm state)
○ AWS alarms include logic to try to avoid false alarms
○ CloudWatch waits the full N periods before alarming
○ Any time metric above the threshold the alarm "timer" is effectively reset.
● The tradeoff longer idle time with additional cost
Scaling In
19. Cost Estimation
Lambda
● Running the lambda via (In the ap-southeast-2
region):
○ x86 architecture
○ 1 request per minute
○ 2000ms duration
○ 128mb memory allocated
○ 512mb ephemeral storage (default)
● Free tier cost $0.00 a month.
● Without the free tier $0.19 USD (43,800
invocations)
Runner (EC2)
● t3.medium spot instance(s) 5 hours over the
month at the average price of $0.0158 is
$0.079 a month
● A t3.medium on demand instance(s) 5 hours
over the month at the average price of
$0.0528 is $0.264 a month
20. ● Trade off speed to respond
due to runner startup
● Likely not ideal for high
activity pipelines
● Small pipelines that trigger
after hours
Cost Estimation vs Docker machine
● Install and register GitLab
Runner for autoscaling with
Docker Machine
○ ~$10 a month for a pilot instance
running 24/7
● Patching and maintenance
● Verification
● Troubleshooting
● Internally we had issues with
SSH access
● Overall cost becomes a lot
higher
●Nice to just have it work
Hi, I’m Anthony Scata and I’m going to talk about some of my experience, lessons, coding tips and tricks while on my journey to write a module for deploying GitLab runners in a cost effective manner. We will see how things go, may even show some live demos.
Start by saying Happy Valentines day, hopefully by saying this i can gain some good karma from my wife so is likely sitting at home, angrily watching tv wondering where i am. I did ask her to join us but she wasn’t keen.
As a good consultant i cannot start a presentation without talking about where I work
Working in a consultancy we often have internal project, some of which are hosted in AWS. They are not business critical but may be a small application used by a few people, an internal project or a solution accelerator that we showcase to clients regarding latest technologies.
The issue is that we don’t make money from these, as a professional services consultancy we have our team members billable to clients. To means most people are very busy working on client projects and can be taken off internal work for higher value work.
It also means people are busy, trying to work internally and just getting things done, typically this means automation or infrastructure as code are on the back burner. It is quite ironic that a company that works so much in the CI/CD space has very little maturity internally, but as mentioned this isn’t how we many money. As time is tough to come by and client projects can pop up consistency for internal projects is often an issue and a lot of projects becomes orphaned with little to no support.
Engineers come onto these project, implement something simple, rarely with time to make it better or easier just doing what they know. Over time this leads to a large mess of reinvention or solutions, especially infrastructure as code all bandaged together. As automation is usually not people's expertise, and as most of us know, is overlooked until the next person comes long to see the dumpster fire of setup, continuing the cycle.
If people do look at automation it often gets expensive, both in time to set up and then maintain. Any system left on needs to be patched, verified and validated and monitored.We have found this to be a large sink of money especially for projects that are rarely touched. We often float the idea of centrally managed runners but then we have an issue with ownership, cost allocation, debugging, generally usage, it becomes very painful.
The solution was for something that needed to be low cost, easy to build, maintain and reproducible that could be deployed into any aws account or region
High quality so it can be reused on another project and not falling over every few months. With the advent of serverless technologies, they provide a great approach for not needing to patch or upgrade running system, are usually low cost or at least lower and provide little attach surface for malicious actors.
The idea is that is also works well for small projects that can scale to something larger. If you don’t want to spend a lot of money on CI/CD runners but if the project grows doesn’t require you to set up a whole new process or implementation.
So this is where it lead to a terraform module, automation the process of building gitlab runners.
Ec2 cost ~$15 a month plus extra costs
Ec2 cost ~$15 a month plus extra costs
Now some of the more technical tips and tricks that i learnt along the way. These help the next person picking up the code. Again one issue is people picking up the code. Build and document as if you were the one looking at this for the first time and what would really help.
As with anything, architecture diagram or documentation as a whole is important. Nothing says this is a well maintained piece of code like documentation which is factual and thorough. Diagrams can really help pain a picture of what will be deployed. Again, why have consumers extra data out by looking at code when they can see it from a high level. Coupled with examples of working code makes it easier for people to try. You want to lower the barrier to entry for any piece of software and you may need them yourself as they provide a good guide.
And lastly, example why in the documentation decisions were made. At times we focus on what was done but not the motivation or limitation as to why. The how can mostly be seen from the code, we can reason about this, the why is more abstract and less obvious. We use this cloudwatch setting because, this is set to negative value so that. This helps the next person out who thinks, why was this done, let me change it to something else that makes more sense to me, only to find themselves in the same situation and rabbit hole you did. Be kind to your future self and engineers.
I want my code to be well documented and for those interested in it to look if necessary, key word being necessary. Terraform docs provides the ability to automatically generate resource, variable, input, provider and other docs based on the code. This means less looking at the code if you are new to the module and provides a better snapshot. Now I can see if I this works with the aws provider version i need for another module, does it use a resource type that my organisation does not allow but more importantly which input variables I need to supply, why and how. With the validation mentioned earlier and the docs a consumer shouldn’t need to view the code to see how a variable will be used making it easier to use for less experience engineers.
As of 0.13.0 you have the ability for variable validation. To check the contents of a variable for example is within a certain number range, or matches a regex or is a valid json string. The idea being that sometimes a plan does not catch these incompatibilities due to the provider, we only find them when its running the apply which is likely too late. Lets do this before the plan to ensure we have a consistent and working environment.
One advantage of the variable map and optionals as mentioned before is that we can check multiple variable values, for example the min is less than the max and the desired in somewhere in between. If the variables are defined separately this cannot be done
This may seem minor but it really helps others who are viewing or changing your code. Some resources may use 10 or 20+ attributes and it may be hard to comprehend what is being used. Sorting the attributes alphabetically makes it easy for others to look and see where its places and then how its used. Reducing the cognitive load of making change and decisions, where does this go, should i put this here helps.
This includes the resources themselves. Although Terraform does not run in a sequential order it helps for us humans to again comprehend change and find resources.
We have all seen code with hundreds or thousands of lines and though, oh god, not this file again. This adds extra stress and cognitive load to changes. You are much better off splitting the files for a higher level resource type, possible autoscaling, cloudwatch and then add a locals specific to that set of resources into the file. Keeping the resources somewhat contained helps to facilitate change. This may sound contradictory to before in terms of logic ordering and it does depend on how many resources you are creating but anything more than 10 resources per file starts to ger unwieldy.
The use of locals makes it easier to reuse strings or data without having to hard code it in multiple places. Sometimes people make these variables as defaults which can be messy as it gives consumers the ability to change them. Utilise locals where possible to reduce duplication of magic strings, once, twice, three times extra into a local. Locals can also be used to remove the complexity of how a variable is computed into a separate file. Resource definitions can be large enough, let alone when you add in a join, compact, concat, split, tostring, try. Move this out, make it a local and reference it when needed.
Try and infer variables where possible rather than having the consumer pass them in. For example the caller account, no need to pass in an account id, we are deploying into this account, just grab the ID with a data source, same with region. This reduces the duplication of variables and the possibility where the consumer changes region and forgets to update the variable.
Thanks you for listening to my presentation and hope that you gain something useful for your Terraform and Infrastructure as Code journey.