Approach: Not to re-hash AWS's standard advice, but to share some other findings.
Cost Optimisation could be a pretty dry subject so I’ll keep it short.
I won’t rehash the standard AWS doco, but dive in to some interesting things we’ve found.
First up something obvious - Look for unused / forgotten resources!
Next look at downsizing.
Downsizing one step saves 50% or close to it. Choosing the right instance type along with right-sizing can mean downsizing two steps – saving about 75%
Add a reservation to that and you can save close to 85%.
To decide if an instance is the right size you need to look at some metrics.
Also need to have some info to predict how future load compares to current load.
T2/3 instances fit way more use cases than we expected.
CPU, Memory and Network usage.
Look at a couple of weeks or more. Note narrow spikes suggesting downsizing and using T2/3 could be applicable.
T2/3 - they are "burstable"; you accrue CPU credits then can get bursts of very high performance. Bursting has been so popular that original T2 were "cheap general purpose" but larger ones have been added.
For a T2 we often see CPU spikes consuming credits … but find the credit balance has hardly been touched. This is an extreme example.
ECU = EC2 Compute Unit. Supposedly phased out in 2014 in favour of vCPUs but still visible on the Pricing pages.
re:Invent - https://www.youtube.com/watch?v=FkMslBsVYFU&feature=youtu.be
RDS – Memory metrics are provided but EC2 you need to create your own.
- No news here but often the custom metrics are neglected. It’s really worth ALWAYS deploying the CloudWatch Agent.
- top or task manager are only point-in-time, and don’t give enough info for DevOps agility.
AWS recommend you use the new agent instead of the older monitoring scripts (Perl) to collect metrics and logs.
EC2 - Make right choice of C, M or R types, depending on memory needs.
- C or M often better replaced by T
- Often found C being assumed correct on roll-out but actually memory use more the issue; downsize if change to other type.
It can be hard to interpret your OOTB networking metrics.
We found very few workloads actually putting networking under stress though.
Instance network performance is very vague in the AWS documentation.
Someone with too much time on their hands though created https://cloudonaut.io/ec2-network-performance-cheat-sheet/
- Not official, a random sample but reasonable methodology.
BTW I-type instances are ambiguously described as “high I/O performance” but to be clear that’s disk not network I/O.
The best way to save money is turn stuff off!
Particularly non-prod environments.
Savings are potentially even more if you switch off automatically but only switch on on-demand.
AWS Instance Scheduler – canned CloudFormation solution provided by AWS
Config in DynamoDB isn’t as convenient and isn’t attached to the resources affected.
Worskpaces can be Always-On or Auto-Stop.
Canned solution provided by AWS.
Probably well known but I wanted to mention it as it’s easy to deploy and works really well after bug fixes earlier this year.
To maximise benefit, need to encourage people to disconnect when not in use so the Workspace will sleep; only takes a minute or two to resume, and no data loss.
Worth running a nightly check for Workspaces left connected overnight
- For Auto-stop ones, can even run “stop-workspaces” operation on them.
Can get 1 or 3 year, no/partial/all upfront – 1 year partial is a good compromise.
- Saves at least 20% on smaller instances, rising to 30-35% for larger instances.
All upfront doesn’t buy much more savings and 3 years is too long a commitment.
Also (not in Sydney) you can purchase scheduled reservations, applicable at prescribed times.
Can be confusing to get your head around when instance size flexibility applies.
- this allows e.g. a t2.large reservation to be applied to 2 t2.medium instead
Reservations can be shared across accounts if using Organizations. If using consolidated billing without Organizations, you can only share reservations made in the Payer account.
Also now can do On-Demand Capacity Reservations independent from RI billing discounts. No time commitment – cancel any time.
Similar to EC2 reservations (years, upfront, sharing) and now with instance size flexibility too for most DB engines.
- Saves a bit more than EC2 – about 40% for Postgres/MySQL, 30% for SQL Server for example.
A lot more parameters to specify, and if not all correct, it won’t match your running instance.
- Take care when reserving!
Without tracking in JIRA, reservations were too hard to manage and renew.
- Notifications from AWS of expiring reservations aren’t enough; don’t necessarily go to the right people and don’t assign responsibility.
Renewal is done by re-opening the existing JIRA.
Promotes agile DevOps and cost savings if
- you can readily tell what a reservation is for
- whether an instance is targeted by a reservation
- if purchasing/renewal is easy.
Created "AWS Cost Management" group for use in subscriptions.
Subscribed this group to 4 filters, 7am each day but no email if no matching reservations. The filters are:- All AWS Reservations expired - everyone gets hassled to act on expired reservations- My AWS Reservations expiring within 1 week- My Open & In-progress AWS Reservations expiring within 2 weeks - you get reminded of reservations not yet submitted for approval- All Idle AWS Reservations expiring within 4 weeks (matches status of "Done") - everyone gets hassled until someone takes ownership and sets it to Open
Reports & reviews - No surprises here!
Resource reviews – Examine new resources; search in JIRA records to check if targeted by reservations
DevOps – essential to have a culture of willingness to make frequent changes not only for cost optimisation tweaks but also modernising and patching for security.