Cloud Cost Optimisation – Improving Infrastructure Efficiency

Did you miss Part 1 of this multi-part series on ‘Cost Optimisation and Operational Efficiencies’?

This article focuses on infrastructure, specifically on two core building blocks that constitute the largest portion of most cloud bills: compute and storage.

Compute, in the form of servers, has been around for the last 50 years (and probably will be for the next 50). Whilst today we have containers, serverless compute and many forms of persistence stores, I will give you my thoughts on Storage (Object and Block) and VM (Virtual Machine) optimisation as these are the two largest components of most cloud bills.

I have been fortunate to work for some of Australia's largest websites and two of the major public cloud vendors. When it comes to architecture, I have seen the exceptional and the questionable.

Cloud and cost can be quite a polarising topic. Do it right, and you can run super lean, drive down the cost to serve and ride the cloud innovation train. But inversely do it wrong by treating the public cloud like a data centre and your costs could be significantly larger than on-premises.

The objective here is to share practical advice so you can walk away with one or two meaningful cost-saving ideas that you can execute in your environment, driving costs down.

Storage

One of the largest expenditures in the cloud, aside from compute, is storage. So, how can you reduce storage costs without making code changes?

Broadly speaking, when it comes to cloud and storage, there are two paths to take, and the right path depends on how you use storage.

The public cloud provides both object and block-based storage. What's the difference between object and block-based storage? Let me explain.

When I grew up, my world was Microsoft DOS, then Windows and a variety of Linux distributions. I stored my data on a hard disk drive, like we still do today. You have a ‘C Drive’ and beyond in Windows or mount (mnt) points in Linux.

While the cloud provides this type of storage (block), it also ushered in a new method - object storage.

Object storage is relatively new when compared with more traditional block storage systems. In short, it’s storage for unstructured data that eliminates the scaling limitations of traditional file storage. Limitless scale is the reason that object storage is the storage of the cloud. All of the major public cloud services, including Amazon (Amazon S3), Google (Google Cloud Storage) and Microsoft (Azure Blob Storage), employ object storage as their primary storage method.

Looking through the lens of storage, I want to provide you with two mechanisms to reduce costs. Both are along the same lines and regardless of whether this is COTS or custom code, these mechanisms will allow you to reduce costs from your storage tier.

Object Storage

All providers have mechanisms for tier storage, and that’s what we need to do. Take advantage of lower price points for infrequently accessed data. Broadly speaking, data is either available (hot) or it's not (cold).

Amazon calls this S3 IA (Infrequent Access), Azure calls it a Cold Tiering access tier, and Google calls it Nearline storage. The terminology between providers is slightly different, but all offer a tier that is always available (millisecond access), with a lower storage cost but a higher transaction cost.

We need to move the long tail of infrequently accessed data to this storage tier. A tiering policy can automate this process.

Understanding the relationship between access frequency and cost at your chosen provider is a simple and effective way to reduce costs.

When your data is stored in a hot tier, your applications can access it immediately. While the hot tier is the best choice for data that is in active use, there is a tier in between, let’s call it the warm tier.

In this tier, you get the benefits of instant access (hot) and cost reduction (cold) but can be penalised in the form of cost when the data is accessed too frequently (aka as if it were hot data). Thus, this warm tier is perfect for less frequently accessed data.

Amazon S3 Pricing – Storage is cheaper for Infrequent Access, but data is still available instantly

Amazon S3 Pricing requests and data access

Amazon S3 Pricing – Storage is cheaper for Infrequent Access, but access is almost double the cost of standard S3

Let’s apply this to a real example.

Imagine I have an application that allows you to rate pictures of cats and dogs, as that’s a common engineering example that’s used.

Dog and Cat Developer Application Example

My application has the following attributes

Our application stores 100TB of photos
Each photo is around 10MB in size
Using S3 Standard Storage will cost $2508 per month (May 2024, ap-southeast-2 (Sydney))
Our data has a long tail. Roughly 80% of data will not be accessed after 30 days and, if so, will be rarely requested.

Without any code-base changes to my application, I can apply a tiering / lifecycle policy, where older data that isn’t accessed as frequently is moved to a cheaper ‘Infrequent Access’ tier of storage but is available for immediate access when required.

By creating a lifecycle policy we have unlocked instant savings in our application

By creating a policy, which requires a few clicks and no code changes, I can add a tiering policy to my storage account. In this example, I am able to save 36.1% on storage costs, with my new monthly cost being $1604USD.

Not bad for a few minutes of effort and imagine the impact at scale.

For more details and a great example of tiering through to archive and deletion, read more on managing your storage lifecycle by AWS.

Block storage

Cloud providers offer various types of block-based storage,from magnetic to nVME based SSD’s, all with different performance and cost characteristics. Below is a table that illustrates the types of disks available on AWS. Your mileage will vary between providers, but everywhere should supply options.

Amazon EBS Volume Types

As cloud is a PAYG (Pay As You Go) model, it is imperative that we pay for what we need, not what we procure. Whilst this is not as straightforward with ‘Object Storage’, there are options.

My first recommendation is to understand your application's Input/Output (I/O) profile. Monitoring platforms such as Azure Monitor, AWS Cloudwatch, Google StackDriver and various application performance monitoring products can alert on a plethora of I/O metrics.

Look out for I/O queues and high ms (millisecond) response times. Slow I/O can have an incredible impact on OLTP based systems but may be transparent to end users in highly distributed scale out systems. Know the importance of I/O on your workload.

Plotting queue depths between IO1 (Premium) and GP2 (Standard) SSD Volumes – How to use CloudWatch metrics to decide between General Purpose or Provisioned IOPS for your RDS database.

Once you understand your usage profile and application behaviour, tweak it accordingly. The beauty of the cloud is that all providers offer a means to migrate from one tier of storage to another without losing data.

Understand the cost to serve using premium storage vs cheaper storage and how this change in performance impacts your application. While it's human nature to want the most IOPS with the least latency, it may not be practical from a cost-to-serve perspective. Understand the tools (the block disk types) you have at your disposal and make a data-driven decision on which types to use.

Here is an example of how to switch between block storage disk types in AWS using the AWS CLI. In this example, the EBS (Elastic Block Store) volume is altered from its current volume type to GP2 (General Purpose SSD 2). This process can also be performed in the portal or via an SDK call and this pattern is available in all 3 cloud providers.

#!/bin/bash
# Specify the volume ID of the EBS volume VOLUME_ID="your-volume-id"
# Modify the volume type to gp2
aws ec2 modify-volume --volume-id $VOLUME_ID --volume-type gp2

Virtual Machines

Virtual Machines (VM’s) are ubiquitous and common in the cloud. You spin up a VM, a VMSS (Virtual Machine Scale Set) or an ASG (Autoscaling Group) and your applications run.

Simple, but that is a thing of yesteryear. All cloud providers now have over 150 different permutations of Virtual Machines with new generations and architectures launching regularly (almost weekly).

Families of VM’s change from one generation to another, and new families of instance types are launched regularly. The cloud is a moving target.

Here are three guiding principles when it comes to VM’s.

1. Newer is Better – Upgrade Frequently

Within the same generation, it’s almost a given that a newer generation of an instance type is going to be cheaper whilst delivering at least the same levels of performance as the older generation. CPU’s and becoming more efficient with every generation. Today’s machines have either lower TDP (Thermal Design Power) or performance scores than their predecessors on almost every benchmark.

Most of this is due to advancements in fabrication technologies with modern CPU’s being manufactured on 2-to-5mn process nodes.

As an example, let’s take the general-purpose compute offering, an AWS M Series EC2 instance and perform a cost-to-performance evaluation. The equivalent in Azure would be the D Series and E Series in GCP.

	m5.xlarge	m7i.xlarge
CPU Cores	4	4
RAM	16GB	16GB
Processor	Intel Xeon Skylake 8175M 3.1ghz	Intel Xeon Sapphire Rapids 8488c
PassMark Single Threaded Score	1903	3096
Cost	$0.24USD Per Hour	$0.25 USD Per Hour

Newer is better!

In this example, the newer instances are 39% faster and 4% more expensive than the old generation.

So please upgrade your instances. Given that compute constitutes the majority of your cloud costs, this can be a sizable chunk of spending over time. However, you should find you need fewer resources to achieve the same outcome.

How you upgrade between instance families will vary, but it should often be as simple as stopping and starting your VM in your provider’s web interface, to updating your IaC (Infrastructure As Code) scripts or adjusting your VMSS (Virtual Machine Scale Set) / ASG(Auto Scaling Groups)’s.

2. ARM64 – The Efficiency King

ARM64 (aarch64) is an alternate architecture to X86 (Intel / AMD) that delivers significant cost savings over X86. All providers today have ARM based systems ( Ampere in Azure, Graviton in AWS, Tau in GCP). This architecture can be as much as 50% cheaper for the same performance as an x86 system.

It sounds too good to be true, doesn’t it? Perhaps. The catch here is your workload needs to be able to be run on an ARM64 architecture. The good news here is that the humble Raspberry Pi with its Broadcom SOC (System On Chip) over the years has done much of this heavy lifting. Secondly, with Apple moving to Apple Silicon this process is only accelerating.

My rule of thumb here is, if your application is Open Source (MySQL, Kafka, etc) or is based on a compiled script engine (Dotnetcore, Java, Golang, Python, etc) then in 99.5% of cases, it will just work.

However, if your application is COTS based and more so if you are using Microsoft Windows, today in 2024 your mileage may vary.

In most providers the cost difference between ARM and x86 is significant. Microsoft claims you will obtain up to 50% better price-performance ratio using ARM based compute over x86 based machines.Your mileage may vary, but this is one to look at.

3. Use SPOT Compute

Do you use SPOT instances? I love them. If SPOT is not part of your compute strategy, ask yourself why not.

SPOT instances represent excess capacity. Cloud provider must have spare capacity available for any surge in customer demand. To offset the loss of idle infrastructure, each provider offers this excess capacity at a massive discount to drive usage.

Think of this like hotel rooms, if they aren’t being used, then they are just wasted.

Cloud providers plan their capacity so you don’t need to,but it means they have a lot of capacity that, at times, is sitting idle. Capitalise on this idle capacity. The SPOT market price fluctuates based on demand.

The SPOT price fluctuates over time, here is an example of the Spot Instance Pricing History.

If there is low demand, the price will be low. If there is high demand, the price will rise and may even equal that of the on-demand price.

SPOT provides great value for your workloads, but many people are only familiar with using them for development, testing, or a highly scalable and embarrassingly parallel processing type of workload. This is because SPOT instances can be revoked by a provider with minimal warning, typically with as little as a 2 minutes notice.

An eviction notice viewed in AWS Cloudwatch (you can also tail the instance metadata).

But did you know you can use SPOT in other more applicable ways? Could it be part of your steady-state workloads? I recommend using SPOT as part of your Auto Scaling Groups.

Yes, you can combine SPOT and On-Demand instances in the same Auto Scaling Group. This means your online / synchronous workloads with respect to your end-users can be hosted using SPOT compute.

The level of SPOT integration differs between providers, but in this example, we have gone from 24 cents per hour to 9.8 cents per hour, a staggering 60% cheaper. Is the juice worth the squeeze? I sure think it is, and I hope you do too.

Keep an eye on SPOT as it becomes increasingly integrated into these platforms. This tutorial offers a fantastic end-to-end example of implementing a mixture of SPOT and On-Demand compute resources into an Auto Scaling Group.

However, before you bet the house on SPOT, have a bidding strategy to reduce your risk. To cover your elastic workloads, bid higher than the market rate but lower than the on-demand price. One of the benefits of the way the SPOT market works is that even if you bid higher than the SPOT price you pay the prevailing rate. For example, if I bid $1 per hour and the prevailing rate is $0.50 per hour, I pay $0.50.

This is great, as long as I’m bidding lower than the on-demand rate and higher than a SPOT rate I’m automatically saving money of some description.

SPOT instances might not always be available, but they should be part of your VM strategy.

Summary

Public cloud offers a multitude of opportunities for builders and architects. If you’re seeking pots of gold, they are there. The public cloud provides you with a raft of new levers that you can pull, twist and manipulate to architect for the new world.

Climb the cloud maturity curve and achieve the equivalent or better outcome at a lower cost, but remain conscious of the costs associated with change.

Architecture can evolve, but it needs to make sense.

Join me in Part 3 as we get deeper into Software Architectural optimisations you can make or reach out if you have any comments.