Welcome back, this is part two in our series on controlling and optimizing costs in the cloud. If you are new, here is the first part of the series, Cost Control Analysis Realizes Valuable ROI.
We often hear from clients who are frustrated with their cloud spending. We hear this across all levels of the business: DevOps/SRE engineers, IT managers, and CFO/CIOs all express to us that while they enjoy working with and in the cloud, they don’t love paying for the cloud.
When we ask them what the biggest issues are regarding cloud spending, there is no shortage of answers. Typically we get responses such as:
“The cloud makes it hard to track/know what you are spending.”
“I don’t know what’s costing us the most money.”
“We don’t know which services we can turn off.”
Look, we get it. The cloud makes it super easy to launch new resources, test new projects, and build out and scale infrastructure, applications, and databases very rapidly. But it requires care and feeding to keep everything operating smoothly, and that same discipline applies to tracking your cloud spending.
Now while it does take some time and effort to manage your cloud spending, if you follow a simple formula, you can help your organization lower your cloud spending, and keep your ongoing costs as low as possible going forward.
That formula is:
↪Monitor→ RightSize and Prune→ Report→ Discuss→ Repeat↩
Let’s dive into each of these items in more detail.
My first rule in business is that before you can improve something, you have to measure it. When it comes to the cloud and monitoring our costs, you want to do three basic things. Set up billing alerts, tag resources, and monitor system resources.
You should monitor your costs so you can be alerted when our bill exceeds some threshold. Most cloud services charge based on usage fees, and you want to make sure you get notified if you begin consuming 10 times more bandwidth or storage than you have been in the past few weeks or months. This increased cost may be really positive for the business if it comes with a 10x or greater increase in customer demand. But, if a developer mistakenly spins up the wrong type of server, or writes a function that has an error and causes it to consume additional resources, well, you want to make sure you discover and review that as quickly as possible.
Once you have billing alerts setup, if you haven’t already, tag your resources. Tagging is the process of adding value labels to your resources. This makes it easier to break out costs into functional areas. For example, you can add application names, business unit names, or cost center IDs to nearly all of your cloud resources. These tags can then be added and filtered on cost allocation reports. At billing review time, this makes it easier to see which applications and groups of resources have increased cloud spending. You can also create tag policies to enforce compliance, which ensures all of your resources can be accurately tracked.
Monitor Resource Utilization
Because the cloud is a pay for what you use model, you want to make sure that you aren’t being wasteful when it comes to resource allocation. You want as many resources needed to do the job, but no more. So make sure you set up a monitoring tool that makes it easy to check on key metrics, like processor, memory, and storage utilization, bandwidth consumed, etc.
Once you have both resource tagging and monitoring in place, it is easy to correlate costs to resource utilization. And as a bonus, you’ll have good data to have more in-depth conversations with business unit owners.
RightSize and Prune
Along with your monthly billing reports, most cloud providers make available other reports, to help you identify areas where you can save money. For example, some offer an underutilized assets report. While this report can be useful, it may be too simplistic. For example, it may show instances that are under-utilized for only CPU utilization. But if you have servers where memory utilization is the critical metric, these reports may not offer you much insight.
If you have tagged your assets, and have resource monitoring set up, you can dig deeper into the billing data. Start by reviewing your largest line items. Assuming an AWS account, look at your monthly bill, and pick the top 2 or 3 line items. For many customers, EC2 and RDS are often their top-line items.
Review the EC2 instances, and look at your monitoring tool. Sort on your most critical metrics. For example, if it’s CPU and memory, look for instances where both are low. Make a list of these resources, and then record their tags. Do a little more digging to verify the low utilization isn’t just due to low periods of an auto-scaled application. For example, if you see say 10, 30 or 100 instances with the same tags, and many or all show lowly utilized, you may have found an excellent opportunity to save costs by using fewer, and/or smaller instances. This is an excellent way to have a meaningful conversation with the business/application owner about saving costs.
For RDS, have a look at the size and configuration of the databases. Often I see clients create multi-az instances by default. While appropriate for production workloads, non-production workloads rarely require multi-az. Check the monitoring tab for RDS, and check the number of connections to the DB. If you see any DBs with zero connections over the past 2 weeks, follow up with the business owner, these are probably databases that can be deleted.
Use these conversations as a chance to educate, rather than antagonize your colleagues. For example, if someone says, “oh we only use that database during release testing, but we can’t delete it,” let them know there are a few options to save money. Ask them if they knew RDS instances can be stopped for up to 7 days at a time. Or, if they can delete the database when it’s not being used, take an automatic snapshot before delete, and then the database can be spun up from that snapshot next time it’s needed. This is also a great time to point out how quickly resources can be spun up/down with Infrastructure as Code tools like terraform or cloudformation. Remind them the benefit of the cloud is in paying only for what you need, so if they need a test database to be up maybe 10% or 20% of the time, then tell them that it is possible to set up, and they can help save the company 80% or more of the costs.
Now, share what you have learned, and put it into practice. For non-prod environments, consider automating account pruning. For example, for sandbox/test environments, you can do a terraform destroy every evening, or every weekend, and recreate the environments on the next business day. Foster an environment of transparency and accountability by sharing the monthly bills and any custom reports you’ve created with as many people within your organization as you can, (and that your compliance policies allow for.) The goal here is to get everyone thinking about how they can help save on costs. By sharing the data they can begin asking questions and having the right conversations with their team. If you want, for the first few reports, add some notes or insights that show how you are using the data, and key metrics that others should pay attention to, and ask for, and encourage feedback and questions.
Be prepared for, and encourage others to ask questions. Understand that for the first month or two, some teams may not like having a spotlight shined on their costs for everyone to see, but keep the conversations going, and structure them in a positive manner so everyone feels empowered to ask meaningful questions. If detailed cost reporting is a new initiative at your company, consider hosting a kick-off call where you walk key stakeholders through the reports and encourage them to ask questions. You may have to help explain how the cloud charges for certain services, and look for opportunities to do architectural reviews with teams to explore ways to improve performance and cut costs.
Now that you have monitoring, alerting, and tagging set up, and you’ve gone through and analyzed and right-sized and pruned you’re well on your way to maintaining a cost-optimized cloud environment. Set aside a little bit of time each month to review the previous month’s bill, and look for any resources that look out of place, or are underutilized. If your organization supports it, send the billing reports out via email or slack, and encourage all teams to review and discuss. Continue to have conversations with people across the business, and help educate them on the ways in which they can help save money.