Cloud TCO – Good Housekeeping

In discussions of cloud computing and data center migrations, the question of “sticker shock” sometimes comes up. Sticker shock is real, but as we’ll show in this series of blog posts, it’s due to trying to apply the old Datacenter (DC) and ITIL mindset to the newer cloud computing domain. Sensible housekeeping and the diligent application of best practices, should allow a company to save money when using cloud providers instead of traditional fixed-footprint data center deployments.

In cost-benefit discussions around cloud computing, it is easy to overlook the secondary benefits of not hosting your own. Sure, you can run your app on a server under your desk, or on a spare Raspberry Pi, even. But it’s not going to be geographically distributed, highly-available, resilient (to power outages, DDoS, or any of a hundred other failure scenarios). Cloud computing today is multi-DC by default, and includes out-of-the-box monitoring, backups, and five nines (or more) of availability.

These are things that are hard to do on your own. Imagine running just two geographically-distributed servers. At the very least, that would require:

  • facilities (power, space, and cooling)
  • provisioning (operating system installation and updates, application code install and configuration)
  • load-balancing or failover between the two

For a company whose primary business isn’t infrastructure, that’s already a decent amount of work and physical real estate to handle. If you add on an SLO for reliability and availability, as well as application- and database-backups, or performance monitoring and alerting, you quickly realize you need to hire an expert. And lest one guy has the keys to the kingdom, you’re going to want a team to manage the infrastructure and be on-call in case something fails.

But what, you ask, specifically can help save costs? Or maybe you’re shaking the monitor, shouting Show me the money! Fair enough. Let’s talk about some specific capabilities for resource management and avoiding cost overrun. At a high level, the promise of cloud computing is pay for only what you use. To do that requires planning and estimation, diligence in resource tracking, and defining policies and life cycles for de-provisioning.

Planning and estimation

There are a few standard tools for lowering compute costs. Both GCP and AWS have a notion of pre-purchasing compute resources you plan to use. On AWS this is reserved instances and on GCP committed use discounts. Additionally, both cloud providers offer drastically reduced-cost compute using spare- or under- utilized capacity. On AWS this is called Spot pricing and on GCP preemptible VMs.

To make use of pre-purchasing, you’ll need to do capacity planning and determine your expected workload. Assuming you know roughly what your platform requires, AWS or GCP technical account managers — or Foghorn’s subject matter experts — can help you formalize that estimation and apply it to these cloud provider’s pricing structure.

With a capacity plan in hand, pre-purchases for expected use, and proper utilization of spare capacity, the cost to run virtual cloud servers can often work out to be a quarter or even an order of magnitude less than the published compute pricing.

Tracking

While it’s easy

to launch a virtual server manually, giving it whatever name makes sense at that moment, your CFO won’t want to hear, a year later, that no one knows whether that FTP Test server is, in fact, critical infrastructure, or that Joe’s throwaway DB actually supports the new Marketing initiative (and has bee

n running non-stop for the past three years). And yet, I’ve seen this exact sort of situation multiple times.

Moving from capacity planning to deployment planning, it helps to adopt standard methods for naming, tagging, and separating environments and applications. Assuming you’re using an automation framework like Terraform (an important, though separate discussion), adopt a standard tagging structure, and construct resource names consistently. For example:

variable “tags” {

 description = “Common tags”

 type = “map”

 default = {

   ManagedBy   = “Terraform”

   CostCenter  = “Marketing”

   Environment = “staging”

   Application = “Blog”

   Class       = “Reservered” # On-Demand, Spare

 }

}

locals {

 app_name = “${var.tags[“Environment”]}-${lower(var.tags[“Application”])}”

}

Don’t take that particular example as the best choice of tags or naming to use. Just be sure to adopt some consistent approach and stick with it. Doing so greatly aids in tracking and cost reporting later on down the road. And though not related to cost, consistent tagging has other benefits, such as environment isolation and security auditing.

Deprovisioning

So you’ve done the capacity planning, you’re making use of reserved- and spot- instances where it makes sense. Your environments are all named and tagged in a consistent manner. What’s next? In the spirit of “pay for only what you use”, now we can go about turning off or removing unneeded resources. Try shutting off power in the data center for a few hours each day and see how well that works!

Cloud deployments can scale horizontally essentially without limit, which is great for large spikes in traffic or large analytics workloads. But the scaling down of unused resources means that, when there’s little traffic or when batch jobs are done, you pay nothing for the remaining hours of the day. By now this should be common knowledge. Tagging, however, gives you more dimensions to work with. You might decide to deploy periodic “janitor” tasks to delete all staging resources, or reclaim development environments every Friday night.

Other automated processes manage backups and retention or deletion. Using tagging allows for different retention and redundancy targets for production versus development environments. You might choose to replicate production snapshots to multiple data centers or regions, whereas staging is only saved in a single region, and development not at all.

Turn off the lights on your way out

I’ve talked a fair amount here about the work you need to do to properly manage resources. And the reality is that infrastructure as code takes work. Defining these service-level objectives, writing the infrastructure provisioning code, and furthermore testing that code both for deployment and disaster recovery takes diligence and time. Scheduling and automated invocation of janitorial tasks can become an entire internal application platform. But what if you do none of that? Are there still “out-of-the-box” benefits from cloud infrastructure?

Cloud providers commonly adopt the philosophy of “least privilege”. Among other things, this means a new cloud service account will have resource limits enabled, so you can’t accidentally take down the Eastern seaboard with a run-away data science sandbox. In Google Cloud Platform, services are disabled by default, meaning you can’t even invoke the platform services by accident. While not enabled by default, budget alerting is simple to set up, sending you notifications if your organization accidentally (or intentionally) exceeds cost.

Finally, the cloud platforms themselves are software that you’re not managing or developing. The cloud providers all continue to refine their offerings (caught, to our benefit, in a sort of arms race against each other), meaning that new capabilities are constantly being introduced. Just the other day, in fact, AWS introduced lifecycle management for disk snapshots. As they found:

It turns out that many of our customers have invested in tools to automate the creation of snapshots, but have skimped on the retention and deletion.

Compare this model to one in which your IT team manages a physical data center (or two, or more), and where any such capabilities would come only from the hard work of your in-house infrastructure team. Or, compare this to the traditional fixed-infrastructure model, which involves long-term vendor contracts with big enterprise providers, often requiring involved RFP processes, months of negotiations, and the ensuing multi-year lock-in agreement.

Don’t get me wrong. I love a fancy steak dinner meeting at the vendor’s expense. I’ve been to some great restaurants that I wouldn’t otherwise foot the bill for myself! But that just doesn’t scale well. In particular, scale down, to a minimum viable product. Two-pizza teams can start on the cloud for pennies, and still scale indefinitely. Next time you’re streaming an HD movie from a certain big video distribution service, remind yourself that that entire platform is run entirely on a well-known cloud provider.

By | 2018-09-20T11:37:01+00:00 September 3rd, 2018|AWS, Cloud, Cloud Management, Cost Optimization, GCP, Public Cloud|0 Comments