To NAT or to Proxy, That is the Question…

| | | 2 comments

A Better Way to Manage Internet Access for VPC Resources

Anyone who has run OpsWorks stacks with private instances relying on NAT for Internet access may have seen firsthand what danger lurks beneath the surface.  These instances will show as unhealthy to OpsWorks when NAT dies, and if you have Auto-Healing turned on (who wouldn’t?), get ready for OpsWorks auto-destruction (and to think your app was running just fine the whole time). (Read my NAT write-up).

What about that AWS Managed NAT Gateway?

When I started writing this post, AWS’ Managed NAT Gateway had not been released.  Such is the speed of innovation at AWS, I can’t even get my blog post out before they improve on VPC NAT infrastructure.  Even with this new and very welcome product, this blog post is still valid.  While Managed NAT Gateway solves many of the pains of running your own NAT instances, it still adheres to the same paradigms.  So, how about addressing Internet access an entirely different way than NAT and eliminating those limitations?  Enter the proxy.  Truthfully, a proxy is more complicated to manage than NAT (especially compared to a managed gateway), but the benefits may outweigh the complexities.  We solved these complexities by packaging up a proxy as a self-contained OpsWorks stack in a CloudFormation template to enable users to easily manage secure Internet access.  As a primer though, proxy?

Proxy?

Yes proxy.  Foghorn Web Services (FWS) proxy is a caching proxy to the Internet.  In normal use cases, this type of proxy is designed to increase performance through caching.  In our application, we use it for extended access controls and domain based whitelisting.  But the advantages don’t stop at security.

Everybody Gets Internet!

Transitive Relationships & VPC Peering

Many of our customers want unique VPCs for isolation, either based on their end customers, their own segmentation desires or for other reasons like dedicated tenancy within a VPC (but not within all of their VPCs).  In either case, once an architecture extends to more than one VPC, you naturally then find a reason to connect them.  This most commonly ends up being a shared services layer for things like VPN access, CI/CD tools, source code repositories, log aggregation, and monitoring.  It would be nice if you could manage Internet access in this same shared service tier, but VPC Peering does not support transitive relationships.  If you don’t feel like reading up on that, the basic premise is traffic can go from one VPC to another when peered, but cannot go through that peered VPC to somewhere else (like a NAT instance, a NAT Gateway, a Virtual Gateway, an Internet Gateway, etc.).

This means that a NAT service layer in a shared services VPC cannot be leveraged by Peered VPCs for shared Internet access.  So each VPC now needs its own NAT service layer.  A proxy however does not behave like NAT.  The FWS proxy layer residing in a shared services VPC can enable Internet access for Peered VPCs.  Now you can manage Internet access like any other shared service.  Proxy 1, NAT 0.

You’re Trying to go Where?

Whitelisting

With NAT we can easily address port based traffic control by simply modifying the security group used by NAT.  So instead of all traffic, we can limit it to just 22/80/443 for example.  But the ports used are only part of the equation.  What if we wanted to whitelist by destination domains?  Why allow access to anywhere on the Internet over 443 when all we really need is GitHub and AWS?  I doubt anyone would argue that managing a whitelist for your environment is easier by domain names than IP addresses.  Again, this is where FWS proxy shines over NAT.  We can easily create whitelist files and have our OpsWorks stack pull them from S3 and update the proxy during a configure lifecycle recipe.  The inverse is true as well, we can also manage blacklists with our proxy.  But since we are going for a least necessary access list, we prefer whitelisting.  Consider for a minute how great it is to just drop traffic you don’t need.  Proxy 2, NAT 0.

Caching

While the main purpose of this blog post is to address the network and security benefits of a proxy over NAT and how we manage FWS proxy via OpsWorks, it doesn’t have to stop there.  Our proxy can be configured as a DNS cache and/or a web cache.  Storage is inexpensive at AWS, why not cache some static content and speed up deployment runs.  Same is true for DNS.  While it can be risky to have long cache times on DNS results, if we picked something reasonable, we can speed things up.  Either way, these are options, not requirements.  We are extending our Internet access service tier features.  Proxy 3, NAT 0.

Who is Pulling from THAT repo? 

Easy to manage?  Surely you can’t be serious?

I am serious, and don’t call me Shirley.  We built a CloudFormation template that creates an Elastic Load Balancer, OpsWorks stack, FWS Proxy layer and Load Scaling instances (as well as its own security group needs and IAM requirements).  It adds our custom recipes to the setup and configure lifecycle events.  Lastly, it writes the common configuration elements of the proxy to the Custom JSON of the Stack, enabling users to easily manage parameters without needing to understand how Chef works.  The most common change is editing the domain whitelist for the proxy.  This is accomplished via a simple text file stored in S3.  The configure recipe retrieves that file, overwrites the current whitelist with the new one and reloads the proxy service.  This is significantly faster than deploying a new proxy instance (say through an autoscaling group with the whitelist as part of the launch configuration).

But wait, there’s more…

Whitelisting is great but sometimes you need to figure out what traffic needs to be part of the whitelist.  We have a true/false configuration item in the OpsWorks Stack custom JSON which will either ignore (false) or enforce (true) the whitelist.  This along with CloudWatch logs integration (also part of the template) helps you determine what domains are currently being used.  You can go to the CloudWatch Logs interface and simply view the log streams.  They contain all the domains being accessed including the HTTP result code.  This can be used to build the whitelist, or audit Internet activity, or even possibly as a troubleshooting tool.

But I Need Persistent Source IPs

What if some of your destination services require whitelisting your addresses?  Not a problem, we have enabled Elastic IP Addresses to provide persistent public IP addressing of a dynamic proxy fleet (the default behavior, this can be disabled to save on the EIP costs).

The default configuration even accounts for all private networks supported by AWS VPC.  You don’t have to worry about what network ranges are chosen, just that they have outbound security group access to the FWS proxy ELB.

So what’s the catch?

Like anything, you don’t know what you don’t know.  If you don’t know where on the Internet your servers require, be prepared for some short term pain.  Furthermore, the adoption of a proxy server varies based on the operating system you are using and how you build your servers.  You will need to adjust the core OS to support the proxy server (if you plan to completely remove Internet access otherwise).  But not only do you need to know how to configure each service to use your proxy, you also need to know where this service is intending to go.  Domain based whitelisting is a powerful security tool, but it is predicated on you knowing the destination domains.  Refer to the earlier section on how we make that process easier with this solution.

On a closing note, while a proxy may be a great substitute for NAT in certain use cases, it is still simply an alternative tool.  So if the concepts discussed here are appealing, a proxy may be a better architecture than NAT.  But that is not to say that there is no place for NAT as there certainly still is (refer to my previous post on NAT).

AWS Marketplace Cluster Perhaps?

Not interested in the user focused OpsWorks managed environment?  No problem, thanks to AWS’ new Marketplace support for Clusters, Foghorn is creating a cluster solution encapsulating the core principles of this post available to launch via the Marketplace Cluster feature set.  Stay tuned for details on that offering.

Need help?

Although this packaged service is user friendly, the key is understanding your environment and Internet usage before migrating over.  Foghorn is here to help you with your cloud initiative, don’t hesitate to give us a call!