Highly Available Network Address Translation, that friend you love to hate…

| | | 0 comments

if you care about security, you should care about NAT

Who is NAT and what is HA?

As we outlined in our last blog, Amazon Web Services (AWS) introduced Virtual Private Cloud (VPC) years ago and many advanced networking and security concepts are only available to VPC customers.  The push for VPC adoption has progressed to the point that “EC2 Classic” is now a basic VPC with public subnet setup for you by default.  Network Address Translation (NAT) is a basic concept you can associate with your home Internet router.  This device takes your public IP address and shares it with a private network of machines.  In AWS, an instance in a VPC subnet with a route that includes an Internet Gateway attachment must have an Elastic IP Address or Public IP Address associated with it in order to gain access to the Internet.  An instance in a VPC subnet without a route that includes an Internet Gateway (private subnet) needs a way to get on the Internet, and that is where NAT comes in.  A NAT instance in the Public subnet can be used for private subnet instances as the route to the Internet.  High Availability or HA (not so much HA-HA funny) is how you address the issue of single points of failure.  If NAT is required for availability of your workload, you need NAT to be Highly Available (HA) as well.  So what now?

I’m better off alone…

Before we get in to making NAT HA (and the risks of some methods), it would seem like an easy solution to just use subnets with an Internet Gateway route in your VPC (public subnets).  This way, you can give your instances public or elastic IP Addresses and everything works great.  This is true, and even if we ignore security risks, this is the recommended approach for any workload that is heavily using Internet bandwidth (get/puts to S3 immediately stands out).   That said, once you give your instance a public IP address, you are solely relying on the network ACL of that subnet and the Security Group of that instance to control access (and mitigate risk).  What if someone is troubleshooting and decides to change the SSH ingress rule to open it up to anyone instead of the private IP CIDR you previously had.  That instance is now open since it has a public IP address.   This may be outside your security policy (or possibly even a compliance issue) but the reason NAT is worth bothering is that private subnet instances have no public or elastic IP Address (and even if they did, it would not matter).  These instances are truly private and as such are more secure and more tolerant to inadvertent security group changes.

you need to define your own health checks with your entire stack in mind, and not simply rely on default values

Maybe I’m better off with NAT in my life…

There are two well-documented approaches to creating Highly Available NAT.  The first is an often-referenced article by Jinesh Varia where High Availability for Amazon VPC NAT is documented.  This is a great article, and the last paragraph is the most important component to internalize.  Appendix A outlines the risk of False Positives.  While Jinesh refers to this as an “edge case”, I would argue that this edge case is more likely than a simple instance failure.   The nat_monitor.sh script that Jinesh uses has a couple problems.  First, the script in its current state is syntactically wrong.  The script makes an AWS CLI call to describe the status of the other NAT instance in the other Availability Zone.  This output is then piped to awk to get just the value of the status (“running” for example).  The comments in the script even mention that you may need to modify this line, never ignore comments in code in this case they are right.  You do need to change the print value, and I believe on the current AWS CLI, it’s actually print $6 that you want.

That said the issue with this script, as Jinesh points out, is that the default values for health check monitoring may be too aggressive and generate false positive results.  False Positive is actually a nice way of saying HA NAT Self Destruction.  Once this not-so-edge case triggers, the first step of the nat_monitor.sh script is to send an AWS CLI call to stop the other NAT instance.  Spoiler alert!  Once this happens both NAT instances are stopped and can’t continue on to step 2 that is to fail-over the NAT route (and subsequently reboot the failed NAT instance so it will take back over its route).  Highly Available becomes Highly Un-Available.  No running NAT means no Internet access for private instances, and when monitoring starts failing and auto-healing kicks in, get ready for cascading instance failures all because of a NAT health check failure.  Now that pesky NAT false positive monitoring event has destroyed working instances, someone is about to get in trouble.

If you were not running HA NAT with the nat_monitor.sh script, you would have a simple NAT instance in each Availability Zone (AZ).  On the edge case that this basic Linux instance with IP Tables and Route Forwarding configured should fail, you are only losing one AZ, and since you architected a highly available workload, losing one AZ does not mean you lost production service.  Impaired, but available.  You get your notification that NAT in AZ1 has failed; you reboot it and get on with life.  It almost seems like a more available approach than if you were running the HA nat_monitor.sh script.

Ok, so you still must have NAT monitoring and route takeover, what then?  The script mentioned earlier is great (if you fixed the syntax), but it needs some parameter changes from default.  To find the right values to enter, you need to determine the appropriate health check values that will persist across AWS network events that you have no visibility into (also known as “why did my health check fail when the instances are fine?).  For your respective region, and with your respective Availability Zones, start logging pings and build the right health check values.   Do not accept default values as a known quantity.  Test for failure or be prepared for the consequences.

I don’t have time to nurture this relationship…

The other approach to HA NAT commonly used is to have an auto-scaling group of 1 for NAT per AZ.  A Health check will be used to actually auto-heal NAT should it really have a system failure, or your non-reserved instance was taken away from you, or you ignored your events page when AWS scheduled your NAT instance for restart.  This approach may be better than the nat_monitor.sh script, as false positives are not an issue (at least not one resulting in AZ to AZ network failures).  You can bake the AMI with NAT configured, or just do it on the fly with user data (which you might as well so you can keep the OS up to date with security patches on boot).   Lastly, this instance will need to find the private subnet route table for its AZ and take over that route when it boots up (new instance means new instance ID, which means the previous route table will be black-holed till this script runs).

This approach however does not solve the same cascading auto-healing failure scenario outlined previously.  If your monitoring and auto-healing of workloads behind NAT is more aggressive than the auto-healing of NAT itself (as well as route take-over), you can still find yourself with a single AZ of failing instances trying to repair but are without Internet access.  Having a very clear understanding of recovery time is crucial in defining a holistic policy on monitoring-driven auto-repair (much like auto scaling in general).

architect for your workload and its complexities, but keep it simple 

What if the magic is dead?

If you think you need to scale NAT for performance for elastic private workloads, there is another more applicable approach than those mentioned previously, Squid Proxy.  Again, Jinesh has a great write-up on creating a Squid Proxy Farm that uses an internal Elastic Load Balancer in front of an auto-scaling (based on network in) layer of squid proxy instances.  This provides high availability, automatic scaling for performance, and durability from a single squid instance failure across availability zones.  Like any high performance, highly available solution, it comes at a cost.  Not only are you running a minimum of 2 instances (like you would with a NAT per AZ), you also have the elastic load balancer (not a significant cost) and most importantly, you are now scaling automatically.  It is critical in any automatic scaling scenario that you understand your workload and how it is leveraging Internet resources (in this case) so you are not surprised by the fees associated with your solution.  Not to mention we have now added a proxy service on top of our normal Linux instance.  Not overly complicated, but more layers involved in the solution means more to manage.

test your solution and have a remediation plan ready

Not everyone is looking for the same thing…

In summary, if you are already highly available in your workload across availability zones and you have placed heavily Internet biased resources in public subnets, a simple approach to NAT is likely the best setup.  Make sure you are monitoring NAT instance status and have a plan to remediate a NAT instance failure.  Otherwise, let NAT do their job without failover in place and rely on the high availability of the workload, not NAT.

If you are required to have highly available NAT through health checks, ensure you have adjusted the monitoring and tested everything thoroughly.  Alternatively, and as an improvement, use a 3rd monitor and control instance that is determining NAT state health and acting independent of NAT itself.  Of course, don’t forget to monitor the health of the monitor!

Lastly, if you are required to have a heavily Internet biased workload be truly private, and you want the best performance for scaling the NAT function, use a Squid Proxy layer with the right Network In value in place for auto-scaling based on your workload.  And don’t forget to test, test and test again.

Need help?

Although most NAT solutions are well documented, simply following the tutorial does not always ensure a highly available production ready setup.  You need to failure test.  Foghorn is here to help you with your cloud initiative, don’t hesitate to give us a call!