AWS and DNS and TLS, Oh My!

aws iac terraform

Clone the companion project to follow along…

In a past series we used Terraform to build a simple LAMP stack (actually a N-tier, Linux-based, MySQL-backed web stack atop AWS… there is technically no Apache or Perl/PHP/Python if we’re being pedantic). We fixed some bugs along the way, and even extended the original design to provide more real-world networking for our private subnets.

In this article, I want to again (ab)use our simple N-tier project to explore DNS and TLS within the AWS ecosystem. Of course this is a DevOps-heavy blog, so we’ll continue the automation obsession and implement what we learn using Terraform. Combined with some refactoring that’s happened along the way, this will add polish needed to turn our suspiciously lab-smelling proof of concept into automation potentially useful in the real world!

Concepts

Since we’ve already invested a lot of time plumbing the network, getting auto scaling and user data just right, figuring out how to spin up a database and tying all those pieces together… Here we can focus specifically on DNS and TLS. Each of those are often shrouded in mystery (perhaps because getting either one wrong can do very bad things to availability and security!) and could consume entire books in themselves. To avoid writing a book as a blog post, I’m going to focus on AWS-specific aspects. Where helpful I’ll touch on generic concepts… but this won’t be a DNS or TLS tutorial since good resources for those already exist.

In a lot of my posts I present patterns that can be extended as needed to meet your specific needs. Patterns are common across many industries, and often seen as a pairing of problem and solution. It’s important to remember that patterns are more like tuples which include a third element: context! Absorbing patterns without awareness of their context leads to confusion.

Before exploring the pieces of the AWS LEGO set we need to manage DNS, I want to highlight the context… We’ll need DNS in at least two places: routing traffic to our web cluster with a friendly name, and supporting TLS certificate validation. I’ll assume you have an existing (registered) domain you control , and that it is not serving mission-critical traffic (so experimenting with it is relatively safe). In reality you might need to register a new domain, cut-over an existing domain serving production traffic, etc. Each of these will have unique considerations. Hopefully you can still extract useful concepts and automation snippets to help you on your journey, adjusting as needed to fit your context.

DNS

We’ve already touched on why we need DNS and our assumptions. Since we will be migrating an existing, unused domain we follow the steps in this guide (at a high level). If you are moving a production domain read this guide instead.

DNS on AWS Route53 leverages Hosted Zones. A Hosted Zone can be public or private. Public zones serve the Internet, while private zones are associated with VPCs (your private AWS networks). A Hosted Zone is simply a container holding DNS resources. While it looks similar, it’s not technically the same as a DNS domain. Hosted Zones are an AWS vs. DNS concept.

For example, you can create a Hosted Zone without owning the associated domain. Nothing will happen until you delegate the DNS domain to the Delegation Set (a set of four AWS name servers associated with the Hosted Zone that will handle traffic for the DNS domain). You can also do interesting things with Hosted Zones such as using Alias Records to map the zone apex directly to AWS resources. Contrasted with pure DNS, you can not alias (CNAME) the apex. There is also a financial difference: Alias Records are free, CNAME queries are not!

This means getting a public domain working with AWS will require a number of steps:

TLS

TLS is the modern, faster, more secure SSL. If you are hosting content on the Internet (or anywhere), you probably have it already and if not you need it. With all the great features comes added complexity. Serving TLS-protected content requires managing certificates.

You have many options… You can purchase certificates from third-party Certificate Authorities (e.g. DigiCert or Thawte). You may integrate your service directly with modern alternatives such as Let’s Encrypt. Depending on your use case, you might chose to cut costs and self-sign certificates using a Private Certificate Authority. There are times when an internal CA makes sense, and AWS even has PCA support so you don’t have to carry the burden alone. If you are managing your own PCA, be sure to account for the hidden costs from additional effort securing, signing, storing, revoking, etc. all of your certs.

Terraform can help in these cases, letting you inject certificate files directly into resources and configuration. Since you will have both private and public keys (certificates), you’ll also need to think about secret vaulting (using something like HashiCorp’s Vault) to keep private keys secure (obviously never committing them to version control).

Considering all the options is one reason TLS can seem overwhelming. Luckily, AWS can help simplify our life and save us money at the same time. We’ll use AWS’ Certificate Manager (ACM) to manager our TLS certificates. This provides a self-service, performant, highly-secure (certainly better than we could roll ourselves) certificate service that is nicely integrated with key parts of the AWS ecosystem (read: our ALB). Unlike third-party CAs which exist to issue certs, AWS gives you certs for free since they make money on other resources that get used with them (ALBs, EC2 instances, etc).

To validate certificates (you wouldn’t want just anyone to be able to issue certs for your domain), ACM can use a DNS or email based workflow. DNS is highly preferred, since it can be fully automated and managed by Terraform.

Code

Now that we know our goal and what pieces we need to accomplish it, we can start translating requirements into code. First, here’s a simple picture of what we’re tying together:

Components and Traffic Flow

Looking at the boxes and solid lines, we point the DNS server (NS) entries for our domain to the Route53 Hosted Zone Delegation Set. This allows us to create resource records, which we can alias to AWS resources. Since we’re going to start leveraging TLS, our ALB now listens and serves traffic on 443/tcp. Internally, we still talk to our EC2 instances using 80/tcp (while we could easily encrypt this traffic as well, offloading TLS overhead to load balancers is a common approach and simplifies our deployment). ACM integrates with our Hosted Zone, publishing special DNS records which allow any certs issued for our domain to be validated.

From a user’s perspective, a request to our service will still require a DNS request to the registrar (hopefully cached) where our delegation will redirect to Route53’s name servers for resolution. In our case we’ll use a www record aliased to our ALB, which terminates the connection securely using the TLS certificate obtained from ACM.

Let’s build it! Creating a Hosted Zone is easy enough, but leveraging a few related resources should improve our quality of life:

resource "aws_route53_delegation_set" "main" {
  reference_name = var.env_name

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_route53_zone" "public" {
  name              = var.web_domain
  delegation_set_id = aws_route53_delegation_set.main.id

  tags = {
    "Name" = "${var.env_name}-route53-hosted-zone"
  }
}

output "name_servers" {
  value = aws_route53_zone.public.name_servers
}

Aside from the aws_route53_zone itself (if you had an existing Hosted Zone, you could use the Data Source to get the Zone ID needed below), we also create a aws_route53_delegation_set. When you create a Hosted Zone, AWS will assign a random set of geographically diverse name servers (DNS best practice). Since you must update your registrar with these values for Internet users to make it to your site (or ACM to validate certs by conducting DNS lookups), you don’t want future updates to potentially change these values and break the delegation. That’s why we use the special lifecycle block to ensure Terraform does not destroy the Delegation Set.

Notice I said “should” improve our quality of life above… While this looks like an ideal representation of our desired state, it has at least one major problem. As you can read in this issue (please 👍 it!), prevent_destroy does not always work satisfactorily. Instead of simply not deleting the specified resource, future updates throw errors (preventing useful things such as cleaning up your project without hand editing and losing the desired benefit).

One workaround is using resource targeting, but that is a toilsome hack at best. A common solution is better segregation of managed resources. That is generally a good thing, but in this case results in another repo, Terraform module, state file, etc. that manages a single resource (might as well just use create-reusable-delegation-set!). Just be aware of this pitfall, and take heart knowing HashiCorp is working on a real solution. In the meantime, this might be a reason to think carefully about how much of your infrastructure you place under Terraform’s control.

We also add an output so we can easily grab the Delegation Set needed to update our registrar (simply run terraform output name_servers after executing terraform apply). Since it varies greatly and often involves clicking through a UI, I won’t show that step here – simply take the provided list of name servers and update your domain registrar’s DNS servers accordingly.

As mentioned above, we’re going to use a Route53 Alias Record to point to our ALB. This is very similar to creating any other Route53 resource with Terraform, but since Alias Records always use a 60-second TTL, we must omit ttl. We also replace records with an alias block (you must have one or the other) referencing the desired resource, and take care of the zone apex since no one has time to type “www” in 2020:

locals {
  # We'll use this more below...
  fqdns   = concat([var.web_domain], var.alt_names)
}


resource "aws_route53_record" "web" {
  count   = length(local.fqdns)
  name    = element(local.fqdns, count.index)
  type    = "A"
  zone_id = aws_route53_zone.public.zone_id

  alias {
    name                   = aws_lb.alb.dns_name
    zone_id                = aws_lb.alb.zone_id
    evaluate_target_health = true
  }
}

evaluate_target_health is another advantage of Alias Records, since it will only route traffic to targeted resources if they are actually in a healthy state. Failing fast and shedding load before it consumes additional resources is a best practice (one of many useful patterns described in Michael Nygard’s Release It!).

Now for the trickiest part… Requesting the certificate from ACM. The actual request is simple, but since we also want to leverage DNS-based validation we need to chain several Terraform resources together. Thankfully, the Terraform docs are great at covering this. In the docs they assume you only have a single hostname to contend with, so here’s a more real-world example:

resource "aws_acm_certificate" "cert" {
  domain_name               = var.web_domain
  subject_alternative_names = var.alt_names
  validation_method         = "DNS"

  tags = {
    "Name" = "${var.env_name}-acm-cert"
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_route53_record" "cert_validation" {
  count   = length(local.fqdns)
  name    = element(aws_acm_certificate.cert.domain_validation_options[*].resource_record_name, count.index)
  type    = element(aws_acm_certificate.cert.domain_validation_options[*].resource_record_type, count.index)
  records = [element(aws_acm_certificate.cert.domain_validation_options[*].resource_record_value, count.index)]
  ttl     = 60
  zone_id = aws_route53_zone.public.zone_id
}

resource "aws_acm_certificate_validation" "cert" {
  certificate_arn         = aws_acm_certificate.cert.arn
  validation_record_fqdns = aws_route53_record.cert_validation[*].fqdn
}

domain_name is your Common Name (CN) for the cert, and you can include any number of SANs via the subject_alternative_names list. You can probably think of ways to improve on this, but avoiding hard-coded indexes into domain_validation_options ensures we can easily extend the alt_names list in configuration to include any number of SANs. Wildcards are supported as well. Note the use of local.fqdns to obtain count. We can’t simply use domain_validation_options for that since it’s only known at apply time.

Almost there… We’ve got a zone, DNS records for our users, and a TLS certificate that has been validated. So far, this does nothing! We need to adjust our ALB listener to use the shiny new certificate. As part of that, we will update our port (since the TLS standard is 443/tcp vs 80/tcp which we had before) and most importantly protocol (we were leveraging the default HTTP but now want HTTPS). ALBs only support HTTP or HTTPS. If you need TCP or other protocols, time to refactor using a Network Load Balancer (this is also required if you want to associate static or elastic IP addresses).

Let’s quickly update our aws_lb_listener:

resource "aws_lb_listener" "listener" {
  load_balancer_arn = aws_lb.alb.arn
  port              = var.lb_port
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-2016-08"
  certificate_arn   = aws_acm_certificate_validation.cert.certificate_arn

  default_action {
    target_group_arn = aws_lb_target_group.tg.arn
    type             = "forward"
  }
}

The one other thing to pay attention to here is ssl_policy, which is required when using HTTPS. These are policies provided by AWS. While 2016-08 sounds a bit dated, it is the latest policy available while writing this. For a full list and details on what each policy contains, refer to the official documentation.

Our site is secure with an ACM TLS cert including SANsAs a final note, there is a bit of chicken and egg when building from scratch… You need to run apply, watch output until the Delegation Set is created, run to the Route53 console, click into your hosted zone, grab the name servers, update your registrar while the apply is running (likely while waiting 10-15 minutes for RDS to provision), then wait on the delegation (usually quick, but could take hours) so DNS resolution works and ACM can actually perform the requisite DNS queries to validate certs.

If you are too slow or that fails, the site will never come up since the dependency graph looks something like ALB => listener => validated cert => DNS (no DNS, no cert, no listener!). For our simple project we’re trying to keep all the pieces organized in a single repo to make it easier to reason about. In the real world you would typically solve this by breaking out modules (and state) to manage your networks, DNS, etc. separate from the application stack.

Conclusion

By now we are starting to see a theme… For common use cases (and many uncommon!), AWS has an answer. By leveraging AWS services we can greatly reduce maintenance and overhead while increasing security and making it easier to automate provisioning. Despite a potential pitfall, Terraform also shines again… In ~50 lines we’ve added substantial functionality, easily managing DNS and TLS (both often considered opaque and persnickety) in a repeatable, auditable way.