Terraforming AWS: Part II

Clone the companion project to follow along…

In Part I of this series [/blog/posts/terraforming-aws-part-i], we worked through a lot of AWS fundamentals and bootstrapped a network including custom VPC, public and private subnets and route tables. Thanks to that foundation, this and subsequent parts will be able to focus on more condensed code and make quicker progress.

In this tutorial, we can move onto something more exciting: EC2 (Elastic Compute Cloud) and ALB (Application Load Balancing). If these are new concepts, don’t worry! We’ll give a quick intro before we dive into provisioning resources with Terraform. First, let’s recall the diagram of our simple project:

Our Goal

Concepts

Our goal is to create a couple load balanced Linux servers. As AWS’ “VM as a Service” offering, a natural starting point might seem like creating EC2 instances. Terraform does make that very easy (I am not repeating steps such as provider configuration covered in Part I, refer to the sample project as needed):

resource "aws_instance" "example" {
  ami           = var.aws_ami
  instance_type = "t2.micro"
}

With just a few lines, we could spin up a VM… then simply use count or copy/paste to add as many instances as needed. As you can imagine, there are a few challenges with this approach. In this simple example, we’ve hard-coded the instance type which reduces flexibility. We pass in an AMI Amazon Machine Image, how EC2 instances consume operating systems) as a variable, but this would be brittle because AMIs can vary across regions. Lastly, even though the iterative approach with count would be DRYer, it would still be static configuration requiring hand edits to scale beyond initial capacity and doing nothing to address HA.

To solve those challenges, we need to incorporate auto scaling and load balancing. These concepts allow us to shrink and expand EC2 instances and distribute traffic amongst them. Historically, load balancing consisted solely of ELB (Elastic Load Balancing), but has evolved into NLB (Network Load Balancing) and ALB (Application Load Balancing). The original incantation is lesser used, and often referred to as Classic Load Balancing. For newer environments, you will want ALB (Layer 7) or NLB (Layer 4), since they offer more features and better performance. Refer to the documentatoin for a full comparison.

With EC2 instances, AMIs and an ALB, we are almost ready to begin spinning up our simple web cluster… The one other primitive we need to understand is the Security Group. Security groups can be associated with VPCs or EC2 instances, and act as virtual firewalls (think iptables rules). Mentioned briefly in Part I but repeated here since it is a common source of confusion, by default AWS requires you to specifically allow ingress traffic but allows all egress. However, Terraform takes a more secure stance and removes the default egress rule. This means you need to explicitly configure both ingress and egress rules for any resources managed by Terraform.

Putting it Together

To get our instances up and running, we need to select an AMI for them to run. In production, this could be a custom image built with HashiCorp’s Packer, but AWS provides a number of pre-baked images for common distributions. The problem is, these images can vary across regions. If we want to use a public image while ensuring it remains up to date and works consistently, we can leverage a data source:

data "aws_ami" "ubuntu" {
  most_recent = true

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  # Canonical
  owners = ["099720109477"]
}

This is a powerful concept… You can tweak or expand the filters as needed to get the desired result. For our simple case, this selects the latest Ubuntu AMI.

We said we were going to avoid provisioning explicit EC2 instances and let AWS handle that for us via auto scaling. As we saw with networking, building larger things with AWS involves bolting many smaller pieces together. The first thing we need is a Launch Configuration. In Terraform, aws_launch_configuration is configured very similarly to aws_instance since it manages EC2 instances for us. Let’s see an example:

resource "aws_security_group" "http_ingress_instance" {
  vpc_id = aws_vpc.vpc.id

  ingress {
    from_port   = var.web_port
    to_port     = var.web_port
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    "Name" = "${var.env_name}-http-ingress-instance-sg"
  }
}

resource "aws_launch_configuration" "lc" {
  # avoid static name so resource can be updated
  name_prefix     = "${var.env_name}-"
  image_id        = data.aws_ami.ubuntu.id
  instance_type   = var.web_instance_type
  security_groups = [aws_security_group.http_ingress_instance.id]
  user_data = templatefile("userdata.sh", {
    web_port    = var.web_port,
    web_message = var.web_message,
    db_endpoint = aws_db_instance.rds.endpoint,
    db_name     = aws_db_instance.rds.name,
    db_username = aws_db_instance.rds.username,
    db_status   = aws_db_instance.rds.status
  })
}

The launch configuration will attach specified security groups to instances it provisions to control access, so we create http_ingress_instance and associate it to our VPC (without specifying vpc_id, it would be associated with the project’s default VPC). The security group just allows traffic on var.web_port (in variables.tf). The egress rule with from_port and to_port set to 0 and protocol = "-1" is an idiom to allow all ports and protocols (you can’t use any with Terraform).

Important to note, we avoid using name for aws_launch_configuration. This is because our goal is high availability. If we were to specify a static name, the resource could be replaced (vs updated) by Terraform in the future after scaling events and cause outages. This combined with the lifecycle block we’ll see later ensures Terraform spins up new instances and migrates traffic to them before tearing down existing resources.

image_id points to our AMI, which is conveniently provided by our data source. Within the OS image, a User Data script is used for configuration. You could provide a pre-baked image as a means of configuration, but user data can be useful for small tweaks or configuration of public images. The templatefile function lets us substitute content we provide via variables into template strings within our script:

#!/bin/bash

cat >index.html <<EOF
<html>
<head>
  <title>Welcome Page</title>
</head>
<body>
  <h1>${web_message}</h1>
  <ul>
    <li><b>RDS endpoint:</b> <pre>${db_endpoint}</pre></li>
    <li><b>Database name:</b> <pre>${db_name}</pre></li>
    <li><b>Database user:</b> <pre>${db_username}</pre></li>
    <li><b>Database password:</b> <pre>Yeah right! :-)</pre></li>
    <li><b>Database status:</b> <pre>${db_status}</pre></li>
  </ul>
</body>
</html>
EOF

nohup busybox httpd -f -p ${web_port} &

To avoid needing to access the Internet and install dependencies in our starter project (which would necessitate route table associations for our private subnets), we simply fire up BusyBox to serve our content. We’ll see where the RDS details come from in a later article.

Now that we have a working launch config, we can build an Auto Scaling Group (ASG) to leverage it. This is where the real magic begins, letting us specify minimum and maximum instance counts to control our dynamic cluster. We also need to link the ASG to the private subnets we created in Part I, letting it know where to create resources (note the * or “splat” syntax since we created several subnets spanning all AZs in our region):

resource "aws_autoscaling_group" "asg" {
  # Avoid static name so resource can be updated.
  name_prefix               = "${var.env_name}-asg-"
  min_size                  = var.web_count_min
  max_size                  = var.web_count_max
  desired_capacity          = var.web_count_min
  default_cooldown          = 60
  health_check_grace_period = 120

  launch_configuration = aws_launch_configuration.lc.name
  vpc_zone_identifier  = aws_subnet.private_subnets[*].id

  target_group_arns     = [aws_lb_target_group.tg.arn]
  health_check_type     = "ELB"
  wait_for_elb_capacity = 1

  lifecycle {
    create_before_destroy = true
  }

  tag {
    key                 = "Name"
    value               = "${var.env_name}-instance"
    propagate_at_launch = true
  }
}

Be careful with default_cooldown and health_check_grace_period – setting a value that is too low for auto scaling cooldown can make you scale too early, while a value that is too low for the health check grace period may cause additional instances to be created while waiting on bootstrapping. Either of these will increase your costs!

We mentioned lifecycle above, a meta block which ensures all instances are not torn down at once during updates. Target Groups are a new concept we still need to configure, but as the name implies it simply specifies a set of targets to handle traffic. You can use default AWS instance health checks to enable routing of traffic, but leveraging ELB’s higher-level checks is more flexible and provides better user experience. We also ensure at least one instance in the target group is passing health checks before routing traffic. Let’s see the target group:

resource "aws_lb_target_group" "tg" {
  vpc_id   = aws_vpc.vpc.id
  port     = var.web_port
  protocol = "HTTP"

  health_check {
    path    = "/"
    port    = var.web_port
    matcher = "200"
  }

  lifecycle {
    create_before_destroy = true
  }

  tags = {
    "Name" = "${var.env_name}-web-lb-tg"
  }
}

Here you see the potential of custom health checks! We can add additional blocks, adjust paths and ports as needed, look for additional status codes (e.g. matcher = "200,201"), and customize thresholds (by default three failed probes are considered failure, and three successful probes are required to bring resources in service). In our simple case (and for stateless cloud-native applications), we don’t need to worry about sticky sessions. If you do, refer to the documentation for details on the stickiness block.

Now we just need to configure the ALB itself. Defining the ALB resource is simple, but we also need a listener and another security group. Earlier we defined a security group which is attached to instances directly. This one will allow traffic into the ALB itself. Since ALBs are often exposed to the public Internet, they do not pass any traffic unless explicitly configured to do so (secure by default).

resource "aws_lb" "alb" {
  load_balancer_type = "application"
  internal           = false
  security_groups    = [aws_security_group.http_ingress_lb.id]
  subnets            = aws_subnet.public_subnets[*].id

  tags = {
    "Name" = "${var.env_name}-web-lb"
  }
}

resource "aws_lb_listener" "listener" {
  load_balancer_arn = aws_lb.alb.arn
  port              = 80

  default_action {
    target_group_arn = aws_lb_target_group.tg.arn
    type             = "forward"
  }
}

resource "aws_security_group" "http_ingress_lb" {
  name   = "${var.env_name}-http-ingress-lb-sg"
  vpc_id = aws_vpc.vpc.id

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Using the now familiar splat syntax, we associate our ALB with our public subnets. The listener configuration exposes port 80 and uses a default action to forward traffic to our target group. This is another powerful construct, since you can specify multiple action blocks for great flexibility in traffic routing.

Almost There

We are getting very close to the target environment in our diagram above… We now have a dynamic and scalable cluster of EC2 instances, leveraging AMIs in a portable way with custom user data, and high availability controlled via flexible health checks.

Next time we’ll jump into DBaaS with AWS’ Relational Database Service. We’ll spin up a lightweight multi-AZ instance fitting into the free tier, but also discuss considerations for hosting production databases. Be sure to keep an eye out for Part III, where we will wrap up our simple project with a fully functional website backed by MySQL.

Thanks for reading!