Automating ECS

aws iac terraform

Clone the project repo to follow along…

So far in this series we’ve learned the fundamentals of Amazon’s Elastic Container Service, containerized a simple Node.js application, and deployed it to the cloud. In the final article of this series, we’ll eliminate the toil of building and maintaining ECS infrastructure by automating everything we’ve learned using Terraform.

Container Definition

Before diving into Terraform, the first thing we’ll need is a “container definition” to feed the aws_ecs_task_definition resource. The good news is we’ve already built this while working through the manual configuration of our task.

By simply trimming off everything in the task definition except the containerDefinitions list, you’ll have all you need. The initial slog of figuring out the task definition JSON is paying dividends! This is just reusing a portion of an existing file, and the point of this article is the actual Terraform, so I’ll simply link to the example from the project repo. Since the first article I did a bit of clean up – removing unnecessary values (nulls, empty lists) and templating a few more things to make reuse easier.

NOTE: To work through the entire demo deployment, you will need to modify one line in the container definition – the path to the Parameter Store secret. You also need to add the secret to test retrieval. In your AWS account, go to Services > Systems Manager > Parameter Store > Create parameter. For anything sensitive, always use SecureString. If you use the same path and name, you will only need to insert your AWS Account ID in the container definition. Otherwise, adjust as needed.

Terraforming ECS

To make this simpler for anyone to test drive, we’ll use the default VPC and subnets that come with new AWS accounts. If you’ve deleted those, you can use Terraform’s AWS VPC and subnet resources (or a module of your choice) to create space for our example.

A past series went into a lot of detail around creating network resources from scratch. Rather than rehash that I want to cover another common scenario – using data sources to discover existing network infrastructure. Beyond account defaults in this case, you will often have shared VPCs, subnets, NAT gateways, etc. that you can consume rather than having to re-create for each service.

data "aws_vpc" "selected" {
  default = true
}

data "aws_subnet_ids" "private" {
  vpc_id = data.aws_vpc.selected.id
}

The aws_subnet_ids data source gives us a list of subnets matching specified criteria we can use elsewhere in our configuration. We’ll use the private subnets to house our ECS tasks. Here simply using the vpc_id gets the job done, but a common practice is using tags to make selection of appropriate resources intuitive.

Before we tackle ECS itself, we need to address IAM. When deploying manually, we leveraged the default ecsTaskExecutionRole and fixed it up to allow access to Parameter Store and Secrets Manager. At the time it was easy to (ab)use, but we called out the best practice of using service-specific roles. As part of our automation, let’s have Terraform manage any roles and policies for us:

resource "aws_iam_role" "app" {
  name                  = var.role_name
  description           = "ECS Task Execution Role for ${var.app_name}"
  force_detach_policies = true

  assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

  tags = local.tags
}

resource "aws_iam_policy" "parameter_store_ro" {
  name_prefix = "ParameterStoreRO"
  description = "Grants RO access to Parameter Store"

  policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameters",
                "kms:Decrypt"
            ],
            "Resource": [
                "arn:aws:ssm:*:*:parameter/*",
                "arn:aws:kms:*:*:key/*"
            ]
        }
    ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "attach_parameter_store_policy" {
  role       = aws_iam_role.app.name
  policy_arn = aws_iam_policy.parameter_store_ro.arn
}

resource "aws_iam_role_policy_attachment" "attach_aws_managed_policy" {
  role       = aws_iam_role.app.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

Service-specific roles allow more granular access controlWhile this is still wide open (we could further limit Parameter Store access to specific paths), it gives you a starter recipe for automating a fully functional service. Refer to the policy examples we ran through previously if you need to grant Secrets Manager access instead.

With network details gathered and IAM squared away, we’re ready to take care of ECS. As you’ll recall from previous articles, we need to create an ECR repository our ECS tasks can access. We’ll also attach a lifecycle policy to our repository to avoid old images building up and wasting space.

resource "aws_ecr_repository" "app" {
  name                 = "${var.app_name}-${var.environment}"
  image_tag_mutability = "MUTABLE"

  image_scanning_configuration {
    scan_on_push = true
  }

  tags = local.tags
}

resource "aws_ecr_lifecycle_policy" "app" {
  repository = aws_ecr_repository.app.name

  policy = <<EOF
{
    "rules": [
        {
            "rulePriority": 1,
            "description": "Expire untagged images older than a week",
            "selection": {
                "tagStatus": "untagged",
                "countType": "sinceImagePushed",
                "countUnit": "days",
                "countNumber": 7
            },
            "action": {
                "type": "expire"
            }
        }
    ]
}
EOF
}

Since we are using container insights and the awslogs driver, when we manually created the ECS service we had to make sure we created the CloudWatch Log Group or our service wouldn’t start. Now we can let Terraform manage that for us.

To make the ECS-specific bits more modular, a number of variables are used. These are referenced directly by our Terraform resources and exposed within the container definition via templatefile. Aside from the service name, region and port, the ECS CPU units, memory reserve and hard limit are configurable. This is tunable enough for most services without overwhelming the operator with excess detail. Finding the right balance reduces cognitive load for others using your automation.

We leverage a lot of defaults in the service configuration, but do pull in the subnets discovered above and expose instance details. For our simple case we’ll run a single task instance, so use instance_percent_min = 0 and instance_percent_max = 100. In the real world we could increase instance_count and adjust the percentages as needed so we can use rolling updates to avoid downtime.

resource "aws_cloudwatch_log_group" "app" {
  name              = "/ecs/${var.app_name}-${var.environment}"
  retention_in_days = 7
  tags              = local.tags
}

resource "aws_ecs_cluster" "app" {
  name = "${var.app_name}-${var.environment}"

  tags = local.tags

  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

resource "aws_ecs_task_definition" "app" {
  family = "${var.app_name}-${var.environment}"
  container_definitions = templatefile("container-definition.json", {
    name           = "${var.app_name}-${var.environment}"
    environment    = var.environment
    image          = "${aws_ecr_repository.app.repository_url}:latest"
    region         = var.region
    port           = var.container_port
    cpu            = var.task_cpu
    memory_limit   = var.task_memory_limit
    memory_reserve = var.task_memory_reserve
  })
  task_role_arn      = aws_iam_role.app.arn
  execution_role_arn = aws_iam_role.app.arn
  network_mode       = "awsvpc"
  cpu                = var.task_cpu
  memory             = var.task_memory_limit

  depends_on = [aws_cloudwatch_log_group.app]

  tags = local.tags
}

resource "aws_ecs_service" "app" {
  name                    = "${var.app_name}-${var.environment}"
  cluster                 = aws_ecs_cluster.app.arn
  task_definition         = aws_ecs_task_definition.app.arn
  enable_ecs_managed_tags = true
  propagate_tags          = "SERVICE"
  launch_type             = "FARGATE"
  scheduling_strategy     = "REPLICA"

  desired_count                      = var.instance_count
  deployment_maximum_percent         = var.instance_percent_max
  deployment_minimum_healthy_percent = var.instance_percent_min

  network_configuration {
    subnets          = local.private_subnets
    security_groups  = [aws_security_group.app.id]
    assign_public_ip = true
  }

  lifecycle {
    ignore_changes = [desired_count]
  }

  depends_on = [aws_iam_role.app]

  tags = local.tags
}

Refer to the project repo for the fully functional Terraform… Just adjust tfvars as needed, then you can run some tests in your account (refer to the README for specifics on configuring Terraform for use with AWS)!

❯ cat hello-world.tfvars 
role_name            = "helloWorldTaskExecutionRole"
region               = "us-east-2"
environment          = "production"
app_name             = "hello-world"
container_port       = 8080
task_cpu             = 256
task_memory_limit    = 512
task_memory_reserve  = 256
instance_count       = 1
instance_percent_min = 0
instance_percent_max = 100

❯ terraform init

❯ terraform plan -var-file=hello-world.tfvars -out=plan
# ...

❯ terraform apply plan                                 
# ...
Apply complete! Resources: 13 added, 0 changed, 0 destroyed.

The state of your infrastructure has been saved to the path
below. This state is required to modify and destroy your
infrastructure, so keep it safe. To inspect the complete state
use the `terraform show` command.

State path: terraform.tfstate

Outputs:

ecr_repo = 012345678901.dkr.ecr.us-east-2.amazonaws.com/hello-world-production
iam_role_arn = arn:aws:iam::012345678901:role/helloWorldTaskExecutionRole
iam_role_name = helloWorldTaskExecutionRole

If we curl the public IP of the deployed task on our container port, we can see the secret retrieval working (via our IAM role) as it exposes a value from Parameter Store – this is obviously only for the sake of example. Never expose secrets, including in logs!

❯ http 3.21.52.50:8080
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 47
Content-Type: text/html; charset=utf-8
Date: Fri, 01 May 2020 02:49:28 GMT
ETag: W/"2f-ZLQ5cKIxGhGpXgCP1cNhiPt9ZB0"
X-Powered-By: Express

Top secret message: HELLO FROM PARAMETER STORE!

Final Thoughts

Counting variable definitions and outputs, we’ve managed to automate away the toil of manually managing ECS-based services in less than a few hundred lines. Rather than simply recreating the exact service we deployed by clicking through the AWS console, we iterated and improved security though a service-specific IAM role and used lifecycle management to further reduce toil. Beyond the initial build, this gives us a framework we can use to continue extending our service, ensures consistency as we go, enables reuse when building similar services, and acts as documentation for ourselves or future team members – all the advantages of Infrastructure as Code.

That continues in the spirit of our minimally viable example service… In the real world you would likely have additional network configuration (perhaps an ALB fronting several tasks), more containers to manage (additional services, sidecars for monitoring or security), backing services to prepare, etc. You can keep adding these things yourself, but as the complexity grows you’ll want to consider modules. Whether you consume modules from the Terraform Registry, GitHub authors you trust, or create your own… they’ll let you avoid copying and pasting code, further ensure consistency, make reuse even easier, and allow you to build increased confidence in shared components.

Hopefully this is enough to get you started toward the nirvana of running containerized services on AWS ECS. Terraform makes the initial infrastructure build and maintenance a breeze. Once your MVP is live, you can continue shipping updates with just a few commands… It’s just a matter of building a new image with your code, pushing to ECR, and updating the ECS service to pull in the latest change. That’s too much to cram in here, but for an example of how it could work refer to the do script.

See you next time!

References

This is the final part of a multi-part series, jump to part one:

Thinking Inside the Box