AWS

Deploy your Auto-Scaling Stack with AWS-CDK

How to deploy an auto-scaling group stack that scales according to the size of a SQS queue with the help of AWS-CDK

Guillaume Androz
The Startup
Published in
11 min readMay 6, 2020

--

AWS Cloud Development Kit

Context

Amazon Web Services are a collection of dozens of services, from simple storage to machine learning pipelines. There are so many services that even a senior representative does not know all of them. I must admit that the first time I had a look at AWS, I was completely lost, not knowing from where to start.

Step 1

Of course, you need a project, something to build. It can be a web application, a serverless data pipeline automation, a machine learning inference API, a CI/CD integration so you can use large machines to execute your tests and so on. In the case here, we will build a scalable fleet of workers that have to process orders issued by a third party service and stacked in a queue.

Step 2

This is perhaps the most difficult task, you need to design your project, propose an architecture, explore all the Amazon services to find which corresponds to your needs the best. Sometime it is straightforward, but sometime you could end with several architectures more or less complex until you find that there is a service that does almost all what you’re trying to manually build. Of course, there is a cost to those services with a high degree of abstraction, but it can be more efficient than building your own.

Step 3

Finally, you have to experiment, build some prototypes, explore more services and iterate with step 2. But once you manage to build your super hyper stack, what’s next ?

The need for automation

I’m a big fan of automation. As an engineer, I don’t like to perform the same action twice especially if it is a non valuable task, and even more if it is done manually. So manually deploying a stack is not something I’m keen on.

But as I said in the step 3, you need to experiment, and when it’s time to play with AWS, experiments are usually done with the console. AWS console is a really great tool. You can build everything you want, you can monitor your application, launch EC2 instances, store your data in a S3 bucket etc. all with a few clicks. The tricky part however, is that there are tons of parameters to set, tons of names to repeat, links to draw between your services. If your stack has few services, it can be manageable, but as soon as you want to plug say four, five, twenty services, it turns out to be a nightmare.

The solution is found in automation

Fortunately, AWS has already anticipated that people like me are lazy and don’t find any pleasure to repeat all the same parameters to build a new stack, when the only thing we want it just to change one. For that purpose, we have access to the CLI. With just one tool, you can write commands to build each of your service. The fun part is that you can now gather all the commands in a script (typically a bash script) and parameterize them so it is easier to perform multiple experiments.

However it is a really nice first approach, there are some limitations. What if something fails during the deployment of your stack ? How can we retry and being sure that we started a clean stack ? What is several people are trying to deploy the same stack, how to manage the race conditions ? How to update your stack without having tons of if-conditions to control the flow ?

That’s why Amazon developed AWS CloudFormation. So what is AWS CloudFormation

AWS CloudFormation provides a common language for you to model and provision AWS and third party application resources in your cloud environment.
https://aws.amazon.com/cloudformation/

What does it mean ? Well, instead of using a scripting and non managed approach to deploy a stack, you now use a declarative one. All you need is to build a YAML file like you build a code script, defining all your needs and CloudFormation provisions the stack in a secure and repeatable manner. It is a managed service that knows how to update a stack, manage errors and rollback when something goes wrong. You can now define your infrastructure as code.

But there is actually an even better approach, an applicative one. You can now use your favorite programming language (or the most popular these days i.e. Javascript, Typescript, Java, Python and C#) in a unified framework : Amazon Cloud Development Kit or CDK. It is now much easier to design your stack and use code to build it. It is much more readable and easier to maintain. Of course, CDK is sill quite young and not everything is supported yet, but it is enough for most needs. CDK is not a new language, it is “just” a framework, an abstraction layer to transcode into a YAML file that can be used with CloudFormation. Of course, you don’t have to manually deploy your YAML file, there is a command for that.

The Stack

AWS Stack

The stack we’ll explore here is quite common. Imagine a todo list, a list of task a worker has to perform, a list which size varies in time so as the work load. In AWS, this todo list has a name, it is a Simple Queue or SQS.

AWS has also EC2 instances we can use as our worker. It can be a general purpose instance, or a specific one with lots of RAM, high computing power, or even a GPU attached on it. The problem is that we don’t know how many workers or instances we need. Too few and the work can be long to do, too many and we pay for instances not to run at max capacity.

We need to automatically scale in and out regarding to the work load i.e. to the size of the queue. Here again, AWS has already thought about it and proposes Auto Scaling Groups. The group can instanciate new EC2 instances, the exact copy of the preceding ones. Custom images or AMI are especially useful for that. The scaling reacts to alarms that trigger when to scale in and out.

Alarms are raised by another service, CloudWatch, that monitors a metric for us (the SQS size in our case). When this metric crosses a given threshold, the alarm is set.

So to summarize. A third party fills a SQS queue. CloudWatch service monitors the length of the queue and set alarms according to our pre-defined thresholds. An Auto Scaling Group automatically adjust its size in reaction to the alarms. Each instance created is the exact copy of previous ones in the same scaling group thanks to a custom AMI.

AWS CDK

Amazon Cloud Development Kit articulates around three concepts : an Application, a Stack and Constructs.

And application is the executive part, that’s where your stack(s) is actually built. The stack is the list of units, each piece of the puzzle. And finally there are the constructs which are the actual units, the building blocks.

As the stacks in CDK are implemented through the CloudFormation stacks, everything you find in CloudFormation can be implemented with the CDK. However, we also suffer of the same limitations.

The Application

The application code is straightforward. All you need is to import your stack(s) and instantiate it. We further specified an environment through the env_dev variable as recommended by AWS. Indeed, imagine you have two or more accounts, one for development purpose and one for production. Of course, you want your stacks to be identical across all your accounts, that’s where the env property is interesting as you can reuse the same stack code.

The Stack

An AWS CDK Stack is a container where you declare every unit you want to use. It is the place where you organize the puzzle. A Stack object is a base class that adhere to only a basic interface (ok, python has not the concept of interface, but as CDK is implemented in TypeScript, some classes need to adhere to some interface rather than inheriting from another class). As so, a Stack only needs you to implement its constructor

from aws_cdk.core import Stack, Construct
class MyStack(Stack):
def __init__(self, scope: Construct, id: str, **kwargs) -> None:
super().__init__(scope, id, **kwargs)

# The declaration of Constructs goes here

The Constructs

Now the fun part, the building blocks i.e. the Constructs. A Construct is the code implementation of a cloud component like a S3 bucket, an EC2 instance and so on. Before going into details, here is the whole code of our Stack example.

Implementation of the Stack

i- The first block is an IAM role. We need this role to allow our EC2 instances to read messages from the SQS queue and delete the once processed. We need to specify which service will assume the role with assumed_by property, the role_name and the policy, here a managed_policy we want to give to the role.

ii- Then, we declare the SQS queue, and all we need it to give it a name.

iii- The third part is one of the most important part in our stack because that’s where we declare the metric we want to track to scale the fleet of EC2 instances. The metric is easy to declare as we don’t need a custom metric, just use one provided by AWS for a SQS queue : the approximate_number_of_messages_visible. Once we have the metric, we construct an alarm that we can further use to scale the fleet. Here, we want to raise the alarm when the number of visible messages in the queue is greater than 1. The auto-scaling group we’ll use will have a minimum size of 0, i.e. we want the group to stay idle when there is nothing to do. We want then to wake up the group and create an instance when there are messages in the queue, at least one message. To do that, the alarm will be raised when there is at least one message in the queue for one minute. We could not decrease the time resolution under one minute by default as it is not allowed by AWS, but it would possible to do if we build a custom metric. However, it is out of the scope of this demonstration. One detail though. CloudWatch set the alarm in an insufficient_data state when the queue length is zero. The trick is to tell the alarm to consider the insufficient_data state as an ok state so that no alarm is raised.

iv- Networking. Our scaling group is part of an existing VPC, so we need to fetch it with the from_lookup function that take the VPC ID as an argument. We then specify that we’ll use the private subnets from our VPC, and we look for the security group we also already have in a previous stack. To fetch the security group, the method is the same as to fetch the VPC, we use the tool from_security_group and simply provide the SG ID.

v- Now the fun part, the autoscaling group (ASG). But before creating it, we need to retrieve the AMI we will use to create the EC2 instances. For this, we will use a custom AMI previously built so we fetch it as a generic_linux image giving the AMI ID to the CDK function. And we are good to create the ASG. We need to provide the role we want our EC2 instances to heritate, the VPC where to host the ASG, the type of instance (here we simply use a t2.micro), the AMI, the key-pair name if we want to be able to SSH into the EC2 instances, the subnets and finally the ASG scaling parameters. We choose to scale our ASG from 0 to 2 instances as we want the ASG to be idle when there is no message in the SQS queue. The desired capacity is then set to 0 either.

The creation of an ASG also creates a default security group (by the time this article has been published). However, it appears to be kind of a bug as we are not allowed to change it, we can only add another SG. This issue has been reported to the AWS CDK team. Another issue also reported, is the use of ASG launch configurations. AWS pushes to now use templates instead of launch configurations to create an ASG. However, CloudFormation and so CDK does not support this feature yet.

vi- And finally, the glue between our services, i.e. we need to tell the ASG to react to the alarm. For this purpose, we define a scaling_action (here a simple Step Action) that will adjust the number of EC2 instances to 1 when the difference between the metric and the alarm threshold is more than 1. Once the action is built, all that remains is to tell the scaling policy to add the action. The trick here is to cast the aws_autoscaling.StepScalingAction into a aws_cloudwatch_actions.AutoScalingAction (the python API does not manage it under the hood yet).

Deployment

The deployment is the exciting part. It is when we see that our stack actually builds up…. or fail. Thanks to CloudFormation, if something goes wrong during deployment, everything is rolled back, that’s the power of CloudFormation. AWS CDK provides three function to play with our stack: synth, deploy and destroy.

Synth

The synth function is a kind of compiler where our python script is translated into a YAML file that is understood by CloudFormation. It is also here that the SG ARN for example is retrieve from the existing stack.

Note: You must specify your credentials and an AWS Region to use the AWS CDK CLI. The CDK looks for credentials and region in the following order:

* Using the — profile option to cdk commands.
* Using environment variables.
* Using the default profile as set by the AWS Command Line Interface (AWS CLI).

Each Stack instance in your AWS CDK app is explicitly or implicitly associated with an environment (env). An environment is the target AWS account and AWS Region into which the stack is intended to be deployed.

If you don’t specify an environment when you define a stack, the stack is said to be environment-agnostic. AWS CloudFormation templates synthesized from such a stack will try to use deploy-time resolution on environment-related attributes

For production stacks, it is recommended to explicitly specify the environment for each stack in the app using the env property.

Furthermore, as we want to retrive an existing VPC :

The AWS CDK distinguishes between not specifying the env property at all and specifying it using CDK_DEFAULT_ACCOUNT and CDK_DEFAULT_REGION. The former implies that the stack should synthesize an environment-agnostic template. Constructs that are defined in such a stack cannot use any information about their environment. For example, you can't write code like if (stack.region === 'us-east-1') or use framework facilities like Vpc.fromLookup (Python: from_lookup), which need to query your AWS account. These features do not work at all without an explicit environment specified; to use them, you must specify env.

Deploy

The function deploy is obvious, we actually deploy our stack to our AWS account. A YAML file is also built just as with the synth method, but it is now sent to CloudFormation that processes it. You can of course follow the deployment directly onto the console and see your stack building. If anything goes wrong (for example the AMI does not exist), the build-up is rolled back and your stack is not updated/built.

Destroy

Here again, the function name says all, we can destroy an existing stack. However, the python script must reflect the actual state of the stack. It is obvious because you can not destroy something that has not been deployed yet, but it can have been updated.

Conclusion

I hope this post will help you architect a simple queue worker stack. The example here uses Amazon SQS and Auto-Scaling Group to better manage the costs. You could reduce even more the costs by using spot instances instead of on-demand instances. For other resources, I suggest you read the great article on AWS blog this post is based on.

--

--