Deploy autoscaling machine learning model inference stack to AWS with CDK

Use the power of AWS/CDK to deploy ML models in a stack that scales automatically for inference on CPU or GPU

Published in

Towards Data Science

5 min readAug 17, 2020

For several years now, data science and machine learning were sexy and lots of companies embraced “artificial intelligence” as a new powerful tool to automate complex tasks as a black box. In this area, deep learning appears to be the Holy Grail to build models to detect and classify images or texts among other cool things. Once models are trained, it is common to deploy them on a small web server and expose a simple REST API to perform inference on a given sample. It is very convenient as usually, whereas training a model needs lots of computation power and lots of GPU, inference only needs a relatively low power CPU to make a prediction on a single sample. You could find numerous articles and blog posts on this approach; event open source projects and companies could be found to make the whole thing painless.

There exist cases, however, where you need to perform inference on not just a sample, but on a batch of samples. In this case, inference can take a lot of time when performed on CPU and you need to use a GPU to parallelize the job. Using a small web server is now no more relevant as you don’t want to pay for an ever running machine with an attached GPU. For example, imagine you have to perform some prediction on large files that are uploaded to your S3 bucket. Your file is first split into small chunks on which the model makes a prediction. What you want is your stack to automatically launch when a file is ready to process, use a machine with a GPU, and shut down the whole thing when there is nothing to process anymore. As there is no serverless stack available (will there be one, one day?), we need to build a stack to scale automatically to our need.

Architecture

The proposed architecture is the following: we get a SQS message with the task to perform. For example, it could be the name of a file on S3 to process. Depending on a configuration parameter found in the parameter store of AWS, this message is forwarded to another SQS queue, one for a CPU-based stack, and one for a GPU-based stack. An ECR cluster then uses Cloudwatch alarms to trigger the scaling of two autoscaling groups, one with CPU-only instances and one with GPU-enabled instances. The result is then returned into a S3 bucket.

Automation

To automate a little more the whole thing, we can use a CDK application. Just like Terraform or Kubernetes offer to build a stack by code, CDK tends to be the reference to deploy stacks on AWS. You can of course use Cloudformation, but just the idea to use very large YAML files to manage a stack makes me run away. However, using a modern programming language like Typescript, C# or Python is way more interesting. In the repository source code, you could find two classes to build the stack : DeployModelEcsMediumStackCore and DeployModelEcsMediumStack. The first one as its name suggests builds the core of the stack i.e. the main SQS queue, attach a lambda function to it, an ECS cluster and the definition of some IAM policies. Then the DeployModelEcsMediumStack class builds a stack for the CPU or for the GPU architecture. A SQS queue and associated metrics, an autoscaling group with the right AMI, instance type and the scaling policies, and the ECS task with the right ECR image to retrieve.

Cluster

First, a cluster shall be created with ecs.Cluster construct. From there, we create an autoscaling group adding a capacity provider to the cluster with cluster.addCapacity static method. We then need to create a task with the construct ecs.Ec2TaskDefinition providing the proper ECR image retrieved with the static method ecs.ContainerImage.fromEcrRepository. The image has to be an ECS optimized AMI based image to work properly. There is no official AWS AMI that support GPU for ECS, but you can find custom ones. We also need to pay attention to the gpuCount property and set it to 1 when we want to use a GPU. Finally, the service is created with the construct ecs.Ec2Service and attached to the cluster.

Work with the stack

Once everything has been deployed, all you need is to send a message in the main SQS queue with the command to execute i.e. the name of the file to retrieve on S3 in my example. To decide whether to use a GPU based instance or a CPU only instance, we only need to change the configuration found in the parameter store in the system manager. An example of such a message could be

aws sqs send-message --queue-url https://sqs.ca-central-1.amazonaws.com/1234567890/MediumArticleQueue --message-body '{"filename": "file_to_process.bin"}'

Queues

There are three SQS queues in the proposed architecture, but the user only needs to send a message in the main queue MediumArticleQueue. Once a message is received, a lambda function is triggered and regarding to the configuration (a parameter in the SSM/Parameter Store), the message is forwarded to the proper queue, GPUQueue for an autoscaling group managing GPU based instances and CPUQueue for CPU only based instances.

Code

The typescript/python code for this CDK stack is a bit large to publish here, but you could find the sources in this repo. Feel free to copy or fork code, just a thumb up or a little comment would be appreciated.

Final note

In the source code, I defined my alarms to scale the autoscaling group but not the tasks count. The reason is that when adding an ECS service, we also set its autoscaling behaviour with ecsService.autoScaleTaskCount method. However, AWS/CDK does not properly link task scaling and instance scaling, which is the role of the capacity provider. This behaviour can be achieved when you work directly in the console, but not programatically. There is a PR to correct it, but it was not available at the time this article was published. To later support this feature, I added a commented code section to illustrate what the code could look like when the feature is released.