Batch Processing With AWS Batch - Part 1

Posted on 21 August 2024 Reading time: 6 min read
Batch Processing With AWS Batch - Part 1

The Problem

I was recently approached by a development team to help solve a problem with their application. They store their images on AWS S3 but over time, one of their processes generated and stored large amounts of temporary images in several folders. This caused some latency issues and slowed down their application until the they found the code that was generating the images and removed it.

Potential Solutions

The problem was that by this time, the amount of images generated needed to be removed. The first idea that popped into my head was, ‘Delete the bucket, problem solved’. Easy, right? I wish! Deleting the bucket meant that other data needed by the business would be deleted. Other options came to mind, like moving needed data into another bucket and delete. Easy enough but not easy if you don’t know exactly which folders to delete within the bucket.

The Choice - AWS Batch

While I was thinking about possible solutions, I remembered AWS Batch from when I wrote my AWS Solutions Architect exam. In this article, I will explain what AWS Batch is and simulate the above problem while using Batch to solve it.

Batch takes care of the provisioning and scaling of resources, which means you don’t need to worry about the underlying infrastructure. This allows you to focus on defining the job and how it should be executed. In terms of scalabality, it can automatically scale up and down based on the volume and requirements of the jobs. It supports diverse workloads, from a few jobs to hundreds of thousands of jobs.

AWS Batch can significantly reduce costs for batch processing. It dynamically allocates the least expensive compute resources available, optimizing cost without compromising performance. It integrates seamlessly with other AWS services like Amazon S3, Amazon RDS, AWS Lambda, and more. This integration allows you to build complex workflows and data pipelines efficiently.

Batch jobs can be defined by using Docker containers, which provide a consistent and reproducible environment for your applications. This makes it easier to manage dependencies and ensure consistency across different stages of the job lifecycle.

Problem Simulation and Solution

To showcase how AWS Batch works, I will create a simulation of the problem where a large number of files from a specific folder in an S3 bucket needs to be deleted. Granted there are other ways to do this but they could either be too risky (i.e deleting the entire bucket) or slow, for instance trying to use an SDK to make API calls or trying to manually delete the files from the management console. With AWS Batch, you only need to configure the compute resources, the job queue, create the job, dispatch it and go about doing other things while the service processes the job. You can also configure cloudwatch to notify you when the job is done so you don’t need to keep checking to see if it has completed.

Generate and Store Documents

Let us start my simulating storing a large quantity of items in an S3 bucket. First, we need to create an s3 bucket where we will store our items. You can do that through the AWS management console by typing “s3” into the search box. I will however use the aws cli because it is quicker and I don’t need to login to AWS in the browser to perform the commands. If you would like to follow along, you will need to be aware of the following prerequisites;

  • You need an AWS account with necessary permissions to create and manage S3 buckets, EC2 instances, and AWS Batch resources.
  • AWS CLI installed and configured with your AWS credentials.
aws --profile=personal s3 mb s3://steve-large-data-bucket

The above command can be broken down as follows;

  • aws: The aws cli.
  • —profile=personal: I have multiple AWS accounts I work with. This tells aws to use my personal account which I am using for this simulation. If you are following along and have only one aws account, you don’t need this.
  • s3: The service you want to run the command on
  • mb: The command to create a bucket
  • s3 mb s3://steve-large-data-bucket: The bucket name.

If you ran the above command successfully, you should get a response like this:

make_bucket: steve-large-data-bucket

You can confirm that the bucket has been created by running the following command;

aws s3 ls

Now that we have a bucket, it is time to simulate the storing of large quantity of items in the S3 bucket. We will achieve this with a simple script “upload_files.sh”.

#!/bin/bash
BUCKET_NAME="steve-large-data-bucket"
NUM_FILES=1000

for i in $(seq 1 $NUM_FILES)
do
  FILE_NAME="file_$i.txt"
  echo "This is file number $i" > $FILE_NAME
  aws s3 cp $FILE_NAME s3://$BUCKET_NAME/$FILE_NAME
  rm $FILE_NAME
done

Hopefully the above script is self explanatory. In summary, it is generating 1000 files and storing them in my S3 bucket.

We run the script by making it executable and running it.

chmod +x upload_files.sh
./upload_files.sh

Depending on your internet speed, this would probably take a while to finish uploading.

Create Custom Docker Image

To set up our job on AWS Batch, we will need to do the following; We need to create a custom docker image and push it to Amazon Elastic Container Registry (ECR) which is Amazon’s repository for docker images. We need this as the base to set up our job definition.

For our custom docker image, we can create a simple Dockerfile using the base amazonlinux image.

FROM amazonlinux

RUN yum install -y aws-cli && \
    yum clean all

RUN if [ ! -e /usr/bin/aws ]; then ln -s /usr/bin/aws2 /usr/bin/aws; fi

Once saved, we build the image;

docker build -t aws-batch-s3 .

Next we push the image to ECR. Make sure you authenticate first otherwise you will encounter authentication errors. One easy way to do this is by doing the following;

AWS_ACCOUNT_ID=[your aws account id]
REGION=[your aws region]

aws ecr get-login-password | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

When you run the last command, you should get a response;

Login Succeeded

Now let’s tag and push our custom docker image to ecr as follows;

docker tag aws-batch-s3:latest $AWS_ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/aws-batch-s3:latest

docker push $AWS_ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/aws-batch-s3:latest

Please note, you will need to have created a repository in Amazon ECR to match the above, otherwise you will encouter errors.

The first part of this topic focuses on setting up the environment and preparing the necessary resources in AWS, while the next part will delve into using the AWS Management Console to complete the setup and see AWS Batch in action.