patternjavaMinor

I want to scale up/down as fast as AWS Lambda but also be able to allocate the vCPU (minimum 8) per task, any advice?

Submitted by: @import:stackexchange-devops·Mar 10, 2026·

Viewed 0 times

wantfasttheperbutanyminimumalsoadvicedown

Problem

Goal

So I'm looking for an AWS product or combination of products to accomplish the following:

Ability to do CPU heavy (non-parallel) calculations (min 8 vCPU's per node)

Minimum timeout limit of 1800 seconds (AWS Lambda has a limit of 900 seconds)

Ability to automatically scale up and down to 0/1

Ability to scale up really fast (

Event-driven execution model (one task per node, node gets destroyed after the job is finished)

Assign the desired vCPU's per "type" (3 predefined types) event/task/job

I'll try to be as detailed as possible in my description of what the job is, the two setups I've tried and what I don't like about them.

Definitions

-
ROE: Route Optimisation Engine

-
VRP: Vehicle Routing Problem

A single JSON containing all the information regarding the stops, vehicles, time windows, start/end addresses) which needs to be optimised. VRP's can be classified into three different complexities [easy, medium, hard] based on:

number of stops

restrictions per stop (time window, capacity)

number of vehicles

restrictions per vehicle (time window, maximum range, capacity, breaks)

Easy classified VRP's will need less CPU resources and can be solved quicker than for example high classified VRP's.

-
Solution: The best solution for the VRP (the most efficient routes for the VRP)

Setup 1 - AWS Lambda

Diagram

AWS SQS

Standard type AWS SQS queue for optimisation request messages. Every message that enters triggers the ROE (AWS Lambda).

AWS Lambda

The Lambda (3008 MB memory) is triggered by the AWS SQS queue for optimisation request messages and processes alles messages when they are added to the queue. The maximum timeout limit of all AWS Lambda's is 15 minutes.

Problems

AWS Lambda does not allow us to select more CPU resources and therefore optimisations of medium complexity VRP's take a long time.

AWS Lambda maximum timeout settings of 15 minutes makes the usage of a Lambda for doing optimising high complexity VRP im

Solution

You can check out AWS Fargate, I've just used it for a similar use case.

Setup:

Jobs are posted to an SQS queue where a scheduler lambda picks them up and launches an AWS Fargate task.

Caveats:

Lambda handles the SQS messages so you can't hand over the messages to Fargate (Lambda will auto-delete the messages from the queue once the lambda returns successfully), instead you'll need to handle retries via a separate lambda that gets notified whenever a task ends.

Fargate only supports up to 4 vCPUs

You need to pass any data to the container either via environment variables or as command line arguments

Launch times are not as fast as lambda, but much better than Batch and EC2 - I typically get my containers within 30 seconds of the lambda launching them.

Dead-letter/retry handling is a bit of a nuisance, but it's not that hard to implement as long as your process is well-behaved. You can easily get the process exit status from the API once the task has finished and you can store the number of retries as a tag on the task since you'll be launching a new task for each retry.

The scheduler and retry handler should be fairly trivial to write and if you need to pass larger amounts of data you can store it on S3 and send the location as an argument to your container.

A similar option - if you don't mind paying for the extra, unused capacity is to use ECS instead of Fargate and have unused capacity in the cluster so you can spawn new containers more or less instantly. This removes the limit on vCPU but you'll instead need to manage the cluster size (can be done via Autoscaling) and pay for the unused capacity.

Context

StackExchange DevOps Q#11436, answer score: 1

Revisions (0)

No revisions yet.