patternMinor

CI/CD with a GPU cluster

Submitted by: @import:stackexchange-devops·Mar 10, 2026·

Viewed 0 times

withgpucluster

Problem

With typical continuous integration environments, you configure an environment capable to execute compilation and test batches (agent, slave..) coordinated by a scheduler (master, server..)

But what if your "client" environment is a Graphics Processing Unit (GPU) cluster used to perform model trainings in different configurations? Is there any difference or would you just for example let the head cluster node incorporate a Jenkins slave? (or Bamboo agent etc)

Solution

Since you're talking about CI/CD I presume you have the possibility to automate the model trainings in those configurations. Let's call the scripts able to do that train_model_config_A, train_model_config_B, etc.

Then you could have a wrapper script which checks an environment variable used to select which client environment you desire and invokes the corresponding train_model_config_ script. Ideally translating the outcome of the training (whatever that is) into one or more results of the pass/fail type. Then such wrapper script can be integrated in a CI/CD pipeline like a custom test step/stage (or even a build one, if it produces any artifacts you might want to archive). Just like any test executed on a testbed incorporating some non-generic piece of hardware. In other words the CPU cluster makes no real difference.

You might not need to install the slave directly on the GPU cluster if the train_model_config_ scripts (and thus the wrapper script as well) can be executed on some other hosts and remotely control the GPUs - you can then have the slave on some other host, leaving the GPU cluster free to only do its stuff.

Context

StackExchange DevOps Q#2002, answer score: 2

Revisions (0)

No revisions yet.