patternMinor
CI/CD with a GPU cluster
Viewed 0 times
withgpucluster
Problem
With typical continuous integration environments, you configure an environment capable to execute compilation and test batches (agent, slave..) coordinated by a scheduler (master, server..)
But what if your "client" environment is a Graphics Processing Unit (GPU) cluster used to perform model trainings in different configurations? Is there any difference or would you just for example let the head cluster node incorporate a Jenkins slave? (or Bamboo agent etc)
But what if your "client" environment is a Graphics Processing Unit (GPU) cluster used to perform model trainings in different configurations? Is there any difference or would you just for example let the head cluster node incorporate a Jenkins slave? (or Bamboo agent etc)
Solution
Since you're talking about CI/CD I presume you have the possibility to automate the model trainings in those configurations. Let's call the scripts able to do that
Then you could have a wrapper script which checks an environment variable used to select which client environment you desire and invokes the corresponding
You might not need to install the slave directly on the GPU cluster if the
train_model_config_A, train_model_config_B, etc.Then you could have a wrapper script which checks an environment variable used to select which client environment you desire and invokes the corresponding
train_model_config_ script. Ideally translating the outcome of the training (whatever that is) into one or more results of the pass/fail type. Then such wrapper script can be integrated in a CI/CD pipeline like a custom test step/stage (or even a build one, if it produces any artifacts you might want to archive). Just like any test executed on a testbed incorporating some non-generic piece of hardware. In other words the CPU cluster makes no real difference.You might not need to install the slave directly on the GPU cluster if the
train_model_config_ scripts (and thus the wrapper script as well) can be executed on some other hosts and remotely control the GPUs - you can then have the slave on some other host, leaving the GPU cluster free to only do its stuff.Context
StackExchange DevOps Q#2002, answer score: 2
Revisions (0)
No revisions yet.