Serial Workflows

When working with any type of serial workflow, like AI workloads, you often have a series of tasks that depends on the output of preceding tasks, for example pre-processing, learning and post-processing, with sequential dependencies.

Illustration with information and figures that show how to scale a serial job.
 
When a workflow like this runs for days or weeks, you must reserve a computing resource that satisfies all the workflow requirements, like many CPU cores, a lot of memory and maybe a GPU. The expensive CPU, memory or GPU might be idling when the other resources are used. This increases your costs and reduces the ability to scale the workload. 

At the Norwegian Competence Centre for HPC we help customers split up the workflow into individual tasks, where the hardware requirements and task dependencies are defined at the scheduler level. This allows the CPU hungry tasks to be spread over more CPU resources to reduce the runtime, as well as ensuring that expensive hardware like GPU is not reserved and paid for while waiting for the pre-processing task to have the data available for the GPU. The same is also important when the GPU dependent task is running, where the tasks can be scaled over more GPUs for a shorter period, and the many CPU cores and a large amount of memory that were needed for the pre-processing can be released and reactivated when needed for post-processing.
 
This type of optimisation is not only applicable if adapting the workload to HPC or a supercomputer, but also applicable for increasing the efficiency of commercial cloud resources and reducing cost. 

Want to know more? Contact the Norwegian Competence Centre for HPC