Score:-3

Heterogeneous Cluster Solution for R&D

vg flag

I work in a research laboratory with multiple physical machines with different specifications. The machines have different CPUs (some Intel, some AMD), different RAM sizes, some have discrete GPUs, and some don't.

Our current solution is based on SSSD and Kerberos, so that users can log in to their accounts from every terminal and have access to their files. The problem is that this way, users are "tied" to one machine while they are working, resulting in sub-optimal resource allocation.

Therefore, we are looking for an alternative solution for our cluster. Our main goal is to truly unify all the machines, i.e., from the user's point of view, the cluster consists of a single machine. However, from what we gather, a solution such as Slurm is not ideal, since we do not want to rely on a job scheduler. The solution we envision goes something like this: when a user logs in, they can specify the specs they need (RAM, number of CPUs, discrete GPU, etc.), then a virtualized environment with the desired specs is created (a Docker image, or a virtual machine, for instance). The user can then use that environment as a regular "computer." Nonetheless, the resources for this virtual environment should be drawn from the cluster and not from a single machine. It should also be possible to share large datasets that can be accessed by every "virtual environment". The cluster should also have an authentication and permission system.

We have been searching for clustering tools that can achieve our goal, but we are unsure which one to pick. We have looked into Mesos, OS/DB, Docker Swarm, Kubernetes, and oVirt, but we do not know if what we want is achievable with these tools, and if so, which one is the best pick. We think that containers might be a good option for production but probably not the best choice for R&D. Can you guys help us out and give some highlights for what to do and where to start from?

Best regards, pinxau1000

pinxau1000 avatar
vg flag
*Before downvoting note that this question has been posted on stack overflow, however, we were advised to move this question here.*
Nikita Kipriyanov avatar
za flag
This is not achievable with *any* of these tools. Each individual partition, container, VM could draw no more than resources of a single machine. There were such projects which could appear to a single machine with combined resources (Kerrighed), and this is called Single System Image (SSI) cluster, but yet each job (I mean, an OS process) still has to be run on single participating computer and can't span several ones. If you want to spread your load, use job scheduling, period.
Score:2
ar flag

It's not possible.

  1. Different CPUs means instructions may differ. This is a nightmare if you want to migrate code between CPUs.
  2. Memory latency is in nanoseconds, network latency in tens of microseconds.

Depending on your workload, it may be possible to translate your workload to run on multiple computers and communicate data between them. For some problems this is trivial, and you can slice the dataset into smaller partitions and work on them in parallel. For other workloads this is is difficult. But this requires modifications to the workload, not operating system.

Score:2
jm flag

Agree with @NikitaKipriyanov that you cannot combine resources from multiple systems into a single image, although there have been commercial products that did this in the past and they relied on infiniband to keep latency down (IMHO, it did not work well). Slurm can be used as a scheduler but you can also use it for interactive jobs and it then can be more of a resource manager.

Each job can specify the number of cpu cores, number and type of gpus, amount of memory, etc. The scheduler will then pick an appropriate and unused system and give you a shell prompt. X11 forwarding is available if needed.

Also, containers can be quite useful in an R&D environment. You should not throw them out because you don't see the utility but they are not the solution to this problem.

pinxau1000 avatar
vg flag
Thanks for the reply. If I understand correctly is impossible to "unify" all physical machine resources as they are a single machine. I should look for a job scheduler to select an appropriate machine from the cluster and run the job according to the user specification. **Can you provide me examples of tools or frameworks that can be used to achieve that goal?**
jm flag
@pinxau1000 Slurm is widely used for this and is well supported. It scales easily from a few systems to thousands of servers. Heterogeneous systems are the norm and each server (or group of servers) is defined with its capabilities. The user defines the needed resources and the scheduler picks a system that meets or exceeds to requirements. Jobs can be run interactively, scheduled for running as soon as resources are available, or to be run at a set time. PBS and LSF are other options.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.