I work in a research laboratory with multiple physical machines with different specifications. The machines have different CPUs (some Intel, some AMD), different RAM sizes, some have discrete GPUs, and some don't.
Our current solution is based on SSSD and Kerberos, so that users can log in to their accounts from every terminal and have access to their files. The problem is that this way, users are "tied" to one machine while they are working, resulting in sub-optimal resource allocation.
Therefore, we are looking for an alternative solution for our cluster. Our main goal is to truly unify all the machines, i.e., from the user's point of view, the cluster consists of a single machine. However, from what we gather, a solution such as Slurm is not ideal, since we do not want to rely on a job scheduler. The solution we envision goes something like this: when a user logs in, they can specify the specs they need (RAM, number of CPUs, discrete GPU, etc.), then a virtualized environment with the desired specs is created (a Docker image, or a virtual machine, for instance). The user can then use that environment as a regular "computer." Nonetheless, the resources for this virtual environment should be drawn from the cluster and not from a single machine. It should also be possible to share large datasets that can be accessed by every "virtual environment". The cluster should also have an authentication and permission system.
We have been searching for clustering tools that can achieve our goal, but we are unsure which one to pick. We have looked into Mesos, OS/DB, Docker Swarm, Kubernetes, and oVirt, but we do not know if what we want is achievable with these tools, and if so, which one is the best pick. We think that containers might be a good option for production but probably not the best choice for R&D. Can you guys help us out and give some highlights for what to do and where to start from?
Best regards, pinxau1000