Score:1

Running a cluster 24x7 at full load. Possible damages?

ps flag

Let us assume that we have a pool of some 50 computers with 6 cores and 12 threads each.

If someone plans to use it for intensive astrophysics simulation using all of its logical CPUs (50*12) for 24x7, how long will it be able to sustain without any physical damage? Given simple cooling with ACs and the CPUs come with their own fans. Can there be any performance degradation over time? If yes, what is the solution?

Please note the two main requirements

  1. 100% CPU usage for all CPUs and,
  2. the concern is about continuous running over say, years.
Romeo Ninov avatar
in flag
Are the servers enterprise grade? Do you have appropriate cooling and powering systems? Can the application survive one or more servers down?
Peedaruos avatar
ps flag
No it can come from consumer class CPUs, say Intel Core for (6C, 12T). As I mentioned, there is nothing special about cooling.. these CPU come with their fans and the entire thing will be placed in an air conditioned room at say 20 degC. All CPUs are need at all times.
user1751825 avatar
ru flag
Consumer grade hardware will not last very long being used in this way, certainly not years. If the application is important, and it's worth running it for that long, then it's probably worth investing in proper enterprise grade workstations/servers.
user1751825 avatar
ru flag
If you're hoping to save money by using consumer grade, you're perhaps not considering the cost of the electricity to run the computers and air conditioning. The purchase price may be quite insignificant compared to the running costs.
HBruijn avatar
in flag
Cheaper hardware will potentially fail more frequently/sooner making a proper cluster/workload design that allows for failure and accommodates re-running uncompleted work packages from a failed node essential.
HBruijn avatar
in flag
The larger the number of compute nodes, the more likely that over time you will see failures and misbehaving nodes, almost regardless of the quality of your hardware. If you for example use memory that has a mean time between failure rating of 1.5 million hours (±170 years) and you have 50 nodes with each 32 memory slots (1600 pieces of RAM), then you'll have ± one strip of memory failing every month. In a larger cluster, with 300 nodes for example, that will mean a memory failure every week. That is almost unavoidable and you work load needs to be be designed to able to cope with failures.
djdomi avatar
za flag
Does this answer your question? [Can you help me with my capacity planning?](https://serverfault.com/questions/384686/can-you-help-me-with-my-capacity-planning)
Score:5
br flag

how long will it be able to sustain without any physical damage?

If you buy decent production-quality servers then no, you shouldn't see any damage. In fact there's an argument that you'd see less than servers going between running hot and cold as thermal-shock can damage components more than being on all the time.

Can there be any performance degradation over time?

Not really, not on any solid-state components anyway, I suppose your PSUs might get slightly less efficient, your fans may even degrade a bit as they get covered in dust.

Obviously no amount of planning will stop components failing mid-life, buy if you design your clusters to handle that sort of thing it doesn't have to be business-impacting.

Peedaruos avatar
ps flag
Thanks, but I do not understand by the 'production-quality servers' you mentioned. We are talking about consumer grade CPUs. For example I have the Intel i5 10400F or AMD Ryzen 5 5600G with (6C, 12T) in mind. I am aware that several top notch Xeon or Epyc CPUs can certainly tackle the requirements.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.