Low capacity jobs blocking high capacity jobs with higher prio #540

ultra-sonic · 2022-04-20T07:30:31Z

Hi Timur,

today I am reporting an issue that is bugging us for a while now but is becoming increasingly important right now.

The scenerio is simple:
Rendernodes all have a total capacity of 1100
Job 1 has prio 50 and 1000 tasks that need a capacity of 500 each.
Job 1 is already started and tasks finish asynchronously leaving just 600 capacity at all times preventing other higher capacity tasks to start.
Job 2 has prio 200 and 1 task with that needs 1000 capacity but it can never start bc there is never enough capacity left.

I think this is something that you know about and I can imagine that you already have a solution to this, do you?

Cheers
Oli
@sebastianelsner

timurhai · 2022-04-25T08:59:09Z

Hi Oli, (sorry for a delay)

Unfortunately, there is no general solution for such situations.
If your renders has 1100 capacity, 500c tasks will never allow 1000c tasks to start.And for now I do not see some simple solution when 500c should "go on pause".
But we are using low capacity tasks at work.Our common render capacity is 1500. Task common capacity is 1000.Tasks that have less than 500 capacity should be very light-weight, such tasks will not take an entire farm, or they can take, but for a small period of time.
Sometimes a user has a "very heavy" tasks, that can't run in parallel even with a light tasks.In this case the user can set the capacity to 1500 to take all render capacity.

ultra-sonic · 2023-03-02T08:12:14Z

hello again...this issue is becoming increasingly important at RISE at the moment bc we are about to run more than 1 task per host by default soon. the scenario will look like this:
host capacity is equal to number of cores on the host. our renderfarm consists of a wild mix of 8,12,32,40,64,128 and 256 core machines - roughly 800 nodes in total.

i have the following renderjobs in the farm:
easy - capacity 8
medium - capacity 64
heavy - capacity 256

as described earlier the problem is that if the heavy job is submitted after the easy and medium jobs it will not start bc the 256 core rendernodes will be busy working on lets say 4 tasks of the medium job. since not all 4 tasks will finish at the same time there will never be enough capacity until all easy and medium jobs are finished.

My temp fix for this would be to limit the max. tasks on all 256 core nodes (that match the jobs hostmask) as long as there are heavy jobs with status RDY. this dynamic limiting could be done via a cron job that runs every minute.

I can imageine that the above temp fix could be intergrated into afserver much more elegant but I realize that this takes some time and maybe you can come up with a much smarter solution for this issue. can you? 😉

One thing that afserver can do which is not that easy to re-implement in a cron job is limiting the max.tasks only on a specific number of hosts based on the "need" of the heavy job, bc I do want medium jobs with higher prio to be scheduled on the 256core nodes if their priority is a lot higher. if we dont take the prio into account then low prio heavy jobs would take a away resources from high prio medium jobs. does that make sense?

cheers
Oli

ultra-sonic · 2023-03-16T06:39:10Z

Hi Timur,
sorry to bother you again...could you think of a way to implement this?
cheers
Oli

timurhai · 2023-03-16T10:27:19Z

Hi Oliver!
Sorry, I did not wrote any answer.
But I smoking this!

ultra-sonic · 2023-03-27T15:11:53Z

Hi Timur,
by "smoking this" you mean you are thinking of a solution or is this impossible to implement?
We already have a name for it: "The capacity dilemma" 😉

timurhai · 2023-03-29T17:59:06Z

I am thinking about the solution.

ultra-sonic mentioned this issue Apr 14, 2023

job picked up by render-node from different pool #573

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low capacity jobs blocking high capacity jobs with higher prio #540

Low capacity jobs blocking high capacity jobs with higher prio #540

ultra-sonic commented Apr 20, 2022

timurhai commented Apr 25, 2022

ultra-sonic commented Mar 2, 2023 •

edited

Loading

ultra-sonic commented Mar 16, 2023

timurhai commented Mar 16, 2023

ultra-sonic commented Mar 27, 2023

timurhai commented Mar 29, 2023

Low capacity jobs blocking high capacity jobs with higher prio #540

Low capacity jobs blocking high capacity jobs with higher prio #540

Comments

ultra-sonic commented Apr 20, 2022

timurhai commented Apr 25, 2022

ultra-sonic commented Mar 2, 2023 • edited Loading

ultra-sonic commented Mar 16, 2023

timurhai commented Mar 16, 2023

ultra-sonic commented Mar 27, 2023

timurhai commented Mar 29, 2023

ultra-sonic commented Mar 2, 2023 •

edited

Loading