-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low capacity jobs blocking high capacity jobs with higher prio #540
Comments
Hi Oli, (sorry for a delay) Unfortunately, there is no general solution for such situations. |
hello again...this issue is becoming increasingly important at RISE at the moment bc we are about to run more than 1 task per host by default soon. the scenario will look like this: i have the following renderjobs in the farm: as described earlier the problem is that if the heavy job is submitted after the easy and medium jobs it will not start bc the 256 core rendernodes will be busy working on lets say 4 tasks of the medium job. since not all 4 tasks will finish at the same time there will never be enough capacity until all easy and medium jobs are finished. My temp fix for this would be to limit the max. tasks on all 256 core nodes (that match the jobs hostmask) as long as there are heavy jobs with status RDY. this dynamic limiting could be done via a cron job that runs every minute. I can imageine that the above temp fix could be intergrated into afserver much more elegant but I realize that this takes some time and maybe you can come up with a much smarter solution for this issue. can you? 😉 One thing that afserver can do which is not that easy to re-implement in a cron job is limiting the max.tasks only on a specific number of hosts based on the "need" of the heavy job, bc I do want medium jobs with higher prio to be scheduled on the 256core nodes if their priority is a lot higher. if we dont take the prio into account then low prio heavy jobs would take a away resources from high prio medium jobs. does that make sense? cheers |
Hi Timur, |
Hi Oliver! |
Hi Timur, |
I am thinking about the solution. |
Hi Timur,
today I am reporting an issue that is bugging us for a while now but is becoming increasingly important right now.
The scenerio is simple:
Rendernodes all have a total capacity of 1100
Job 1 has prio 50 and 1000 tasks that need a capacity of 500 each.
Job 1 is already started and tasks finish asynchronously leaving just 600 capacity at all times preventing other higher capacity tasks to start.
Job 2 has prio 200 and 1 task with that needs 1000 capacity but it can never start bc there is never enough capacity left.
I think this is something that you know about and I can imagine that you already have a solution to this, do you?
Cheers
Oli
@sebastianelsner
The text was updated successfully, but these errors were encountered: