Scheduling, "soft" realtime, and CPU affinity.

As many of you will already know your stock linux scheduler allows to flag a process to be scheduled with some "soft" realtime characteristics. This may be excellent in some contexts, but perhaps not a very good idea if not used wisely. More on that later...

Lets go over a few simple ideas and some history regarding process scheduling. A multiprocess OS needs to manage the processes in execution such that they can share the limited hardware resources (particularly the CPU) to execute them in a seemingly parallel manner. Back in the days of single-core monoprocessor systems (it seems like a long time ago, but most desktop computers up to 2005 or so were such) nothing was really running concurrently on your CPU, but you got that impression nevertheless. Fortunately, processors run at a frequency high enough that we can intertwine the execution of these processes into small slices or chunks of time during which that particular process will occupy the CPU. These slices need to be big enough to allow the CPU to get some real work done for the process and amortize the context switch, but not so lengthy that we cannot execute a bunch of processes in a seemingly parallel form. Just so that you get an idea, imagine 8 concurrent processes and time slices of a quarter of a second. A process will run for 0.25 secs, and stall for another 1.75 secs. If you hate your computer now, imagine what that would be!? A typical time-slice period or quantum has been 10ms (100Hz), which would allow to run 100 processes per second. That sounds better doesn't it. It becomes obvious very quickly how important the scheduling algorithms become, deciding who should run next is definitely not random, and round-robin is simply not good enough.

Preemptive multitasking and the preemtible kernel were a huge step forward, it meant that we could remove, or interrupt a process from execution to continue later even when a process was in the middle of execution and actively occupying the CPU. This is what enabled the mechanisms for preemptive multitasking described above to be put in place. Before this, in the old days, processes had to relinquish the CPU voluntarily, imagine that! I wouldn't trust another coder to let me borrow some of "his" (ha!) resources, would you? Thankfully, you can still yield if you wish to do so. Within the kernel not everything is preemtible, some code (ie. interrupt handlers, the scheduler itself) must complete execution to avoid race conditions or leaving the system in an incoherent state.

Anyway, as you already know things like priorities were introduced, detailed book-keeping is kept for each process regarding stuff like last time executed, I/O, etc. All of this in an attempt to be as efficient as possible in running processes and also allowing some sort of control from userspace over how these processes are scheduled. You all know how the nice and renice commands allows us to setup the PRIO level from -20 to 20 where -20 is maximum priority and 20 is minimum priority. But that may not be enough for your userspace process. What if you need some sort of guarantee about how this process executes? If you've experimented with nice, you probably already know that it doesn't always offer the spectacular results you come to expect. A process can be interrupted by a process with a higher priority, and I believe it may also be interrupted by a process with the same priority as well. This is when realtime, soft realtime which is what we're talking about here, comes into play. This is an attempt to ensure that a given process will not be interrupted by any other process. Realtime processes will typically be scheduled in round-robin or FIFO fashion.

When assigning realtime attributes to a process you must specify the scheduling algorithm to use SCHED_OTHER is commonly used for most typical/regular tasks. These tasks have a scheduling priority of 0. SCHED_RR and SCHED_FIFO are used for time-sensitive, special tasks. These tasks are assigned a static, non-zero priority level between 1 - 99. However, of these, only SCHED_FIFO tasks run without being limited to a timeslice and only get preempted by higher priority SCHED_FIFO tasks. Effectively, the highest priority SCHED_FIFO tasks relinquish the CPU only when they yield it themselves. These are the most critical tasks in the system. SCHED_RR tasks have priority levels and privileges much like SCHED_FIFO, except that they are each limited to a timeslice. SCHED_FIFO and SCHED_RR normally require super-user privileges.

As you can see, this gives the user quite a bit of power regarding the way processes execute. Its obviously a double-edged sword, if you assign realtime properties to a very greedy process that refuses to yield the CPU you may end up with an unresponsive system. Which sucks obviously because if for any reason your userspace code enters an infinite loop or whatever, then you'll freeze the system. I mean, having a userspace app freezing a system is really kind of unacceptable (as a note, there was a patchset written so that this sort of behavior could be curbed, nice read and code starting here: kerneltrap).

Now that we've covered the different scheduling algorithms, we can add into the mix multicore architectures and process affinities. As you all know we can tell a process to run on a specific core. What I wanted to see is how the scheduling algorithms described above affected free processes (ie. processes not pinned to any particular CPU) under different conditions of CPU load for all available cores. So I concocted a little app (I'm pretty sure there's something better out there, but I couldn't quickly find anything so I wrote this myself) you may find it here: mkload on github. The code does precisely what the name suggests: it creates a load - you decide how much. You can then pin these tasks to specific cores. Once that was done, you could launch one more execution of mkload, but setting no CPU affinity to it, and play around with the scheduling algorithm or process priorities to evaluate how the process was actually scheduled:

How does the process move between cores?
How much load can that process actually generate?

I should be posting some actual results here, but the truth is this is all a bit hazy because this post has been in the pipeline for a long time - saved it as a draft months ago and never got around to finishing it properly. If I actually find the results, I'll put them here, but in the meantime, just go crazy and experiment!!! What I do remember is the behavior was not always what one would naively expect. For instance, (under certain scheduling conditions) the free process would end up settling down on a given core, or couple of cores instead of moving around freely so that all tasks as a whole could generate the expected load. The reason behind this is probably due to the fact that swithcing cores constantly could prove detrimental (ie. cache issues - remember L1 and L2 caches on ivy bridge, for instance are not shared across cores). If this interests you, give it a try! The way I conducted my tests was:

N+1 mkload processes. N pinned, 1 "free".
The N pinned processes with same load and scheduling algorithm.
The free process set to whatever load you feel appropriate, but interesting values would be 100% load or whatever remaining CPU time the N pinned processes leave.
Check out with top, htop, ps how the process behaves.

I hope to get those results sometime in the near-future, and hopefully a script to launch the experiment (processes, afinities, etc) automatically. But if I don't, please forgive me and just have some fun!

Written on January 11, 2013