Multicore Scheduling (continued)

Partitions with Page Coloring

A key mechanism we use is page coloring. Page coloring takes advantage of the virtual memory system that translates virtual memory addresses to physical addresses. This mechanism assigns physical addresses that do not interfere with each other to different tasks running on different cores. Page coloring works in combination with characteristics of the memory hardware that divides the memory into areas that do not interfere with each other. Different mechanisms exist for cache and main memory.

Cache Partitions (Cache Coloring)

Cache is fast memory that is used in the memory system to speed up the access to frequently used data (e.g. variables). Specifically, when a variable is first accessed, it is loaded into cache so that subsequent accesses to the same variable are performed from the cache instead of from main memory much faster. However, caches are much smaller than main memory, and as the program executes it will stop using some variables. The cache system knows how frequently each variable is accessed, so when the cache must add a newly accessed variable but the cache is full, the system clears the least frequently accessed cache block to make room. Accessing that cleared variable will take additional time because it must be loaded back from the slower main memory, as it was the first time it was loaded.

Most cache hardware divides the cache into sets of cache blocks in what is known as set-associativity. Each set is restricted to be used for certain area of the physical memory and cannot be used by any other. We take advantage of this restriction and ensure that the physical memory used by one task in one core belongs to one of these regions, while the memory of another task running on a different core belongs to a different one. This is known as cache coloring, which effectively creates cache partitions. 

Memory Bank Partitions (Bank Coloring)

While cache coloring provides a great benefit to reduce the interference across cores, it's not sufficient to solve the problem. Main memory is another source of significant interference. In fact, the experimental results shown in the figure above are due to memory interference, not cache interference. Main memory is divided into regions called banks, and these banks are organized into rows and columns. When a task running in a core tries to access a memory address in main memory, this address first is analyzed to extract three pieces of information (from specific bits in the memory address): (i) the bank number, (ii) the row number, (iii) the column number. The bank memory is used to select the bank where the memory block is located. Then the memory controller loads the row from that bank into a row buffer within the bank for faster access. Finally the memory block is accessed in the column indicated by the column number from the row buffer. This is illustrated in the figure below. 

Illustrates how memory bank partitioning affects execution latency in multicore processors 
Because the memory controller is optimized to improve the number of memory accesses per second, it takes advantage of the row buffer and favors the memory accesses that go to the same row. Unfortunately, this means that when a task 1 in one core is accessing a row (already loaded in the row buffer) while a task 2 running in another core is trying to access another row in the same bank, the access from task 2 can be moved back in the memory access queue by another, more recent access from task 1 to the already-loaded row multiple times, creating an important delay for task 1.

Memory bank partitions are created by mapping the memory of the different tasks to different memory banks. In this way each task can have its own bank and row buffer, and no other task will modify that buffer or the queue of memory accesses to this bank. 

Combined Cache and Bank Partitions

Because caches and memory banking technologies were not developed together, their partitions often intersect each other. In other words, it is not possible to select a bank color independently from a cache color because the selection of a cache color may limit the number of bank colors available. This occurs because, in some processor architectures, the address bits used to select a bank and the bits used to select a cache set share some elements. To illustrate this idea, consider a memory system with four banks and four cache sets. In this case, we need two address bits to select a bank and two bits to select a cache set. If they were independent, we could select four cache colors for each selectable bank color for a total of 16 combinations. We can visualize this as a matrix (a color matrix) where rows are cache colors and columns are bank colors. However, if cache and bank colors share one bit, then in reality we only have 23=8 colors. This means that in the color matrix some of the cells will not be real. The figure below illustrates this concept.


Coordinated Cache, Bank, and Processor Allocation

We developed a coordinated approach to allocate cache and bank colors, along with processor, to tasks in order to avoid the cache/bank color conflicts. Our approach also maximizes the effectiveness of the memory partitions, by taking into account the difference between inter- and intra-core interference.

We developed memory reservations with cache and memory partitions in the Linux/RK OS.

Limited Number of Partitions

Unfortunately, the number of partitions obtainable through page coloring is limited. For instance, for an Intel i7 2600 processor, it is possible to obtain 32 cache colors and 16 bank colors. Given that, in practice, we may have a larger number of tasks (say 100), this number of partitions may prove insufficient for a real system. As a result, it is important to also enable the sharing of partitions whenever the memory bandwidth requirements of tasks allow it. However, this sharing must be done in a predictable way to ensure we can guarantee meeting task deadlines. At the same time, it is important to avoid pessimistic over-approximations, in order not to waste the processor cycles that we are trying to save in the first place. For this case, we developed an analysis algorithm that allows us to verify the timing interference of private and shared memory partitions.

Parallelized Tasks Scheduling

Beyond solving the resource-sharing problem, we also need to enable the execution of parallelized tasks. For this we have developed a global EDF scheduling algorithm for parallelized tasks with staged execution. These tasks generate jobs composed of a set of sequential stages. These stages are further composed of a set of parallel segments. These segments are allowed to run in parallel to each other, provided that all the segments from the previous stage have completed, or the task has arrived for the first segments of the first stage. Our algorithm allows us to verify the schedulability of these tasks with a global EDF scheduler. Beyond EDF, it is possible to use this algorithm with a global fixed priority scheduler with synchronous start, harmonic periods, and implicit deadlines, a common configuration used by practitioners.