Predictable Execution on GPU Caches
Increased computational requirements of embedded software, in real-time domains such as automotive and avionics, demand an increasing amount of performance, that can not be offered from traditional single-core systems used within these fields. While more powerful platforms, have long been available on the consumer market, their adoption into safety critical fields has been slow, due to the rigorous safety certification requirements. Such platforms include multi-core CPU processors and heterogeneous CPU + accelerator system-on-chips (SoCs). Enabling the use of heterogeneous platforms in safety-critical systems does not only offer a major increase in performance, but also promises an increase in energy efficiency due to the use of more efficient accelerators. However, characteristic for these platforms that the operation of one core may interfere with an other due to the sharing of hardware resources. Therefore, one of the most important factors for certification is to provide guarantees on freedom from interference within the system, enabling strict guarantees on the completion of real-time tasks before the set deadlines.
Within the HERCULES project, IIS is collaborating with research groups and companies around Europe to create a software framework for timing predictable execution on commercial off-the-shelf (COTS) heterogeneous platforms. This framework implements the PRedictable Execution Model (PREM) on the NVIDIA platform, and consists of a PREM-enabling compiler, a memory schedule aware hypervisor, soft and hard real-time operating systems (Linux, Erika), and low-level mechanisms to enforce the freedom from interference property. The work at IIS is primarily focused around the CPU/GPU compiler, as well as mechanisms for enforcing memory operations on the GPU side.
The fundaments of the PRedictable Execution Model (PREM) is the separation of memory and compute operations within programs, such that memory operations can be independently scheduled to provide freedom from interference -- i.e., only a single core is able to use the memory system at one point in time. To avoid stalling the program when the system is not permitting memory accesses from the program in question, the memory phase is tasked with copying all data needed for computation into core-local private memories, such that the compute operations can be executed independently.
The current PREM-enabling GPU compiler is based upon the use of the GPU scratchpad memory (CUDA shared memory), as this memory is software managed and does not suffer from unpredictable cache replacement policies, and thus data can be guaranteed to be available during the full PREM Compute phase. However, evaluation has shown that the limited size of the scratchpad memory becomes a significant bottleneck in the effectiveness of the approach, due to frequent reloading of data. However, this limitation can be lifted by using the 5x larger last level cache instead of the scratchpad memory.
The goal of this project is to shed light on the positive and negative effects of using hardware managed caches with unpredictable replacement policies in place of a software managed cache, and evaluate this from a performance and predictability, i.e., freedom from interference, perspective.
- Looking for 1 Interested Master Student (Semester/Master Project)
- Supervision: Bjoern Forsberg
The exact contents of this project need to be discussed in detail with interested students to account for their special interests, and to ensure convergence with the short and long term goals of the HERCULES project.
- Knowledge on Memory Hierarchies
- Knowledge on GPU programming (CUDA)
- Interest in compiler techniques
- Interest in real-time systems (this project does not cover real-time schedulability analysis)
The course Energy-Efficient Parallel Computing Systems for Data Analytics provides a good fundament for the first three points on the list. Experience from Compiler Design and Embedded Systems is meriting.