Fast Accelerator Context Switch for PULP
- Type: Master Thesis
- Professor: Prof. Dr. L. Benini
PULP (Parallel Ultra-Low-Power)  is an open-source multi-core computing platform. It consists of an advanced microcontroller architecture with a parallel computing cluster composed of 8 RISC-V, fully programmable 32-bit processing elements (PE) featuring DSP extensions targeting energy-efficient digital signal processing . This computing cluster serves as an accelerator.
ControlPULP is a specialized version of PULP that focuses on predictability. It is used in the European Processor Initiative (EPI)  playing the part of a Power Controller Subsystem (PCS) dynamically adjusting the operating point of a High Performance Computing (HPC) processor to meet energy, power, and thermal constraints. The PCF executes a multi-task control-law that involves optimal operating point computation and complex multi-input, multi-output interaction with the external world. In order to fulfill this task, the FreeRTOS real-time OS layer is employed to schedule the PCF tasks.
The parallel computing cluster is used as an accelerator to parallelize the Power Control Firmware (PCF)  running on the PCS architecture. In the current implementation, the fabric controller, i.e. the 32-bit manager core, can offload a task to the computing cluster by exploiting:
1. Synchronous (blocking) offload: the manager core has to wait (poll) until the computing cluster finishes the task. When this happens, the manager core regains control and can offload another computation;
2. Asynchronous (non-blocking) offload: the manager core can perform other tasks while the cluster is busy with the offloaded task. When it finishes, it notifies the fabric controller with a callback, interrupt-triggered.
The cluster can execute one task at a time, partitioned over the whole set of workers (8 cores) or a subset of it. Nevertheless, support for multiple offloading tasks is not implemented, as well as the capability to switch between tasks.
Context switching is the process of saving the context, i.e. registers and other machine states, determining the next task to be scheduled, and finally restoring the context of the potentially new, different task . The computing cluster employed in ControlPULP lacks optimization and flexibility for more real-time oriented scenarios.
The goal of this project is to add context switching and virtualization capabilities to the cluster of ControlPULP, allowing the operating system to schedule several accelerator tasks and let them run concurrently. This allows:
1. Higher utilization of the cluster compute resources. For example, accelerator tasks that wait for data from the DMA can block and let other accelerator tasks continue.
2. An intrinsic benefit when a task is not embarrassingly parallel: in the current implementation, offloading it to all the workers would incur a waste of resources (diminishing returns coming from the nature of the algorithm) while partitioning the job on a subset of workers would just let the remaining cores unused. This implicitly leads again to higher utilization of the cluster resources, preventing idle wait.
3. Advanced resource partitioning to improve real-time guarantees. By using hardware partition IDs, we can restrict and regulate usage of the accelerator.
The goal of this project is to
- Implement context switching capabilities in software as a baseline.
- Identify opportunities to accelerate context switching and virtualization at the hardware level (e.g., handling tasks pipelining in hardware) and implement these in ControlPULP.
- Evaluate the resulting system in RTL simulation and/or on the FPGA, for which a mature implementation of ControlPULP already exists.
During the exploration, the effect of varying the number of cores in a single cluster instance can be investigated at varying task scheduling and workload.
Stretch goal We intend to scale the power controller towards a multi-cluster configuration, where multiple clusters share the same L2 memory. The idea of scheduling multiple tasks to a single cluster can be extended to the multi-cluster domain. Investigating the case study developed during the project in this scenario would be interesting to understand the effects of heterogeneous task scheduling on multiple hardware resources with real-time workloads.
- 10% Literature / architecture review
- 50% RTL implementation
- 30% Bare-metal C programming
- 10% Evaluation
- Strong interest in computer architecture
- Experience with digital design in SystemVerilog as taught in VLSI I
- Experience with low-level programming in C
 https://github.com/pulp-platform/pulp (GitHub repository)
 https://iis-git.ee.ethz.ch/giovanni.bambini/epi_pmu_ethz (Gitlab repository)