PULP in space - Fault Tolerant PULP System for Critical Space Applications
The miniaturization of increasingly complex integrated circuits (ICs) is coupled with a reduction of their noise margin, which makes such circuits more and more exposed to faults and failures. While in the past a certain reliability level could be achieved by disposing the failing circuits after their fabrication, such technique would incur into low yield and high costs if applied to modern processes.
The problem is even more accentuated in critical applications, such as the transport and aerospace industries, which require strict levels of reliability. Even at the altitude of commercial flights, due to a thinner atmospheric protection, the electrical components are exposed to energetic cosmic rays, which increase the probability of the ICs encountering a transient fault (soft errors). Faster clocks increase the probability of signal spikes due to cosmic rays being captured by sequential elements, taking the system into a faulty state.
Design for Reliability (DfR) is a must for such critical domains. Traditional fault-tolerant systems use massive redundancy schemes, such as Triple Modular Redundancy (TMR), to ensure a reliability level. In such a scheme, a module is replicated an odd number of times, and a voting mechanism is used to detect and correct any discrepancies between the modules' outputs.
In collaboration with the Federal University of Rio Grande do Norte (UFRN - Brazil) and the Brazilian National Institute for Space Research (INPE), we want to investigate a fault-tolerant low-power OpenPULP multicore cluster.
There are quite a few approaches that can be explored. For example, a set of cores can run in lockstep, with an error being detected/corrected once they diverge (i.e., their program counter/register content differ). Alternatively, a core could run a checking task that verifies the execution of another core. The checking routines can also be run sporadically, at the end of each "critical function" for example. Similar redundancy techniques can also be applied to the memory side, either using Error Correcting Codes (ECC) or by using redundant memory banks.
This thesis consists of three parts:
1. Architectural exploration of fault-tolerant techniques (~1 month).
2. Specification and RTL design on top of an OpenPULP cluster (~2-3 months).
3. FPGA evaluation of the implementation (~1-2 months).
If time permits and if you are interested, an ASIC implementation of the design is also feasible.
To work on this project, you will need:
- to have worked in the past with at least one RTL language (SystemVerilog or Verilog or VHDL). Having followed the VLSI 1 / VLSI 2 courses is recommended.
- to have prior knowledge of hardware design and computer architecture
- Looking for 2 students for a semester project or 1 student for a master thesis
- Supervision: Matheus Cavalcante (firstname.lastname@example.org), Pirmin Vogel, Pasquale Davide Schiavone
Meetings & Presentations
The students and advisor(s) agree on weekly meetings to discuss all relevant decisions and decide on how to proceed. Of course, additional meetings can be organized to address urgent issues.
Around the middle of the project there is a design review, where senior members of the lab review your work (bring all the relevant information, such as prelim. specifications, block diagrams, synthesis reports, testing strategy, ...) to make sure everything is on track and decide whether further support is necessary. They also make the definite decision on whether the chip is actually manufactured (no reason to worry, if the project is on track) and whether more chip area, a different package, ... is provided. For more details refer to (1).
At the end of the project, you have to present/defend your work during a 15 min. presentation and 5 min. of discussion as part of the IIS colloquium.
- Lorena Anghel. Les limites technologiques du silicium et tolérance aux fautes. PhD Thesis, Institut Polytechnique de Grenoble, 2001.
- Matheus Cavalcante. Conception d'une plateforme RISC-V avec le système d'exploitation temps réel Trampoline. Semester Thesis, Institut Polytechnique de Grenoble, 2017.
- OpenPULP. Available at (2)