PULP in space - Fault Tolerant PULP System for Critical Space Applications
The miniaturization of increasingly complex integrated circuits (ICs) is coupled with a reduction of their noise margin, which makes such circuits more and more exposed to faults and failures. While in the past a certain reliability level could be achieved by disposing the failing circuits after their fabrication, such technique would incur into low yield and high costs if applied to modern processes.
The problem is even more accentuated in critical applications, such as the transport and aerospace industries, which require strict levels of reliability. Even at the altitude of commercial flights, due to a thinner atmospheric protection, the electrical components are exposed to energetic cosmic rays, which increase the probability of the ICs encountering a transient fault (soft errors). Faster clocks increase the probability of signal spikes due to cosmic rays being captured by sequential elements, taking the system into a faulty state.
Design for Reliability (DfR) is a must for such critical domains. Traditional fault-tolerant systems use massive redundancy schemes, such as Triple Modular Redundancy (TMR), to ensure a reliability level. For example, a processor core is replicated an odd number of times, and a voting mechanism is used to detect and correct any discrepancies between the cores' outputs. While such schemes are relatively simple, they incur a high cost in terms of circuit area and power.
The goal of this project in collaboration with the Federal University of Rio Grande do Norte (UFRN - Brazil) and the Brazilian National Institute for Space Research (INPE) is to investigate architectural enhancements for building fault-tolerant low-power OpenPULP multicore clusters.
There are quite a few approaches at different architectural levels that can be explored. For example, a set of cores could be dynamically configured to run in lockstep, with an error being detected/corrected once they diverge (i.e., their program counter/register content differ). Alternatively, a core could run a checking task that verifies the execution of another core. The checking routines can also be run sporadically, at the end of each "critical function" for example. Moreover, flexible redundancy can be added to the cores themselves. For example, redundancy at the register-file level be dynamically added at the cost of performance by reducing the number of available registers. Finally, redundancy techniques can also be applied to the memory side, either using Error Correcting Codes (ECC) or by using redundant memory banks.
For this project the OpenPULP cluster will be used in combination with the Ibex processor core. Similar to RI5CY, Ibex implements the RV32IMC instruction set architecture and is open source. It has originally been designed at ETH Zürich and University of Bologna under the name Zero-riscy  and is now maintained and developed by the not-for-profit company lowRISC C.I.C.. In contrast to RI5CY, Ibex has a much simpler 2-stage pipeline and no custom hardware extensions. It is fully compliant with the RISC-V specification, comes with extensive documentation and industry-grade verification infrastructure. All this facilitates the addition of new features e.g. targeting core-level fault tolerance.
The project can either be done by a team of two students as a semester thesis, or by one student as a Master thesis. The project consists of three parts:
1. Familiarizing with the architecture of the OpenPULP cluster and the Ibex processor core, exploration of fault-tolerant techniques (~2 person months).
2. Specification and RTL design on top of an OpenPULP cluster with Ibex processor cores (~2 person months).
3. FPGA evaluation of the implementation (~1-2 person months).
If time permits and if you are interested, an ASIC implementation of the design is also feasible.
To work on this project, you will need:
- to have worked in the past with at least one RTL language (SystemVerilog or Verilog or VHDL). Having followed the VLSI 1 / VLSI 2 courses is recommended.
- to have prior knowledge of hardware design and computer architecture
- to be motivated to work hard on a super cool open-source project
Meetings & Presentations
The students and advisor(s) agree on weekly meetings to discuss all relevant decisions and decide on how to proceed. Of course, additional meetings can be organized to address urgent issues.
Around the middle of the project there is a design review, where senior members of the lab review your work (bring all the relevant information, such as prelim. specifications, block diagrams, synthesis reports, testing strategy, ...) to make sure everything is on track and decide whether further support is necessary. They also make the definite decision on whether the chip is actually manufactured (no reason to worry, if the project is on track) and whether more chip area, a different package, ... is provided. For more details refer to (1).
At the end of the project, you have to present/defend your work during a 15 min. presentation and 5 min. of discussion as part of the IIS colloquium.
- Lorena Anghel. Les limites technologiques du silicium et tolérance aux fautes. PhD Thesis, Institut Polytechnique de Grenoble, 2001.
- Matheus Cavalcante. Conception d'une plateforme RISC-V avec le système d'exploitation temps réel Trampoline. Semester Thesis, Institut Polytechnique de Grenoble, 2017.
- P.D. Schiavone, et al. "Slow and steady wins the race? A comparison of ultra-low-power RISC-V cores for Internet-of-Things applications", Proceedings of the 27th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS), Thessaloniki, Greece, 2017. link