Personal tools

Implementing Configurable Dual-Core Redundancy

From iis-projects

Jump to: navigation, search


Overview

Status: Available

Introduction

All computing systems, are vulnerable to runtime faults (Single Event Upsets, SEUs), especially when deployed in environments with high levels of radiation, such as space. To combat this, IIS has developed a Triple Modular Redundancy (TMR) approach at the core level, making use of the PULP cluster, with additional features to enable and disable the redundancy at runtime (On-Demand Redundancy Grouping, ODRG)[1].

At ETH, we are developing our own many-core system called MemPool [2], [3]. It boasts 256 lightweight 32-bit Snitch cores [4]. They implement the RISC-V instruction set architecture (ISA), a modular and open ISA [5]. Despite its size, MemPool manages to give all 256 cores low-latency access to the shared L1 memory, with a zero-load latency of at most five cycles. Therefore, all cores can efficiently communicate, making MemPool suitable for various workloads and easy to program.

Project

While TMR works very well for fast recovery, using three cores for fault tolerance has a 200% overhead in area and power consumption, which for certain use-cases is not tenable. The goal of this project would be to implement configurable Dual Modular Redundancy at the core-level within a MemPool cluster, allowing a single tile to switch between a 4-core or a reliable 2-core mode. A two-core redundancy mechanism has some more challenges compared to a three-core approach, as the correct state is not known, requiring more involved recovery schemes.

Character

  • 20% Hardware and Software familiarization
  • 50% Hardware Design
  • 30% Reliability Evaluation

Prerequisites

  • Experience with digital design in SystemVerilog as taught in VLSI I
  • Strong interest in Computer Architecture and Reliable Computing

References

[1] M. Rogenmoser, N. Wistoff, P. Vogel, F. Gürkaynak, L. Benini, “On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster,” in 2022 IEEE Computer Society Annual Symposium on VLSI 2022.

[2] M. Cavalcante, S. Riedel, A. Pullini, and L. Benini, “MemPool: A shared-L1 memory many-core cluster with a low-latency interconnect,” in 2021 design, automation, and test in europe conference and exhibition (date), 2021, pp. 701–706.

[3] S. Riedel and M. Cavalcante, “MemPool GitHub.” 2021.

[4] F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, “Snitch: A 10 kGE Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads,” IEEE TRANSACTIONS ON COMPUTERS, pp. 1–1, Feb. 2020.

[5] A. Waterman and K. Asanović, “The RISC-V Instruction Set Manual Volume I: Unprivileged ISA - Document Version 20191213,” RISC-V Foundation, 2019.