ISA extensions in the Snitch Processor for Signal Processing (M)
Striving for high image quality, even on mobile devices, has lead to an increase in pixel count in smartphone cameras over the last decade . These image sensors, boasting tens of millions of pixels, create a huge amount of data to be processed on a tight power envelope as quickly as possible. Fortunately, this processing is highly parallelizable, but it requires specialized image signal processors (ISPs) that are able to exploit this high degree of parallelism and support domain-specific instructions to meet the timing and power constraints. One modern example of such an ISP is Google’s Pixel Visual Core, which contains 8 image processing units, each consisting of 256 specialized processing elements to achieve a combined performance of 3.28 .
At ETH, we are developing our own ISP called MemPool  . It is built around the RISC-V instruction set architecture (ISA), which is a modular and open ISA . It specifies a base instruction set (RV32I), which can be extended with either specified extensions, like the ‘M’ extension to add multiplication and division instructions, or with custom user extensions. The processor used in MemPool is Snitch, a lightweight 32-bit RISC-V core developed at ETH Zurich . It implements the base RISC-V ISA with the Atomic extension (RV32IA). Additionally, it can support he ‘M’ and ‘F’ extension through a custom co-processor interface. However, currently, there is no support for domain-specific instructions.
Another open-source RISC-V processor is CV32E40P, also developed at ETH  and is currently maintained by the OpenHW Group . It is a 32-bit in-order processor with 4 pipeline stages. In contrast to Snitch, it features custom DSP instructions, such as multiply-accumulate and SIMD instructions.
The goal of the project is to extend Snitch, or more general the MemPool system, with custom instructions tailored towards image signal processing. Snitch’s interface to communicate with co-processors will be used to execute domain-specific instructions outside its pipeline.
The instructions that will be added to this core are image processing instructions, such as multiply-accumulate, clamp, or various SIMD instructions. The final goal is to implement a relevant subset of the RISC-V ‘P’ extension . The set of instructions that should be added to the system is not fixed, but the student should implement and evaluate different image processing algorithms and propose instructions based on their results. Using CV32E40P as a reference implementation, these instructions will have to be implemented carefully with a good trade-off between area, performance and energy. The implementation has to be tested for verifying their functionality and eventually an evaluation of their impact on image processing applications is expected.
To use the new instructions in software written for MemPool, they have to be added to the assembler and the Spike simulator . While these efforts are only a very small part of the project, they are an integral part of the overall work.
- Identify key instructions that can be used to accelerate the image-processing benchmarks;
- Port the custom DSP instructions of CV32E40P to Snitch;
- Adapt the custom instructions to align them with RISC-V’s ‘P’ extension;
- Modify the benchmarks to use the custom instructions and benchmark the system. The key performance indicators will be:
- Hardware: Area, frequency, throughput, latency, power;
- Benchmarks: Runtime, throughput.
- Familiarize yourself with the MemPool system. Run some existing benchmarks in RTL simulation and implement some image processing kernels. Based on these kernels, identify some key instruction in CV32E40P’s ‘xpulp’ and RISC-V’s ‘P’ extension to implement in MemPool.
- Port the relevant DSP instructions of CV32E40P to Snitch. In this phase, the focus lies on the interface between Snitch and the co-processor and not on optimizing the co-processors functional unit.
- Adapt the custom instructions to align them with RISC-V’s ‘P’ extension. Optimize the datapath of the co-processor to align it with the requirements of MemPool.
Throughout the project, the new hardware has to be benchmarked to make informed decisions on how to move forward and the implementations have to be carefully verified.
Throughout the project, a number of milestones have to be reached. In the following, a tentative list of expected milestones is provided, which might be modified during the first weeks of the work:
- Image processing kernels are running on MemPool (no custom instructions yet).
- The custom instructions of CV32E40P’s ‘xpulp’ extension are working in Snitch.
- All custom instructions are aligned with RISC-V’s ‘P’ extension and optimized for the MemPool architecture.
- The benchmarks are updated to make use of the custom instructions and the implementation is evaluated.
Depending on the progress and the interest of the student, the project can explore different directions. The focus can be kept on implementing the ‘P’ extension and the student can implement as many instructions as possible, or do a very detailed analysis of the optimal data-path for different trade-offs.
Another option is to explore the use of Stream Semantic Registers (SSRs) in our system. SSR is a lightweight, non-invasive RISC-V ISA extension which implicitly encodes memory accesses as register reads/writes, eliminating a large number of explicit load/store instructions . This extension can be used in combination with the new co-processor to increase performance even further.
Meetings & Presentations
The students and advisor(s) agree on weekly meetings to discuss all relevant decisions and decide on how to proceed. Of course, additional meetings can be organized to address urgent issues.
Around the middle of the project there is a design review, where senior members of the lab review your work (bring all the relevant information, such as prelim. specifications, block diagrams, synthesis reports, testing strategy, ...) to make sure everything is on track and decide whether further support is necessary. They also make the definite decision on whether the chip is actually manufactured (no reason to worry, if the project is on track) and whether more chip area, a different package, ... is provided. For more details refer to (1).
At the end of the project, you have to present/defend your work during a 15 min. presentation and 5 min. of discussion as part of the IIS Colloquium.
- Student: Sergio Mazzola
- Type: Master Thesis
- Semester: Autumn Semester 2020
- Professor: Prof. Dr. L. Benini
 S. Skafisk, This is how smartphone cameras have improved over time..
 J. L. Hennessy and D. A. Patterson, Computer architecture: A quantitative approach. Elsevier, 2011.
 M. Cavalcante and S. Riedel, MemPool gitlab repository..
 A. Waterman et al., “The RISC-V instruction set manual.” 2014.
 F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, “Snitch: A 10 kGE pseudo dual-issue processor for area and energy efficient execution of floating-point intensive workloads.” 2020.
 M. Gautschi et al., “Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2700–2713, 2017.
 O. Group, CV32E40P github repository..
 C. Chang, “RISC-v "P" extension proposal.” 2020.
 RISC-V, “Spike RISC-V ISA simulator.” 2019.
 F. Schuiki, F. Zaruba, T. Hoefler, and L. Benini, “Stream semantic registers: A lightweight RISC-V ISA extension achieving full compute utilization in single-issue cores.” 2019.