Efficient Synchronization of Manycore Systems (M/1S)
- 1 Overview
- 2 Introduction
- 3 Project Description
- 4 Milestones
- 5 Project Realization
- 6 Deliverables
- 7 References
- Type: Semester/Master Thesis
- Professor: Prof. Dr. L. Benini
- 20% Software
- 60% RTL design
- 20% Evaluation
- VLSI I
- Experience with C
Parallel processing on multi-core systems allows reducing the execution time drastically compared to single-core systems. However, sharing resources and data requires coordination between the cores to serialize their access to critical sections. This coordination can limit the potential speedup offered by the multi-core system according to Amdahl’s law . Efficient means of synchronization are therefore of great importance to multi-core systems.
Instruction set architectures (ISAs) define atomic instructions to allow for efficient synchronization. The RISC-V ISA , for example, specifies the load-reserved (LR) and store-conditional (SC) instruction pair to implement an arbitrary atomic read-modify-write (RMW) operation. The LR loads a value while placing a reservation at the specified location. The thread can then modify the value and store it back with an SC. The SC will only succeed if no one has modified the data in the meantime; therefore, guaranteeing the atomicity of a successful SC. The LR/SC instruction pair allows implementing locks and is widely used in lock-free algorithms.
Locks are a well-known construct to guard critical sections and serialize parallel threads’ access to shared objects. However, one major drawback is their need for polling, which increases congestion in the interconnect, wastes energy, and can interfere with the thread holding the lock, slowing down execution .
A similar problem of polling also shows itself in concurrent algorithms that exhibit a producer-consumer behavior. For example, in a producer-consumer queue, the consumers have to poll an empty queue for new data, while the producers have to poll a full queue for a free slot becoming available.
To remove the need for polling, the memory needs to notify the waiting threads when the locks become free or a queue’s status changes. Cache coherent systems can leverage the coherency protocol for this, which will invalidate the cache line holding the lock if its value changes. For non-coherent systems, we propose to extend RISC-V’s LR/SC instruction pair with an additional atomic instruction we call load-reserved-wait (LRwait). Like RISC-V’s LR instruction, it loads a value and places a reservation, but in addition to guaranteeing the atomicity, this instruction also puts the thread to sleep and triggers the wake up once a different thread changed the specified memory location. This allows the hardware to sleep while the lock is taken and waking up only once it is free again. To reduce contention on a free lock, only one waiting thread is woken up.
Naturally, such an instruction requires the support of the interconnect and the memory system. We recently developed and published ATomic UNit (ATUN) , a module that implements RISC-V’s atomic instructions in a modular way. It allows adding support for atomic operations within the given platform.
This thesis aims to implement the proposed instruction in hardware and evaluate its impact in terms of area, speed, and energy efficiency. The ATUN will be used as the starting point and extended to support the additional functionality required by the LRwait. Depending on the type of thesis (Semester or master thesis), the scope can range from focusing on the hardware implementation to an exploration of new atomic instructions and their use in concurrent algorithms.
Familiarization and specification of LRwait
The first phase’s goal is to become familiar with concurrent algorithms, atomic operations, and the existing hardware platform by implementing a few simple concurrent algorithms and running them on the platform. As a next step, it should be evaluated how these algorithms can benefit from the LRwait instruction and how it has to be specified to be as efficient and versatile as possible.
In the next step, the specified instruction will be implemented in hardware. The ATUN will be extended with the new functionality with as little resource overhead as possible. In addition to the ATUN, the processors also have to be extended to support the new instruction and possibly other system components as well. The extended ATUN has to be integrated into MemPool, our manycore system that can boast up to 256 cores.
The impact of the new instruction on concurrent algorithms has to be evaluated in terms of speed and energy consumption. It has to be evaluated which algorithms benefit from this new instruction and how well it scales. Further interesting evaluations include the resulting impact on the interconnect congestion or the latency of acquiring a lock.
The following are the milestones that we expect to achieve throughout the project:
- Specify the new instruction and show how to use it in concurrent algorithms.
- Implement the new instruction in hardware.
- Extend the ATUN to support LRwait.
- Extend the Snitch core to support this instruction.
- Verify the changes in a full system set-up.
- Implement concurrent algorithms using the LRwait instruction.
- Benchmark and analyze the impact of the new instruction.
Should the above milestones be reached earlier than expected and you are motivated to do further work, we propose the following stretch goals to aim for:
- Extensive investigation of how the new instruction can be used in existing concurrent algorithms.
- Evaluate the pros and cons of different variations of such an instruction.
Weekly meetings will be held between the student and the assistants. The exact time and location of these meetings will be determined within the first week of the project in order to fit the student’s and the assistants’ schedule. These meetings will be used to evaluate the status and progress of the project. Besides these regular meetings, additional meetings can be organized to address urgent issues as well.
Master Thesis: The student is required to write a weekly report at the end of each week and to send it to his advisors by email. The idea of the weekly report is to briefly summarize the work, progress and any findings made during the week, plan the actions for the next week, and discuss open questions and points. The weekly report is also an important means for the student to get a goal-oriented attitude to work.
Semester Thesis: The student is advised, but not required, to write a weekly report at the end of each week and to send it to his advisors. The idea of the weekly report is to briefly summarize the work, progress and any findings made during the week, plan the actions for the next week, and bring up open questions and points. The weekly report is also an important means for the student to get a goal-oriented attitude to work.
HDL Code Style
Adapting a consistent code style is one of the most important steps in order to make your code easy to understand. If signals, processes, and modules are always named consistently, any inconsistency can be detected more easily. Moreover, if a design group shares the same naming and formatting conventions, all members immediately feel at home with each other’s code. At IIS, we use lowRISC’s style guide for SystemVerilog HDL: https://github.com/lowRISC/style-guides/.
Software Code Style
We generally suggest that you use style guides or code formatters provided by the language’s developers or community. For example, we recommend LLVM’s or Google’s code styles with
clang-format for C/C++, PEP-8 and
pylint for Python, and the official style guide with
rustfmt for Rust.
Even in the context of a student project, keeping a precise history of changes is essential to a maintainable codebase. You may also need to collaborate with others, adopt their changes to existing code, or work on different versions of your code concurrently. For all of these purposes, we heavily use Git as a version control system at IIS. If you have no previous experience with Git, we strongly advise you to familiarize yourself with the basic Git workflow before you start your project.
Documentation is an important and often overlooked aspect of engineering. A final report has to be completed within this project.
The common language of engineering is de facto English. Therefore, the final report of the work is preferred to be written in English.
Any form of word processing software is allowed for writing the reports, nevertheless, the use of LaTeX with Inkscape or any other vector drawing software (for block diagrams) is strongly encouraged by the IIS staff.
If you write the report in LaTeX, we offer an instructive, ready-to-use template, which can be forked from the Git repository at https://iis-git.ee.ethz.ch/akurth/iisreport.
The final report has to be presented at the end of the project and a digital copy needs to be handed in and remain property of the IIS. Note that this task description is part of your report and has to be attached to your final report.
There will be a presentation 15/20 min presentation and 5 min Q&A at the end of this project in order to present your results to a wider audience. The exact date will be determined towards the end of the work.
In order to complete the project successfully, the following deliverables have to be submitted at the end of the work:
- Final report incl. presentation slides
- Source code and documentation for all developed software and hardware
- Testsuites (software) and testbenches (hardware)
- Synthesis and implementation scripts, results, and reports
 J. L. Hennessy and D. A. Patterson, Computer Architecture, Sixth Edition: A Quantitative Approach, 6th ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2017, p. 936.
 A. Waterman and K. Asanović, “The RISC-V Instruction Set Manual Volume I: Unprivileged ISA - Document Version 20191213,” RISC-V Foundation, 2019.
 M. Herlihy and N. Shavit, “The Art of Multiprocessor Programming,” 2012.
 A. Kurth, S. Riedel, F. Zaruba, T. Hoefler, and L. Benini, “ATUNs: Modular and scalable support for atomic operations in a shared memory multiprocessor,” in Proceedings - design automation conference, 2020, vol. 2020–July.
 F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, “Snitch: A 10 kGE Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads,” IEEE TRANSACTIONS ON COMPUTERS, pp. 1–1, Feb. 2020.