Personal tools

Difference between revisions of "LLVM and DaCe for Snitch (1-2S)"

From iis-projects

Jump to: navigation, search
(Status: In progress)
(Status: In progress)
Line 13: Line 13:
** [[:User:Alexandru.calotoiu | Alexandru Calotoiu]]: []
** [[:User:Alexandru.calotoiu | Alexandru Calotoiu]]: []
** [[:User:Paulsc | Paul Scheffler]]: []
** [[:User:Paulsc | Paul Scheffler]]: []
** [[:User:Akurth | Andreas Kurth]]: []
** [[:User:Akurth | Andreas Kurth]]: []
== Character ==
== Character ==

Revision as of 19:11, 15 February 2021


Status: In progress


  • 70% Implementation
  • 30% Evaluation


  • C/C++, Python programming skills
  • Interest in code generation
  • Interest in or experience with the LLVM/Clang codebase
  • Ability to endure soul-crunching hardware simulation environments


The Snitch ecosystem [1] targets energy-efficient high-performance systems. It is built around the minimal RISC-V Snitch core, only 15 kGE in size, which can optionally be coupled to an FPU, a DMA, or DSP extensions in development. Snitch’s floating-point subsystem is of particular interest: it includes stream semantic registers (SSRs) [2] and the floating-point repetition (FREP) hardware loop, approaching full FPU utilization in many data-oblivious problems by reducing the control-to-compute overhead.

Currently, leveraging these powerful extensions requires assembly programming, since there is no compiler support for them. This includes setting up SSR address generators, managing stream semantics, and inserting FREP loops where appropriate.

In a simple prototype, we explored how the RISC-V LLVM backend could be extended to automatically detect loads/stores amenable for SSR replacement, and to insert the necessary code. Ideally, we would have an optimized LLVM toolchain supporting and inferring code for all Snitch extensions, including more complex SSR/FREP loops and DMA transactions.

Even with good compiler support, however, efficiently programming large multicore systems such as Manticore [3] is challenging; it requires careful scheduling of data movement, using the programmable DMA, and the actual computation. As systems and problems grow, this can become inefficent and cumbersome to implement.

A promising angle to make this more manageable is to leverage a higher-level framework to describe computation with respect to its dataflow. This representation can then be mapped onto the target hardware in an optimal way. Frameworks like TensorFlow already do this, but target a specific domain such as Machine Learning.

We would therefore like to leverage DaCe [4] to generate code for Snitch. DaCe is a Python framework for Data-Centric Parallel Programming, shifting the fundamental view from computation to data movement. In DaCe, computations are represented as stateful dataflow multigraphs (SDFGs) modeling computation elements, data containers, and parametric dataflow dependencies. This can provide a portable representation of scientific computations without sacrificing performance, as performance engineers can focus on efficiently mapping SDFGs to their hardware instead of individual problems.

Project Description

The project has two primary goals:

  • Extending and tuning the RISC-V LLVM toolchain for Snitch; this includes supporting and inferring Xssr, Xfrep, and Xdma functionality.
  • Extending DaCe to generate efficient code for a Snitch system from SDFGs, ideally for the existing Snitch-HERO platform.


The following are the milestones that we expect to achieve throughout the project:

  • Extend the existing LLVM backend to support FREP and the DMA instructions so they can be emitted and used in assembly.
  • Define sensible intrinsics for SSR configuration, FREP loops, asynchronous DMA calls, and other useful runtime functionality, such as Core-FPU synchronization.
  • Implement simple LLVM backend passes to automatically replace traditional loop structures with FREP loops and SSR jobs where applicable.
  • Adapt the existing C++ DaCe backend to emit LLVM-compilable code for a manycore Snitch system like Snitch-HERO. Validate your implementation on simple kernels or selected Polybench [5] benchmarks with well-understood SDFGs.
  • Implement appropriate SDFG transform passes in DaCe to make use of the Snitch cluster TCDM and improve performance on the aforementioned examples.

Stretch Goals

Should the above milestones be reached earlier than expected and you are motivated to do further work, we propose the following stretch goals to aim for:

  • Map, evaluate, and try to optimize further applications on Snitch using DaCe, such as the full PolyBench suite or a more elaborate scientific computing problem like LULESH [6].
  • Improve your DaCe code generation backend and SDFGs transform passes to further improve mapping to and overall performance on Snitch.
  • Further tune LLVM to Snitch: for example, make it aware of architectural latencies and forbid it from assuming memory coherency between core and FPU.
  • Explore compiler support and inference for the new indirection feature of the backwards-compatible indirection stream semantic registers [7].

Project Realization

Time Schedule

The time schedule presented in 1 is merely a proposition; it is primarily intended as a reference and an estimation of the time required for each required step.

Project phase Time estimate
Add extension instructions to LLVM 1 week
Define sensible intrinsics 2 weeks
Implement LLVM inference passes 3 weeks
Adapt C++ DaCe backend to Snitch 2 weeks
Write appropriate SDFG transforms 2 weeks
Stretch goals remaining time
Write report 2 weeks
Prepare presentation 1 week
Proposed time schedule and investment


Weekly meetings will be held between the student and the assistants. The exact time and location of these meetings will be determined within the first week of the project in order to fit the student’s and the assistants’ schedule. These meetings will be used to evaluate the status and progress of the project. Beside these regular meetings, additional meetings can be organized to address urgent issues as well.

Weekly Digests

The student is advised, but not required, to a write a weekly digest at the end of each week and to send it to his advisors. The idea of the weekly digest is to briefly summarize the work, progress and any findings made during the week, to plan the actions for the next week, and to bring up open questions and points. The weekly digest is also an important means for the student to get a goal-oriented attitude to work.


Documentation is an important and often overlooked aspect of engineering. A final report has to be completed within this project.

The common language of engineering is de facto English. Therefore, the final report of the work is preferred to be written in English.

Any form of word processing software is allowed for writing the reports, nevertheless the use of LaTeX with Inkscape or any other vector drawing software (for block diagrams) is strongly encouraged.

Final Report

The final report has to be presented at the end of the project and a digital copy needs to be handed in and remain property of SPCL. Note that this task description is part of your report and has to be attached to your final report.


There will be a presentation (15 min presentation and 5 min Q&A) at the end of this project in order to present your results to a wider audience. The exact date will be determined towards the end of the work.


In order to complete the project successfully, the following deliverables have to be submitted at the end of the work:

  • Final report incl. presentation slides
  • Source code, test suites, and documentation for all developed software


[1] F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, “Snitch: A tiny pseudo dual-issue processor for area and energy efficient execution of floating-point intensive workloads,” IEEE Trans. Comput., pp. 1–1, 2020.

[2] F. Schuiki, F. Zaruba, T. Hoefler, and L. Benini, Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores,” IEEE Trans. Comp., pp. 1–1, 2020.

[3] F. Zaruba, F. Schuiki, and L. Benini, “Manticore: A 4096-core RISC-v chiplet architecture for ultra-efficient floating-point computing.” 2020.

[4] T. Ben-Nun, J. de Fine Licht, A. N. Ziogas, T. Schneider, and T. Hoefler, “Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures.” 2020.

[5] L.-N. Pouchet, PolyBench/C: the Polyhedral Benchmark suite.” 2010.

[6] I. Karlin, J. Keasler, and R. Neely, “LULESH 2.0 updates and changes,” Livermore, CA, LLNL-TR-641973, 2013.

[7] P. Scheffler, F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini, “Indirection stream semantic register architecture for efficient sparse-dense linear algebra.” 2020.