Personal tools

A reduction-capable AXI XBAR for fast M-to-1 communication (1M)

From iis-projects

Revision as of 10:14, 12 May 2023 by Colluca (talk | contribs) (Created page with "<!-- A reduction-capable AXI XBAR for fast M-to-1 communication (1M) --> Category:Digital Category:High Performance SoCs Category:2023 Category:Master Thesis...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
AXI XBAR block diagram

Overview

Status: Available

Introduction

To realize the performance potential of many-core architectures, efficient and scalable on-chip communication is required [1]. Collective communication lies on the critical path for many applications; the criticality of such communication is evident in the dedicated collective and barrier networks employed in several supercomputers, such as Summit [2], NYU Ultracomputer, Cray T3D and Blue Gene/L. Likewise, many-core architectures would benefit from hardware support for collective communications but may not be able to afford separate, dedicated networks due to routing and area costs. For this reason, several papers in literature explored the integration of collective communication support directly into the existing NoC [1, 3].

In another ongoing work, we are exploring the cost of integrating multicast support directly into the interconnect of a shared-memory many-core system called Occamy [4]. In Occamy, 216+1 cores and their tightly-coupled data memories are interconnected by a hierarchy of AXI XBARs in a partitioned global address space (PGAS). Each AXI XBAR interconnects a set of AXI masters with a set of AXI slaves, enabling unicast communication between any master and any slave. Our work involved extending the AXI XBAR to multicast transactions from an AXI master to multiple AXI slaves.

Opposed to multicast (or 1-to-M communication), in this thesis you will explore the cost of integrating reduction operations (or M-to-1 communication) in the AXI XBAR.

A block diagram of the AXI XBAR is shown in figure TODO.

Project description

Detailed task description

To break it down in more detail, you will:

  • Gain a deep understanding of the PolyBench kernels, in particular of:
    • the underlying algorithms;
    • the data movement and communication patterns;
    • the parallelism they expose, i.e. distinguish sequential vs. parallel code regions, data vs. task parallelism, etc.;
  • Understand the Occamy architecture and familiarize with the software stack
  • Select a suitable subset of kernels to implement
  • Implement the kernels on Occamy
    • A) Port the original sources to run on the CVA6 host
    • B) Offload amenable code regions to the accelerator
    • C) Optimize data movement, overlapping communication and computation where possible
  • Compare the performance and energy efficiency of the implementations in A), B) and C)
    • Analyze the speedup in Amdahl's terms
    • Understand and locate where the major performance losses occur
    • Compare the attained FPU utilization and performance to the architecture's peak values
    • Suggest new hardware features or ISA extensions to further improve the attained performance

Optional stretch goals

Additional stretch goals may include:

  • Study which kernels could be optimized with the SSR or FREP ISA extensions
  • Eventually optimize the kernels with SSRs or FREP
  • Categorize the kernels based on their use of collective communication (multicast, reductions) and synchronization primitives (barriers, locks)
  • Compare your results to a GPU or server-class CPU implementation [11]

Character

  • 10% Literature/architecture review
  • 70% RTL design and verification
  • 20% Physical design exploration

Prerequisites

  • Strong interest in computer architecture
  • Experience with digital design in SystemVerilog as taught in VLSI I
  • Preferred: Experience in bare-metal or embedded C programming
  • Preferred: Experience with ASIC implementation flow as taught in VLSI II

References

[1] Supporting Efficient Collective Communication in NoCs [2] The high-speed networks of the Summit and Sierra supercomputers [3] Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication [4] Occamy many-core chiplet system [1] Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing
[2] CVA6 core Github repository
[3] The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-ready 1.7GHz 64bit RISC-V Core in 22nm FDSOI Technology
[4] Snitch core Github repository
[5] Snitch: A tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads
[6] Nvidia Tegra Xavier Wikichip article
[7] Arm big.Little Wikipedia article
[8] Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores
[9] Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra
[10] PolyBench/C Website
[11] PolyBench port to HERO architecture
[12] HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA
[13] PolyBench 4.2.1 kernel descriptions