Personal tools

A reduction-capable AXI XBAR for fast M-to-1 communication (1M)

From iis-projects

Revision as of 14:06, 12 May 2023 by Colluca (talk | contribs) (Stretch goals)
Jump to: navigation, search


Overview

Status: Available

Introduction

Figure 1: A block diagram of the Occamy chip architecture

To realize the performance potential of many-core architectures, efficient and scalable on-chip communication is required [1]. Collective communication lies on the critical path for many applications; the criticality of such communication is evident in the dedicated collective and barrier networks employed in several supercomputers, such as Summit [2], NYU Ultracomputer, Cray T3D and Blue Gene/L. Likewise, many-core architectures would benefit from hardware support for collective communications but may not be able to afford separate, dedicated networks due to routing and area costs. For this reason, several papers in literature explored the integration of collective communication support directly into the existing NoC [1, 3].

Collective communication operations are said to be "rooted" (or "asymmetric") when a specific node (the root) is either the sole origin (or producer) of data to be redistributed or the sole destination (or consumer) of data or results contributed by the nodes involved in the communication. Conversely, in "non-rooted" (or "symmetric") operations all nodes contribute and receive data. [7]

Our focus is on rooted operations where the root node exchanges a single datum, i.e. multicast (when the root is a producer) and reduction (when the root is a consumer) operations.

In another work, we explored the cost of integrating multicast support directly into the interconnect of a shared-memory many-core system called Occamy [4]. In Occamy, 216+1 cores and their tightly-coupled data memories are interconnected by a hierarchy of AXI XBARs [5] in a partitioned global address space (PGAS). Each AXI XBAR interconnects a set of AXI masters with a set of AXI slaves, enabling unicast communication between any master and any slave. Our work involved extending the AXI XBAR to multicast transactions from an AXI master to multiple AXI slaves.

Reduction operations are widespread in parallel numerical applications. Reductions are also found in barrier synchronization operations, which are at the core of the BSP programming model [6], and in cache coherency protocols [1].

Project description

Figure 2: A block diagram of the AXI XBAR IP

Opposed to multicast (or 1-to-M communication), in this thesis you will explore the cost of integrating reduction operations (or M-to-1 communication) in the AXI XBAR.

A block diagram of the AXI XBAR is shown in figure 2. The XBAR provides the connectivity to route read and write requests from every master (connected to a slave port of the XBAR) to every slave (connected to a master port of the XBAR). Responses from the slaves are arbitrated and returned to the requesting masters. In our multicast work, we extended the XBAR to forward write transactions concurrently to multiple slaves. This allows us to improve the bandwidth of the XBAR for 1-to-many communication. This extension is useful in scenarios where multiple (consumer) cores all need the same data produced by another (producer) core. To make use of our extension, the producer would have to initiate this data exchange. Alternatively, the same scenario could see the consumer cores requesting the data from the producer core. Despite achieving the same goal, this communication scheme would better be called many-from-1. The difference between the 1-to-many and many-from-1 schemes lies in the agent which initiates the communication, respectively the producer or the consumers of the data being communicated. Albeit reduction operations are traditionally known as many-to-1 communication, with a similar argument we could envision the possibility to carry out a reduction both with a many-to-1 and a 1-from-many scheme. In the first case, the operation would be initiated by the producers of the vector to be reduced. In the second case, it would be initiated by the consumer of the reduced data. In the many-to-1 case, the producers (masters) would issue write requests which are merged in the XBAR (the reduction occurs at this point) before being forwarded to the consumer (slave). The response would then be multicasted back to the producers. In the 1-from-many case, the consumer (master) would issue a read request which the XBAR multicasts to the producers (slaves). The responses would then be merged at the XBAR (the reduction occurs at this point) and a single response returned to the consumer.

As part of this project, you are given the task to reason about the two approaches and their implications on programming flexibility and PPA of the reduction extension.

Detailed task description

To break it down in more detail, you will:

  • Study the existing AXI XBAR, including:
    • RTL implementation
    • Verification infrastructure
    • Multicast extension
    • How it is used as a building block in Occamy's interconnect
  • Understand the implications of a many-to-1 vs. a 1-from-many implementation, in particular in terms of:
    • PPA: does one of the two schemes cost more to implement in hardware?
    • programming flexibility: can one of the two schemes be used in more applications than the other?
  • Implement a reduction extension for the AXI XBAR:
    • Plan and carry out RTL modifications
    • Extend the testbench and verification infrastructure to verify the design
    • Explore PPA overheads and correlate them with your RTL changes
    • Iterate and improve PPA of the design

Stretch goals

Additional optional stretch goals may include:

  • Integrate and benchmark the reduction extension in Occamy:
    • Possibly extend the cores to be able to issue reduction transactions
    • Extend Solder to generate XBARs with the reduction extension and proper USER signal connections
    • Develop software to compare the runtime of a reduction with and without your extension
  • Extend our OpenMP runtime to make use of your extension
  • Measure the improvement on real-world OpenMP workloads

Character

  • 10% Literature/architecture review
  • 70% RTL design and verification
  • 20% Physical design exploration

Prerequisites

  • Strong interest in computer architecture
  • Experience with digital design in SystemVerilog as taught in VLSI I
  • Preferred: Experience in bare-metal or embedded C programming
  • Preferred: Experience with ASIC implementation flow as taught in VLSI II

References

[1] Supporting Efficient Collective Communication in NoCs
[2] The high-speed networks of the Summit and Sierra supercomputers
[3] Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication
[4] Occamy many-core chiplet system [5] PULP platform's AXI XBAR IP documentation
[6] A Bridging Model for Parallel Computation
[7] Encyclopedia of Parallel Computing: Collective Communication entry