Personal tools

Difference between revisions of "A reduction-capable AXI XBAR for fast M-to-1 communication (1M)"

From iis-projects

Jump to: navigation, search
(Created page with "<!-- A reduction-capable AXI XBAR for fast M-to-1 communication (1M) --> Category:Digital Category:High Performance SoCs Category:2023 Category:Master Thesis...")
 
 
(19 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
[[Category:Hot]]
 
[[Category:Hot]]
 
[[Category:Colluca]]
 
[[Category:Colluca]]
[[Category:Available]]
+
[[Category:Tbenz]]
 
+
[[Category:Completed]]
[[File:axi_xbar.png|thumb|AXI XBAR block diagram]]
 
  
 
= Overview =
 
= Overview =
  
== Status: Available ==
+
== Status: Completed ==
  
 
* Type: Master Thesis
 
* Type: Master Thesis
Line 20: Line 19:
 
** [[:User:Colluca | Luca Colagrande]]: [mailto:colluca@iis.ee.ethz.ch colluca@iis.ee.ethz.ch]
 
** [[:User:Colluca | Luca Colagrande]]: [mailto:colluca@iis.ee.ethz.ch colluca@iis.ee.ethz.ch]
 
** [[:User:Tbenz | Thomas Benz]]: [mailto:tbenz@iis.ee.ethz.ch tbenz@iis.ee.ethz.ch]
 
** [[:User:Tbenz | Thomas Benz]]: [mailto:tbenz@iis.ee.ethz.ch tbenz@iis.ee.ethz.ch]
* Reserved: Lorenzo Leone
+
* Student: Lorenzo Leone
  
 
= Introduction =
 
= Introduction =
 +
 +
[[File:occamy_block_diagram.png|thumb|Figure 1: A block diagram of the Occamy chip architecture]]
  
 
To realize the performance potential of many-core
 
To realize the performance potential of many-core
Line 29: Line 30:
 
many-core architectures would benefit from hardware support for collective communications but may not be able to afford separate, dedicated networks due to routing and area costs. For this reason, several papers in literature explored the integration of collective communication support directly into the existing NoC [1, 3].
 
many-core architectures would benefit from hardware support for collective communications but may not be able to afford separate, dedicated networks due to routing and area costs. For this reason, several papers in literature explored the integration of collective communication support directly into the existing NoC [1, 3].
  
In another ongoing work, we are exploring the cost of integrating multicast support directly into the interconnect of a shared-memory many-core system called Occamy [4]. In Occamy, 216+1 cores and their tightly-coupled data memories are interconnected by a hierarchy of AXI XBARs in a partitioned global address space (PGAS). Each AXI XBAR interconnects a set of AXI masters with a set of AXI slaves, enabling unicast communication between any master and any slave. Our work involved extending the AXI XBAR to multicast transactions from an AXI master to multiple AXI slaves.
+
Collective communication operations are said to be "rooted" (or "asymmetric") when a specific node (the root) is either the sole origin (or producer) of data to be redistributed or the sole destination (or consumer) of data or results contributed by the nodes involved in the communication. Conversely, in "non-rooted" (or "symmetric") operations all nodes contribute and receive data. [7]
 +
 
 +
Our focus is on rooted operations where the root node exchanges a single datum, i.e. multicast (when the root is a producer) and reduction (when the root is a consumer) operations.
  
Opposed to multicast (or 1-to-M communication), in this thesis you will explore the cost of integrating reduction operations (or M-to-1 communication) in the AXI XBAR.
+
In another work, we explored the cost of integrating multicast support directly into the interconnect of a shared-memory many-core system called Occamy [4]. In Occamy, 216+1 cores and their tightly-coupled data memories are interconnected by a hierarchy of AXI XBARs [5] in a partitioned global address space (PGAS). Each AXI XBAR interconnects a set of AXI masters with a set of AXI slaves, enabling unicast communication between any master and any slave. Our work involved extending the AXI XBAR to multicast transactions from an AXI master to multiple AXI slaves.
  
A block diagram of the AXI XBAR is shown in figure TODO.  
+
Reduction operations are widespread in parallel numerical applications. Reductions are also found in barrier synchronization operations, which are at the core of the BSP programming model [6], and in cache coherency protocols [1].
  
 
= Project description =
 
= Project description =
  
 +
[[File:Axi_xbar.png|thumb|Figure 2: A block diagram of the AXI XBAR IP]]
 +
 +
Opposed to multicast (or 1-to-M communication), in this thesis you will explore the cost of integrating reduction operations (or M-to-1 communication) in the AXI XBAR.
 +
 +
A block diagram of the AXI XBAR is shown in figure 2. The XBAR provides the connectivity to route read and write requests from every master (connected to a slave port of the XBAR) to every slave (connected to a master port of the XBAR). Responses from the slaves are arbitrated and returned to the requesting masters. In our multicast work, we extended the XBAR to forward write transactions concurrently to multiple slaves. This allows us to improve the bandwidth of the XBAR for 1-to-many communication. This extension is useful in scenarios where multiple (consumer) cores all need the same data produced by another (producer) core. To make use of our extension, the producer would have to initiate this data exchange. Alternatively, the same scenario could see the consumer cores requesting the data from the producer core. Despite achieving the same goal, this communication scheme would better be called many-from-1. The difference between the 1-to-many and many-from-1 schemes lies in the agent which initiates the communication, respectively the producer or the consumers of the data being communicated. Albeit reduction operations are traditionally known as many-to-1 communication, with a similar argument we could envision the possibility to carry out a reduction both with a many-to-1 and a 1-from-many scheme. In the first case, the operation would be initiated by the producers of the vector to be reduced. In the second case, it would be initiated by the consumer of the reduced data. In the many-to-1 case, the producers (masters) would issue write requests which are merged in the XBAR (the reduction occurs at this point) before being forwarded to the consumer (slave). The response would then be multicasted back to the producers. In the 1-from-many case, the consumer (master) would issue a read request which the XBAR multicasts to the producers (slaves). The responses would then be merged at the XBAR (the reduction occurs at this point) and a single response returned to the consumer.
 +
 +
As part of this project, you are given the task to reason about the two approaches and their implications on programming flexibility and PPA of the reduction extension.
  
 
== Detailed task description ==
 
== Detailed task description ==
Line 42: Line 52:
 
To break it down in more detail, you will:
 
To break it down in more detail, you will:
  
* '''Gain a deep understanding of the PolyBench kernels''', in particular of:  
+
* '''Study the existing AXI XBAR''', including:
** the underlying algorithms;
+
** RTL implementation
** the data movement and communication patterns;
+
** Verification infrastructure
** the parallelism they expose, i.e. distinguish sequential vs. parallel code regions, data vs. task parallelism, etc.;
+
** Multicast extension
* '''Understand the Occamy architecture and familiarize with the software stack
+
** How it is used as a building block in Occamy's interconnect
* '''Select a suitable subset of kernels to implement'''
+
* '''Review state-of-the-art reduction implementations in NoCs'''
* '''Implement the kernels on Occamy'''
+
* '''Understand the implications of a many-to-1 vs. a 1-from-many implementation''', in particular in terms of:
** '''A)''' Port the original sources to run on the CVA6 host
+
** PPA: does one of the two schemes cost more to implement in hardware?
** '''B)''' Offload amenable code regions to the accelerator
+
** Programming flexibility: can one of the two schemes be used in more applications than the other?
** '''C)''' Optimize data movement, overlapping communication and computation where possible
+
* '''Implement a reduction extension for the AXI XBAR''':
* '''Compare the performance and energy efficiency of the implementations in A), B) and C)'''
+
** Choose one scheme among the many-to-1 and 1-from-many options
** Analyze the speedup in Amdahl's terms
+
** Plan and carry out RTL modifications
** Understand and locate where the major performance losses occur
+
** Extend the testbench and verification infrastructure to verify the design
** Compare the attained FPU utilization and performance to the architecture's peak values
+
** Explore PPA overheads and correlate them with your RTL changes
** Suggest new hardware features or ISA extensions to further improve the attained performance
+
** Iterate and improve PPA of the design
  
== Optional stretch goals ==
+
== Stretch goals ==
  
Additional stretch goals may include:
+
Additional optional stretch goals may include:
  
* Study which kernels could be optimized with the SSR or FREP ISA extensions
+
* Integrate and benchmark the reduction extension in Occamy:
* Eventually optimize the kernels with SSRs or FREP
+
** Possibly extend the cores to be able to issue reduction transactions
* Categorize the kernels based on their use of collective communication (multicast, reductions) and synchronization primitives (barriers, locks)
+
** Extend Solder to generate XBARs with the reduction extension and proper USER signal connections
* Compare your results to a GPU or server-class CPU implementation [11]
+
** Develop software to compare the runtime of a reduction with and without your extension
 +
* Extend our OpenMP runtime to make use of your extension
 +
* Measure how your extension improves the overall runtime of real-world OpenMP workloads
  
 
== Character ==
 
== Character ==
Line 82: Line 94:
 
= References =
 
= References =
  
[1] [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6168953 Supporting Efficient Collective Communication in NoCs]
+
[1] [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6168953 Supporting Efficient Collective Communication in NoCs] <br />
[2] [https://ieeexplore.ieee.org/document/8961159 The high-speed networks of the Summit and Sierra supercomputers]
+
[2] [https://ieeexplore.ieee.org/document/8961159 The high-speed networks of the Summit and Sierra supercomputers] <br />
[3] [https://dl.acm.org/doi/10.1145/2155620.2155630 Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication]
+
[3] [https://dl.acm.org/doi/10.1145/2155620.2155630 Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication] <br />
[4] [https://pulp-platform.org/occamy/ Occamy many-core chiplet system]
+
[4] [https://pulp-platform.org/occamy/ Occamy many-core chiplet system] <br />
[1] [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9296802 Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing] <br />
+
[5] [https://github.com/pulp-platform/axi/blob/master/doc/axi_xbar.md PULP platform's AXI XBAR IP documentation] <br />
[2] [https://github.com/openhwgroup/cva6 CVA6 core Github repository] <br />
+
[6] [https://dl.acm.org/doi/pdf/10.1145/79173.79181 A Bridging Model for Parallel Computation] <br />
[3] [https://arxiv.org/pdf/1904.05442.pdf The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-ready 1.7GHz 64bit RISC-V Core in 22nm FDSOI Technology] <br />
+
[7] [https://link.springer.com/referenceworkentry/10.1007/978-0-387-09766-4_28 Encyclopedia of Parallel Computing: Collective Communication entry] <br />
[4] [https://github.com/pulp-platform/snitch Snitch core Github repository] <br />
+
[8] [https://people.inf.ethz.ch/omutlu/pub/carpool-bufferless-network_ics17.pdf Carpool: A Bufferless On-Chip Network Supporting Adaptive Multicast and Hotspot Alleviation] <br />
[5] [https://arxiv.org/pdf/2002.10143.pdf Snitch: A tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads] <br />
 
[6] [https://en.wikichip.org/wiki/nvidia/tegra/xavier Nvidia Tegra Xavier Wikichip article] <br />
 
[7] [https://en.wikipedia.org/wiki/ARM_big.LITTLE Arm big.Little Wikipedia article] <br />
 
[8] [https://arxiv.org/pdf/1911.08356.pdf Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores] <br />
 
[9] [https://arxiv.org/pdf/2011.08070.pdf Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra] <br />
 
[10] [http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/ PolyBench/C Website] <br />
 
[11] [https://iis-git.ee.ethz.ch/bjoernf/PolyBench-ACC PolyBench port to HERO architecture] <br />
 
[12] [https://arxiv.org/pdf/1712.06497.pdf HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA] <br />
 
[13] [https://github.com/MatthiasJReisinger/PolyBenchC-4.2.1/blob/master/polybench.pdf PolyBench 4.2.1 kernel descriptions]
 

Latest revision as of 10:45, 13 March 2024


Overview

Status: Completed

Introduction

Figure 1: A block diagram of the Occamy chip architecture

To realize the performance potential of many-core architectures, efficient and scalable on-chip communication is required [1]. Collective communication lies on the critical path for many applications; the criticality of such communication is evident in the dedicated collective and barrier networks employed in several supercomputers, such as Summit [2], NYU Ultracomputer, Cray T3D and Blue Gene/L. Likewise, many-core architectures would benefit from hardware support for collective communications but may not be able to afford separate, dedicated networks due to routing and area costs. For this reason, several papers in literature explored the integration of collective communication support directly into the existing NoC [1, 3].

Collective communication operations are said to be "rooted" (or "asymmetric") when a specific node (the root) is either the sole origin (or producer) of data to be redistributed or the sole destination (or consumer) of data or results contributed by the nodes involved in the communication. Conversely, in "non-rooted" (or "symmetric") operations all nodes contribute and receive data. [7]

Our focus is on rooted operations where the root node exchanges a single datum, i.e. multicast (when the root is a producer) and reduction (when the root is a consumer) operations.

In another work, we explored the cost of integrating multicast support directly into the interconnect of a shared-memory many-core system called Occamy [4]. In Occamy, 216+1 cores and their tightly-coupled data memories are interconnected by a hierarchy of AXI XBARs [5] in a partitioned global address space (PGAS). Each AXI XBAR interconnects a set of AXI masters with a set of AXI slaves, enabling unicast communication between any master and any slave. Our work involved extending the AXI XBAR to multicast transactions from an AXI master to multiple AXI slaves.

Reduction operations are widespread in parallel numerical applications. Reductions are also found in barrier synchronization operations, which are at the core of the BSP programming model [6], and in cache coherency protocols [1].

Project description

Figure 2: A block diagram of the AXI XBAR IP

Opposed to multicast (or 1-to-M communication), in this thesis you will explore the cost of integrating reduction operations (or M-to-1 communication) in the AXI XBAR.

A block diagram of the AXI XBAR is shown in figure 2. The XBAR provides the connectivity to route read and write requests from every master (connected to a slave port of the XBAR) to every slave (connected to a master port of the XBAR). Responses from the slaves are arbitrated and returned to the requesting masters. In our multicast work, we extended the XBAR to forward write transactions concurrently to multiple slaves. This allows us to improve the bandwidth of the XBAR for 1-to-many communication. This extension is useful in scenarios where multiple (consumer) cores all need the same data produced by another (producer) core. To make use of our extension, the producer would have to initiate this data exchange. Alternatively, the same scenario could see the consumer cores requesting the data from the producer core. Despite achieving the same goal, this communication scheme would better be called many-from-1. The difference between the 1-to-many and many-from-1 schemes lies in the agent which initiates the communication, respectively the producer or the consumers of the data being communicated. Albeit reduction operations are traditionally known as many-to-1 communication, with a similar argument we could envision the possibility to carry out a reduction both with a many-to-1 and a 1-from-many scheme. In the first case, the operation would be initiated by the producers of the vector to be reduced. In the second case, it would be initiated by the consumer of the reduced data. In the many-to-1 case, the producers (masters) would issue write requests which are merged in the XBAR (the reduction occurs at this point) before being forwarded to the consumer (slave). The response would then be multicasted back to the producers. In the 1-from-many case, the consumer (master) would issue a read request which the XBAR multicasts to the producers (slaves). The responses would then be merged at the XBAR (the reduction occurs at this point) and a single response returned to the consumer.

As part of this project, you are given the task to reason about the two approaches and their implications on programming flexibility and PPA of the reduction extension.

Detailed task description

To break it down in more detail, you will:

  • Study the existing AXI XBAR, including:
    • RTL implementation
    • Verification infrastructure
    • Multicast extension
    • How it is used as a building block in Occamy's interconnect
  • Review state-of-the-art reduction implementations in NoCs
  • Understand the implications of a many-to-1 vs. a 1-from-many implementation, in particular in terms of:
    • PPA: does one of the two schemes cost more to implement in hardware?
    • Programming flexibility: can one of the two schemes be used in more applications than the other?
  • Implement a reduction extension for the AXI XBAR:
    • Choose one scheme among the many-to-1 and 1-from-many options
    • Plan and carry out RTL modifications
    • Extend the testbench and verification infrastructure to verify the design
    • Explore PPA overheads and correlate them with your RTL changes
    • Iterate and improve PPA of the design

Stretch goals

Additional optional stretch goals may include:

  • Integrate and benchmark the reduction extension in Occamy:
    • Possibly extend the cores to be able to issue reduction transactions
    • Extend Solder to generate XBARs with the reduction extension and proper USER signal connections
    • Develop software to compare the runtime of a reduction with and without your extension
  • Extend our OpenMP runtime to make use of your extension
  • Measure how your extension improves the overall runtime of real-world OpenMP workloads

Character

  • 10% Literature/architecture review
  • 70% RTL design and verification
  • 20% Physical design exploration

Prerequisites

  • Strong interest in computer architecture
  • Experience with digital design in SystemVerilog as taught in VLSI I
  • Preferred: Experience in bare-metal or embedded C programming
  • Preferred: Experience with ASIC implementation flow as taught in VLSI II

References

[1] Supporting Efficient Collective Communication in NoCs
[2] The high-speed networks of the Summit and Sierra supercomputers
[3] Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication
[4] Occamy many-core chiplet system
[5] PULP platform's AXI XBAR IP documentation
[6] A Bridging Model for Parallel Computation
[7] Encyclopedia of Parallel Computing: Collective Communication entry
[8] Carpool: A Bufferless On-Chip Network Supporting Adaptive Multicast and Hotspot Alleviation