Personal tools

Difference between revisions of "PULPonFPGA: Hardware L2 Cache"

From iis-projects

Jump to: navigation, search
(Created page with "thumb|600px ==Short Description== While high-end heterogeneous systems-on-chip (SoCs) are increasingly supporting heterogeneous uniform memory access...")
 
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[File:Pulp_on_fpga.png|thumb|600px]]
+
[[File:Pulpjuno.png|thumb|800px]]
==Short Description==
+
==Intro==
 
While high-end heterogeneous systems-on-chip (SoCs) are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts targeting the embedded domain still lack basic features like virtual memory support for accelerators. As opposed to simply passing virtual address pointers, explicit data management involving copies is needed to share data between host processor and accelerators which hampers programmability and performance.
 
While high-end heterogeneous systems-on-chip (SoCs) are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts targeting the embedded domain still lack basic features like virtual memory support for accelerators. As opposed to simply passing virtual address pointers, explicit data management involving copies is needed to share data between host processor and accelerators which hampers programmability and performance.
  
At IIS, we study the integration of programmable many-core accelerators into embedded heterogeneous SoCs. We have developed a mixed hardware/software solution to enable lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs [1,2]. Recently, we have switched to a new evaluation platform based on the ARM Juno Development Platform [3]. This system combines a modern ARMv8 multicluster CPU with a Xilinx Virtex-7 XC7V2000T FPGA [4] capable of implementing PULP [5] with 32 to 64 cores. The two subsystems 
+
At IIS, we study the integration of programmable many-core accelerators into embedded heterogeneous SoCs. We have developed a mixed hardware/software solution to enable lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs [1,2]. Recently, we have switched to a new evaluation platform based on the Juno ARM Development Platform [3]. This system combines a modern ARMv8 multicluster CPU with a Xilinx Virtex-7 XC7V2000T FPGA [4] capable of implementing PULP [5] with 4 to 8 clusters and a total of 32 to 64 cores.
  
we use an evaluation platform based on the Xilinx Zynq-7000 SoC [1] with PULPonFPGA [2] implemented in the programmable logic to study the integration of programmable many-core accelerators into embedded heterogeneous SoCs.  
+
==Short Description==
 +
The two subsystems are on two separate chips and connected through a high-bandwidth off-chip interface. To reduce the traffic on this interface, PULP can be equipped with an L2 cache memory that caches frequent accesses to the shared main memory. This also allows to reduce the stress on our lightweight virtual memory solution. While the overall platform is of considerable complexity, your job is well defined and isolated: The idea of this project is to develop the hardware IP of this L2 cache.  
  
While our current solution allows the programmer to share virtual address pointers between the host processor and the accelerator in a completely transparent manner, it still requires the programmer to manually orchestrate DMA transfers between the accelerator's low latency tightly-coupled data memory (TCDM), an L1 scratchpad memory, and the shared main memory to optimally exploit the system's memory hierarchy and to achieve high performance.
+
You can start from an existing IP block designed in a previous project. This IP is equipped with proprietary interfaces and needs to be adapted to use the widely adopted AXI4 protocol [6], which has built-in cache control signals that shall be used by your new IP. Besides designing the IP, your job is also to develop a set of testbenches to verify the functionality and analyze the design. The work primarily targets the implementation on the Xilinx Virtex-7 FPGA but if desired, an ASIC back-end design can also be implemented.  
 
 
The idea of this project is to enhance our solution with a software cache similar to [5] that uses part of the TCDM to filter the accelerator's accesses to shared data structures living in main memory, and to spare the programmer from manually setting up the DMA transfers, thereby increasing performance and programmability, respectively. To this end, you will need to:
 
: ... Study existing software caches such as [5].
 
: ... Design and implement your own software cache suitable for the heterogeneous platform at hand.
 
: ... Set up operation on the evaluation platform, and characterize the enhanced solution using heterogeneous benchmark applications.
 
  
 
===Status: Available ===
 
===Status: Available ===
: Looking for Interested Master Students
+
: Looking for 1-2 Interested Master Students (Semester Project)
: Supervision: [[:User:Vogelpi|Pirmin Vogel]], [[:User:Mandrea|Andrea Marongiu]]
+
: Supervision: [[:User:Vogelpi|Pirmin Vogel]], [[:User:Schaffner|Michael Schaffner]], [[:User:Mandrea|Andrea Marongiu]]
  
 
===Character===
 
===Character===
 
: 20% Theory, Algorithms and Simulation
 
: 20% Theory, Algorithms and Simulation
: 5% VHDL, FPGA Design
+
: 50% VHDL, FPGA/ASIC Design
: 60% Accelerator Runtime Development (C)
+
: 30% Verification
:  5% Linux Kernel-Level Driver Development
 
: 10% User-space Runtime and Application Development for Host and Accelerator
 
  
 
===Prerequisites===
 
===Prerequisites===
 
: VLSI I,
 
: VLSI I,
 
: VHDL/System Verilog, C
 
: VHDL/System Verilog, C
: Embedded Linux experience
 
: Experience with Linux kernel-level driver development, GCC and GDB is of advantage.
 
  
 
===Professor===
 
===Professor===
Line 40: Line 32:
  
 
===References===
 
===References===
# Xilinx Zynq-7000 All-Programmable SoC [http://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html link]
+
# P. Vogel, A. Marongiu, L. Benini, "Lightweight Virtual Memory Support for Many-Core Accelerators in Heterogeneous Embedded SoCs", ''Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'15)'', Amsterdam, The Netherlands, 2015. [http://dl.acm.org/citation.cfm?id=2830846 link]
 +
# P. Vogel, A. Marongiu, L. Benini, "Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs", ''IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 7'', 2017. [http://ieeexplore.ieee.org/document/7797491/ link]
 +
# Juno ARM Development Platform [http://www.arm.com/products/tools/development-boards/versatile-express/juno-arm-development-platform.php link]
 +
# Xilinx Virtex-7 XC7V2000T FPGA [http://www.xilinx.com/support/documentation/data_sheets/ds180_7Series_Overview.pdf link]
 
# PULP [http://iis-projects.ee.ethz.ch/index.php/PULP link]
 
# PULP [http://iis-projects.ee.ethz.ch/index.php/PULP link]
# P. Vogel, A. Marongiu, L. Benini, "Lightweight Virtual Memory Support for Many-Core Accelerators in Heterogeneous Embedded SoCs", ''Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'15)'', Amsterdam, The Netherlands, 2015. [http://dl.acm.org/citation.cfm?id=2830846 link]
+
# AMBA 4 AXI Protocol Specifications  [http://www.arm.com/products/system-ip/amba-specifications.php link]
# P. Vogel, A. Marongiu, L. Benini, "Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs", ''to be published'', 2016.
 
# C. Pinto, L. Benini, "A Highly Efficient, Thread-Safe Software Cache Implementation for Tightly-Coupled Multicore Clusters", ''Proceedings of the 24th International Conference on Application-Specific Systems, Architectures and Processors (ASAP'13)'', Washington, DC, USA, 2013. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6567591 link]
 
 
# D. J. Sorin, M. D. Hill, D. A. Wood, "A Primer on Memory Consistency and Cache Coherence", ''Morgan & Claypool Publishers'', 2011. [http://dl.acm.org/citation.cfm?id=2028905 link]
 
# D. J. Sorin, M. D. Hill, D. A. Wood, "A Primer on Memory Consistency and Cache Coherence", ''Morgan & Claypool Publishers'', 2011. [http://dl.acm.org/citation.cfm?id=2028905 link]
  
 
[[#top|↑ top]]
 
[[#top|↑ top]]
 
[[Category:Digital]]
 
[[Category:Digital]]
[[Category:Available]]
 
 
[[Category:Semester Thesis]]
 
[[Category:Semester Thesis]]
[[Category:Master Thesis]]
 
 
[[Category:PULP]]
 
[[Category:PULP]]
 
[[Category:System Design]]
 
[[Category:System Design]]
Line 58: Line 49:
 
[[Category:Marongiu]]
 
[[Category:Marongiu]]
 
[[Category:PSocrates]]
 
[[Category:PSocrates]]
 +
[[Category:Akurth]]
 +
[[Category:Heterogeneous Acceleration Systems]]
  
 
<!--  
 
<!--  

Latest revision as of 09:27, 5 November 2019

Pulpjuno.png

Intro

While high-end heterogeneous systems-on-chip (SoCs) are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts targeting the embedded domain still lack basic features like virtual memory support for accelerators. As opposed to simply passing virtual address pointers, explicit data management involving copies is needed to share data between host processor and accelerators which hampers programmability and performance.

At IIS, we study the integration of programmable many-core accelerators into embedded heterogeneous SoCs. We have developed a mixed hardware/software solution to enable lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs [1,2]. Recently, we have switched to a new evaluation platform based on the Juno ARM Development Platform [3]. This system combines a modern ARMv8 multicluster CPU with a Xilinx Virtex-7 XC7V2000T FPGA [4] capable of implementing PULP [5] with 4 to 8 clusters and a total of 32 to 64 cores.

Short Description

The two subsystems are on two separate chips and connected through a high-bandwidth off-chip interface. To reduce the traffic on this interface, PULP can be equipped with an L2 cache memory that caches frequent accesses to the shared main memory. This also allows to reduce the stress on our lightweight virtual memory solution. While the overall platform is of considerable complexity, your job is well defined and isolated: The idea of this project is to develop the hardware IP of this L2 cache.

You can start from an existing IP block designed in a previous project. This IP is equipped with proprietary interfaces and needs to be adapted to use the widely adopted AXI4 protocol [6], which has built-in cache control signals that shall be used by your new IP. Besides designing the IP, your job is also to develop a set of testbenches to verify the functionality and analyze the design. The work primarily targets the implementation on the Xilinx Virtex-7 FPGA but if desired, an ASIC back-end design can also be implemented.

Status: Available

Looking for 1-2 Interested Master Students (Semester Project)
Supervision: Pirmin Vogel, Michael Schaffner, Andrea Marongiu

Character

20% Theory, Algorithms and Simulation
50% VHDL, FPGA/ASIC Design
30% Verification

Prerequisites

VLSI I,
VHDL/System Verilog, C

Professor

Luca Benini

References

  1. P. Vogel, A. Marongiu, L. Benini, "Lightweight Virtual Memory Support for Many-Core Accelerators in Heterogeneous Embedded SoCs", Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'15), Amsterdam, The Netherlands, 2015. link
  2. P. Vogel, A. Marongiu, L. Benini, "Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs", IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 7, 2017. link
  3. Juno ARM Development Platform link
  4. Xilinx Virtex-7 XC7V2000T FPGA link
  5. PULP link
  6. AMBA 4 AXI Protocol Specifications link
  7. D. J. Sorin, M. D. Hill, D. A. Wood, "A Primer on Memory Consistency and Cache Coherence", Morgan & Claypool Publishers, 2011. link

↑ top