Difference between revisions of "Hardware Acceleration"
From iis-projects
(→Computational Units) |
|||
Line 31: | Line 31: | ||
− | ====[[:User: | + | ====[[:User:Lbertaccini | Luca Bertaccini]]==== |
* [mailto:lbertaccini@iis.ee.ethz.ch lbertaccini@iis.ee.ethz.ch] | * [mailto:lbertaccini@iis.ee.ethz.ch lbertaccini@iis.ee.ethz.ch] | ||
* ETZ J78 | * ETZ J78 |
Revision as of 17:02, 24 November 2023
Accelerators are the backbone of big data and scientific computing. While general-purpose processor architectures such as Intel's x86 provide good performance across a wide variety of applications, it is only since the advent of general-purpose GPUs that many computationally demanding tasks have become feasible. Since these GPUs support a much narrower set of operations, it is easier to optimize the architecture to make them more efficient. Such accelerators are not limited to the high-performance sector alone. In low power computing, they allow complex tasks such as computer vision or cryptography to be performed under a very tight power budget. Without a dedicated accelerator, these tasks would not be feasible.
Contents
General-Purpose Computing
While monolithic accelerators dedicated to specific tasks are usually unbeatable in terms of throughput and energy efficiency, they also have drawbacks: they are commonly integrated onto Systems-on-Chip where they need to interact with, and are commonly be programmed by, regular general-purpose processors. This separation of compute acceleration and control limits the system's flexibility and real-world performance as communication and data exchange between the processor and accelerator become major bottlenecks.
A common alternative is to accelerate problems inside general-purpose cores directly. This can be done either by writing optimized software for specific problems, or by integrating dedicated acceleration hardware directly into the processor's ISA and pipeline. The most prominent example of the latter is the often overlooked, yet ubiquitous Floating Point Unit (FPU). However, this same idea can be applied to a large variety of problems, and different architecture extensions may even work together in effective ways.
If you are looking for a project and these ideas sound interesting to you, do not hesitate to contact us!
Nils Wistoff
- e-mail: nwistoff@iis.ee.ethz.ch
- phone: +41 44 632 06 75
- office: ETZ J85
Paul Scheffler
- e-mail: paulsc@iis.ee.ethz.ch
- phone: +41 44 632 09 15
- office: ETZ J85
Computational Units
The last decade has seen explosive growth in the quest for energy-efficient architectures and systems. An era of exponentially improving computing efficiency - driven mostly by CMOS technology scaling - is coming to an end as Moore’s law falters. The obstacle of the so-called thermal- or power-wall is fueling a push towards computing paradigms, which hold energy efficiency as the ultimate figure of merit for any hardware design.
The broad term "computational units" covers a wide range of hardware accelerators for a multitude of different systems, such as floating-point units (FPUs) for processors, or dedicated accelerators for cryptography, signal processing, etc. Such computational units are housed within full systems which usually command stringent requirements in terms of performance, size, and efficiency.
Key topics of interest are energy-efficient accelerators at various extremes of the design space, covering high-performance, ultra low-power, or minimum area implementations, as well as the exploration of novel paradigms in computing, arithmetics, and processor architectures.
Luca Bertaccini
- lbertaccini@iis.ee.ethz.ch
- ETZ J78
Matteo Perotti
- mperotti@iis.ee.ethz.ch
- ETZ J78
Stefan Mach
- smach@iis.ee.ethz.ch
- ETZ J89
Hardware Acceleration of DNNs and QNNs
Deep Learning (DL) and Artificial Intelligence (AI) are quickly becoming dominant paradigms for all kinds of analytics, complementing or replacing traditional data science methods. Successful at-scale deployment of these algorithms requires deploying them directly at the data source, i.e. in the IoT end-nodes collecting data. However, due to the extreme constraints of these devices (in terms of power, memory footprint, area cost), performing full DL inference in-situ in low-power end-nodes requires a breakthrough in computational performance and efficiency. It is widely known that the numerical representation typically used when developing DL algorithms (single-precision floating-point) encodes a higher precision than what is actually required to achieve high quality-of-results in inference (Courbariaux et al. 2016); this fact can be exploited in the design of energy-efficient hardware for DL. For example, by using ternary weights, which means all network weights are quantized to {-1,0,1}, we can design the fundamental compute units in hardware without using an HW-expensive multiplication unit. Additionally, it allows us to store the weights much more compact on-chip.
Gianna Paulin
|
Georg Rutishauser
|
Moritz Scherer
|
Philip Wiese
|
Projects Overview
Available Projects
- Extending our FPU with Internal High-Precision Accumulation (M)
- Low Precision Ara for ML
- Hardware Exploration of Shared-Exponent MiniFloats (M)
- Approximate Matrix Multiplication based Hardware Accelerator to achieve the next 10x in Energy Efficiency: Full System Intregration
- Extended Verification for Ara
- Ibex: Tightly-Coupled Accelerators and ISA Extensions
- RVfplib
- Scalable Heterogeneous L1 Memory Interconnect for Smart Accelerator Coupling in Ultra-Low Power Multicores
Projects In Progress
- Fault-Tolerant Floating-Point Units (M)
- Virtual Memory Ara
- New RVV 1.0 Vector Instructions for Ara
- Big Data Analytics Benchmarks for Ara
- An all Standard-Cell Based Energy Efficient HW Accelerator for DSP and Deep Learning Applications
Completed Projects
- Integrating an Open-Source Double-Precision Floating-Point DivSqrt Unit into CVFPU (1S)
- Investigating the Cost of Special-Case Handling in Low-Precision Floating-Point Dot Product Units (1S)
- Optimizing the Pipeline in our Floating Point Architectures (1S)
- Streaming Integer Extensions for Snitch (M/1-2S)
- A Unified Compute Kernel Library for Snitch (1-2S)
- NVDLA meets PULP
- Hardware Accelerators for Lossless Quantized Deep Neural Networks
- Floating-Point Divide & Square Root Unit for Transprecision
- Low-Energy Cluster-Coupled Vector Coprocessor for Special-Purpose PULP Acceleration
- Design and Implementation of an Approximate Floating Point Unit