Difference between revisions of "Ultra-low power processor design"
|Line 92:||Line 92:|
== Links ==
== Links ==
* [http://asic.ethz.ch/cg/2013/Or10n Chip gallery page] for Or10n the chip manufactured during this thesis
* [http://asic.ethz.ch/cg/2013/Or10nChip gallery page] for Or10n the chip manufactured during this thesis
Latest revision as of 19:01, 30 January 2014
- 1 Short Description
- 2 Detailed Task Description
- 2.1 Introduction
- 2.2 Project Description
- 2.3 Goals
- 2.4 Practical Details
- 3 Results
- 4 Links
In this project, you will contribute to a new and exciting project at the IIS on designing an extremely power efficient multi-processor hardware platform that can achieve milliWatt/GOPS energy efficiency. Your task will be the optimization of the central operational core based on the open-source OpenRISC processor. While the current OpenRISC processor is functionally correct, we believe that its energy efficiency can be improved by a significant amount by optimizing the microarchitecture. We will use a state of the art 28nm SOI process to evaluate the performance of the processor.
- Date: Fall Semester 2014
- Students: Matthias Baer, Renzo Andri
- Supervisors: Frank K. Gurkaynak, Michael Gautschi
- 25% Theory
- 50% ASIC Design
- 25% EDA tools
Detailed Task Description
Energy Efficient Computing using Multicore Systems
Ultra low voltage (ULV) computing circuits, where the supply voltage is near the threshold voltage of transistors, have emerged as an attractive approach for ultralow-power (ULP) embedded systems. The goal is to achieve a computing platform which is superior in energy efficiency.
A high energy efficiency in terms of GOPS/mW can be achieved by splitting computation among several clusters. Each cluster comprises several simple processor cores which are operated in ULP region.
Such a cluster is shown above. In this example four processing elements are used to cover the computational tasks and eight tightly coupled data memories(TCDM) are used as common storage for data, which can be accessed through a low-latency interconnect arbitration unit. Inter-cluster communication and direct memory access(DMA) can be added to improve the functionality.
The main component of the cluster in Figure above is the processing element, which is typically a reduced instruction set computing(RISC) processor due to its simplicity. One possible candidate is the OpenRISC processor core, an open source RISC processor. A stable implementation of the OpenRISC core, written in Verilog exists and can be downloaded.
The core consists of a four stage integer pipeline. The four stages are Instruction Fetch(IF), Instruction Decode(ID), Execute(EX) and Write Back(WB). The core supports direct-mapped data and instruction cache, instruction and data management units(I/DMMU), power management units, etc. All this functionality can be enabled or disabled by the user`s preferences.
While being functionally correct, the microarchitecture of the core can be improved a lot. The verilog implementation of the core showed poor performance in terms of instructions per cycle(IPC). The reason for the poor performance was mainly due to stalls during load/store operations, stalls during multiplications and poor branch execution. These problems have already been resolved and the IPC as been improved significantly.
The microarchitecture of the core is still far from perfect, one main thing which has to be improved is the pipeline itself. In the current implementation, the complete pipeline is stalled instead of one single stage of the pipeline. Another area of interest is the multiplier, which is implemented with a three cycle latency to break the critical path. While this might be optimal in a FPGA setup, it might be possible in an ASIC to perform a multiplication, which is done in the execute stage, in one or two cycles
We want to build a clean OpenRISC core which will later be used in the cluster. Throughout this project, the focus is on the implementation and optimization of one single core.
Implementation of the OpenRISC Processor
The original version of the verilog implementation of the OpenRISC core was already modified to improve the core in terms of IPC. This modified version will serve as golden model for your implementation. However, there is a lot of functionality which will never be used in the target system and it is therefore not necessary to maintain all this functionality. Removing caches, IMMU, DMMU, etc. will greatly simplify the design and increase readability of the code.
Clean and well-documented Implementation
The OpenRISC is an open source processor and it is important that the code of the processor is clean and well documented. The pip eline and the signals which influence the pipeline have to be clear and it has to be easy to add additional conditions to modify the behaviour of the pipeline. To obtain a good and clean design it is important to follow coding guidelines, as separation of combinational and sequential processes, using packages to define constants, adhering to naming conventions, no excessive use of defines, which make the code unreadable and the like. To achieve a good code, your supervisors will organize a code review after a few weeks of HDL coding.
Simulation of Instruction Cache and Data Cache Misses
In the future the core is intended to be used in a multicore environment similar to the one shown in Figure above, where several cores use one or more shared memories. In this case it is possible that more than one core tries to access the same memory at the same time. In this case, at least one processor will have to wait(stall) until the memory becomes available. To simulate such a miss in a single core structure, an acknowledge will be added to each read/store operation. This acknowledge will be assigned in the testbench such that memory collisions can be simulated.
A similar modification has to be added to the instruction fetch unit of the processor to simulate instruction cache misses.
Design and Definition of a Test Interface
One of the most important things is the test interface. The main problems we will face are the limited number of I/O pins, no flash memory, and the speed. We will not be able to attach the final chip to an external memory, because the pin count will not be sufficient for addresses, 32~bit data and 32~bit instructions. The configuration ports will be used to initialize the memory with instructions and data. After the configuration is done the processor will start the program and in the end it will output a signal which indicates that the computation has finished. After this signal has been seen the memory can be read through the same configuration ports and compared to the expected responses. Note that for power measurements of the processor it is important to know the start and end point of a program.
Depending on the project progress and the student's interests one or more of the following sub-tasks can be implemented:
- Implementation of Hardware Loops: So far the performance of the core has been analyzed in terms of IPC. Other important measures are the code density and the number of cycles required to run a specific program. Both measures can be improved by replacing for-loops by hardware loops. With hardware loops, the processor does no longer take care of incrementing and comparing loop variables, but uses a separate unit which takes care of this computations instead. Important information about each loop, such as entry and exit points, iteration count, etc. will be stored in a separate LUT, which can be accessed by the hardware loop execution unit. To store those information in the beginning, each loop requires a few initialization cycles. The authors have shown that the runtime of a benchmark test such as a matrix multiplication can be reduced up to 42% by replacing all the control computation of a loop with powerful hardware loop instructions. We believe that the impact on the OpenRISC core will be similar.
- Adding User I/O to the Processor: If the processor is not embedded in a system, it is like a blackbox which computes something and stores the result in the data memory. It will be possible to read the memory through the test interface, but it is also possible to add some specific I/O pins which are directly connected to the processor. These pins can be in, out or even bi-directional to better use the limited pin resources. What the usage of these pins will be is up to the students and will be discussed once the basic functionality of the core is implemented. A few ideas are displaying outputs of the processor on an LED, adding some kind of sensor and letting the processor process the sensed data, or running a software on the processor which controls a FPGA board or some other chip through the user I/O pins. But many other great ideas exist and can be implemented.
- Pipeline/Processor Optimizations: Right now the processor is implemented with four pipeline stages. While increasing the number of pipeline stages also increases the overhead of the control and forwarding paths, fewer pipeline stages could even more simplify the design probably at the cost of a slightly increased critical path. We believe that it would be possible to combine the IF and ID stage and therefore reduce the number of pipeline stages to three. The impact on the critical path and complexity of the design will be elaborated in this task.
The goal of this thesis is to implement an OpenRISC processor in System Verilog or VHDL. The following tasks need to be accomplished during this project:
- Good understanding of the OpenRISC core and its implementation.
- Functionally correct, and proper implementation of the OpenRISC core in VHDL or System Verilog with a similar performance to the reference design.
- Developement of an environment to test the basic functionality of the core.
- Well-documented VHDL or System Verilog code.
- Perform a back-end design of the developed processor.
- Hand in a good report and give an interesting presentation.
- Chip gallery page for Or10n the chip manufactured during this thesis