Personal tools

Weekly Reports

From iis-projects

Jump to: navigation, search

Most of the semester and master projects at IIS are related to other research projects. It is therefore important to update the rest of the group of your progress regularly. One way for such updates are posting weekly reports.if you are working for a project of the Digital Circuits and Systems Group by Prof. Benini, you will be asked to submit a weekly report by e-mail every week on Friday.

Please use ASCII text, if needed with embedded figures. Do not send your report as pdf or .docx in attachment. This prevents Prof. Benini to answer interspersed in the email, and makes it difficult to link responses to specific points in your email.

Note the following:

  • Add 'WR' to the subject line, in addition it is practical to add the week number and the title of your thesis.
  • Address your mail to Prof. Benini, do not forget to add your supervisors in CC
  • Start with a 1-2 sentence summary of your project. (this can be repeated every week the same).
  • Using an itemized list, state what you have done in the current week. Make sure to report results (if any) and inform of design decisions you have taken. Make sure you add technical insight when possible: just stating "advanced the development of XXX" is not informative - it's better explain what new function you added and why. Provide quantitative elements and/or figures if you believe is needed.
  • Add the plan for the coming week using bullet points as well.
  • Try to be brief (about half a page) and concise.
  • Add at the end any request or general comment you may have (e.g. request for a meeting)

Submit your reports on-time. Late submissions will reflect poorly on you.

Example Weekly Report

Subject: WR (34) - Template project
From: student <sem00h0@iis.ee.ethz.ch>
To: Luca Benini <lbenini@iis.ee.ethz.ch>
CC: advisor1<advisor1@iis.ee.ethz.ch>, advisor2<advisor2@iis.ee.ethz.ch>

Dear Prof. Benini,

I am currently working on extending the virtualization support for cva6 
as my Master's thesis with advisor1, advisor2.

*Progress*:

 * I spent some time investigating the sources of delay in the pipeline
   when servicing ISRs. In particular, until last week I was still
   having sub-optimal results in the *context-save* and *return*
   latencies (the return latency measures the number of cycles that
   occur from when the /sret/ instruction retires to when the
   interrupted instruction retires).
   I found out that:
     o The *context-save* stalls are due to the commit queue in the
       store buffer being full for consequent stores without cache
       misses. With the default configuration of CVA6 (8 entries in the
       store commit queue), on a full context save (16 GP caller-saved
       registers), the stalls are therefore unavoidable. However, by
       increasing the commit queue size from 8 to 16 entries, it is
       possible to save the full context without stalls, which brings
       down the context-save latency from the 33 cycles (average)
       measured last weeks to 20 cycles (20 instructions in total, so
       IPC = 1).
     o The high *return* latency is due to deterministic I-CACHE
       misses. The reason is that when executing the ISR, the
       instructions after the /sret /are fetched into the core's
       fronted and they can cause cache misses. Since the /sret/ causes
       a pipeline flush, one of the cache misses is never completely
       resolved and therefore occurs every time, causing an overhead of
       ~25cycles to kill the cache miss handling. It is possible to
       work around this in software by manually pre-heating the cache
       with the cache lines that follow the trap handler's /sret/, but
       this solution might not be feasible for other use cases, since
       the cache line could be evicted at any time. Alternatively, the
       issue could be mitigated in hardware by changing the branch
       prediction algorithm of CVA6 to take into account this problem,
       e.g. by treating /m/sret/ instructions similarly to direct jump
       operations. With the software workaround, I can now measure
       return latencies of 6 cycles (as expected, the cost of a
       pipeline flush in CVA6).

   To summarize, this brings the time-to-handler (TTH) latency in CLIC
   mode down to *31 cycles* when a full context save is required
   (assuming hot-caches/TLBs), rather than the 44 cycles measured
   before, and reduces the overall ISR latency by shortening the return
   latency.


 * I developed and tested some new test cases with vectored interrupts.
   Since CLIC supports vectored interrupts, it is possible to shorten
   the interrupt latency by only saving the necessary context on a
   per-interrupt base. This also helps reducing the pressure on the
   store buffer's commit queue. Combined with compiler optimizations
   (an /"interrupt/" function attribute already exists in newer version
   of RISC-V GCC), this minimizes the number of instruction of an ISR,
   allowing real-time small ISRs to be executed in a few cycles. As an
   example, a minimal ISR setting a flag variable  in response to an
   interrupt can execute within 40 cycles (with a TTH of *15 cycles*),
   in case of no cache / TLB misses.

 * I explored more in depth the baseline case of interrupt latencies
   when running Bao in CLINT mode. I customized the Bao vPLIC emulation
   to be targeting the CVA6 testbench environment. The main
   optimization consists in making the PLIC emulation depending on the
   number of actually implemented interrupt sources in the physical
   PLIC, rather than being fixed to its maximum. The main source of
   delay is a function that iterates over a software data-structures
   holding per-interrupt information, that is called a few times during
   the interrupt injection process. This function's execution time is
   approximately (ignoring the calling convention overhead) linearly
   dependent on the number of interrupts that it emulates. Since this
   function execution takes about 86% of the total PLIC emulation time,
   I modeled the total emulation delay as a function of the number of
   interrupts implemented, following Amdahl's law. The model is backed
   up by the experimental results, that show an overall ISR latency of
   23K cycles (instead of the original 135K cycles) as a result of
   running the test with only 30 interrupt sources implemented (as
   happens in the testbench I am using) out of 1024. Furthermore, I
   measured a total delay of about 10K cycles due to locking
   operations, which could be omitted on a single-core platform as the
   CVA6 testbench. However, since the PLIC is a platform level IC and
   Bao typically executes on multi-core platforms, I believe that it
   makes sense to consider this overhead, maybe abstracting out the
   efficiency of locking implementation.

 * I continued the seL4 porting. I integrated the CLIC support in the
   seL4 kernel, but I am still having problems when running the VMM,
   since the seL4 runtime is now crashing when initializing it.

*Next week:*

 * I will run PPA analysis of the latest version of CLIC in order to
   have an estimate of the hardware virtualization overhead.
 * I will continue the seL4 CLIC integration process.