Logo image
Efficient scaling of out-of-order processor resources
Dissertation   Open access

Efficient scaling of out-of-order processor resources

Steven James Battle
Doctor of Philosophy (Ph.D.), Drexel University
01 Jun 2015
DOI:
https://doi.org/10.17918/etd-6322
pdf
Battle_Steven_20151.83 MBDownloadView

Abstract

Low voltage integrated circuits Computer Engineering Computer Science
Rather than improving single-threaded performance, with the dawn of the multi-core era, processor microarchitects have exploited Moore's law transistor scaling by increasing core density on a chip and increasing the number of thread contexts within a core. However, single-thread performance and efficiency is still very relevant in the power-constrained multi-core era, as increasing core counts do not yield corresponding performance improvements under real thermal and thread-level constraints. This dissertation provides a detailed study of register reference count structures and its application to both conventional and non-conventional, latency tolerant, out-of-order processors. Prior work has incorporated reference counting, but without a detailed implementation or energy model. This dissertation presents a working implementation of reference count structures and shows the overheads are low and can be recouped by the techniques enabled in high-performance out-of-order processors. A study of register allocation algorithms exploits register file occupancy to reduce power consumption by dynamically resizing the register file, which is especially important in the face of wider multi-threaded processors who require larger register files. Latency tolerance has been introduced as a technique to improve single threaded performance by removing cache-miss dependent instructions from the execution pipeline until the miss returns. This dissertation introduces a microarchitecture with a predictive approach to identify long-latency loads, and reduce the energy cost and overhead of scaling the instruction window inherent in latency tolerant microarchitectures. The key features include a front-end predictive slice-out mechanism and in-order queue structure along with mechanisms to reduce the energy cost and register-file usage of executing instructions. Cycle-level simulation shows improved performance and reduced energy delay for memory-bound workloads. Both techniques scale processor resources, addressing register file inefficiency and the allocation of processor resources to instructions during low ILP regions.

Metrics

55 File views/ downloads
34 Record Views

Details

Logo image