Fault-Tolerant and Variation-Tolerant High-Performance CMP Design

Principle Investigators: Qiang Xu; Yinhe Han (ICT, CAS)

Graduate Students: Lin Huang; Li Jiang; L. Zhang (ICT, CAS)

 

 

Project Summary

Motivation

High-performance microprocessors used to achieve better throughput by boosting their operational speeds to the extreme. With technology advancement and the associated ever-increasing design complexity, however, power and thermal requirements fundamentally limit the fastest clock frequency that microprocessors can run and we have to look for alternative methods to achieve higher computing power. By employing multiple processor cores on a single silicon die and improving performance through parallel execution, chip multiprocessors (CMPs), also known as multicore or manycore processors (depending on the number of cores on the die), are shown to be much more power-efficient and therefore have become increasingly popular in the industry. For example, Intel has demonstrated an 80-core teraflop processor prototype at Intel Developer Forum 2006. At a special session in 2007 Design Automation Conference, researchers from various places have projected that thousand-core processor chips will become commercially available within 5~10 years.

IC fabrication is an extremely complex process and it is very likely that some embedded cores are made defective in large-scale CMPs. If the CMP vendors only ship perfect ICs without any defects to customers, the manufacturing yield will be very low, resulting in
unaffordable manufacturing cost. Fortunately, for CMPs containing many homogeneous processor cores, introducing a few redundant cores on-chip can enhance yield effectively at the cost of some extra silicon area. At the same time, however, since the sold chips may contain defective cores and they can occur at any position in a mesh-connected CMP, the system oftentimes features an incomplete mesh topology. In addition, circuit permanent failures due to aging effects (e.g., electromigration in interconnects and oxide breakdown) have an increasingly adverse effect with technology scaling. These hard errors may occur in some processor cores during the lifetime of the system, again, leading to possible broken communication topology.

The imperfect manufacturing process also causes device parameter variations, e.g., the different channel lengths and threshold voltages among transistors. Because of this, even though embedded cores are structurally identical in homogeneous CMPs, they are essentially with non-uniform frequency and/or power characteristics. In addition, many advanced power management schemes such as core-level dynamic voltage and frequency scaling (DVFS) and thermal throttling are likely to be incorporated in CMPs for better performance-power tradeoff, leading to runtime performance variations among different cores.

Approach

With the increasing number of embedded cores in a homogeneous manycore processor, it is more preferred to employ core-level redundancy rather than microarchitecture-level redundancy to improve the processorís manufacturing yield and enhance its lifetime reliability. With this scheme, however, it is possible that the interconnection network topology of the manycore processor is modified and different fabricated chips may have different underlying physical topologies. This is a big burden for programmers because an optimized program for one topology may not work well for a different one. To address the above problem, we introduce the concept of logical topology in this project. A logical topology is isomorphic with the topology of the target design, but is typically a degraded version. From the viewpoint of OS and programmers, they always see a unified topology regardless of the actual physical cores that are being used underneath and their corresponding physical topologies. This eases the dispatching and scheduling tasks for OS and facilitates the optimization of parallel programs.

We are able to construct many logical topologies for a homogeneous manycore processor with a specific physical topology since any two fault-free cores can be logical neighbors to each other. The selection of the logical topology has a significant impact on the performance of the manycore processor because different logical topologies may have different communication bandwidth and latency. We therefore need to tradeoff the processorís performance and its reliability with optimized topology reconfiguration algorithms in this project.

Papers and Presentations

On Modeling the Lifetime Reliability of Homogeneous Manycore Systems, accepted for publication in Proc. IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), Dec. 2008.

Defect Tolerance in Homogeneous Manycore Processors Using Core-Level Redundancy with Unified Topology, IEEE/ACM Design, Automation, and Test in Europe (DATE),  March 2008.

  • Paper
  • Presentation (coming soon)