Global Multi-Threaded Instruction Scheduling: Technique and Initial Results [abstract] (CiteSeerX, PDF)
Guilherme Ottoni and David I. August
Proceedings of the Sixth Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler Technology (EPIC), March 2007.

Recently, the microprocessor industry has reached hard physical and micro-architectural limits that have prevented the continuous clock-rate increase, which had been the major source of performance gains for decades. These impediments, in conjunction with the still increasing transistor counts per chip, have driven all major microprocessor manufacturers toward Chip Multiprocessor (CMP) designs. Although CMPs are able to concurrently pursue multiple threads of execution, they do not directly improve the performance of most applications, which are written in sequential languages. In effect, the move to CMPs has shifted even more the task of improving performance from the hardware to the software. In order to support more effective parallelization of sequential applications, computer architects have proposed CMPs with light-weight communication mechanisms. Despite such support, proposed multi-threaded scheduling techniques have generally demonstrated little effectiveness in extracting parallelism from general-purpose, sequential applications. We call these techniques local multi-threaded scheduling, because they basically exploit parallelism within straight-line regions of code. A key observation of this paper is that local multi-threaded techniques do not exploit the main feature of CMPs: the ability to concurrently execute instructions from different control-flow regions. In order to benefit from this powerful CMP characteristic, it is necessary to perform global multi-threaded scheduling, which simultaneously schedules instructions from different basic blocks to enable their concurrent execution. This paper presents algorithms to perform global scheduling for communication-exposed multi-threaded architectures. By global we mean that our technique simultaneously schedules instructions from an arbitrary code region. Very encouraging preliminary results, targeting a dual-core Itanium~2 model, are presented for selected MediaBench and SPEC-CPU benchmark applications.