Global Instruction Scheduling for Multi-Threaded Architectures [abstract] (PDF)
Guilherme de Lima Ottoni
Ph.D. Thesis, Department of Computer Science, Princeton University, September 2008.

Recently, the microprocessor industry has moved toward multi-core or chip multiprocessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-threaded codes, a characteristic of most applications. In effect, the move to CMPs has shifted even more the task of improving performance from the hardware to the software.

Since developing parallel applications has long been recognized as significantly harder than developing sequential ones, it is very desirable to have automatic tools to extract thread-level parallelism (TLP) from sequential applications. Unfortunately, automatic parallelization has only been successful in the restricted domains of scientific and data-parallel applications, which usually have regular array-based memory accesses and little control flow. In order to support parallelization of general-purpose applications, computer architects have proposed CMPs with light-weight, fine-grained (scalar) communication mechanisms. Despite such support, most existing multi-threading compilation techniques have generally demonstrated little effectiveness in extracting parallelism from non-scientific applications. The main reason for this is that such techniques are mostly restricted to extracting parallelism within local, straight-line regions of code. As observed in several limit studies, local regions of code contain limited parallelism, and control dependence analysis and multiple control units are necessary to extract most of the parallelism opportunities.

This thesis describes a general compiler framework for Global Multi-Threaded (GMT) instruction scheduling, i.e. to simultaneously schedule instructions from a global region of code to extract TLP for multi-threaded architectures. Our compiler framework is based on a Program Dependence Graph (PDG) representation, efficient graph partitioning algorithms, and novel multi-threaded code generation algorithms. Central to this framework are our multi-threaded code generation algorithms, which produce efficient code for arbitrary partitions of the PDG into threads. Based on this framework, three thread-partitioning strategies for GMT instructions scheduling are proposed. The first one, called GREMIO, extends list-scheduling to target multi-threaded architectures and to operate on global code regions. The second technique, called Decoupled Software Pipelining (DSWP), extracts pipelined TLP from arbitrary loop nests. We also propose Parallel-Stage DSWP, an extension of DSWP that allows multiple threads to concurrently execute the same stage of the pipeline. These techniques are implemented in the VELOCITY compiler and evaluated on an accurate CMP simulator built on top of validated Itanium 2 core models. The experiments show that our techniques balance applicability and scalability differently, with each technique resulting in the best speedup in different scenarios. Overall, the results demonstrate the effectiveness of the proposed compilation techniques, with significant speedups on a number of real benchmark applications written in C.