Pipelined Multithreading Transformations and Support Mechanisms [abstract] (PDF)
Ram Rangan
Ph.D. Thesis, Department of Computer Science,
Princeton University, June 2007.
Even though chip multiprocessors have emerged as the
predominant organization for future microprocessors, the multiple
on-chip cores do not directly result in improved application
performance (especially for legacy applications, which are
predominantly sequential C/C++ codes). Consequently, parallelizing
applications to execute on multiple cores is essential to their
success. Independent multithreading techniques, like DOALL extraction,
create partially or fully independent threads, which communicate
rarely, if at all. While such strategies keep high inter-thread
communication costs from impacting program performance, they cannot be
applied to parallelize general-purpose applications which are
characterized by difficult-to-break recurrences. Even though cyclic
multithreading techniques, such as DOACROSS, are more applicable, the
cyclic inter-thread dependences created by these techniques cause them
to have very low tolerance to rising inter-core latencies. To address these problems, this work introduces a pipelined
multithreading (PMT) transformation called Decoupled Software
Pipelining (DSWP). DSWP, in particular, and PMT techniques, in
general, are able to tolerate inter-core latencies, while still
handling codes with complex recurrences. They achieve this by
enforcing an acyclic communication discipline amongst threads, which
allow threads to use inter-thread queues to communicate in a pipelined
fashion. This dissertation demonstrates that DSWPed codes not only
tolerate inter-core communication costs, but also effectively tolerate
variable latency stalls in applications better than single-threaded
execution on both in-order and out-of-order issue processors with
comparable resources. It then performs a thorough analysis of the
performance scalability of automatically generated DSWPed codes and
identifies the conditions necessary to achieve peak PMT
performance. Next, the dissertation shows that even though PMT applications
tolerate inter-core latencies well, the high frequency of inter-thread
communication (once every 5 to 20 dynamic instructions) in these
codes, makes them very sensitive to the intra-thread overhead imposed
by communication operations. In order to understand the issues
surrounding inter-thread communication for PMT applications, this
dissertation undertakes a methodical exploration of the design space
of communication support options for PMT. Three new communication
mechanisms with varying cost-performance tradeoffs are introduced and
are shown to perform 38% to 200% better than the state of the
art.