From Sequential Source to Parallel Performance

The Source Code

Click here to download the video.

The application is ks, a graph partitioning algorithm. The hot loop is in the function FindMaxGpAndSwap; it's a doubly-nested linked list traversal which does some computation and finds a max.

From C to LLVM Bitcode

Click here to download the video.

Here we use the LLVM tools to read in the C code and emit the bitcode file containing the LLVM IR. When there are multiple C files, they are linked together into one bitcode file. Also, the code is profiled to obtain basic block execution counts.

Automatic Parallelization

Click here to download the video.

Here is where we actually run the parallelizing compiler. This includes building the Program Dependence Graph, running both the DSWP and PS-DSWP partitioners, and running the Multi-Threaded Code Generator, which emits multi-threaded LLVM IR that implements parallel execution. At the end, this output is linked with the queue library and the thread pool library to create a native executable.

Program Dependence Graph

Click here to download the video.

During the Automatic Parallelization step, a number of graphs are generated to help the developer visualize the program. One of these is the Program Dependence Graph (PDG), which shows data and control dependences between instructions in the loop to be parallelized. Different colored edges represent different types of dependences.

Graph of Strongly Connected Components

Click here to download the video.

In this graph, each node is a Strongly Connected Component (SCC) and consists of one or more nodes from the original PDG. The partitioner operates on these SCCs rather than on PDG nodes directly. Some SCCs are marked DOALL, indicating that they have no inter-iteration dependences and can be placed in a parallel stage by the PS-DSWP partitioner.

Partitioning the Graph Into Stages

Click here to download the video.

This graph shows the actual partition that was computed by PS-DSWP. The red stage is the sequential stage, while the green stage is parallel. PS-DSWP is successful in creating as large a parallel stage as possible to maximize performance.

Sequential Execution

Click here to download the video.

To see the baseline performance, the original C code is compiled using gcc and executed. (At this point, you may start playing the next video to visually compare the speeds of sequential vs. parallel execution.)

Parallel Execution

Click here to download the video.

The parallelized code runs in about 16.5 seconds, while the sequential code took 1:14. This is a speedup of about 4.5x on an 8-core machine. Thus, we can achieve real, significant speedups on unmodified source code.