FastForward for Concurrent Threaded Pipelines [abstract] (PDF)
John Giacomoni, Tipp Moseley, and Manish Vachharajani
University of Colorado Technical Report CU-CS-1023-07, January 2007.
The performance, cost, and flexibility of commodity
multi-core systems make them appealing for threaded applications.
Unfortunately, popular threading techniques require independent code
regions, use expensive synchronization primitives, and use expensive
communication mechanisms. Recently, researchers have proposed several
Concurrent Threaded Pipeline architectures (CTP) which relax the data
independence requirement and can increase computational throughput
proportionately to the pipeline depth. Examples include Decoupled
Software Pipelining, which focuses on compiler based extraction of
pipelines from sequential codes, and the Frame Shared Memory
architecture, which focuses specifically on network processing. CTP
architectures show great promise for threading applications given a
low-overhead high-speed blocking queue implementation.
This paper presents the FastForward system, a novel software-only
low-overhead high-speed blocking queue implementation for CTPs.
FastForward uses a novel domain-specific adaptation of concurrent
lock-free queues (CLF) in conjunction with a clever memory
organization to provide the fast, low-overhead, queue operations. The
key to FastForward's success is its domain specific optimization based
on careful tuning for modern multi-core microarchitectures. Enqueue
and dequeue times are as low as 35 ns, 5 times faster than the next
best solution.