S. C. Smith, R. F. DeMara, J. S. Yuan, M. Hagedorn, and D. Ferguson,
"NULL Convention Multiply and Accumulate Unit with Conditional Rounding, Scaling, 
and Saturation," Journal of Systems Architecture, Vol. 47, No. 12, June, 2002, 
pp. 977 - 998.

Abstract 
Approaches for maximizing throughput of self-timed multiply-accumulate units 
(MACs) are developed and assessed using the NULL Convention Logic (NCL) paradigm. 
In this class of self-timed circuits, the functional correctness is independent 
of any delays in circuit elements, through circuit construction, and independent 
of any wire delays, through the isochronic fork assumption [1, 2], where wire 
delays are assumed to be much less than gate delays. Therefore self-timed 
circuits provide distinct advantages for System-on-a-Chip applications.  
First, a number of alternative MAC algorithms are compared and contrasted in 
terms of throughput and area to determine which approach will yield the maximum 
throughput with the least area. It was determined that two algorithms that meet 
these criteria well are the Modified Baugh-Wooley and Modified Booth2 algorithms. 
Dual-rail non-pipelined versions of these algorithms were first designed using 
the Threshold Combinational Reduction (TCR) method [3]. The non-pipelined 
designs were then optimized for throughput using the Gate-Level Pipelining (GLP) 
method [4]. Finally, each design was simulated using Synopsys to quantify the 
advantage of the dual-rail pipelined Modified Baugh-Wooley MAC, which yielded a 
speedup of 2.5 over its initial non-pipelined version. This design also required 
20% fewer gates than the dual-rail pipelined Modified Booth2 MAC that had the 
same throughput. The resulting design employs a three-stage feed-forward 
multiply pipeline connected to a four-stage feedback multifunctional loop to 
perform a 72+32*32 MAC in 12.7 ns on average using a 0.25  CMOS process at  3.3 
V, thus outperforming other delay insensitive/self-timed MACs in the literature.