Abstract
In recent years computer applications have increased in their computational complexity. The industry-wide usage of performance benchmarks, such as SPECmarks, and the popularity of 3D graphics applications forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. This paper presents results of the Stanford subnanosecond arithmetic processor (SNAP) research effort in the design of hardware for floating point addition, multiplication and division. We show that one cycle FP addition is achievable 32% of the time using a variable latency algorithm. For multiplication, a binary tree is often inferior to a Wallace-tree designed using an algorithmic layout approach for contemporary feature sizes (0.3um). Further, in most cases two-bit Booth encoding of the multiplier is preferable to non-Booth encoding for partial product generation. It appears that for division, optimum area-performance is achieved using functional iteration, and we present two techniques to further reduce average division latency.