This work investigated the implementation of low-power double-precision floating-point division and square root units. Although division and square root are not very frequent operations ignoring their implementations can result in system performance degradation. In addition, although division is less frequent than addition and multiplication, because of its longer latency, it dissipates a not negligible portion of the total energy consumed in floating-point units.

Our main objective was to reduce the energy consumption without increasing the execution time and to study the relationship between the radix of the algorithm and the energy consumption. The energy dissipated in CMOS cells can be reduced by applying a number of techniques at different level of abstraction. We both applied already known techniques to the specific case of division and square root, and developed some algorithm-specific modifications that reduce the energy dissipation in the units.

To evaluate the effectiveness of these techniques, we presented the implementation of four different schemes of division and one combined division and square root unit. All the units were implemented with a static CMOS standard cell library. We obtained, for all the radices except radix-512, an overall energy reduction of 40% and estimated that if gates for dual voltage were available in our library we could have reached a reduction of about 60%. Moreover, the energy per operation is roughly the same for radix-4, 8 and 16, and the energy per cycle increases with radix. Because the average power is proportional to the energy per cycle, also the average power dissipation increases with the radix, but to a smaller extent because the cycle time is longer for higher radices. The use of dual voltage is more effective for simple datapaths in which the time slack between the delay of different portions of the circuit is larger.

The results obtained showed that the most effective techniques to reduce the energy dissipation are those applied at a higher level of design abstraction, such as modification in the conversion and rounding algorithm, disabling not active blocks, and the use of dual voltage.

File translated from T