EFFICIENT SERIAL FLOATING-POINT CONSTANT DIVIDER

STRUCTURE OF THE FORM 2

±1

Karthik Mahesh Varadarajan

ACIN, Technical University of Vienna, Vienna, Austria

Keywords: Constant divider circuit, Integer division, Serial division, Full-precision division, DSP, VLSI,

Computational optimization.

Abstract: Implementation specific computation modules hold the key to the success of fast DSP and Embedded

systems. Exponential encoders, dedicated multipliers, barrel shifters and accumulators are common units

available on DSPs. The family of constant divider circuits of the form 2

±1, which are useful for image

processing, statistical processing like histograms etc., is the specific focus of this paper. This family is

largely dominated by the Residue Number System (RNS), Petry and Srinivasan algorithms and the Shuo-

Yen Robert-Li algorithm. While these algorithms offer various trade-offs in terms of accuracy, memory

footprint, power consumption and timing behavior, none of these methods are suited for processing

serialized inputs, dividend inputs with apriori unknown bit length and the circuits have to be replaced with

change in input bit length. The circuit size also grows enormously for large input lengths along with a

reduction in accuracy. These methods are suited only for integer division and are unsuited for extension to

floating/fixed point division. In this paper a novel constant divider algorithm is offered, which overcomes

the above mentioned limitations while handling arbitrary length, serial/ parallel data and producing full-

precision, full-accuracy, floating point capable results with constant circuit requirements and comparable

timing to state of the art methods.

1 INTRODUCTION

Software and hardware optimization processes are

essential elements towards building a cost-effective

and efficient embedded system. Dedicated

computation modules hold the key to the success of

fast DSP, GPU and Embedded systems. Modern

DSP development and deployment kits host a

number of dedicated computation units. Exponential

encoders, dedicated multipliers, barrel shifters and

accumulators are common units available on DSPs.

One specific computation module that holds

considerable importance with respect to embedded

systems designed for signal and image processing

applications is the constant division operation. The

family of constant divider circuits of the form 2

±1

is the specific focus of this paper.

This computational module is useful for image

processing applications (such as division by image

dimensions or array lengths or extreme image

intensity values – typically represented as 2

-1),

signal processing (such as Fourier Transform

normalization), statistical processing like histogram

estimation etc. Division by 2

-1 also assumes

significance due to the fact that division by any

integer can be converted to that format using the

Euler-Fermat theorem.

2 RELATED WORK

The state of the art with respect to this family of

constant division by 2

±1 includes the Residue

Number System (RNS) (Al-Besher, 1997), Petry and

Srinivasan algorithms (Srinivasan, 2007) (Petry,

1994) (Petry, 1983) and the Shuo-Yen Robert-Li

algorithm (Li, 1985). While these algorithms offer

various trade-offs in terms of accuracy, memory

footprint, power consumption and timing behavior

(extensive comparisons are presented in

(Schwarzbacher, 2000) and (Srinivasan, 2007)),

none of these methods are suited for processing

serialized inputs, dividend inputs with apriori

unknown bit length and the circuits have to be

replaced with change in input bit length. In other

words, once a circuit has been designed for a certain

485

Varadarajan K..

EFFICIENT SERIAL FLOATING-POINT CONSTANT DIVIDER STRUCTURE OF THE FORM 2P±1.

DOI: 10.5220/0003402904850490

In Proceedings of the 1st International Conference on Pervasive and Embedded Computing and Communication Systems (PECCS-2011), pages

485-490

ISBN: 978-989-8425-48-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

input bit stream size, the circuit has to be modified

in order to support longer bit streams. Also, these

circuits cannot produce partial or serialized results

and require the entire input data to be available

before performing the division operation. The

quotient and remainder are estimated using a shift,

add/subtract and scale paradigm resulting from

inverting the divisor in the binary format.

Alternatively, the division can be represented as a

multiplication (or shift and add) with a pre-scaling

operation. While some of the methods such as (Guei,

1985) compute the quotient bits from higher order to

lower order, methods such as (Artzy, 1974) compute

it in the reverse order. Methods such as (Petry, 1994)

do not give out an explicit remainder. Certain

methods are also restricted to exact divisions.

In addition to these constraints, the circuit size

for the implementation of most of these algorithms

also grows enormously for large input lengths along

with a reduction in accuracy and increase in

processing time (Figure 1 and 2). The addition of a

single tuple of dividend bits demands a circuit

change requiring an additional shifter and adder at

the very least. Moreover, these methods are suited

only for integer division and are unsuited for

extension to floating/fixed point division.

Figure 1: Accuracy and error deviation of different

approaches for a divide-by-3 operation (Src:

Schwarzbacher, 2000).

3 CONSTANT DIVIDER

ALGORITHM

In this paper a novel constant divider algorithm is

offered, which overcomes the limitations of the

previous algorithms, while handling arbitrary length,

serial/ parallel data and producing full-precision,

full-accuracy, floating point capable results with

constant circuit requirements and comparable timing

to state of the art methods.

Figure 2: Timing behavior of different approaches

obtained on European Silicon Structures 0.7µ CMOS

technology (Src: Schwarzbacher, 2000).

Two specific streams of processing are employed

for the cases of 2

-1 and 2

+1. These are described

in Figure 3 and Figure 4 respectively.

3.1 Division by 2

-1

The pseudo-code for the case of 2

-1 is detailed

below.

Step 1: Segment dividend into p-bit tuples

Step 2: Load T

into register 1, T

n-1

into register 2

Step 3: Add register 1 and 2

Step 4: If Sum >= 2

-1

Quotient tuple = register 1 + 1;

Remainder tuple = Sum – (2

-1);

Else

Quotient tuple = register 1;

Remainder tuple = Sum;

Step 5: Save quotient tuple in output buffer

Step 6: Load remainder tuple into register 1, next

dividend tuple (T

n-k

) into register 2

Step 7: Continue to Step 4, unless there are no

dividend tuples left

Step 8: Output partial remainder as remainder (for

integer division) or repeat partial remainder tuple

(for floating point division)

For the case of division by 2

-1, the input data

stream is divided into tuples of equal bit length ‘p’.

Zero padding is carried out, if necessary to

regularize the Most Significant Tuple (MST) to the

tuple size (p) being used. For the case of serial data

streams, this regularization may not be necessary, if

the serial input stream has been pre-synchronized for

the required tuple size. In Figure 3, it has been

assumed that for a tuple size of ‘p’ bits and input

dividend data stream length of ‘N’ bits, the number

of tuples generated is ‘n’. For the generation of the

most significant quotient tuple (represented by t=0 in

Figure 3), the MST (T

) of the dividend and the

second most significant tuple (T

n-1

) are loaded to

registers 1 and 2 respectively. A p-bit adder

calculates the sum of the two tuples. For the case of

an overflow – i.e. the carry bit being set, it is

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

486

Figure 3: Functional units of the proposed 2

-1 constant divider circuit.

Figure 4: Functional units of the proposed 2

+1 constant divider circuit.

necessary to increment the contents of register 1 in

order to obtain the MST of the quotient. This is done

by connecting the activation port of the incrementor

to the carry bit. The incrementor is a functional unit

that increments the contents of a register by 1. It can

possibly be implemented using a simple adder

circuit or a counter. The MST of the quotient (Q

n-1

)

thus generated can be stored in a quotient buffer or

serialized for an output data stream. The sum output

from the p-bit adder forms the partial remainder.

EFFICIENT SERIAL FLOATING-POINT CONSTANT DIVIDER STRUCTURE OF THE FORM 2P±1

487

Similar to the quotient tuple, the partial

remainder has to be corrected in the case of a

generated carry. A second incrementor with the

activation port tied to the carry bit output of the

adder is used to obtain the partial remainder from the

generated sum tuple. For successive time sample

instances (t > 0), the partial remainder is loaded into

the quotient. At these instances (t > 0), higher order

tuples (T

n-2

, T

n-3

, T

n-4

, T

n-5

..... etc.) are progressively

loaded into register 2. The adder then outputs the

partial quotient and remainder for each time instant.

The incrementor normalizes the partial quotient and

the remainder which are fed to the output queue and

to register 1 for the processing of the next sample

instant, respectively. The cycle generates the

quotient tuple bits for the whole part (non-fractional)

part in the case of both integer and floating point

division.

The processing is terminated when all the input

dividend tuples have been processed. In the case of

integer division, the process remainder equals the

partial remainder obtained from the second

incrementor for the last time sequence instant (t = n-

1). In the case of floating point division the

fractional part of the quotient is obtained as non-

terminating, recurring tuple bits computed by

repeating the tuple bits representing the partial

remainder. The fractional tuples can be generated to

any length based on the required levels of system

accuracy and precision. While this computation is

not very straightforward for the case of division by

+1, we present an alternate scheme to tackle this

issue, as demonstrated in the next section.

Computation of the fractional part is a major

advantage of the scheme presented in this paper and

unlike Petry and Srinivas or RNS, full accuracy can

be maintained to any desired precision.

3.2 Division by 2

The pseudo-code for the case of 2

+1 is detailed

below.

Step 1: Segment dividend into p-bit tuples

Step 2: Load T

into register 1, T

n-1

into register 2

Step 3: Subtract register 1 from 2

Step 4: If Difference >= 0

Quotient tuple = register 1;

Remainder tuple = Difference;

Else

Quotient tuple = register 1 - 1;

Remainder tuple=Difference+(2

+1);

Step 5: Save quotient tuple in output buffer

Step 6: Load remainder tuple into register 1, next

dividend tuple (T

n-k

) into register 2

Step 7: Continue to Step 4, unless there are no

dividend tuples left

Step 8: Output partial remainder as remainder (for

integer division) or repeat partial remainder less 1

and its 2

-1 complement tuple (for floating point

division)

For the case of division by 2

+1, the input data

stream is similarly divided into tuples of equal bit

length ‘p’. Again, zero padding is carried out, if

necessary to regularize the Most Significant Tuple

(MST) to the tuple size (p) being used. In Figure 4, it

has been assumed that for a tuple size of ‘p’ bits and

input dividend data stream length of ‘N’ bits, the

number of tuples generated is ‘n’. Similar to the case

of division by 2

-1, for the generation of the most

significant quotient tuple (represented by t=0 in

Figure 4), the MST (T

) of the dividend and the

second most significant tuple (T

n-1

) are loaded to

registers 1 and 2 respectively. A p-bit subtractor

calculates the difference of the two tuples (register 2

– register 1). The subtractor can be implemented

using a 2’s complement logical adder. This also

helps obtain the correct difference between the two

tuples (the value being the absolute difference

between the two tuples along with a negative sign,

indicated by the borrow/overflow bit, for the case

when the value in register 2 is smaller than that in

bit being set, it is necessary to decrement the

contents of register 1 in order to obtain the MST of

the quotient. This is done by connecting the

activation port of the decrementor to the borrow bit.

The decrementor is a functional unit that decrements

the contents of a register by 1. It can possibly be

implemented using a simple subtractor circuit or a

counter. Again, similar to the case of division by 2

1, the MST of the quotient (Q

n-1

) thus generated can

be stored in a quotient buffer or serialized for an

output data stream. The difference output from the

p-bit subtractor forms the partial remainder.

The partial remainder also has to be corrected

in the case of a generated borrow. A p-bit subtractor

(or alternatively a bit reversal module) with the

activation port tied to the borrow bit output of the

subtractor is used to obtain the partial remainder

from the generated sum tuple. The subtractor

computes the value of 2

+1 – Diff. Note that the

‘Diff’ quantity used here is the absolute difference

value between the two registers and hence the need

for the additional negative sign. For successive time

sample instances (t > 0), the partial remainder is

loaded into register 1 for the calculation of

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

488

successive tuples of the quotient. As before, at these

instances (t > 0), higher order tuples (T

n-2

, T

n-3

, T

n-4

n-5

..... etc.) are progressively loaded into register 2.

The subtractor then outputs the partial quotient and

remainder for each time instant, while the

decrementor and the second subtractor normalize the

partial quotient and the remainder which are fed to

the output queue and to register 1 for the processing

of the next sample instant, respectively. The cycle

generates the quotient tuple bits for the whole part

(non-fractional) part in the case of both integer and

floating point division.

The processing is terminated when all the input

dividend tuples have been processed. In the case of

integer division, the process remainder equals the

partial remainder obtained from the second

subtractor for the last time sequence instant (t = n-1).

In the case of floating point division, the fractional

part of the quotient is obtained as non-terminating,

recurring tuple bits computed by repeating the tuple

bits representing the partial remainder less 1 and its

-1 complement. In other words the fraction part

consists of repeating sets of 2 tuples, the first of

which is one less than the partial remainder (R – 1)

and the second tuple is (2

-1) – (R-1).

As before, the fractional tuples can be

generated to any length based on the required levels

of system accuracy and precision.

4 ANALYSIS

Numerically, the algorithm presented in this paper

has an analogue in (Guei, 1985). However, unlike

the (Guei, 1985) algorithm which requires the entire

input stream for the computation, our scheme

requires only two tuples at any time sequence

instant, giving out one tuple of quotient bits along

with the partial remainder, based on carry/ borrow

calculation. Thus, our algorithm is well suited for

serial processing. Moreover, the circuit requirements

are constant for varying input bit lengths.

Additionally, our algorithm is well suited for integer

as well as floating point divisions and can generate

fractional results with arbitrary accuracy/ precision.

Since tuples can be processed serially using a

single adder and 2 incrementors (example case of

division by 2

-1), the constant circuit can be

efficiently implemented in hardware. As noted

earlier, the requirements do not change with increase

in input bit length and the same circuit can be

replicated for operation in parallel mode, in which

case the number of such computation units will be

equal to n – 1. Also, the design provides a natural

way to trade-off speed and circuit requirements

through the possibility of using a serial mode of

operation working on multiple tuples (or parallel

tuples) at the same time.

The possibility of calculating the fractional part

of the quotient to any arbitrary length with full

accuracy supports the use of the algorithm for

constant divider circuits in DSPs and other

embedded systems.

Since the number of computational units in the

pipeline is less than or is at least comparable to other

state-of-art methods, it can be expected that the

computational time numbers also favor use of our

approach.

5 CONCLUSIONS

AND FUTURE WORK

In this paper, we have presented the design for a

constant divider circuit of the form of the form 2

±1.

Analyses have also been presented to demonstrate

the constant computation requirements of the

approach. The method is well suited for processing

serialized inputs, dividend inputs with apriori

unknown bit length while producing full-precision,

full-accuracy, floating point capable results. The

next step would be to implement the design using

VHDL/Verilog for simulation and testing followed

by actual implementation in VLSI for a thorough

evaluation of timing, power requirements, memory

footprint and chip area estimation. This is expected

to be followed by performance evaluation of the

circuitry in consonance with a DSP or Embedded

System or a GPU targeted at applications such as

image processing, signal processing and statistical/

mathematical computation and modeling.

REFERENCES

A. Th. Schwarzbacher, M. Brutscheck, O. Schwingel, J. B.

Foley, ‘Constant Divider Structures of the Form 2

±1’, pp. 368-375, Irish Signals And Systems

Conference, 2000.

P. Srinivasan, F. E. Petry, ‘Constant-Division Algorithms’,

IEEE Proc. Computers and Digital Techniques, Vol.

141, No. 6, 2007 (1994).

A. Th. Schwarzbacher, P. A. Comiskey and J. B. Foley,

‘Reduction of the power consumption at the

algorithmic level of CMOS circuits’, Electronic

Systems and Devices Conference, pp. 5-8, June 1998.

B. Al-Besher, A. Bouridane, A. S. Ashur, ‘An RNS-based

Division Architecture for Constant Divisors of the

EFFICIENT SERIAL FLOATING-POINT CONSTANT DIVIDER STRUCTURE OF THE FORM 2P±1

489

Form 2

+1 and 2

-1’, Irish Signals & Systems

Conference, 1997.

S. Y. R. Li, ‘Fast Constant Division Routines,’ IEEE

Transactions on Computers, Vol. C-34, No. 9,

September 1985.

F. E. Petry, P. Srinivasan, ‘Division Techniques for

Integers of the Form 2

+1 and 2

-1’, Int. J.

Electronics, Vol. 74, No. 5, 1993.

R. L. Sites, ‘Serial Binary Division by Ten’, IEEE

Transactions on Computers, vol. 23, no. 12, pp. 1299-

1301, 1974.

Artzy, E., Hinds, J. A., Saal, H. J., 'A fast division

technique for constant divisors', Comm. ACM,

February 1976, 19, (Z), pp. 98-101.

C. Y. Guei, Y. X. Dong, W. B. Shan, 'A fast division

technique for constant divisors 2

+/- 1)’,

Proceedings of the 1st International Conference on

Computers and Applications, 1985, pp. 715-718.

Johannes, J., Pegden, C., Petry, F., 'Decimal shifting for an

exact floating point representation', Computer and

Electrical Engineering, 1980.7, (3). pp. 149-155.

Petry, F. E., 'Two's complement extension of a parallel

binary division by ten', Electronics Letters, September

1983, 19, (18), pp. 718-720.

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

490