Wallace trees are an efficient (time and space) binary structure for adding many summands, which can be applied to multiplication. They are based around carry save adders, which is a fancy name for arrays of full adders:

Where denote bitwise xor, and and or, respectively, and is left shift. All variables are bit vectors of the same length. Carry-save adders perform a constant-time operation which conserves the *sum* of its inputs and outputs, i.e. , but there are fewer outputs than inputs.

Wallace (link to 1964 paper) devised a cunning method for applying carry saves to multi-operand addition. The essence is, for a given layer, pass each group of three vectors to a carry save. The output vectors (along with the loose inputs extending beyond a multiple of 3) pass to the next layer, where they are reduced again. The number of summands goes as approximately , and hence within layers we have only 2 numbers left to add, which can be completed in time with a carry-lookahead adder or similar.

There is a clear recursive structure here, which can be exploited using Verilog 2001 recursive modules:

The 3-summand case is handled directly; for any more than 3 layers, we perform a Wallace reduction, and the recursively instantiate ourselves for any further reduction that is needed. N is the number of inputs to the module, and N_NEXT is the number of outputs. The “out” module port passes up the final output of the innermost module instantiation.

The loop in the n-summand case does exactly what was described above: collect bundles of 3 summands, perform a carry save addition, and pass the results through, along with any shrapnel that did not fit into a CSA. Optionally, we can add half-adders for the (N – i == 2) case, but this does not alter the depth of the recursion, and seems to increase overall gatecount.

We can leverage our multi-input adder to create an efficient multiplier:

Simply generate shifted partial products (trivial — single digit binary multiplication is just AND) and add them all up. There we go, a multiplier! The lower half of the output (which has the same size as the input) is independent of the signedness of the inputs, but the upper half is not. We can handle this by sign-extending in the partial products, and setting the weight of the highest partial product to rather than , if each respective operand is signed. (These tricks are in fact equivalent, but are easier to apply in different situations.)

It may seem wasteful to use full-width partial-products for all of our inputs, but inputs which are provably constant 0 will be efficiently trimmed during synthesis, via e.g. the constant folding pass.

We can wrap up with a quick testbench to confirm that our mulltiplier… multiplies: