Did you take into account the number of transistors needed for each method ?
Maybe the methods you mentioned require less steps but because it's a
complexer method it requires more transistors !
Since the simpler method requires less transistors it can be implemented
more in parallel on the same "chip space".
So to make comparision fair that factor has to be taken into the equation as
well for example:
1. "Reaction time of simple method" versus "Reaction time of complex
method".
Plus:
2. "Throughput of all sequentially called parallel implemented simple
methods" versus "Throughput of all sequentially called parallel implemented
complex methods".
For example:
The complex method has reaction time of 1000 nanoseconds.
The simple method has reaction time of 50000 nanoseconds.
The complex method can be implemented 10 times on chip space.
The simple method can be implemented 1000 times on chip space.
1 second = 1 000 000 000 nanoseconds:
Throughput for one complex method: 1 000 000 000 / 1 000 = 1 000 000
Throughput for one simple method: 1 000 000 000 / 50 000 = 20 000
Total throughput complex method: 1 000 000 * 10 = 10 000 000
Total throughput simple method: 20 000 * 1 000 = 20 000 000
This example shows the simple method could outperform the complex method.
Throughput wasn't on the table before...
The question of throughput makes for a much more complex question. It
may in some cases be possible to build a very small, but very slow,
multiplier and you might get the best throughput per transistor that
way. For example, a bit serial multiplier will be of size N, but will
take N**2 time to multiply two numbers. A full multiplier (using the
basic algorithm) will be of size N**2, but will multiply in
approximately log(N)**2.
Also, since we're talking about hardware implementations, pipelining
has to be considered. The simply full multiplier can easily be
pipelined so that about log(N) multiplications can be working their
way though it at any one time.
Now as to particular implementations. Building a full Karatsuba
multiplier would be straightforward, and would have about the same
advantage in transistor count (modulo some constant) over the basic
algorithm as does the software implementation. It would even be
pipelineable. You will definitely lose something in die area, since
the resulting structure will not be anywhere near as regular as that
for a basic full multiplier, so you wouldn't be able to fit as many on
a chip.
The transforms needed for FFT multiplication would require a fair bit
of analysis before I'd be willing to speculate on how hard it would be
to implement a "full FFT multiplier." But as a basic rule, the amount
of hardware you need for a "full" implementation will be roughly
proportional to the number of operations performed by the algorithm.
But the question is, would you ever choose to build a 1024x1024 bit
multiplier in hardware directly? It's possible that might be the best
approach for some application, but doesn't solve the problem for
numbers bigger than 1024 bits, and will be very expensive in terms of
area. Almost certainly you'd build a high precision multiplier out of
some smaller components, like a 32x32 multiplier, storage for the work
fields, and some control logic, etc. The control logic then replaces
the software algorithm, and other than some linear improvements in
efficiency (for example, the multiplier might pipeline, you might be
able to merge several add functions into a single clock cycle), runs
in exactly the same way. And mind you that you can fit a quite
complex sequencer in the same space as a 32x32 multiplier.
In any event, it seem difficult to justify any really large hardware
multipliers. At some point, the software overhead of a extended
precision calculation becomes insignificant. If the CPU actually
provided a reasonably fast 1024x1024 multiplier, you could probably
write a 16384x16384 multiplier in interpreted Basic (using the
hardware 1024x1024 multiplier as a primitive) that would be
performance indistinguishable from a hardware or microcode
implementation of the same thing. So why burden the hardware with
that? Which brings us to your original point. A CPU should not be
expected to solve all of the worlds computation problems directly. It
should, however, provide a good set of primitives that allow the
programmer to solve those problems. Why, for example, should
processor have primitive to compute hyperbolic trig functions, or
directly handle complex numbers (or any of an infinity of other
complex functions)? While at some point it may make sense to include
those, for the foreseeable future, implementing those directly in
hardware would bring modest performance gains (at best), for
considerable complexity, and be applicable only to a small user
community.
As a general comment, I can't think of a single CPU that's directly
implemented arbitrary precision arithmetic. A few cases of
specialized hardware, for example the modular exponentiation unit on
Niagara, do exist, but those are not for general purpose use.