Kids can do math better than x86 cpu's.

Terje Mathisen · Mar 8, 2007

Richard said:
[Please do not mail me a copy of your followup]

John Larkin <[email protected]> spake the secret code

But there are few problems that need more than 64 or 80 bit
floating-point math, and fewer still that need it fast.

Click to expand...

....which makes it perfect for reconfigurable computing.

Not really.

The critical building block, as Nick have already mentioned, is the
NxN->2N multiplier (and/or MAC/FMAC), which means that you don't need a
lot of uncommitted gates, but instead a lot of multipliers and enough
(reconfigurable) routing between them to setup the required operations.

As I understand it, that's what people are doing these days. Where
previously you would have been required to develop a custom ASIC at
great NRE cost, nowadays you can develop custom hardware using a
relatively inexpensive FPGA peripheral card. Configuring the FPGA
circuitry is very much like writing software when you use something
like VHDL or Verilog.

In fact, you can even get these in a portable form factor -- there's
an FPGA card you can buy that plugs into the Nintendo GameBoy Advance
and it comes with a complete tool chain included. All for about $200
from Charmed Labs <http://www.charmedlabs.com> as the Xport 2.0 which
has 150,000 gates.

Nice.

Terje

already5chosen@yahoo.com · Mar 8, 2007

(I think these algorithms were invented thousands of years ago by the
Greeks).

Were they? Did Greeks really had so advanced arithmetic skills?

slebetman@yahoo.com · Mar 8, 2007

|>
|> > A CPU should be able to do it faster in hardware then a software library...
|>
|> What makes you think this?

He is right, but not VERY right.

The standard ISAs provide poor primitives for the purpose. For multiple
precision integer arithmetic, there are only a few extra primitives
needed (e.g. NxN->2N multiplication, add with carry etc.) For what I
am doing (software binary floating-point), the ISA and HLL primitives
stink. The actual arithmetic isn't the problem, as it is the same as
for integers, but the housekeeping, normalisation, rounding etc. are
a right royal pain in the, er, neck.

My estimates from many years back, which are probably still true, is
that a decent set of primitives would enable a software floating-point
to be 3-5 times slower than all-hardware (for a plausible precision).
But current ones make it a LOT slower than that.

However, when one is talking large integer arithmetic (hundreds of
digits), all one needs is a small number of extra primitives, and there
is no difficulty in doing it in software. Even without them, it is
very rarely more than 3 times slower than the best hardware could be.

It should be noted, all modern CPU I know of have these primitives,
even the lowly PIC. So bignum math is supported by hardware *already*.
The problem is higher level languages like C doesn't support these low
level hardware features. Like, how would you detect the carry bit in C?

Wilco Dijkstra · Mar 8, 2007

It should be noted, all modern CPU I know of have these primitives,
even the lowly PIC. So bignum math is supported by hardware *already*.
The problem is higher level languages like C doesn't support these low
level hardware features. Like, how would you detect the carry bit in C?

Using add with carry is easy in C, various compilers can detect it:

x = y + z;
if (x < y) x++;
x += a;

Translates to:

ADDS x, y, z
ADC x, x, a

Many compilers also detect NxN -> 2N bit multiplies.

Wilco

Skybuck Flying · Mar 8, 2007

Only on the really slow implementations. While bignum addition and
subtraction has to be done digit by digit (for some definition of
digit), you do *not* want to do division and multiplication the way
you learned it in school. Karatsuba multiplication (O(n**1.6) instead
of O(n**2) for the classical algorithm) is useful for medium sized
numbers, and FFT (O(n ln(n) ln(ln(n)))) for large numbers. Division
is usually be preformed by approximating a result and then refining it
with an appropriate number of iterations of Newton-Raphson (which just
needs multiplications).

Did you take into account the number of transistors needed for each method ?

Maybe the methods you mentioned require less steps but because it's a
complexer method it requires more transistors !

Since the simpler method requires less transistors it can be implemented
more in parallel on the same "chip space".

So to make comparision fair that factor has to be taken into the equation as
well for example:

1. "Reaction time of simple method" versus "Reaction time of complex
method".

Plus:

2. "Throughput of all sequentially called parallel implemented simple
methods" versus "Throughput of all sequentially called parallel implemented
complex methods".

For example:

The complex method has reaction time of 1000 nanoseconds.

The simple method has reaction time of 50000 nanoseconds.

The complex method can be implemented 10 times on chip space.

The simple method can be implemented 1000 times on chip space.

1 second = 1 000 000 000 nanoseconds:

Throughput for one complex method: 1 000 000 000 / 1 000 = 1 000 000

Throughput for one simple method: 1 000 000 000 / 50 000 = 20 000

Total throughput complex method: 1 000 000 * 10 = 10 000 000

Total throughput simple method: 20 000 * 1 000 = 20 000 000

This example shows the simple method could outperform the complex method.

Bye,
Skybuck.

Nick Maclaren · Mar 8, 2007

|> |>
|> > It should be noted, all modern CPU I know of have these primitives,
|> > even the lowly PIC. So bignum math is supported by hardware *already*.
|> > The problem is higher level languages like C doesn't support these low
|> > level hardware features. Like, how would you detect the carry bit in C?

Not that I can see. How many have a 64x64->128 multiply and a
128/64->64,64 quotient, remainder?

|> Using add with carry is easy in C, various compilers can detect it:

Some compilers may be able to, IF you use a code form that they can
recognise; but, if they don't document that and exactly what code
forms are acceptable, you are just hoping.

And, in any case, that is not "in C" - it is "in Intel C" or "in gcc"
etc. There is no way of doing it in standard or portable C, let alone
standard AND portable C.

Regards,
Nick Maclaren.

dalai lamah · Mar 8, 2007

Un bel giorno Skybuck Flying digitò:

However x86 cpu's today can not do with kids, Greeks or common people can do
and that is do math with arbitrary long numbers.

In my opinion tasks like this don't belong to the CPU instruction set, but
to the software level. The CPU instructions should address all (and just)
the useful tasks that can't be done with a comparable efficiency with a
group of instructions, for example HW multiplication between two registers
in one cycle, MAC in one cycle, and so on.

Ken Hagan · Mar 8, 2007

Maybe the methods you mentioned require less steps but because it's a
complexer method it requires more transistors !

This is nonsensical. You wouldn't implement such methods purely in
hardware since, unless you've got a way of adding transistors at run-time,
you'd be constrained in the size of number you could deal with.

So, you're necessarily looking at a software implementation, perhaps
assisted by some carefully chosen hardware primitives. As far as I know,
no-one has come up with primitives that are any better than the universal
notions of fixed-size arithmetic and carry bits.

shoppa@trailing-edge.com · Mar 8, 2007

Were they? Did Greeks really had so advanced arithmetic skills?

What, you thought the Romans invented it? You ever try doing long
division with Roman Numerals, man? (OK, the other number systems in
use weren't much better!)

More realistically, while arithmetic is a greek word, most texts trace
the algorithms taught in grade school today to the Indian
mathematician Brahmagupta. He called his method of multiplication
gomutrika, which translates to (I'm not making this up) "trajectory of
a cow's urine".

Tim.

Wilco Dijkstra · Mar 8, 2007

Nick Maclaren said:
|> |>
|> > It should be noted, all modern CPU I know of have these primitives,
|> > even the lowly PIC. So bignum math is supported by hardware
*already*.
|> > The problem is higher level languages like C doesn't support these
low
|> > level hardware features. Like, how would you detect the carry bit in
C?

Not that I can see. How many have a 64x64->128 multiply and a
128/64->64,64 quotient, remainder?

Widening multiplies are quite common on CPUs, division is far less
common (ie. on RISC), especially the narrowing form. Luckily you don't
need a division primitive for multi precision arithmetic.

|> Using add with carry is easy in C, various compilers can detect it:

Some compilers may be able to, IF you use a code form that they can
recognise; but, if they don't document that and exactly what code
forms are acceptable, you are just hoping.

The idioms for rotate, widening multiplies, carry, multiply+accumulate,
divison+modulo and so on are pretty standard. It is trivial to recognise

the various forms said:
And, in any case, that is not "in C" - it is "in Intel C" or "in gcc"
etc. There is no way of doing it in standard or portable C, let alone
standard AND portable C.

Actually these idioms are both standard AND portable (the code I showed
works on all compilers irrespectively of integer size or whether the target
even has a carry flag!). On the other hand, inline assembler is non-standard
and not portable between different compilers for the same architecture.

Wilco

Nick Maclaren · Mar 8, 2007

|>
|> > Not that I can see. How many have a 64x64->128 multiply and a
|> > 128/64->64,64 quotient, remainder?
|>
|> Widening multiplies are quite common on CPUs, division is far less
|> common (ie. on RISC), especially the narrowing form. Luckily you don't
|> need a division primitive for multi precision arithmetic.

No, you don't. You DO, however, need it for efficient, relatively short,
multiple precision arithmetic - such as is very common, especially for
floating-point emulation.

And you are wrong that widening multiplies are quite common for the
largest efficient integer sizes. It gains you very little having to
drop to a size down in order to get them.

|> > |> Using add with carry is easy in C, various compilers can detect it:
|> >
|> > Some compilers may be able to, IF you use a code form that they can
|> > recognise; but, if they don't document that and exactly what code
|> > forms are acceptable, you are just hoping.
|>
|> The idioms for rotate, widening multiplies, carry, multiply+accumulate,
|> divison+modulo and so on are pretty standard. It is trivial to recognise
|> the various forms, eg. a + b < a and b > a + b both test the carry flag.

Ah. Perhaps I know rather more idioms than you do.

|> > And, in any case, that is not "in C" - it is "in Intel C" or "in gcc"
|> > etc. There is no way of doing it in standard or portable C, let alone
|> > standard AND portable C.
|>
|> Actually these idioms are both standard AND portable (the code I showed
|> works on all compilers irrespectively of integer size or whether the target
|> even has a carry flag!). On the other hand, inline assembler is non-standard
|> and not portable between different compilers for the same architecture.

That code will work only for unsigned values and, even then, only if
all operands are the same precision. It will NOT generate code that
uses the carry bit at all reliably, even on Intel.

I have just checked gcc and Intel - neither do it by default, though
they might if I found the right compiler option.

Regards,
Nick Maclaren.

Skybuck Flying · Mar 8, 2007

Ken Hagan said:
This is nonsensical. You wouldn't implement such methods purely in
hardware since, unless you've got a way of adding transistors at run-time,
you'd be constrained in the size of number you could deal with.

The add algorithm needs 1 digit for A, 1 digit for B and two digits for the
result.

Maybe some more digits for the operation and that's about it.

The digits of A and digits of B and the result digits for C can all be
streamed in and out to and from memory.

So, you're necessarily looking at a software implementation, perhaps
assisted by some carefully chosen hardware primitives. As far as I know,
no-one has come up with primitives that are any better than the universal
notions of fixed-size arithmetic and carry bits.

Lol,

Welcome to Skybuck's World =D where the seemingly impossible becomes
possible =D

Note:

You asked about the ADD method and that's the only thing I discussed above,
for the other 3 algorithms I don't know... I assume the other three
algorithms can be implemented with ADD.

Bye,
Skybuck.

Richard · Mar 8, 2007

[Please do not mail me a copy of your followup]

Terje Mathisen <[email protected]> spake the secret code

Not really.

The critical building block, as Nick have already mentioned, is the
NxN->2N multiplier (and/or MAC/FMAC), which means that you don't need a
lot of uncommitted gates, but instead a lot of multipliers and enough
(reconfigurable) routing between them to setup the required operations.

I don't understand your objection. You can implement multipliers in
FPGAs, plenty of people have. For arbitrary precision arithmetic you
might find a bit-serial approach easier to implement on an FPGA. That
means you can probably have several multipliers running concurrently.

Richard · Mar 8, 2007

[Please do not mail me a copy of your followup]

"Ken Hagan" <[email protected]> spake the secret code

This is nonsensical. You wouldn't implement such methods purely in
hardware since, unless you've got a way of adding transistors at run-time,
you'd be constrained in the size of number you could deal with.

There's nothing that says the hardware has to process all the bits of
an arbitrary precision number in parallel. It can process them in
chunks. Of course, you might argue that this is just what the
software is doing and you'd be right, but it would still be an
implementation in hardware.

John Larkin · Mar 8, 2007

[Please do not mail me a copy of your followup]

Terje Mathisen <[email protected]> spake the secret code

Not really.

The critical building block, as Nick have already mentioned, is the
NxN->2N multiplier (and/or MAC/FMAC), which means that you don't need a
lot of uncommitted gates, but instead a lot of multipliers and enough
(reconfigurable) routing between them to setup the required operations.

Click to expand...

I don't understand your objection. You can implement multipliers in
FPGAs, plenty of people have. For arbitrary precision arithmetic you
might find a bit-serial approach easier to implement on an FPGA. That
means you can probably have several multipliers running concurrently.

The midsize Xilinx chips have dozens of hard multipliers. And you can
make lots more, if slower, from the regular logic cells.

John

Thomas Womack · Mar 8, 2007

John Larkin said:
The midsize Xilinx chips have dozens of hard multipliers. And you can
make lots more, if slower, from the regular logic cells.

The problem is that a 17x17->34 unsigned multiplier clocked at 500MHz
is really only 1/64 as useful as a 64x64->128 multiplier clocked at
2GHz (since you need 16 17x17 multiplies to emulate a 64x64). You
might be able to get that up a bit by Karatsuba approaches, but only
to 1/36 (nine multipliers to emulate one of four times the width), and
I suspect the extra routing within the chip means you slow down to
significantly less than 500MHz.

Xilinx chips start comparable in price with Core2 processors and get
absurdly expensive very quickly. A Virtex-5 LX50 board costs $900,
and that gets you 48 multipliers.

[you can get 24 multipliers on a $199 XSA Spartan-3 board, which is
probably a more appealing price point, but you'll have great trouble
running them at more than 150MHz]

Tom

Terje Mathisen · Mar 8, 2007

Richard said:
[Please do not mail me a copy of your followup]

Terje Mathisen <[email protected]> spake the secret code

Not really.

The critical building block, as Nick have already mentioned, is the
NxN->2N multiplier (and/or MAC/FMAC), which means that you don't need a
lot of uncommitted gates, but instead a lot of multipliers and enough
(reconfigurable) routing between them to setup the required operations.

Click to expand...

I don't understand your objection. You can implement multipliers in
FPGAs, plenty of people have.

Not really.

Using general FPGA cells to implement a multiplier will be _far_ from
efficient compared with a dedicates slice of silicon.

Bignum math is either trivial (no efficiency/speed requirements), or
critical.

In the latter case FPGAs are unsuited.

For arbitrary precision arithmetic you
might find a bit-serial approach easier to implement on an FPGA. That
means you can probably have several multipliers running concurrently.

See above.

I guarantee you that this will be significantly slower than quite simple
x86 code using the basic 64x64->128 bitMUL opcode.

Terje

Richard · Mar 8, 2007

[Please do not mail me a copy of your followup]

Terje Mathisen <[email protected]> spake the secret code

Using general FPGA cells to implement a multiplier will be _far_ from
efficient compared with a dedicates slice of silicon.

Nothing says that you can't hook up a dedicated multiplier.

Bignum math is either trivial (no efficiency/speed requirements), or
critical.

In the latter case FPGAs are unsuited.

I remain unconvinced just because you state it to be so.

See above.

I guarantee you that this will be significantly slower than quite simple
x86 code using the basic 64x64->128 bitMUL opcode.

OK, so prove it to me.

me · Mar 8, 2007

The problem is that a 17x17->34 unsigned multiplier clocked at 500MHz
is really only 1/64 as useful as a 64x64->128 multiplier clocked at
2GHz (since you need 16 17x17 multiplies to emulate a 64x64). You
might be able to get that up a bit by Karatsuba approaches, but only
to 1/36 (nine multipliers to emulate one of four times the width), and
I suspect the extra routing within the chip means you slow down to
significantly less than 500MHz.

snip

But they all do math better than children. Faster and less prone to
errors...

John Perry · Mar 8, 2007

Un bel giorno Skybuck Flying digitò:

Some early responses started approaching the correct answer to SF's
initial statement, but then we got off to a lot of wrangling over what
is the best approach to solving the problem.

The correct response to your comment, Skybuck, is "of course it can".
In fact CPU's from the earliest days of computing have been doing since
the '60's, and even before, exactly what 4th-graders do, in the only way
possible, since we can't build infinitely wide-word cpu's, and kids
(even geniuses) can't keep track of infinite numbers of digits in their
heads. Or even arbitrarily large numbers of digits.

Your comment, while it has sparked some interesting discussion, was
patently silly from the start.

John Perry

Moore's Lobby Podcast

Menu

Categories

Platforms

Content

Connect With Us

Network

Kids can do math better than x86 cpu's.

Kids can do math better than x86 cpu's.

Terje Mathisen

[email protected]

[email protected]

Wilco Dijkstra

Skybuck Flying

Nick Maclaren

dalai lamah

Ken Hagan

[email protected]

Wilco Dijkstra

Nick Maclaren

Skybuck Flying

Richard

Richard

John Larkin

Thomas Womack

Terje Mathisen

Richard

me

John Perry

Similar threads