Maker Pro
Maker Pro

Kids can do math better than x86 cpu's.

T

Terje Mathisen

Jan 1, 1970
0
Richard said:
[Please do not mail me a copy of your followup]

John Larkin <[email protected]> spake the secret code
But there are few problems that need more than 64 or 80 bit
floating-point math, and fewer still that need it fast.

....which makes it perfect for reconfigurable computing.

Not really.

The critical building block, as Nick have already mentioned, is the
NxN->2N multiplier (and/or MAC/FMAC), which means that you don't need a
lot of uncommitted gates, but instead a lot of multipliers and enough
(reconfigurable) routing between them to setup the required operations.
As I understand it, that's what people are doing these days. Where
previously you would have been required to develop a custom ASIC at
great NRE cost, nowadays you can develop custom hardware using a
relatively inexpensive FPGA peripheral card. Configuring the FPGA
circuitry is very much like writing software when you use something
like VHDL or Verilog.

In fact, you can even get these in a portable form factor -- there's
an FPGA card you can buy that plugs into the Nintendo GameBoy Advance
and it comes with a complete tool chain included. All for about $200
from Charmed Labs <http://www.charmedlabs.com> as the Xport 2.0 which
has 150,000 gates.

Nice. :)

Terje
 
|>
|> > A CPU should be able to do it faster in hardware then a software library...
|>
|> What makes you think this?

He is right, but not VERY right.

The standard ISAs provide poor primitives for the purpose. For multiple
precision integer arithmetic, there are only a few extra primitives
needed (e.g. NxN->2N multiplication, add with carry etc.) For what I
am doing (software binary floating-point), the ISA and HLL primitives
stink. The actual arithmetic isn't the problem, as it is the same as
for integers, but the housekeeping, normalisation, rounding etc. are
a right royal pain in the, er, neck.

My estimates from many years back, which are probably still true, is
that a decent set of primitives would enable a software floating-point
to be 3-5 times slower than all-hardware (for a plausible precision).
But current ones make it a LOT slower than that.

However, when one is talking large integer arithmetic (hundreds of
digits), all one needs is a small number of extra primitives, and there
is no difficulty in doing it in software. Even without them, it is
very rarely more than 3 times slower than the best hardware could be.

It should be noted, all modern CPU I know of have these primitives,
even the lowly PIC. So bignum math is supported by hardware *already*.
The problem is higher level languages like C doesn't support these low
level hardware features. Like, how would you detect the carry bit in C?
 
W

Wilco Dijkstra

Jan 1, 1970
0
It should be noted, all modern CPU I know of have these primitives,
even the lowly PIC. So bignum math is supported by hardware *already*.
The problem is higher level languages like C doesn't support these low
level hardware features. Like, how would you detect the carry bit in C?

Using add with carry is easy in C, various compilers can detect it:

x = y + z;
if (x < y) x++;
x += a;

Translates to:

ADDS x, y, z
ADC x, x, a

Many compilers also detect NxN -> 2N bit multiplies.

Wilco
 
S

Skybuck Flying

Jan 1, 1970
0
Only on the really slow implementations. While bignum addition and
subtraction has to be done digit by digit (for some definition of
digit), you do *not* want to do division and multiplication the way
you learned it in school. Karatsuba multiplication (O(n**1.6) instead
of O(n**2) for the classical algorithm) is useful for medium sized
numbers, and FFT (O(n ln(n) ln(ln(n)))) for large numbers. Division
is usually be preformed by approximating a result and then refining it
with an appropriate number of iterations of Newton-Raphson (which just
needs multiplications).

Did you take into account the number of transistors needed for each method ?

Maybe the methods you mentioned require less steps but because it's a
complexer method it requires more transistors !

Since the simpler method requires less transistors it can be implemented
more in parallel on the same "chip space".

So to make comparision fair that factor has to be taken into the equation as
well for example:


1. "Reaction time of simple method" versus "Reaction time of complex
method".

Plus:

2. "Throughput of all sequentially called parallel implemented simple
methods" versus "Throughput of all sequentially called parallel implemented
complex methods".


For example:

The complex method has reaction time of 1000 nanoseconds.

The simple method has reaction time of 50000 nanoseconds.

The complex method can be implemented 10 times on chip space.

The simple method can be implemented 1000 times on chip space.

1 second = 1 000 000 000 nanoseconds:

Throughput for one complex method: 1 000 000 000 / 1 000 = 1 000 000

Throughput for one simple method: 1 000 000 000 / 50 000 = 20 000

Total throughput complex method: 1 000 000 * 10 = 10 000 000

Total throughput simple method: 20 000 * 1 000 = 20 000 000

This example shows the simple method could outperform the complex method.

Bye,
Skybuck.
 
N

Nick Maclaren

Jan 1, 1970
0
|> |>
|> > It should be noted, all modern CPU I know of have these primitives,
|> > even the lowly PIC. So bignum math is supported by hardware *already*.
|> > The problem is higher level languages like C doesn't support these low
|> > level hardware features. Like, how would you detect the carry bit in C?

Not that I can see. How many have a 64x64->128 multiply and a
128/64->64,64 quotient, remainder?

|> Using add with carry is easy in C, various compilers can detect it:

Some compilers may be able to, IF you use a code form that they can
recognise; but, if they don't document that and exactly what code
forms are acceptable, you are just hoping.

And, in any case, that is not "in C" - it is "in Intel C" or "in gcc"
etc. There is no way of doing it in standard or portable C, let alone
standard AND portable C.


Regards,
Nick Maclaren.
 
D

dalai lamah

Jan 1, 1970
0
Un bel giorno Skybuck Flying digitò:
However x86 cpu's today can not do with kids, Greeks or common people can do
and that is do math with arbitrary long numbers.

In my opinion tasks like this don't belong to the CPU instruction set, but
to the software level. The CPU instructions should address all (and just)
the useful tasks that can't be done with a comparable efficiency with a
group of instructions, for example HW multiplication between two registers
in one cycle, MAC in one cycle, and so on.
 
K

Ken Hagan

Jan 1, 1970
0
Maybe the methods you mentioned require less steps but because it's a
complexer method it requires more transistors !

This is nonsensical. You wouldn't implement such methods purely in
hardware since, unless you've got a way of adding transistors at run-time,
you'd be constrained in the size of number you could deal with.

So, you're necessarily looking at a software implementation, perhaps
assisted by some carefully chosen hardware primitives. As far as I know,
no-one has come up with primitives that are any better than the universal
notions of fixed-size arithmetic and carry bits.
 
Were they? Did Greeks really had so advanced arithmetic skills?

What, you thought the Romans invented it? You ever try doing long
division with Roman Numerals, man? (OK, the other number systems in
use weren't much better!)

More realistically, while arithmetic is a greek word, most texts trace
the algorithms taught in grade school today to the Indian
mathematician Brahmagupta. He called his method of multiplication
gomutrika, which translates to (I'm not making this up) "trajectory of
a cow's urine".

Tim.
 
W

Wilco Dijkstra

Jan 1, 1970
0
Nick Maclaren said:
|> |>
|> > It should be noted, all modern CPU I know of have these primitives,
|> > even the lowly PIC. So bignum math is supported by hardware
*already*.
|> > The problem is higher level languages like C doesn't support these
low
|> > level hardware features. Like, how would you detect the carry bit in
C?

Not that I can see. How many have a 64x64->128 multiply and a
128/64->64,64 quotient, remainder?

Widening multiplies are quite common on CPUs, division is far less
common (ie. on RISC), especially the narrowing form. Luckily you don't
need a division primitive for multi precision arithmetic.
|> Using add with carry is easy in C, various compilers can detect it:

Some compilers may be able to, IF you use a code form that they can
recognise; but, if they don't document that and exactly what code
forms are acceptable, you are just hoping.

The idioms for rotate, widening multiplies, carry, multiply+accumulate,
divison+modulo and so on are pretty standard. It is trivial to recognise
the various forms said:
And, in any case, that is not "in C" - it is "in Intel C" or "in gcc"
etc. There is no way of doing it in standard or portable C, let alone
standard AND portable C.

Actually these idioms are both standard AND portable (the code I showed
works on all compilers irrespectively of integer size or whether the target
even has a carry flag!). On the other hand, inline assembler is non-standard
and not portable between different compilers for the same architecture.

Wilco
 
N

Nick Maclaren

Jan 1, 1970
0
|>
|> > Not that I can see. How many have a 64x64->128 multiply and a
|> > 128/64->64,64 quotient, remainder?
|>
|> Widening multiplies are quite common on CPUs, division is far less
|> common (ie. on RISC), especially the narrowing form. Luckily you don't
|> need a division primitive for multi precision arithmetic.

No, you don't. You DO, however, need it for efficient, relatively short,
multiple precision arithmetic - such as is very common, especially for
floating-point emulation.

And you are wrong that widening multiplies are quite common for the
largest efficient integer sizes. It gains you very little having to
drop to a size down in order to get them.

|> > |> Using add with carry is easy in C, various compilers can detect it:
|> >
|> > Some compilers may be able to, IF you use a code form that they can
|> > recognise; but, if they don't document that and exactly what code
|> > forms are acceptable, you are just hoping.
|>
|> The idioms for rotate, widening multiplies, carry, multiply+accumulate,
|> divison+modulo and so on are pretty standard. It is trivial to recognise
|> the various forms, eg. a + b < a and b > a + b both test the carry flag.

Ah. Perhaps I know rather more idioms than you do.

|> > And, in any case, that is not "in C" - it is "in Intel C" or "in gcc"
|> > etc. There is no way of doing it in standard or portable C, let alone
|> > standard AND portable C.
|>
|> Actually these idioms are both standard AND portable (the code I showed
|> works on all compilers irrespectively of integer size or whether the target
|> even has a carry flag!). On the other hand, inline assembler is non-standard
|> and not portable between different compilers for the same architecture.

That code will work only for unsigned values and, even then, only if
all operands are the same precision. It will NOT generate code that
uses the carry bit at all reliably, even on Intel.

I have just checked gcc and Intel - neither do it by default, though
they might if I found the right compiler option.


Regards,
Nick Maclaren.
 
S

Skybuck Flying

Jan 1, 1970
0
Ken Hagan said:
This is nonsensical. You wouldn't implement such methods purely in
hardware since, unless you've got a way of adding transistors at run-time,
you'd be constrained in the size of number you could deal with.

The add algorithm needs 1 digit for A, 1 digit for B and two digits for the
result.

Maybe some more digits for the operation and that's about it.

The digits of A and digits of B and the result digits for C can all be
streamed in and out to and from memory.
So, you're necessarily looking at a software implementation, perhaps
assisted by some carefully chosen hardware primitives. As far as I know,
no-one has come up with primitives that are any better than the universal
notions of fixed-size arithmetic and carry bits.

Lol,

Welcome to Skybuck's World =D where the seemingly impossible becomes
possible =D

Note:

You asked about the ADD method and that's the only thing I discussed above,
for the other 3 algorithms I don't know... I assume the other three
algorithms can be implemented with ADD.

Bye,
Skybuck.
 
R

Richard

Jan 1, 1970
0
[Please do not mail me a copy of your followup]

Terje Mathisen <[email protected]> spake the secret code
Not really.

The critical building block, as Nick have already mentioned, is the
NxN->2N multiplier (and/or MAC/FMAC), which means that you don't need a
lot of uncommitted gates, but instead a lot of multipliers and enough
(reconfigurable) routing between them to setup the required operations.

I don't understand your objection. You can implement multipliers in
FPGAs, plenty of people have. For arbitrary precision arithmetic you
might find a bit-serial approach easier to implement on an FPGA. That
means you can probably have several multipliers running concurrently.
 
R

Richard

Jan 1, 1970
0
[Please do not mail me a copy of your followup]

"Ken Hagan" <[email protected]> spake the secret code
This is nonsensical. You wouldn't implement such methods purely in
hardware since, unless you've got a way of adding transistors at run-time,
you'd be constrained in the size of number you could deal with.

There's nothing that says the hardware has to process all the bits of
an arbitrary precision number in parallel. It can process them in
chunks. Of course, you might argue that this is just what the
software is doing and you'd be right, but it would still be an
implementation in hardware.
 
J

John Larkin

Jan 1, 1970
0
[Please do not mail me a copy of your followup]

Terje Mathisen <[email protected]> spake the secret code
Not really.

The critical building block, as Nick have already mentioned, is the
NxN->2N multiplier (and/or MAC/FMAC), which means that you don't need a
lot of uncommitted gates, but instead a lot of multipliers and enough
(reconfigurable) routing between them to setup the required operations.

I don't understand your objection. You can implement multipliers in
FPGAs, plenty of people have. For arbitrary precision arithmetic you
might find a bit-serial approach easier to implement on an FPGA. That
means you can probably have several multipliers running concurrently.

The midsize Xilinx chips have dozens of hard multipliers. And you can
make lots more, if slower, from the regular logic cells.

John
 
T

Thomas Womack

Jan 1, 1970
0
John Larkin said:
The midsize Xilinx chips have dozens of hard multipliers. And you can
make lots more, if slower, from the regular logic cells.

The problem is that a 17x17->34 unsigned multiplier clocked at 500MHz
is really only 1/64 as useful as a 64x64->128 multiplier clocked at
2GHz (since you need 16 17x17 multiplies to emulate a 64x64). You
might be able to get that up a bit by Karatsuba approaches, but only
to 1/36 (nine multipliers to emulate one of four times the width), and
I suspect the extra routing within the chip means you slow down to
significantly less than 500MHz.

Xilinx chips start comparable in price with Core2 processors and get
absurdly expensive very quickly. A Virtex-5 LX50 board costs $900,
and that gets you 48 multipliers.

[you can get 24 multipliers on a $199 XSA Spartan-3 board, which is
probably a more appealing price point, but you'll have great trouble
running them at more than 150MHz]

Tom
 
T

Terje Mathisen

Jan 1, 1970
0
Richard said:
[Please do not mail me a copy of your followup]

Terje Mathisen <[email protected]> spake the secret code
Not really.

The critical building block, as Nick have already mentioned, is the
NxN->2N multiplier (and/or MAC/FMAC), which means that you don't need a
lot of uncommitted gates, but instead a lot of multipliers and enough
(reconfigurable) routing between them to setup the required operations.

I don't understand your objection. You can implement multipliers in
FPGAs, plenty of people have.

Not really.

Using general FPGA cells to implement a multiplier will be _far_ from
efficient compared with a dedicates slice of silicon.

Bignum math is either trivial (no efficiency/speed requirements), or
critical.

In the latter case FPGAs are unsuited.
For arbitrary precision arithmetic you
might find a bit-serial approach easier to implement on an FPGA. That
means you can probably have several multipliers running concurrently.

See above.

I guarantee you that this will be significantly slower than quite simple
x86 code using the basic 64x64->128 bitMUL opcode.

Terje
 
R

Richard

Jan 1, 1970
0
[Please do not mail me a copy of your followup]

Terje Mathisen <[email protected]> spake the secret code
Using general FPGA cells to implement a multiplier will be _far_ from
efficient compared with a dedicates slice of silicon.

Nothing says that you can't hook up a dedicated multiplier.
Bignum math is either trivial (no efficiency/speed requirements), or
critical.

In the latter case FPGAs are unsuited.

I remain unconvinced just because you state it to be so.
See above.

I guarantee you that this will be significantly slower than quite simple
x86 code using the basic 64x64->128 bitMUL opcode.

OK, so prove it to me.
 
M

me

Jan 1, 1970
0
The problem is that a 17x17->34 unsigned multiplier clocked at 500MHz
is really only 1/64 as useful as a 64x64->128 multiplier clocked at
2GHz (since you need 16 17x17 multiplies to emulate a 64x64). You
might be able to get that up a bit by Karatsuba approaches, but only
to 1/36 (nine multipliers to emulate one of four times the width), and
I suspect the extra routing within the chip means you slow down to
significantly less than 500MHz.
snip

But they all do math better than children. Faster and less prone to
errors...
 
J

John Perry

Jan 1, 1970
0
Un bel giorno Skybuck Flying digitò:

Some early responses started approaching the correct answer to SF's
initial statement, but then we got off to a lot of wrangling over what
is the best approach to solving the problem.

The correct response to your comment, Skybuck, is "of course it can".
In fact CPU's from the earliest days of computing have been doing since
the '60's, and even before, exactly what 4th-graders do, in the only way
possible, since we can't build infinitely wide-word cpu's, and kids
(even geniuses) can't keep track of infinite numbers of digits in their
heads. Or even arbitrarily large numbers of digits.

Your comment, while it has sparked some interesting discussion, was
patently silly from the start.

John Perry
 
Top