Re: Intel details future Larrabee graphics chip

MooseFET · Aug 23, 2008

On 20 Aug 2008 08:25:57 GMT, [email protected] (Nick Maclaren) wrote:
[....]

Aside: does anyone know why the "Harvard" approach was promoted from
being a trivial but important variation of Von Neumann to being of
equal rank, starting about 20 years ago? Because it assuredly ain't
so, despite the nonsense in Wikipedia, and almost all programming
languages have used separate code and data "address spaces" since
the invention of COBOL and FORTRAN, and were/are always talked about
as using the Von Neumann model (as they do).

Click to expand...

Regards,
Nick Maclaren.

Click to expand...

Well, gosh, the idea came from Harvard, it must be right.

I believe that the idea existed long before it got the tag "Harvard"
put onto it..

It is part of the reason why the 8051 is a much better processor than
an 8088.

Dennis · Aug 23, 2008

Terje said:
Huh???

Have you ever heard about x86 (SWAP reg), or any of the SIMD instruction
sets, like Altivec?

Terje

Actually both x86 and PowerPc have byte reverse instructions for up to 8
bytes. On the the PowerPc the SIMD byte shuffle instruction is a bit
faster but you need to worry about 16 byte alignment.

Bernd Paysan · Aug 23, 2008

Joe said:
Comes from the Harvard Mark I, which used a paper tape for instructions
and some sort of electromechanical storage for data.

That goes back to the first Zuse machine. The actual advance was von
Neumann's, because he found that instructions are just another form of
data. However, many computer languages haven't caught up to that idea, and
still only manipulate data, but don't provide meta-programming features.

No wonder ancient abominations like the 8051 are still popular :-(.

malc@pulsesoft.com · Aug 23, 2008

Dennis said:
Actually both x86 and PowerPc have byte reverse instructions for up to
8 bytes. On the the PowerPc the SIMD byte shuffle instruction is a bit
faster but you need to worry about 16 byte alignment.

The PowerPC bit is news to me. First of all there are no byte reverse
instructions per se, only reversed loads/stores, secondly there is no
ld/stbrx so you can only reverse halfs and words.

Besides on _x86_ bswap also is limited to 16/32 bit quantities, bswapq
is x86_64 material.

P.S. There's a chance i missing something about double word byte reverse
on PPC64 so please do fill me in.

MooseFET · Aug 24, 2008

Huh???

Have you ever heard about x86 (SWAP reg),

Yes, I have heard of the SWAP and exchange instructions. Take a look
at what they do to a 32 bit field and compare that to the operation
needed to convert big and little endian 32 byte values. You will find
that they don't do the needed operation in one go.

or any of the SIMD instruction

Andrew Reilly · Aug 24, 2008

I.e. FFT/DCT runs better with a precalculated array of pairs of indices
to be swapped, since that avoids all the need for branch prediction
within the inner loop:

From:
ai = bitrev(i);
if (ai < i)
swap(data, data[ai]);

To:
i = revtable[r];
j = revtable[r+1];
swap(data, data[j]);

The fastest bit-reverse re-orderer I've managed used neither table nor
bit reverser: it computed the indices to be swapped recursively, just
following the counting sequence, but bit-backwards. Fully deterministic
(especially on machines with a return predictor). On a DSP with bit-
reverse post-increment addressing modes you usually don't need to
physically reorder the data at all: just use it in sequence, out of
order. However, I'm now of the opinion that that sort of trick is not
worth the architectural complication, since the in-place reorder
operation consumes way less than 10% of the FFT time however you do it,
usually.

Cheers,

MooseFET · Aug 24, 2008

On Aug 23 said:
The simplest case is when you have to swap the byte order of multi-
byte numbers. Many processors can exchange two bytes but I don't know
of any that can reverse the order of a 4 byte or 8 byte field quickly.

Ooops I must have been asleep while writing that.

Wilco Dijkstra · Aug 24, 2008

MooseFET said:
Yes, I have heard of the SWAP and exchange instructions. Take a look
at what they do to a 32 bit field and compare that to the operation
needed to convert big and little endian 32 byte values. You will find
that they don't do the needed operation in one go.

x86 bswap converts a 32-bit value between little and big-endian as
expected. You can use an endian swap of any size to swap larger
values by repeating it several times (and that applies to both pure
endian and odd mixed endian values).

Wilco

Wilco Dijkstra · Aug 24, 2008

Nick Maclaren said:
|> On Wed, 20 Aug 2008 10:09:58 +0200, Morten Reistad <[email protected]>
|> wrote:
|>
|> >There are a number of algorithms I know that would have great
|> >benefit from some dedicated bit-shifting in hardware. As long as
|> >the FPGA is not too far removed from the CPU, and of acceptable
|> >speed. It usually involves scanning a few hundred to a few tens
|> >of thousand bytes and generating data.
|>
|> Not fully sure what you mean with dedicated bit shifting in hardware.
|> If you mean normal shifts and rotates most modern CPUs have that. If
|> you mean something else, please state what you want to do.

Try writing code to unpick an IEEE 754 floating-point number; once
you have done that, try doing that with an IEEE 754R decimal one

With the current ISAs, doing that sort of bit-munging in software
can be a hundred times as expensive as it could be done in a very
small amount of hardware.

Decoding/encoding IEEE 754 values typically takes 3-5 instructions on
ARM, so it's pretty trivial in software. 754R decimal is a lot more more
complex of course, but not nearly 100 times. One can decode a 50-bit
mantissa in around 20-25 instructions on a 32-bit CPU.

Some of the cryptographic algorithms are similar. Inverting bits
(as used in FFTs) is, too, but I don't know any algorithms where
that is a major bottleneck.

Indeed. Various architectures do implement bitreverse, but it is hardly
needed as CPUs already have the ultimate bitshuffle instruction:
the lookup table.

Wilco

Nick Maclaren · Aug 24, 2008

|>
|> > Try writing code to unpick an IEEE 754 floating-point number; once
|> > you have done that, try doing that with an IEEE 754R decimal one

|> >
|> > With the current ISAs, doing that sort of bit-munging in software
|> > can be a hundred times as expensive as it could be done in a very
|> > small amount of hardware.
|>
|> Decoding/encoding IEEE 754 values typically takes 3-5 instructions on
|> ARM, so it's pretty trivial in software. 754R decimal is a lot more more
|> complex of course, but not nearly 100 times. One can decode a 50-bit
|> mantissa in around 20-25 instructions on a 32-bit CPU.

I think that you mistook what I said.

If you can decode an IEEE 754 value in 3-5 instructions, and get all
of the special cases right, then it has hardware assistance. Note
that merely breaking the number up into fields is the easy part of
the decoding. Stopping at that point isn't interesting.

And I said "a hundred times as expensive", not "100 instructions",
though it could well be 100 executed instructions. The reason that
I said it was expensive is that it will often/usually have a lot of
mispredicted branches. You are aware that there are TWO formats of
decimal, aren't you?

Regards,
Nick Maclaren.

Nick Maclaren · Aug 24, 2008

|>
|> > With the current ISAs, doing that sort of bit-munging in software
|> > can be a hundred times as expensive as it could be done in a very
|> > small amount of hardware.
|> >
|> > Some of the cryptographic algorithms are similar. Inverting bits
|> > (as used in FFTs) is, too, but I don't know any algorithms where
|> > that is a major bottleneck.
|>
|> Afaik, there are no such algorithms today, simply because you can nearly
|> always either rewrite the problem or solve it with a suitable lookup
|> table, exchanging some memory for dedicated addressing hardware.

Yes, but that's one way the factor of a hundred comes in! If you
need a couple of references to main memory, that's it in one ....
Even references to large lookup tables cause enough slowdown that
they are very marginal.

Note that, if you are timing the algorithm on its own, that often
isn't obvious. But, when it is embedded in a real application, you
have to budget for what it does to cached data in the code that is
calling it. I have been caught out that way more than once tuning
an algorithmic kernel on its own, only to discover that the slower
version ran faster in actual use!

Regards,
Nick Maclaren.

Wilco Dijkstra · Aug 24, 2008

Nick Maclaren said:
|>
|> > Try writing code to unpick an IEEE 754 floating-point number; once
|> > you have done that, try doing that with an IEEE 754R decimal one
|> >
|> > With the current ISAs, doing that sort of bit-munging in software
|> > can be a hundred times as expensive as it could be done in a very
|> > small amount of hardware.
|>
|> Decoding/encoding IEEE 754 values typically takes 3-5 instructions on
|> ARM, so it's pretty trivial in software. 754R decimal is a lot more more
|> complex of course, but not nearly 100 times. One can decode a 50-bit
|> mantissa in around 20-25 instructions on a 32-bit CPU.

I think that you mistook what I said.

If you can decode an IEEE 754 value in 3-5 instructions, and get all
of the special cases right, then it has hardware assistance. Note
that merely breaking the number up into fields is the easy part of
the decoding. Stopping at that point isn't interesting.

We were talking about decoding bit formats, not about emulation. Zeroes,
denorm, Inf or NaN don't need further work beyond decoding into sign,
exp and mantissa. The only special case is setting the leading bit, but
even that is pretty easy (only set it if exp != 0 and exp != max).

For example, my 32-bit IEEE floating point multiply emulation routine
takes 41 instructions in total. That is without any hardware assistance.

And I said "a hundred times as expensive", not "100 instructions",
though it could well be 100 executed instructions. The reason that
I said it was expensive is that it will often/usually have a lot of
mispredicted branches. You are aware that there are TWO formats of
decimal, aren't you?

Much of the decoding would be done using a few small lookup tables and
conditional moves, something modern CPUs can easily handle. The few
remaining branches are likely very predictable (eg. if (isNaN(x))... )

I'm aware there are 2 decimal formats, but one only has to deal with one
at a time. Converting the binary format into the BCD one is indeed
very expensive, but that is not decoding.

Wilco

JosephKK · Aug 24, 2008

|> On Wed, 20 Aug 2008 10:09:58 +0200, Morten Reistad <[email protected]>
|> wrote:
|>
|> >There are a number of algorithms I know that would have great
|> >benefit from some dedicated bit-shifting in hardware. As long as
|> >the FPGA is not too far removed from the CPU, and of acceptable
|> >speed. It usually involves scanning a few hundred to a few tens
|> >of thousand bytes and generating data.
|>
|> Not fully sure what you mean with dedicated bit shifting in hardware.
|> If you mean normal shifts and rotates most modern CPUs have that. If
|> you mean something else, please state what you want to do.

Try writing code to unpick an IEEE 754 floating-point number; once
you have done that, try doing that with an IEEE 754R decimal one

With the current ISAs, doing that sort of bit-munging in software
can be a hundred times as expensive as it could be done in a very
small amount of hardware.

This is one of the few severe cases of shift to assembler, as the HLLs
get too far away from the binary representations and hardware
operations.

JosephKK · Aug 24, 2008

|>
|> > Try writing code to unpick an IEEE 754 floating-point number; once
|> > you have done that, try doing that with an IEEE 754R decimal one
|> >
|> > With the current ISAs, doing that sort of bit-munging in software
|> > can be a hundred times as expensive as it could be done in a very
|> > small amount of hardware.
|>
|> Decoding/encoding IEEE 754 values typically takes 3-5 instructions on
|> ARM, so it's pretty trivial in software. 754R decimal is a lot more more
|> complex of course, but not nearly 100 times. One can decode a 50-bit
|> mantissa in around 20-25 instructions on a 32-bit CPU.

I think that you mistook what I said.

If you can decode an IEEE 754 value in 3-5 instructions, and get all
of the special cases right, then it has hardware assistance. Note
that merely breaking the number up into fields is the easy part of
the decoding. Stopping at that point isn't interesting.

And I said "a hundred times as expensive", not "100 instructions",
though it could well be 100 executed instructions. The reason that
I said it was expensive is that it will often/usually have a lot of
mispredicted branches. You are aware that there are TWO formats of
decimal, aren't you?

Regards,
Nick Maclaren.

I guess you never came across xs3 variant of 8421. It doesn't work
for 2421.

JosephKK · Aug 24, 2008

On 20 Aug 2008 08:25:57 GMT, [email protected] (Nick Maclaren) wrote:
[....]

Aside: does anyone know why the "Harvard" approach was promoted from
being a trivial but important variation of Von Neumann to being of
equal rank, starting about 20 years ago? Because it assuredly ain't
so, despite the nonsense in Wikipedia, and almost all programming
languages have used separate code and data "address spaces" since
the invention of COBOL and FORTRAN, and were/are always talked about
as using the Von Neumann model (as they do).

Click to expand...

Regards,
Nick Maclaren.

Click to expand...

Well, gosh, the idea came from Harvard, it must be right.

Click to expand...

I believe that the idea existed long before it got the tag "Harvard"
put onto it..

It is part of the reason why the 8051 is a much better processor than
an 8088.

Though i have used both, i do not particularly like either one. I
prefer more regular instruction sets. There are plenty of them. e.g.
SPARC, 68k, Power, mips, etc.,

Nick Maclaren · Aug 24, 2008

|>
|> >Try writing code to unpick an IEEE 754 floating-point number; once
|> >you have done that, try doing that with an IEEE 754R decimal one

|> >
|> >With the current ISAs, doing that sort of bit-munging in software
|> >can be a hundred times as expensive as it could be done in a very
|> >small amount of hardware.
|>
|> This is one of the few severe cases of shift to assembler, as the HLLs
|> get too far away from the binary representations and hardware
|> operations.

It doesn't help, actually, compared with a language like C. I have
done it in both assembler and C, and there's precious little difference
in either the complexity or the performance.

Regards,
Nick Maclaren.

Dennis · Aug 24, 2008

The PowerPC bit is news to me. First of all there are no byte reverse
instructions per se, only reversed loads/stores, secondly there is no
ld/stbrx so you can only reverse halfs and words.

Besides on _x86_ bswap also is limited to 16/32 bit quantities, bswapq
is x86_64 material.

P.S. There's a chance i missing something about double word byte reverse
on PPC64 so please do fill me in.

OK, bswapq is x86_64, but then that is probably where the extension
makes sense. Since the PowerPC is a load/store architecture (mostly) I
counted the byte reverse load/stores.

The 8 byte reverse load/store is apparently an extension for the Cell
Broadband Engine PPE variant of the PowerPC. The following is from the
CellBE_PXCell_Handbook_v1.11_12May08_pub.pdf at

http://www-01.ibm.com/chips/techlib...0257460006FD68D?Open&S_TACT=105AGX16&S_CMP=LP

page 740. It may be easier to just search on "ibm cell alphaworks" and
look around the site.

A.2.1 New PowerPC Instructions
The PPE implements the following new instructions, relative
to version 2.02 of the PowerPC
Architecture:
• ldbrx—Load Doubleword Byte Reverse Indexed X-form
• sdbrx—Store Doubleword Byte Reverse Indexed X-form
Details follow starting on the next page.

malc@pulsesoft.com · Aug 24, 2008

Dennis said:
OK, bswapq is x86_64, but then that is probably where the extension
makes sense. Since the PowerPC is a load/store architecture (mostly) I
counted the byte reverse load/stores.

The 8 byte reverse load/store is apparently an extension for the Cell
Broadband Engine PPE variant of the PowerPC. The following is from the
CellBE_PXCell_Handbook_v1.11_12May08_pub.pdf at

I see, so i haven't missed anything while reading PowerISA 2.04. Being
PPE only precludes me from using it, yeah it'll work on one PPC64
machine i have access to, but will bomb on the other, shame.

http://www-01.ibm.com/chips/techlib...0257460006FD68D?Open&S_TACT=105AGX16&S_CMP=LP

page 740. It may be easier to just search on "ibm cell alphaworks" and
look around the site.

A.2.1 New PowerPC Instructions
The PPE implements the following new instructions, relative
to version 2.02 of the PowerPC
Architecture:
• ldbrx—Load Doubleword Byte Reverse Indexed X-form
• sdbrx—Store Doubleword Byte Reverse Indexed X-form
Details follow starting on the next page.

Thank You.

Andrew Reilly · Aug 25, 2008

Let's see... that's Bergland & Dolan 1979, right?

No idea, sorry. Came up with it myself, but nothing like that is ever
really new. I guess you mean their article in the IEEE signal processing
algorithms book of that vintage? Not easy to find these days, it seems...

Thanks for the sqrt notion: that seems to be quite a useful approach for
reasonable (cache-sized) transforms.

Cheers,

Nick Maclaren · Aug 25, 2008

|>
|> > If you can decode an IEEE 754 value in 3-5 instructions, and get all
|> > of the special cases right, then it has hardware assistance. Note
|> > that merely breaking the number up into fields is the easy part of
|> > the decoding. Stopping at that point isn't interesting.
|>
|> Agreed, sort of:
|>
|> If you can do the same as most hw, i.e. punting at Inf/NaN/Denorm,

No, you can't - that's not according to specification!

|> then
|> the real cost is in the multi-way branch on the exponent field, with the
|> problem being the fact that Zero is quite common, so we cannot simply
|> lump it together with the other extreme exponent cases but have to
|> specialcase it: ...

Don't bet on the others being rare - it's very application-dependent.
In particular, denorms are NOT rare in many programs, and only some
architectures have a "position of first bit" opcode.

|> > And I said "a hundred times as expensive", not "100 instructions",
|> > though it could well be 100 executed instructions. The reason that
|> > I said it was expensive is that it will often/usually have a lot of
|> > mispredicted branches. You are aware that there are TWO formats of
|> > decimal, aren't you?
|>
|> I didn't know that, but that would only be a single well-predicted
|> branch up front in each routine, right?

In general, yes. But only one of them would be less than horrible to
decode efficiently and correctly in software.

|> OK, using binary encoding makes a sw implementation quite easy! In fact,
|> it seems like a useful working format for a sw implementation.

Even that is not nice. You have a hard-to-predict branch based on
which of the two binary variants is used, plus the other tests. Also,
unless IEEE 754R was changed radically after I stopped following it,
even the binary representation supports cohorts - and decoding a
number means getting that right, too.

A decent decoder would support the densely packed decimal format, too,
which is NOT pretty in software!

Regards,
Nick Maclaren.

Moore's Lobby Podcast

Menu

Categories

Platforms

Content

Connect With Us

Network

Re: Intel details future Larrabee graphics chip

Re: Intel details future Larrabee graphics chip

MooseFET

Dennis

Bernd Paysan

[email protected]

MooseFET

Andrew Reilly

MooseFET

Wilco Dijkstra

Wilco Dijkstra

Nick Maclaren

Nick Maclaren

Wilco Dijkstra

JosephKK

JosephKK

JosephKK

Nick Maclaren

Dennis

[email protected]

Andrew Reilly

Nick Maclaren

Similar threads