Re: Intel details future Larrabee graphics chip

Skybuck Flying · Aug 5, 2008

As the number of cores goes up the watt requirements goes up too ?

Will we need a zillion watts of power soon ?

Bye,
Skybuck.

Dirk Bruere at NeoPax · Aug 5, 2008

Skybuck said:
As the number of cores goes up the watt requirements goes up too ?

Will we need a zillion watts of power soon ?

Bye,
Skybuck.

Since the ATI Radeon™ HD 4800 series has 800 cores you work it out.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Chris M. Thomasson · Aug 5, 2008

John Larkin said:
Not necessarily, if the technology progresses and the clock rates are
kept reasonable. And one can always throttle down the CPUs that aren't
busy.

I saw suggestions of something like 60 cores, 240 threads in the
reasonable future.

I can see it now... A mega-core GPU chip that can dedicate 1 core per-pixel.

lol.

This has got to affect OS design.

They need to completely rethink their multi-threaded synchronization
algorihtms. I have a feeling that efficient distributed non-blocking
algorihtms, which are comfortable running under a very weak cache coherency
model will be all the rage. Getting rid of atomic RMW or StoreLoad style
memory barriers is the first step.

Dirk Bruere at NeoPax · Aug 5, 2008

Chris said:
I can see it now... A mega-core GPU chip that can dedicate 1 core
per-pixel.

Why not?
Probably configured as a systolic array
http://en.wikipedia.org/wiki/Systolic_array

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

NV55 · Aug 7, 2008

Since the ATI Radeon™ HD 4800 series has 800 cores you work it out.

Each of the 800 "cores", which are simple stream processors, in
ATI RV770
(Radeon 4800 series) are not comparable to the 16, 24, 32 or 48
cores that will be in Larrabee. Just like they're not comparable to
the 240 "cores" in Nvidia GeForce GTX 280. Though I'm not saying
you didn't realize that, just for those that might not have.

Nick Maclaren · Aug 7, 2008

|> On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
|> >|> >
|> >> This has got to affect OS design.
|> >
|> >They need to completely rethink their multi-threaded synchronization
|> >algorihtms. I have a feeling that efficient distributed non-blocking
|> >algorihtms, which are comfortable running under a very weak cache coherency
|> >model will be all the rage. Getting rid of atomic RMW or StoreLoad style
|> >memory barriers is the first step.
|>
|> Run one process per CPU. Run the OS kernal, and nothing else, on one
|> CPU. Never context switch. Never swap. Never crash.

Been there - done that

That is precisely how the early SMP systems worked, and it works
for dinky little SMP systems of 4-8 cores. But the kernel becomes
the bottleneck for many workloads even on those, and it doesn't
scale to large numbers of cores. So you HAVE to multi-thread the
kernel.

SGI were (are?) the leaders, but all of HP, IBM and Sun have been
along the same path. Modern Linux is multi-threaded.

Regards,
Nick Maclaren.

Nick Maclaren · Aug 7, 2008

|>
|> >|> Run one process per CPU. Run the OS kernal, and nothing else, on one
|> >|> CPU. Never context switch. Never swap. Never crash.
|> >
|> >Been there - done that

|> >
|> >That is precisely how the early SMP systems worked, and it works
|> >for dinky little SMP systems of 4-8 cores. But the kernel becomes
|> >the bottleneck for many workloads even on those, and it doesn't
|> >scale to large numbers of cores. So you HAVE to multi-thread the
|> >kernel.
|>
|> Why? All it has to do is grant run permissions and look at the big
|> picture. It certainly wouldn't do I/O or networking or file
|> management. If memory allocation becomes a burden, it can set up four
|> (or fourteen) memory-allocation cores and let them do the crunching.
|> Why multi-thread *anything* when hundreds or thousands of CPUs are
|> available?

I don't have time to describe 40 years of experience to you, and
it is better written up in books, anyway. Microkernels of the sort
you mention were trendy a decade or two back (look up Mach), but
introduced too many bottlenecks.

In theory, the kernel doesn't have to do I/O or networking, but
have you ever used a system where they were outside it? I have.

The reason that exporting them to multiple CPUs doesn't solve the
scalability problems is that the interaction rate goes up more
than linearly with the number of CPUs. And the same problem
applies to memory management, if you are going to allow shared
memory - or even virtual shared memory, as in PGAS languages.

And so it goes. TANSTAAFL.

|> Using multicore properly will require undoing about 60 years of
|> thinking, 60 years of believing that CPUs are expensive.

Now, THAT is true.

Regards,
Nick Maclaren.

Chris M. Thomasson · Aug 7, 2008

John Larkin said:
Why? All it has to do is grant run permissions and look at the big
picture. It certainly wouldn't do I/O or networking or file
management. If memory allocation becomes a burden, it can set up four
(or fourteen) memory-allocation cores and let them do the crunching.

FWIW, I have a memory allocation algorithm which can scale because its based
on per-thread/core/node heaps:

http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855

AFAICT, there is absolutely no need for memory-allocation cores. Each thread
can have a private heap such that local allocations do not need any
synchronization. Also, thread local deallocations of memory do not need any
sync. Local meaning that Thread A allocates memory M which is subsequently
freed by Thread A. When a threads memory pool is exhausted, it then tries to
allocate from the core local heap. If that fails, then it asks the system,
and perhaps virtual memory comes into play.

A scaleable high-level memory allocation algorithm for a super-computer
could look something like:
_____________________________________________________________
void* malloc(size_t sz) {
void* mem;

/* level 1 - thread local */
if ((! mem = Per_Thread_Try_Allocate(sz))) {

/* level 2 - core local */
if ((! mem = Per_Core_Try_Allocate(sz))) {

/* level 3 - physical chip local */
if ((! mem = Per_Chip_Try_Allocate(sz))) {

/* level 4 - node local */
if ((! mem = Per_Node_Try_Allocate(sz))) {

/* level 5 - system-wide */
if ((! mem = System_Try_Allocate(sz))) {

/* level 6 - failure */
Report_Allocation_Failure(sz);
return NULL;
}
}
}
}
}

return mem;
}
_____________________________________________________________

Level 1 does not need any atomic RMW OR membars at all.

Level 2 does not need membars, but needs atomic RMW.

Level 3 would need membars and atomic RMW.

Level 4 is same as level 3

Level 5 is worst case senerio, may need MPI...

Level 6 is total memory exhaustion! Ouch...

All local frees have same overhead while all remote frees need atomic RMW
and possibly membars.

This algorithm can scale to very large numbers of cores, chips and nodes.

Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

The bottleneck is the cache-coherency system. Luckily, there is years of
experience is dealing with weak cache schemes... Think RCU.

Why multi-thread *anything* when hundreds or thousands of CPUs are
available?

You don't think there is any need for communication between cores on a chip?

Chris M. Thomasson · Aug 7, 2008

Chris M. Thomasson said:
message news:[email protected]... [...]

Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

Click to expand...

The bottleneck is the cache-coherency system.

I meant to say:

/One/ bottleneck is the cache-coherency system.

Jan Panteltje · Aug 7, 2008

Why? All it has to do is grant run permissions and look at the big
picture. It certainly wouldn't do I/O or networking or file
management. If memory allocation becomes a burden, it can set up four
(or fourteen) memory-allocation cores and let them do the crunching.
Why multi-thread *anything* when hundreds or thousands of CPUs are
available?

Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

John

Ah, and this all reminds me about when 'object oriented programming' was going to
change everything.
It did lead to such language disasters as C++ (and of course MS went for it),
where the compiler writers at one time did not even know how to implement things.
Now the next big thing is 'think an object for every core' LOL.
Days of future wasted.
All the little things have to communicate and deliver data at the right time to the right place.
Sounds a bit like Intel made a bigger version of Cell.
And Cell is a beast to program (for optimum speed).
Maybe it will work for graphics, as things are sort of fixed, like to see real numbers though.
Couple of PS3s together make great rendering, there is a demo on youtube.

Nick Maclaren · Aug 7, 2008

|>
|> FWIW, I have a memory allocation algorithm which can scale because its based
|> on per-thread/core/node heaps:
|>
|> AFAICT, there is absolutely no need for memory-allocation cores. Each thread
|> can have a private heap such that local allocations do not need any
|> synchronization.

Provided that you can live with the constraints of that approach.
Most applications can, but not all.

Regards,
Nick Maclaren.

Jan Panteltje · Aug 7, 2008

Then stop thinking about optimum speed. Start thinking about a
computer system that doesn't crash, can't get viruses or trojans, is
easy to understand and use, that not even a rogue device driver can
bring down.

I already have those, they run Linux.
I give you though, that a bad behaving module can cause big problems.
Just had to reboot a couple of times to get rid of 'vloopback', wanted to interface
the Ethernet webcam with Flashplayer.
It works now: http://panteltje.com/panteltje/mcamip/#v4l_and_flash
not with the new adobe flashplayer 10 beta for Linux though....
We will almost always be one step behind I guess..

Think about how to manage a chip with 1024 CPUs. Hurry, because it
will be reality soon. We have two choices: make existing OS's
unspeakably more tangled, or start over and do something simple.

If I understood the Intel press release correctly, the API of Larabee
will be not different from a normal graphics card, that would be nice.
They create the problem, let them write the soft

Speed will be a side effect, almost by accident.

One can wonder how important speed really is for the consumer PC.
Sure, HD video, and later maybe 4096xsomething pixels will take more speed.
However for normal HD already cheap chipsets provide the power.
For HD video editing the speed can probably never be high enough...
but that is not only a graphics issue.

John, I dunno where it will go, but one thing I know:
It Will Not Become Simpler

There is a tendency to more and more complex structures in nature.
With us at the top perhaps, little one cell organisms at the bottom, molecules,
atoms, quarks, what not.
Self organising in a way, the best configurations make it - in time -
And what is time, we are but a dash in eternity.

Dirk Bruere at NeoPax · Aug 7, 2008

NV55 said:
Each of the 800 "cores", which are simple stream processors, in
ATI RV770
(Radeon 4800 series) are not comparable to the 16, 24, 32 or 48
cores that will be in Larrabee. Just like they're not comparable to
the 240 "cores" in Nvidia GeForce GTX 280. Though I'm not saying
you didn't realize that, just for those that might not have.

True, but they seem to be positioning Larrabee in the same tech segment
as video cards. Which makes sense since a SIMD system is the easiest to
program. If they want N general purpose cores doing general purpose
computing the whole thing will bog down somewhere between 16 and 32. A
lot of the R&D theory was done 30+ years ago.

Maybe they will try something radical, like an ancient data flow
architecture, but I doubt it.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Robert Myers · Aug 7, 2008

True, but they seem to be positioning Larrabee in the same tech segment
as video cards. Which makes sense since a SIMD system is the easiest to
program. If they want N general purpose cores doing general purpose
computing the whole thing will bog down somewhere between 16 and 32. A
lot of the R&D theory was done 30+ years ago.

Maybe they will try something radical, like an ancient data flow
architecture, but I doubt it.

"General purpose" GPU's are not really general purpose, but they
aren't doing graphics, either.

Robert.

Bernd Paysan · Aug 8, 2008

Nick said:
In theory, the kernel doesn't have to do I/O or networking, but
have you ever used a system where they were outside it? I have.

Actually, doing I/O or networking in a "main" CPU is waste of resources. Any
sane architecture (CDC 6600, mainframes) has a bunch of multi-threaded IO
processors, which you program so that the main CPU has little effort to
deal with IO.

This works well even when you do virtualization. The main CPU sends a
pointer to an IO processor program ("high-level abstraction", not the
device driver details) to the IO processor, which in turn runs the device
driver to get the data in or out. In a VM, the VM monitor has to
sanity-check the command, maybe rewrites it ("don't write to track 3 of
disk 5, write it to the 16 sectors starting at sector 8819834 in disk 1,
which is where the virtual volume of this VM sits").

The fact that in PCs the main CPU is doing IO (even down to the level of
writing to individual IO ports) is a consequence of saving CPUs - no money
for an IO processor, the 8088 can do that itself just fine. Why we'll soon
have 32 x86 cores, but still no IO processor is beyond what I can
understand.

Basically all IO in a modern PC is sending fixed- or variable-sized packets
over some sort of network - via SATA/SCSI, via USB, Firewire, or Ethernet,
etc.

Jan Panteltje · Aug 8, 2008

Actually, doing I/O or networking in a "main" CPU is waste of resources. Any
sane architecture (CDC 6600, mainframes) has a bunch of multi-threaded IO
processors, which you program so that the main CPU has little effort to
deal with IO.

This works well even when you do virtualization. The main CPU sends a
pointer to an IO processor program ("high-level abstraction", not the
device driver details) to the IO processor, which in turn runs the device
driver to get the data in or out. In a VM, the VM monitor has to
sanity-check the command, maybe rewrites it ("don't write to track 3 of
disk 5, write it to the 16 sectors starting at sector 8819834 in disk 1,
which is where the virtual volume of this VM sits").

The fact that in PCs the main CPU is doing IO (even down to the level of
writing to individual IO ports) is a consequence of saving CPUs - no money
for an IO processor, the 8088 can do that itself just fine. Why we'll soon
have 32 x86 cores, but still no IO processor is beyond what I can
understand.

Basically all IO in a modern PC is sending fixed- or variable-sized packets
over some sort of network - via SATA/SCSI, via USB, Firewire, or Ethernet,
etc.

Do not forget, since the days of 8088, and maybe CPUs running at about 13 MHz,
we now run at 3.4 GHz, 3400 / 13 = 261 x faster.
Also even faster because of better architectures.
This leaves plenty of time for a CPU to do normal IO.
And in fact the IO has been hardware supported always.
For example, although you can poll a serial port bit by bit, there is a hardware shift register,
hardware FIFO too.
Although you can construct sectors for a floppy in software bit by bit, there is a floppy controller
with write pre-compensation etc.. all in hardware.
Although you could do graphics there is a graphics card with hardware acceleration.
the first 2 are included in the chip set, maybe the graphics too.
The same thing for Ethernet, it is a dedicated chip, or included in the chip set,
taking the place of your 'IO processor'.
Same thing for hard disks, and those may even have on board encryption, all you
have to do is specify a sector number and send the sector data.

So.. no real need for a separate IO processor, in fact you likely find a processor
in all that dedicated hardware, or maybe a FPGA.

Jan Panteltje · Aug 8, 2008

That's the IBM "channel controller" concept: add complexm specialized
dma-based i/o controllers to take the load off the CPU. But if you
have hundreds of CPU's, the strategy changes.

John

Ultimately you will have to move bytes, from one CPU to the other,
or from dedicated IO to one CPU, and things have to happen at the right moment.
Results will never be available before requests......
It is a bit like Usenet: (smile), there are many 'processors' (readers. posters,
lurkers) here, some output some data at some time in response to some event,
could be a question, others read it, later, much later perhaps, see the problem?
Watched the Olympic opening, I must say the Chinese make a beautiful event.
Never got boring, the previous one was ugly and not worth looking at, but
anyways, so many LEDs? And some projection!
Seems they are ahead in many a field.
Would you not be scare to death if you were a little girl hanging 25 meters
above the floor from some steel cables.....
Chinese are brave too

Jan Panteltje · Aug 8, 2008

I think the trend is to have the cores surround a common shared cache;
a little local memory (and cache, if the local memory is slower for
some reason) per CPU wouldn't hurt.

Cache coherency is simple if you don't insist on flat-out maximum
performance. What we should insist on is flat-out unbreakable systems,
and buy better silicon to get the performance back if we need it.

I'm reading Showstopper!, the story of the development of NT. It's a
great example of why we need a different way of thinking about OS's.

Silicon is going to make that happen, finally free us of the tyranny
of CPU-as-precious-resource. A lot of programmers aren't going to like
this.

John

John Lennon:

'You know I am a dreamer'
.....
' And I hope you join us someday'

(well what I remember of it).
You should REALLY try to program a Cell processor some day.

Dunno what you have against programmers, there are programmaers who
are amazingly clever with hardware resources.
I dunno about NT and MS, but IIRC MS plucked programmers from
unis, and sort of brainwashed them then.. the result we all know.

UltimatePatriot · Aug 8, 2008

John Lennon:

'You know I am a dreamer'
....
' And I hope you join us someday'

(well what I remember of it).
You should REALLY try to program a Cell processor some day.

Dunno what you have against programmers, there are programmaers who
are amazingly clever with hardware resources.
I dunno about NT and MS, but IIRC MS plucked programmers from
unis, and sort of brainwashed them then.. the result we all know.

The Cell BE IS the current future.

VERY powerful. Ten times that of a PC in MANY areas. It will improve
too.

Dirk Bruere at NeoPax · Aug 8, 2008

John said:
John said:

On Thu, 7 Aug 2008 07:44:19 -0700, "Chris M. Thomasson"

message [...]
Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.
The bottleneck is the cache-coherency system.
I meant to say:

/One/ bottleneck is the cache-coherency system.
I think the trend is to have the cores surround a common shared cache;
a little local memory (and cache, if the local memory is slower for
some reason) per CPU wouldn't hurt.

Click to expand...

For small N this can be made work very nicely.

Cache coherency is simple if you don't insist on flat-out maximum
performance. What we should insist on is flat-out unbreakable systems,
and buy better silicon to get the performance back if we need it.

Click to expand...

Existing cache hardware on Pentiums still isn't quite good enough. Try
probing its memory with large power of two strides and you fall over a
performance limitation caused by the cheap and cheerful way it uses
lower address bits for cache associativity. See Steven Johnsons post in
the FFT Timing thread.

I'm reading Showstopper!, the story of the development of NT. It's a
great example of why we need a different way of thinking about OS's.

Click to expand...

If it is anything like the development of OS/2 you get to see very
bright guys reinvent things from scratch that were already known in the
mini and mainframe world (sometimes with the same bugs and quirks as the
first iteration of big iron code suffered from).

Click to expand...

Yes. Everybody thought they could write from scratch a better
(whatever) than the other groups had already developed, and in a few
weeks yet. There were "two inch pipes full of piss flowing in both
directions" between graphics groups.

Code reuse is not popular among people who live to write code.

NT 3.51 was a particularly good vintage. After that bloatware set in.
CPU cycles are cheap and getting cheaper and human cycles are expensive
and getting more expensive. But that also says that we should also be
using better tools and languages to manage the hardware.

Unfortunately time to market advantage tends to produce less than robust
applications with pretty interfaces and fragile internals. You can after
all send out code patches over the Internet all too easily ;-)

Click to expand...

NT followed the classic methodology: code fast, build the OS,
test/test/test looking for bugs. I think there were 2000 known bugs in
the first developer's release. There must have been ballpark 100K bugs
created and fixed during development.

Since people buy the stuff (I would not wish Vista on my worst enemy by
the way) even with all its faults the market rules, and market forces
are never wrong...

Most of what you are claiming as advantages of separate CPUs can be
achieved just as easily with hardware support for protected user memory
and security privilige rings. It is more likely that virtualisation of
single, dual or quad cores will become common in domestic PCs.

Click to expand...

Intel was criminally negligent in not providing better hardware
protections, and Microsoft a co-criminal in not using what little was
available. Microsoft has never seen data that it didn't want to
execute. I ran PDP-11 timeshare systems that couldn't be crashed by
hostile users, and ran for months between power failures.

There was a Pentium exploit documented against some brands of Unix. eg.
http://www.ssi.gouv.fr/fr/sciences/fichiers/lti/cansecwest2006-duflot.pdf

Loads of physical CPUs just creates a different set of complexity
problems. And they are a pig to program efficiently.

Click to expand...

So program them inefficiently. Stop thinking about CPU cycles as
precious resources, and start think that users matter more. I have
personally spent far more time recovering from Windows crashes and
stupidities than I've spent waiting for compute-bound stuff to run.

If the OS runs alone on one CPU, totally hardware protected from all
other processes, totally in control, that's not complex.

As transistors get smaller and cheaper, and cores multiply into the
hundreds, the limiting resource will become power dissipation. So if
every process gets its own CPU, and idle CPUs power down, and there's
no context switching overhead, the multi-CPU system is net better off.

What else are we gonna do with 1024 cores? We'll probably see it on
Linux first.

I was doing/learning all this stuff 30 years ago.
We even developed a loosely couple multi uP system where each module had
a comms processor, and apps processor and an OS processor. Back then all
these problems had already been analysed to death, and solutions found
(if they existed). The future of Intel/MS R&D ought to be reading IEEE
papers from the 60s/70s

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Moore's Lobby Podcast

Menu

Categories

Platforms

Content

Connect With Us

Network

Re: Intel details future Larrabee graphics chip

Re: Intel details future Larrabee graphics chip

Skybuck Flying

Dirk Bruere at NeoPax

Chris M. Thomasson

Dirk Bruere at NeoPax

NV55

Nick Maclaren

Nick Maclaren

Chris M. Thomasson

Chris M. Thomasson

Jan Panteltje

Nick Maclaren

Jan Panteltje

Dirk Bruere at NeoPax

Robert Myers

Bernd Paysan

Jan Panteltje

Jan Panteltje

Jan Panteltje

UltimatePatriot

Dirk Bruere at NeoPax

Similar threads