Re: Intel details future Larrabee graphics chip

Dirk Bruere at NeoPax · Aug 8, 2008

John said:
Umm, excuse me, what do those words mean, "properly designed secure
operating system" ?

That's what my wife asked me once when I was stupid enough to use the
phrase "too much garlic."

There's nothing magical about lots of cores. Everybody is doing it.

As James says, don't assume malice when incompetance will do.

What may well happen is that, once hundred-core CPUs are out in the
wild, some small group of Linix kernal jocks will spin a version that
*can* have file systems, drivers, stacks, and apps assignable to
various CPUs. Then it would just be a configuration thing to assign
one cpu to run just the OS. That would be dynamite for server apps.

Then Microsoft will scramble to catch up, as usual.

The big bottleneck has always been inter-process and/or inter-processor
communications. That has to be solved at a hardware level. Tightly
coupling the cores is only good for a max of around 16-32 cores. Unless
it's SIMD, which is relatively easy.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Chris M. Thomasson · Aug 9, 2008

John Larkin said:
John said:

On Thu, 7 Aug 2008 07:44:19 -0700, "Chris M. Thomasson"

message [...]
Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

Click to expand...

The bottleneck is the cache-coherency system.
I meant to say:

/One/ bottleneck is the cache-coherency system.

I think the trend is to have the cores surround a common shared cache;
a little local memory (and cache, if the local memory is slower for
some reason) per CPU wouldn't hurt.

Click to expand...

For small N this can be made work very nicely.

Cache coherency is simple if you don't insist on flat-out maximum
performance. What we should insist on is flat-out unbreakable systems,
and buy better silicon to get the performance back if we need it.

Click to expand...

Existing cache hardware on Pentiums still isn't quite good enough. Try
probing its memory with large power of two strides and you fall over a
performance limitation caused by the cheap and cheerful way it uses
lower address bits for cache associativity. See Steven Johnsons post in
the FFT Timing thread.

I'm reading Showstopper!, the story of the development of NT. It's a
great example of why we need a different way of thinking about OS's.

Click to expand...

If it is anything like the development of OS/2 you get to see very
bright guys reinvent things from scratch that were already known in the
mini and mainframe world (sometimes with the same bugs and quirks as the
first iteration of big iron code suffered from).

Click to expand...

Yes. Everybody thought they could write from scratch a better
(whatever) than the other groups had already developed, and in a few
weeks yet. There were "two inch pipes full of piss flowing in both
directions" between graphics groups.

Code reuse is not popular among people who live to write code.

NT 3.51 was a particularly good vintage. After that bloatware set in.

CPU cycles are cheap and getting cheaper and human cycles are expensive
and getting more expensive. But that also says that we should also be
using better tools and languages to manage the hardware.

Unfortunately time to market advantage tends to produce less than robust
applications with pretty interfaces and fragile internals. You can after
all send out code patches over the Internet all too easily ;-)

Click to expand...

NT followed the classic methodology: code fast, build the OS,
test/test/test looking for bugs. I think there were 2000 known bugs in
the first developer's release. There must have been ballpark 100K bugs
created and fixed during development.

Since people buy the stuff (I would not wish Vista on my worst enemy by
the way) even with all its faults the market rules, and market forces
are never wrong...

Most of what you are claiming as advantages of separate CPUs can be
achieved just as easily with hardware support for protected user memory
and security privilige rings. It is more likely that virtualisation of
single, dual or quad cores will become common in domestic PCs.

Click to expand...

Intel was criminally negligent in not providing better hardware
protections, and Microsoft a co-criminal in not using what little was
available. Microsoft has never seen data that it didn't want to
execute. I ran PDP-11 timeshare systems that couldn't be crashed by
hostile users, and ran for months between power failures.

There was a Pentium exploit documented against some brands of Unix. eg.
http://www.ssi.gouv.fr/fr/sciences/fichiers/lti/cansecwest2006-duflot.pdf

Loads of physical CPUs just creates a different set of complexity
problems. And they are a pig to program efficiently.

Click to expand...

So program them inefficiently. Stop thinking about CPU cycles as
precious resources, and start think that users matter more. I have
personally spent far more time recovering from Windows crashes and
stupidities than I've spent waiting for compute-bound stuff to run.

If the OS runs alone on one CPU, totally hardware protected from all
other processes, totally in control, that's not complex.

As transistors get smaller and cheaper, and cores multiply into the
hundreds, the limiting resource will become power dissipation. So if
every process gets its own CPU, and idle CPUs power down, and there's
no context switching overhead, the multi-CPU system is net better off.

What else are we gonna do with 1024 cores? We'll probably see it on
Linux first.

One point:

RCU can scale to thousands of cores; Linux uses that algorithm in its kernel
today.

JosephKK · Aug 9, 2008

Not necessarily, if the technology progresses and the clock rates are
kept reasonable. And one can always throttle down the CPUs that aren't
busy.

I saw suggestions of something like 60 cores, 240 threads in the
reasonable future.

This has got to affect OS design.

John

This won't bother *nix class OS's They have been scaled past 10
thousand cores already. Other OS are on their own.

JosephKK · Aug 9, 2008

I can see it now... A mega-core GPU chip that can dedicate 1 core per-pixel.

lol.

At that point you should integrate them directly into the display.
Then you could get to get to giga core systems.

They need to completely rethink their multi-threaded synchronization
algorihtms. I have a feeling that efficient distributed non-blocking
algorihtms, which are comfortable running under a very weak cache coherency
model will be all the rage. Getting rid of atomic RMW or StoreLoad style
memory barriers is the first step.

That reminds me of an article / paper i once read about Cache Only
Memory Architecture (COMA). Only they did seem to be able to get it
to work though.

JosephKK · Aug 9, 2008

Run one process per CPU. Run the OS kernal, and nothing else, on one
CPU. Never context switch. Never swap. Never crash.

John

OK. How do you deal with I/O devices, user input and hot swap?

JosephKK · Aug 9, 2008

That's the IBM "channel controller" concept: add complexm specialized
dma-based i/o controllers to take the load off the CPU. But if you
have hundreds of CPU's, the strategy changes.

John

Why would it? The design could also use hundreds or thousands of
dedicated I/O controllers. If you want to talk about real
bottlenecks look at memory and data bus limitations.

Robert Myers · Aug 9, 2008

Why would it? The design could also use hundreds or thousands of
dedicated I/O controllers. If you want to talk about real
bottlenecks look at memory and data bus limitations.

mmhmm.

Bandwidth per flop is headed toward zero.

Robert.

JosephKK · Aug 9, 2008

John said:
John said:

message [...]
Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.
The bottleneck is the cache-coherency system.
I meant to say:

/One/ bottleneck is the cache-coherency system.

Click to expand...

I think the trend is to have the cores surround a common shared cache;
a little local memory (and cache, if the local memory is slower for
some reason) per CPU wouldn't hurt.

Click to expand...

For small N this can be made work very nicely.

Cache coherency is simple if you don't insist on flat-out maximum
performance. What we should insist on is flat-out unbreakable systems,
and buy better silicon to get the performance back if we need it.

Click to expand...

Existing cache hardware on Pentiums still isn't quite good enough. Try
probing its memory with large power of two strides and you fall over a
performance limitation caused by the cheap and cheerful way it uses
lower address bits for cache associativity. See Steven Johnsons post in
the FFT Timing thread.

I'm reading Showstopper!, the story of the development of NT. It's a
great example of why we need a different way of thinking about OS's.

Click to expand...

If it is anything like the development of OS/2 you get to see very
bright guys reinvent things from scratch that were already known in the
mini and mainframe world (sometimes with the same bugs and quirks as the
first iteration of big iron code suffered from).

NT 3.51 was a particularly good vintage. After that bloatware set in.

Silicon is going to make that happen, finally free us of the tyranny
of CPU-as-precious-resource. A lot of programmers aren't going to like
this.

Click to expand...

CPU cycles are cheap and getting cheaper and human cycles are expensive
and getting more expensive. But that also says that we should also be
using better tools and languages to manage the hardware.

Unfortunately time to market advantage tends to produce less than robust
applications with pretty interfaces and fragile internals. You can after
all send out code patches over the Internet all too easily ;-)

Yeah, to people with broadband. Back when XP SP2 came out i was still
on dial up, MS send me a CD for free. Consider costs like that before
spouting.

Since people buy the stuff (I would not wish Vista on my worst enemy by
the way) even with all its faults the market rules, and market forces
are never wrong...

Most of what you are claiming as advantages of separate CPUs can be
achieved just as easily with hardware support for protected user memory
and security privilige rings. It is more likely that virtualisation of
single, dual or quad cores will become common in domestic PCs.

Why virtualize them? I can have them physically. Of course M$ PC
style software still cannot use them efficiently. Nor can they use
64-bit effectively and usually make poor use of SSE, SSE2 etc.,

There was a Pentium exploit documented against some brands of Unix. eg.
http://www.ssi.gouv.fr/fr/sciences/fichiers/lti/cansecwest2006-duflot.pdf

Loads of physical CPUs just creates a different set of complexity
problems. And they are a pig to program efficiently.

Mostly due to MS-DOS and follow ons style group think. We have a
generation of programmers that never learned partitioning properly.

JosephKK · Aug 9, 2008

John said:
John said:

On Thu, 7 Aug 2008 07:44:19 -0700, "Chris M. Thomasson"

message [...]
Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

Click to expand...

The bottleneck is the cache-coherency system.
I meant to say:

/One/ bottleneck is the cache-coherency system.

I think the trend is to have the cores surround a common shared cache;
a little local memory (and cache, if the local memory is slower for
some reason) per CPU wouldn't hurt.

Click to expand...

For small N this can be made work very nicely.

Cache coherency is simple if you don't insist on flat-out maximum
performance. What we should insist on is flat-out unbreakable systems,
and buy better silicon to get the performance back if we need it.

Click to expand...

Existing cache hardware on Pentiums still isn't quite good enough. Try
probing its memory with large power of two strides and you fall over a
performance limitation caused by the cheap and cheerful way it uses
lower address bits for cache associativity. See Steven Johnsons post in
the FFT Timing thread.

I'm reading Showstopper!, the story of the development of NT. It's a
great example of why we need a different way of thinking about OS's.

Click to expand...

If it is anything like the development of OS/2 you get to see very
bright guys reinvent things from scratch that were already known in the
mini and mainframe world (sometimes with the same bugs and quirks as the
first iteration of big iron code suffered from).

Click to expand...

Yes. Everybody thought they could write from scratch a better
(whatever) than the other groups had already developed, and in a few
weeks yet. There were "two inch pipes full of piss flowing in both
directions" between graphics groups.

Code reuse is not popular among people who live to write code.

NT 3.51 was a particularly good vintage. After that bloatware set in.

CPU cycles are cheap and getting cheaper and human cycles are expensive
and getting more expensive. But that also says that we should also be
using better tools and languages to manage the hardware.

Unfortunately time to market advantage tends to produce less than robust
applications with pretty interfaces and fragile internals. You can after
all send out code patches over the Internet all too easily ;-)

Click to expand...

NT followed the classic methodology: code fast, build the OS,
test/test/test looking for bugs. I think there were 2000 known bugs in
the first developer's release. There must have been ballpark 100K bugs
created and fixed during development.

Since people buy the stuff (I would not wish Vista on my worst enemy by
the way) even with all its faults the market rules, and market forces
are never wrong...

Most of what you are claiming as advantages of separate CPUs can be
achieved just as easily with hardware support for protected user memory
and security privilige rings. It is more likely that virtualisation of
single, dual or quad cores will become common in domestic PCs.

Click to expand...

Intel was criminally negligent in not providing better hardware
protections, and Microsoft a co-criminal in not using what little was
available. Microsoft has never seen data that it didn't want to
execute. I ran PDP-11 timeshare systems that couldn't be crashed by
hostile users, and ran for months between power failures.

There was a Pentium exploit documented against some brands of Unix. eg.
http://www.ssi.gouv.fr/fr/sciences/fichiers/lti/cansecwest2006-duflot.pdf

Loads of physical CPUs just creates a different set of complexity
problems. And they are a pig to program efficiently.

Click to expand...

So program them inefficiently. Stop thinking about CPU cycles as
precious resources, and start think that users matter more. I have
personally spent far more time recovering from Windows crashes and
stupidities than I've spent waiting for compute-bound stuff to run.

I have run compute bound stuff on a PC that took hours (about 5 1/2 to
run) and i wrote it myself. It was clean and efficient, just compute
bound. I tried it on a recent machine, took about 10 minutes. Yet
the general performance of the general PC application on the typical
PC seems to have no performance improvement for the past 10 years.
What do you think is the cause?

If the OS runs alone on one CPU, totally hardware protected from all
other processes, totally in control, that's not complex.

As transistors get smaller and cheaper, and cores multiply into the
hundreds, the limiting resource will become power dissipation. So if
every process gets its own CPU, and idle CPUs power down, and there's
no context switching overhead, the multi-CPU system is net better off.

What else are we gonna do with 1024 cores? We'll probably see it on
Linux first.

We have already seen it on Linux, in the form of parallel
supercomputers. With more cores as well.

JosephKK · Aug 9, 2008

Ah, and this all reminds me about when 'object oriented programming' was going to
change everything.
It did lead to such language disasters as C++ (and of course MS went for it),
where the compiler writers at one time did not even know how to implement things.
Now the next big thing is 'think an object for every core' LOL.
Days of future wasted.
All the little things have to communicate and deliver data at the right time to the right place.
Sounds a bit like Intel made a bigger version of Cell.
And Cell is a beast to program (for optimum speed).

Part of what many others are saying you no longer need optimum
performance, just good performance. Good enough is the mortal enemy
of the best. This seems to be true in all areas of endeavor.

Maybe it will work for graphics, as things are sort of fixed, like to see real numbers though.
Couple of PS3s together make great rendering, there is a demo on youtube.

There have been many "silver bullet" fixes since the 1960's.
Structured Programming, Literate Programming, several programming
languages, Rapid Prototyping, CASE, OOA / OOD, Provable Software (in
the mathematical sense), and numerous others.

Has any of them worked(?), No (except on a few restricted cases).

Dirk Bruere at NeoPax · Aug 9, 2008

JosephKK said:
Part of what many others are saying you no longer need optimum
performance, just good performance. Good enough is the mortal enemy
of the best. This seems to be true in all areas of endeavor.

There have been many "silver bullet" fixes since the 1960's.
Structured Programming, Literate Programming, several programming
languages, Rapid Prototyping, CASE, OOA / OOD, Provable Software (in
the mathematical sense), and numerous others.

Has any of them worked(?), No (except on a few restricted cases).

Does anyone here actually use a s/w methodology?
For the most part I do top down and bottom up. Basic outline first, then
write the peripherals drivers and low level routines that I know I'm
going to need. It usually all meets up in the middle without a problem.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Jan Panteltje · Aug 9, 2008

"First, some words about the meaning of "kernel". Operating Systems
can be written so that most services are moved outside the OS core and
implemented as processes.This OS core then becomes a lot smaller, and
we call it a kernel. When this kernel only provides the basic
services, such as basic memory management ant multithreading, it is
called a microkernel or even nanokernel for the super-small ones. To
stress the difference between the

Unix-type of OS, the Unix-like core is called a monolithic kernel. A
monolithic kernel provides full process management, device
drivers,file systems, network access etc. I will here use the word
kernel in the broad sense, meaning the part of the OS supervising the
machine."

Just to rain a bit on your parade, in the *Linux* kernel,
many years ago, the concept of 'modules' was introduced.
Now device drivers are 'modules', and are, although closely connected, and in the same
source package, _not_ a real pert of the kernel.
(I am no Linux kernel expert, but it is absolutely possible to write a device
driver as module, and then, while the system is running, load that module,
and unload it again.
I sort of have the feeling that your knowledge of Linux, and the Linux kernel, is very academic John,
and you should really compile a kernel and play with Linux a bit to get
the feel of it.

Most popular os's (Win, Linux, Unix) are big-kernel designs, to reduce
inter-process overhead. That makes them complex, buggy, and
paradoxically slow.

Unix has been around decades, got more and more perfectioned, Linux and BSD are incarnations of it.

There was some old saying that went like this (correct me hopefully somebody knows it more precisely):
"Those who criticise Unix are bound to re-invent it'.

Jan Panteltje · Aug 10, 2008

Er, the discussion that John quoted above referred not to what is
compiled with the kernel but to what executes in the same protection
domain that the kernel does (as it is my impression Linux modules do).
Perhaps John is not the one who needs to develop a deeper understanding
here.

He mentioned 'monolithic', and with modules, the Linux kernel is _not_ monolitic.
You can load a device driver as a module (after you configured it to be a module
before compilation, the kernel config gives you often a choice), and
then that module will even be dynamically loaded, including other modules it depends on,
and unloaded too if no longer used (that device).
This keeps memory usage low, and prevent that you need to reboot if you add a new driver.

As to 'protection domain' be aware that even if you were to run device drivers on a different core (one for each device???)
then you will still have to move the data from one core to the other for processing, and
how protected do you think that data is? It is all illusion: 'More cores will solve everything.'.
I wonder how many here actually use Linux, compiled a kernel, wrote modules and applications,
and even can write in C.
I'd rather have a discussion with them, then the generalised bloating about systems they never even
had hands on experience with.
In that case sci.electronics.design becomes like sci.physics, bunch of idiots with even
more idiotic theories causing so much noise that the real stuff is obscured, and your chance to learn something
is zero.
This is my personal rant, I am a Linux user, written many applications for it, did some work on
drivers too.
Academic bullshit I know about too, in my first year Information Technology I found an error in the
text book, reported it, professors do not always like to be corrected, I learned that.
There was a project that you could join, about in depth study of operating systems, and, since I actually
wrote one, I applied for the project, was promptly rejected.
Where did those guys go? Microsoft??????
I will listen to John Larkin's theory about how safe multicore systems are after he writes a demo, or even
shows someone else's that cannot be corrupted.
Utopia does not exist.

<EOR (=End OF Rant>

Chris M. Thomasson · Aug 10, 2008

Nick Maclaren said:
|>
|> FWIW, I have a memory allocation algorithm which can scale because its
based
|> on per-thread/core/node heaps:
|>
|> AFAICT, there is absolutely no need for memory-allocation cores. Each
thread
|> can have a private heap such that local allocations do not need any
|> synchronization.

Provided that you can live with the constraints of that approach.
Most applications can, but not all.

That's a great point! It just seems that the approach could possibly be
beneficial to all sorts of applications. Could you help me out here and give
some examples of a couple of applications that simply could not tolerate the
approach at any level? When I say any level I mean allocations starting at
lowest common denominator from it orgin... This being trying thread local
heap, then core local heap, and so on and so forth...

I see problems. Well, with mega-core systems, the per-core memory is going
to be limited indeed! Its analogous to programming a Cell with its dedicated
per-SPE memory; something like 256 kb. When the local allocation to a SPE is
exhausted, well, DMA to the global memory is going to need to be utilized. I
know this works because I have played around with algorithms using the IBM
Cell Simulator.

http://groups.google.com/group/comp.arch/browse_frm/thread/4c97441d6704d8a1

http://groups.google.com/group/comp.arch/msg/4133f6eb8a6b5a74

programming the Cell is VERY FUN!!!!

AnimalMagic · Aug 10, 2008

Utopia does not exist.

Thanks to dopey, closed mindsets like yours.

After 3 years, folks are still trying to circumnavigate Sony's
hypervisor control over the graphics port on the PS3, so those of us that
run Linux on it cannot get accelerated graphics or GL performance on it.

Apparently for them, Utopia's castle walls are still standing.

Jan Panteltje · Aug 10, 2008

Thanks to dopey, closed mindsets like yours.

Hey nutcase, YOU failed to hack it !

After 3 years, folks are still trying to circumnavigate Sony's
hypervisor control over the graphics port on the PS3, so those of us that
run Linux on it cannot get accelerated graphics or GL performance on it.

Apparently for them, Utopia's castle walls are still standing.

Learn to wipe your own arse.

Jan Panteltje · Aug 10, 2008

On a sunny day (Sun, 10 Aug 2008 15:02:40 GMT) it happened Jan Panteltje

And for the others: Sony was to have two HDMI ports on the PS3,
should have made for interesting experiments.

But the real PS3 only had one, so I decided to skip on the Sony product
(most Sony products I have bought in the past were really bad actually).
And Linux you can run on anything (and runs on anything), for less then
the cost of a PS3 you can assemble a good PC, so if you must run Linux
why bother tortuing yourself on a PS3? Use a real computer.

But perhaps if you are one of those gamers... well
the video modes also suck on that thing. And the power consumption is high,
not green at all, and it does not have that nice Nintendo remote.

))))))))))))))))))))))))))))))))))))))))

ChrisQ · Aug 10, 2008

Jan said:
John Lennon:

'You know I am a dreamer' .... ' And I hope you join us someday'

(well what I remember of it). You should REALLY try to program a Cell
processor some day.

Dunno what you have against programmers, there are programmaers who
are amazingly clever with hardware resources. I dunno about NT and
MS, but IIRC MS plucked programmers from unis, and sort of
brainwashed them then.. the result we all know.

That's just the problem - programmers have been so good at hiding the
limitations of poorly designed hardware that the whole world thinks
that hardware must be perfect and needs no attention other than making
it go faster.

If you look at some modern i/o device architectures, it's obvious the
hardware engineers never gave a second thought about how the thing would
be programmed efficiently...

Chris (with embedded programmer hat on :-(

Jan Panteltje · Aug 10, 2008

That's just the problem - programmers have been so good at hiding the
limitations of poorly designed hardware that the whole world thinks
that hardware must be perfect and needs no attention other than making
it go faster.

If you look at some modern i/o device architectures, it's obvious the
hardware engineers never gave a second thought about how the thing would
be programmed efficiently...

Chris (with embedded programmer hat on :-(

Interesting.
For me, I have a hardware background, but also software, the two
came together with FPGA, when I wanted to implement DES as fast as possible.
I did wind up with just a bunch of gates and 1 clock cycle, so no program

No loops (all unfolded in hardware).
So, you need to define some boundary between hardware resources (that one used a lot of gates),
and software resources, I think.

Tim Williams · Aug 10, 2008

ChrisQ said:
That's just the problem - programmers have been so good at hiding the
limitations of poorly designed hardware

Is that like the crummy WinModems?

Tim

Moore's Lobby Podcast

Menu

Categories

Platforms

Content

Connect With Us

Network

Re: Intel details future Larrabee graphics chip

Re: Intel details future Larrabee graphics chip

Dirk Bruere at NeoPax

Chris M. Thomasson

JosephKK

JosephKK

JosephKK

JosephKK

Robert Myers

JosephKK

JosephKK

JosephKK

Dirk Bruere at NeoPax

Jan Panteltje

Jan Panteltje

Chris M. Thomasson

AnimalMagic

Jan Panteltje

Jan Panteltje

ChrisQ

Jan Panteltje

Tim Williams

Similar threads