Re: ARTICLE: Memory bandwidth

From: Brian Atkins (
Date: Sat Apr 14 2001 - 00:47:23 MDT

James Rogers wrote:
> On 4/13/01 6:15 PM, "Brian Atkins" <> wrote:
> > I didn't even read the article yet, but I have read about the P4 quite a
> > bit. I don't think it is really appropriate to draw conclusions based on
> > comparing it to P3 since they are radically different, and IMO the P4 is
> > a botched design. They were trying for something really amazing with it,
> > but a lot got gutted out of the design resulting in a very unbalanced
> > CPU. In many benchmarks for instance a slower clocked P3 would beat a P4.

Ok I read the article now, and stick by what I said.

> You are somewhat missing the point. The comparisons are in fact typically
> between a 800-MHz P3 and 1.5-GHz P4, but the difference in clock speed is
> immaterial if you actually look at the benchmarks (mostly because the
> benchmarks show *why* clock speed is immaterial). In fact, the article
> squares perfectly with what I've been hearing on the hardware lists where
> people have been running their own benchmarks.
> In short:
> For fp and vector codes *where the data fits in the cache*, you generally
> get much better performance with the P4 than the P3 i.e. Better than
> suggested by the clock speed difference. Since it fits in the cache, memory
> bandwidth and latency is mostly irrelevant.

Except that the P4 fp unit is so badly done that a P3 will kill it on most
any straight fp app, UNLESS you recode to use SSE2 instead of normal fp

> For fp and vector codes that have data sets substantially larger than the
> cache, the speed of the processor is irrelevant: a P3 will be memory limited
> and the P4 will be memory limited. The P4 has a poor memory architecture

Well where does the "Vortex" benchmark mentioned in the article fit in?
Apparently it is larger than cache, yet the behavior demonstrated does not
scale based on the bandwidth. Rather it seems to be limited in the P4 case
by bad CPU design (bad penalties of branch prediction misses + filling up
cache with unneeded data) and higher latency memory.

> and benchmarks as a dog compared to similarly clocked Athlon chips, due
> solely to differences in memory architecture; the Athlon has more bandwidth,

Actually no I believe even PC2100 DDR-SDRAM used on Athlon systems is far
lower bandwidth (2.1GB/sec) than the P4 theoretical maximum of 3.2GB/sec.

> so for memory limited problems it is faster, even at slower clock speeds.

I just don't see how your argument is stacking up here. Clearly the P4 is
a lame horse not because of the memory situation, unless you have an app
that requires low latency. Primarily the P4 is lame because it just plain
sucks itself.

> Note that the P4 has 15-20% *worse* memory latency than the P3, so cache
> efficiency is even more important on the P4 than on the P3. What the P4
> represents is one of the first times where the inter-generational
> differences in processor performance are based almost solely on the memory
> bus chipsets, not on the processor clock.

Total BS IMO. The P4 primarily sucks due to its own totally radical differences
compared to the P3. The P4 has 20+ pipeline stages! And many other oddities
that trip it up. The memory actually is quite nice except for the latency. It
even has hardware prefetch and other goodies that the Athlon has yet to get.
Yet it can't do anything with it in many cases because the P4 is so unbalanced.
In some specific cases it screams along, and in others it is worse than an a
sub-ghz P3.

> As for having a botched design, you are mostly correct -- the memory
> architecture is much worse than it has to be (addressed in the article).
> The problem is that Intel doesn't have a huge amount of room to improve it.
> For those classes of problem where everything (or most everything) fits in
> the high-speed cache, it screams. Unfortunately, we aren't talking about
> those types of problems.
> The point to all this being that if you are running an AI, which presumably
> will be churning on vast quantities of data, the clock speed doesn't matter.
> What the benchmarks between the P3 and P4 show is that memory bandwidth is
> already a serious crisis for data hungry apps. The situation on the P4 is
> so bad that they expect faster versions of the P4 to show no performance
> improvement for cache killers, whereas it used to be that you got *some*
> minor improvement with increased clock speeds even for these apps.

Here I can agree. Personally I hope they eventually move to some kind of
optical interconnect that would allow for much higher levels of bus performance.
Unfortunately there doesn't seem to be much demand for such things. Most
applications do just fine, and with upcoming hacks like large level 3 caches
being added to northbridges, the set of applications constrained by memory
bandwidth will shrink further.

Brian Atkins
Director, Singularity Institute for Artificial Intelligence

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:36 MDT