cache sizes in graphics hardware

Hi,

I’ve got two questions:

  1. Does anybody have information about cache sizes for textures or vertices in current graphics processors? (How can I find out this by some benchmark?)

  2. How can I optimally take advantage of caching in graphics hardware (during texturing or vertex processing)?

Michael

Originally posted by mlb:
[b]Hi,

  1. Does anybody have information about cache sizes for textures or vertices in current graphics processors? (How can I find out this by some benchmark?)
    [/b]
    There is very little public information on GPU caches. Texture caches are small and mainly designed to avoid refetching texels for neighbouring bilinear samples. Vertex caches are typically on the order of 10-32 vertices.
  1. How can I optimally take advantage of caching in graphics hardware (during texturing or vertex processing)?

Keep textures as small as possible. Use compression. Strip your geometry so that there is as much vertex reuse as possible.

Hi,

I’m working on volume rendering algorithms, i.e. 3D-texture-based slicing of bricked volume data. There are many rumors that 2D-textures allow better caching than 3D-textures. Can anybody confirm this – and try give some explanations (or benchmarks)?

In general, I prefer 3D-textures since it is more elegant to use them. I think, there should be some way to do 3D-texture-lookups in a more cache-friendly way: For example by rendering triangles that are rasterized in a way that produces more cache-friendly lookups. (For example, the shape of triangles also has an influence on the speed of the rasterization) Did anybody do some experiments like this?

Concerning the size of textures: It would be easy to make textures “as small as possible”, but the number of texture objects would explode in my case – consider a volume of 512^3 voxels and a brick size of 16^3 voxels: the result would be 32768 texture objects.

That rises another question: Why does performance drop down significantly when using large numbers of texture objects – motivated by making textures small? This phenomenon starts around a brick size of 32^3 or less when using a volume of about 512^3 voxels – which corresponds to 4096 texture objects. This suggests that there is some upper limit where management of texture objects gets less efficient. Btw. I know that binding a texture object produces some constant overhead, but it seems that this is not the only reason. Lots of speculations – Any hints appreciated :wink:

I forgot to tell: I’m using current NVIDIA hardware

Michael

Hi,
you can find some results of experiments concerning texture cache aware volume rendering in the paper cited below. Note that unless you have 512MB graphics memory a 512^3 16bpv volume won’t fit in the memory, which may be actually the cause of the slowdown you have observed for large numbers of textures.

Visualization of large medical data sets using memory-optimized CPU and GPU algorithms. Gundolf Kiefer, Helko Lehmann, Juergen Weese. Proc. SPIE Vol. 5744, p. 677-687, Medical Imaging 2005: Visualization, Image-Guided Procedures, and Display; Robert L. Galloway, Jr., Kevin R. Cleary; Eds., Apr 2005

Greetings,
dreld

dreld,

Could you post a link to that paper?

Thanks.

P.S. The cause of slowdown is not texture memory. It is only caused by the large number of texture objects even if they fit completely in texture memory.

The link is:

http://bookstore.spie.org/index.cfm?fuse…FTOKEN=65029017

SPIE charges for the full paper though :frowning:

dreld

Thanks, dreld,

it’s an interesting paper, but it does not cover the aspect of caching I had in mind. It mainly tells that a “cache adapted” bricking using 3d textures is approximately equally good or better than other methods like 2d-multi-textures especially when considering large volumes. But the “cache-awareness” is not really proved, the given numbers only tell that some brick sizes perform better than others and that 2D-textures are faster than 3D-textures for small volumes. We need some words by ATI or NVIDIA to find out if the numbers can be explained by caching effects – they could also be explained by drivers performing better for special brick sizes.

I had in mind other optimizations that are not related to caching between main memory and graphics memory. I’m interested in GPU-internal caches and in generating some order of rasterized pixels that gives a more cache-friendly order of texture-lookups. Did anybody ever try this? One could try to find out the approximate interal memory layout of 3d-textures (some kind of blocking to enable orientation-independent slicing performance) and to render triangles that approximately fit that memory layout. I know that sounds a bit strange but one could try…

Michael

The site http://www.digit-life.com/articles2/gffx/nv40-part1-a.html
has some nice block diagrams showing the architecture of the nv40. The figure at http://www.digit-life.com/articles2/gffx/5.png shows that there is a second level texture cache that is shared by all quad processors and a first level cache for each of the quad processors. I wonder if anybody ever tried to find out the size of these caches?

Archmark, if used carefully, can help you determine texture cache sizes.
Many people have struggled with getting accurate results, most likely because they failed at large to RTFM, so I’ll just sum up what I believe to be correct.

Radeon 8500: 4kiB texture cache
Radeon 9000/9200/9250: 2kiB texture cache
R300 and derivatives (Radeon 9500-9800Pro, X300, X600): 8kiB texture cache
NV2x, NV3x: 4kiB texture cache (see extra info below)
NV40: 16kiB texture cache IIRC (not sure, may be 8, lost the report)
Radeon X700 and up: don’t know
Geforce 7800: don’t know

On NV2x and NV3x paletted or S3TC compressed textures are fully decoded upon loading into the cache. That means that for paletted textures (which are transformed to RGBA8888) you only have 1kiB of “effective” texture cache. Depending on your settings you may or may not have 2kiB of effective cache for DXT1, because the chip can decode that to RGB565, which is a hack; this is the cause of all the quality issues with DXT1, e.g. Quake 3’s skybox.

The NV40 generation fixes this. You get the same effective cache size for compressed and uncompressed textures. This is explained by interpreting this two-level cache marketing hyperbole correctly, which is otherwise a complete hoax IMO. It does nothing for you except handling compressed textures sensibly and competitively.

Hi,

thanks zeckensack, that’s a really interesting benchmark tool. The cache size for nv40 is 8kiB according to your tool. The numbers for my box for rgba texture caching are:

rgba_fetches/s,bytes
2672808448.000000,4
2745205504.000000,8
2723405056.000000,16
2726195200.000000,32
2737927168.000000,64
2734844928.000000,128
2735956992.000000,256
2738938880.000000,512
2704938496.000000,1024
2742886656.000000,2048
2725963264.000000,4096
2729058304.000000,8192
2427928832.000000,16384
2024621312.000000,32768
2009164928.000000,65536
2004677120.000000,131072
1997701888.000000,262144
2002703872.000000,524288
1998383744.000000,1048576

That means that the relative speed of good vs. bad caching is approximately 1.3. I wonder how to interprete these numbers…They only tell about the cache size, but what about the maximum speed-up? Is it really only 1.3?

P.S. What about expanding your tool to support 3d-textures (with 1, 2, or 4 components) or other formats like floating point textures…

The NV40 generation fixes this. You get the same effective cache size for compressed and uncompressed textures. This is explained by interpreting this two-level cache marketing hyperbole correctly, which is otherwise a complete hoax IMO. It does nothing for you except handling compressed textures sensibly and competitively.
So there’s no first-level cache in the nv40?

Originally posted by mlb:
So there’s no first-level cache in the nv40?
There may well be two levels, but I wouldn’t call them L1 and L2. In any case, the lowest level texture cache has no measurable impact on the outside world. You just see the large one.

And the largest texture cache level on NV40 looks like it is 8k (thx for the fresh set of numbers). Twice the amount of NV3x, and the same amount as R300. The new/other/additional vs NV3x cache level, if present, must be smaller.

You could call the larger one L2 and the smaller one L1, probably. Or you could call the larger one L1 and the smaller one L0, or predecode buffer, or anything you fancy. Or (and I prefer that) you could just call the large one “the cache” and leave it at that.

A texture cache has an effect that can be measured. It increases performance for smaller data sets. You can verify its effects, and you can see performance gains in the real world.
Having an extra decode buffer farther away from memory has no such effect. It’s purely an engineering detail. It may affect die size, power, manufacturability, clock speeds and what-have-you. Who knows … and who really cares …

The fact of the matter is that ATI might have the same mechanism in their chips, to help simplify the texture sampler itself, to make broadcasting texture data all over the die easier, whatever, and they just don’t care to market it as a cache. We really can’t tell from the outside of these, to us, black boxes. If we can’t tell the difference, marketing shouldn’t imply that we can, IMO.


but what about the maximum speed-up? Is it really only 1.3
Yes, in this specific case it is. The left column is fillrate. As you can see this is a rather simplistic scenario, just perfectly isotropic single-texturing. Texture memory access, even when spilling the cache, should be perfectly coherent.

If you want more real-world impressions, you could play around with extreme negative LOD bias vs LOD bias 0.0 in some games, at modest resolutions (I suggest the UT series, as it’s easy to make this setting in the .ini).
Negative LOD bias, besides making textures look crappy in motion but oh-so-sharp on screenshots, kills coherency. The texture caches thrash all the time, which IMO reproduces the performance profile that you would get without any texture cache quite well.

I’m sure you’ll see much worse than a 30% drop. Be wary of app specific “optimizations” though …

zeckensack,

i see, the so called 2nd level cache seems to be the interesting part, since it is possible to benchmark its most likely size – and to optimize texture sizes for that cache. The 1st level cache is most likely very small and located in the quad processor. The effects of the 1st-level cache are not easy to benchmark, perhaps. Its effects are on the quad-level, so it is useful for example if pixels in a quad frequently access the same texels. Or could it be that the 1st level cache is not made for texture caching but for expanding the set of registers available in the ALUS?

Michael