Hi All,
Generally, should we expect a considerable drop of FPS changing the number of samples from 2 to 4, 8 and 16 ?
Thanks,
Alberto
Hi All,
Generally, should we expect a considerable drop of FPS changing the number of samples from 2 to 4, 8 and 16 ?
Thanks,
Alberto
yes i believe you should expect a drop in fps.
I’m no expert but from what I understood, a multisampling rate of 2 means you are grabbing two neighbors of a fragment and smoothing.
Same goes for multisampling rate of 4, 8 etc…
So the more samples you use, the more intensive the smoothing operation is which inevitably leads to a drop in fps.
In our software, we had to settle for a multisampling rate of 4 although we initially went for 16.
I hope that helps.
yes i believe you should expect a drop in fps.
I’m no expert but from what I understood, a multisampling rate of 2 means you are grabbing two neighbors of a fragment and smoothing.
Same goes for multisampling rate of 4, 8 etc…
So the more samples you use, the more intensive the smoothing operation is which inevitably leads to a drop in fps.
In our software, we had to settle for a multisampling rate of 4 although we initially went for 16.
I hope that helps.
matchStickMan,
In our software, we had to settle for a multisampling rate of 4 although we initially went for 16.
I think we need to do the same, it looks like on some GPUs our program drops from 100fps to 1-2fps and the quality improvement from 4 to 16 samples is almost not noticeable.
What do you think?
Thanks.
Alberto
Cache coherency is nulled, ROPs get overwhelmed, imho.
Hi Ilian,
I don’t understand your post, please explain.
Thanks,
Alberto
Basically fillrate increases. And cache is very useful up to a certain threshold.
I.e if on a cpu-only app you work intensively over only 32kB of contiguous data (fits in L1 cache), bandwidth easily reaches 100GB/s, getting limited down only by the arithmetic ops you do. Make that 33kB, and you overstep the size of L1, arithmetic ops become less of a bottleneck. Then overstep the L2 by a good margin, make your memory-accesses more random, and your 1-cycle ops can each take 300+ cycles.
GPUs try to merge outputs from ROPs, as GDDR is with high-latency/high-granularity access. In GDDR3/4/5 datasheets I didn’t see in-stream masks for which bytes to be updated - there is only a whole-stream mask that is applied from start to finish of transfer (glColorMask indirectly creates this mask). So, when you update depths (and/or colors in CSAA) of subpixels, the previous 163232 depths have to be prefetched, masked, merged, uploaded to GDDR. (assuming 32x32 tile-size, which I deducted from some Insomniac Games reports).
Caches help tremendously up to a certain threshold. After that, the latency horrors get visible
Metal Gear Solid 4 uses framebuffers only 1024px wide for this reason, even though the RSX is claimed to have “enough cache to fit ANY dxt1 texture” (4096x4096 = 8MB?). They simply found that’s the threshold on that hardware, for their set of scenes.
… and the quality improvement from 4 to 16 samples is almost not noticeable.
Yea true. The increase in smoothness does not match the slowdown in fps.
imho, a sampling rate of 4 is good enough.
Thanks Guys, now everything is clear to me!
Alberto