Geforce 3 vs GL_POINTS

zeckensack · August 5, 2003, 5:13am

As I have need for directly writing depth and color atomically (w active depth testing), I’ve come to the conclusion that GL_POINTS rendering is my way to go.

Looks like this

struct Vertex
{
float x,y,z;
};
Vertex* vbuf=NULL;

void
init_gl_stuff()
{
<…>
vbuf=(Vertex*)malloc(640480sizeof(Vertex));
float x_step=2.0f/640;
float y_step=-2.0f/480;

for (int x=0;x<640;++x) for (int y=0;y<480;++y)
{
vbuf[640y+x].x=-1.0f+x_stepx+0.5fx_step;
vbuf[640y+x].y= 1.0f+y_stepy+0.5fy_step;
vbuf[640*y+x].z= 0.0f;
}
<…>
}

void
pedestrian_blit()
{
glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT);
glVertexPointer(3,GL_FLOAT,sizeof(Vertex),&(vbuf->x));

//source color from disjoint color array, init code is not shown
glColorPointer(4,GL_UNSIGNED_BYTE,4,shredder+offset);
glEnableClientState(GL_VERTEX_ARRAY);
glEnableClientState(GL_COLOR_ARRAY);

glDrawArrays(GL_POINTS,0,640*480);
}

This ‘blit’ is wrapped inside a glut display callback that also takes care of buffer swaps and does timing.

On to the issue:
This ‘blit’ replacement times in at ~16ms (640x480), for R200, R300 and Geforce 3. So far, so good.
If I pluck it into a display list (which is impractical, I did so to look at what performance I can expect from static VBOs), the R300 tpf drops into the microsecond domain. Whoosh.

However, on my Geforce 3 the tpf increases to ~60ms. Does that mean that there’s no hw support for point rendering on Gf3? Is ‘pretesselation’ into pixel-sized quads the only way to get acceptable performance?
Or is it simply a driver glitch?

It’s hardly practical to disable VBO usage when the renderer string starts with “Geforce” … or is it?

(Dets 44.03, Geforce 3Ti200, Athlon XP 2400+, Win2k)

Relic · August 5, 2003, 6:38am

Try building shorter display lists, like max. 65k vertices in one.
Performance estimations with different pathes is a no-go. Use the real thing (VBOs) to measure performance.

[This message has been edited by Relic (edited 08-05-2003).]

zeckensack · August 5, 2003, 7:05am

Originally posted by Relic:
Try building shorter display lists, like max. 65k vertices in one.
Performance estimations with different pathes is a no-go. Use the real thing (VBOs) to measure performance.
You’re right.
Still, this is completely unexpected. If anything, display lists should perform better than static VBOs. If the Gf3 lacks explicit GL_POINTS support in hardware, the display list compilation could hide this, while the VBO mechanism could not (the primitive type is undefined when filling a VBO).

I’m surprised that with DL this is slower than without. I wouldn’t be posting this otherwise, because it wouldn’t mean anything, even if the speed gain was nil. I’ve stepped into none of the more obvious traps (like knackered did a few months back). I’m not using indices, so I needn’t be concerned about ushort vs uint.

Anyways, I’ll implement the VBO thing and report back. This is the optimum path for R200/R300 so it’s needed anyway. I’ll be very much surprised if this leads me to different conclusions re Gf3 though.

[This message has been edited by zeckensack (edited 08-05-2003).]

al_bob · August 5, 2003, 10:33am

I’m surprised that with DL this is slower than without.

If the DL is being optimized, you might be CPU-bound (that is, it takes more time to optimize the DL than the benefit the optimized DL will bring). If the DL is not being optimized, you’re spending extra time building the DL, then decoding it.

VBO is definately the way to go.

zeckensack · August 5, 2003, 12:54pm

Originally posted by al_bob:
If the DL is being optimized, you might be CPU-bound (that is, it takes more time to optimize the DL than the benefit the optimized DL will bring). If the DL is not being optimized, you’re spending extra time building the DL, then decoding it.
This is not what I measured. I built the DL once and reused it unchanged. I time the glClear, glCallList, glutSwapBuffers sequence in a semi-infinite loop (until program termination) and accumulate at most 50 frames or two seconds for calculating fps. I use my own RDTSC based timer. Results are as expected for other cards, and for other transfer methods on the Gf3 (eg glDrawPixels to color buffer at 3ms per frame). This is not an issue with my timing logic.

VBO is definately the way to go.
Agreed …

Originally posted by zeckensack:
Anyways, I’ll implement the VBO thing and report back. This is the optimum path for R200/R300 so it’s needed anyway. I’ll be very much surprised if this leads me to different conclusions re Gf3 though.
And now I am surprised.
Rendering the points from GL_STATIC_DRAW_ARB arrays (disjoint: 3 floats for position, 4 unsigned bytes for the color array) renders at 16ms per frame. Just like system memory arrays. Strange coincidence?
Note that I never update or lock these arrays after initialization.

Whatever’s going on here, there seems to be something wrong with GL_POINTS in display lists.

[This message has been edited by zeckensack (edited 08-05-2003).]