MapBuffers - slow writes?

I’ve fiddled around a bit with mapped VBOs and can’t manage to exceed ~185MB/s write speed. Is that normal?

Yes, I have an AGP GART driver installed and everything’s working fine.

Benchmarking loop:

	GLuint vbo=0;

	glGenBuffersARB(1,&vbo);
	glBindBufferARB(GL_ARRAY_BUFFER_ARB,vbo);
	glBufferDataARB(GL_ARRAY_BUFFER_ARB,1<<20,NULL,GL_STREAM_DRAW_ARB);

	void* mapped_vbo=glMapBufferARB(GL_ARRAY_BUFFER_ARB,GL_WRITE_ONLY_ARB);

	int megs=0;
	double total_time=0.0;
	t.reset();
	do
	{
		MMX_write_one_meg(mapped_vbo);
		++megs;
		total_time=t.elapsed_seconds();
	} while ((total_time<0.5)&&(megs<2048));
	if (!glUnmapBufferARB(GL_ARRAY_BUFFER_ARB))
	{
		bw_vbo_write=0.0;
	}
	else
	{
		bw_vbo_write=double(megs)*double(1<<20)/total_time;
	}

	glBindBufferARB(GL_ARRAY_BUFFER_ARB,0);
	glDeleteBuffersARB(1,&vbo);

The timer is working okay, so don’t ask

Here’s the code I use to write:

;write one meg of junk (to uncached memory) with minimum cpu overhead
;prototype:
;void MMX_write_one_meg(void* target)
_MMX_write_one_meg:
	PUSH EAX
	PUSH EDI
	XOR EAX,EAX
	INC EAX
	SHL EAX,17
	PXOR mm0,mm0
	PXOR mm1,mm1
	PXOR mm2,mm2
	PXOR mm3,mm3
	MOV EDI,[ESP+12]
	LEA EDI,[EDI+8*EAX]
	NEG EAX
.loop_128:
	MOVNTQ [EDI+8*EAX],mm0
	MOVNTQ [EDI+8*EAX+8],mm1
	MOVNTQ [EDI+8*EAX+16],mm2
	MOVNTQ [EDI+8*EAX+24],mm3
	MOVNTQ [EDI+8*EAX+32],mm0
	MOVNTQ [EDI+8*EAX+40],mm1
	MOVNTQ [EDI+8*EAX+48],mm2
	MOVNTQ [EDI+8*EAX+56],mm3
	MOVNTQ [EDI+8*EAX+64],mm0
	MOVNTQ [EDI+8*EAX+72],mm1
	MOVNTQ [EDI+8*EAX+80],mm2
	MOVNTQ [EDI+8*EAX+88],mm3
	MOVNTQ [EDI+8*EAX+96],mm0
	MOVNTQ [EDI+8*EAX+104],mm1
	MOVNTQ [EDI+8*EAX+112],mm2
	MOVNTQ [EDI+8*EAX+120],mm3
	ADD EAX,16
	JNZ .loop_128
	SFENCE
	EMMS
	POP EDI
	POP EAX
	RETN

Radeon 9500Pro, Cat 3.6, Athlon XP2400+, KT266A, up-to-date VIA Hyperion, lots of RAM, yadda, yadda.

Bonus info:
Using regular x86, ie writing in 32 bit chunks, increases bandwidth to ~192MB/s.

There’s no difference between GL_DYNAMIC_DRAW_ARB and GL_STREAM_DRAW_ARB in this test.

More testing:
glBufferSubDataARB from malloc’ed system memory hits 370MB/s …

Uh oh …
Geforce 3, Det 44.03 (yeah, I know …)
Map, MMX => 1.935 GB/s
Map, x86 => 1.935 GB/s
BufferSubData => 471 MB/s

ATI devrel, here I come.

Preliminary conclusion:
It’s quite obvious that my Geforce 3 drivers funnels the map to a staging area, or gives me AGP memory. That’s only AGP4x, so more than 1GB/s is a theoretical no go …

Still, glMapBufferARB seems to be completely useless so far.

[This message has been edited by zeckensack (edited 08-14-2003).]

[This message has been edited by zeckensack (edited 08-14-2003).]

[This message has been edited by zeckensack (edited 08-14-2003).]

[This message has been edited by zeckensack (edited 08-14-2003).]

Yeah.
I’ve noticed that gl[Read,Draw]Pixels is REALLY slow on my machine, like around 1Mpix/s. Almost certainly a driver issue.

Originally posted by NitroGL:
Yeah.
I’ve noticed that gl[Read,Draw]Pixels is REALLY slow on my machine, like around 1Mpix/s. Almost certainly a driver issue.

Hi, Nitro.
The pixel path isn’t exactly high performance for me either, but there are ways around that.

This thread’s about something different though: VBO access

But it still has to go over the AGP bus, so it affects the pixel pipe too…

Originally posted by NitroGL:
But it still has to go over the AGP bus, so it affects the pixel pipe too…
Ummm, yes, in a way it does.
I was just wondering, why should anyone use glMapBuffersARB when
a)ATI’s implementation performs a lot better with BufferSubData
b)NVIDIA’s implementation points me to a staging area that’ll get uploaded to the card sometime later, hogging up my memory interface again
?

I mean, avoiding a second copy, writing directly to the card memory, FSB and AGP being the only bottlenecks (and - for once - not system memory bandwidth) … wasn’t the whole point of using MapBuffer? Wasn’t that the reason to provide this function? Wasn’t that what makes it worthwhile despite all the ‘virtual memory critical section yadda yadda’ issues it introduces?

I know I won’t be using it. Maybe in a year I’ll try again, but not now.

I was just wondering, why should anyone use glMapBuffersARB when
a)ATI’s implementation performs a lot better with BufferSubData
b)NVIDIA’s implementation points me to a staging area that’ll get uploaded to the card sometime later, hogging up my memory interface again

Because neither of those have to be true. They may be true now, but they don’t have to be.

NVIDIA’s implementation points me to a staging area that’ll get uploaded to the card sometime later, hogging up my memory interface again

I think this is the way it SHOULD be. AGP memory is defined to be easy/fast to write (if you’re doing it sequential with full coverage) and easy for the card to read asynchronously whenever it needs it.

Think about it: the card can use VRAM for framebuffers and texturing, and most games are fill rate limited. Thus, using AGP to read the geometry at the point where you actually draw it is “free” bandwidth. Plus, it de-couples the CPU: the CPU blats out vertices into local system memory very quickly, and is then free to do other things.

If you’re doing non-typical-game stuff, like uploading a gig of texture data per second, this design gets in the way, but then you’d be in the very small minority…

As for why you’d get such slow performance on your ATI set-up, I have no idea. I’d expect ATI to do the same thing as I describe in here. It’s really The Right Compromise.

Originally posted by Korval:
Because neither of those have to be true. They may be true now, but they don’t have to be.
Point taken.

I’m doing freeware stuff where ‘product cycles’ or whatever it’s called are quick and many. That’s why I’m so impatient.

jwatte,
I was thinking of respecifying a rather large array of vertex positions, where I can calculate x/y on the fly, using just the loop counter, without consuming any read bandwidth. z in some outlandish format gets streamed in, converted to float and added to the array.

It’s critical that this array get’s transferred fast, because the standard system memory vertex array it’s supposed to replace is already drawable at ~300MB/s (both IHVs; IMO that’s transfer limited because both fillrate and geometry horsepower are plenty). I also hoped for minimum cache footprint.

That’s the “GL_POINTS is better than glDrawPixels(GL_DEPTH_COMPONENT,<…> );” trick btw.

NVIDIA’s VBO transfer speed is good, uncached AGP memory is good, I just somewhat dislike the duplicated system memory usage, that’s all.

I thought this was the optimum path for the hypothetical SDRAM P4-Celeron with a decent graphics card, which many unfortunate end users have bought, where bandwidth and cache are things you really can’t risk to waste.

However, I won’t implement something that performs worse than doing nothing about it on ATI cards, as long as it does.

If this whole thing gets better w a driver release, I can adjust my code fast enough. I just wanted to get this out of the way now.