Speeding up 2D Texture engine

Hello,

I recently started an openGL 2d game and I’m in the process of optimizing.
It’s a really simple engine, the only thing it has to do is draw transparent images at a certain x and y.
While I got it reasonably fast I want it to draw up to 5000 textures per frame at 60 fps. It can do this already but only on modern hardware.
I might want to add more complexity later and I’m afraid the graphics engine just can’t keep up.
After doing some measurements I noticed it’s spending 7500 microseconds on drawing a frame and 1000 microseconds on all the other code.

I’d like some help on trying to improve it’s speed.

This is the code of my engine.

Initialization:



width=600;
height=600;

glEnable(GL_TEXTURE_2D);
glDisable(GL_DEPTH_TEST);

glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(0, width, 0, height, -1, 1);
glMatrixMode(GL_MODELVIEW);

glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);


Texture Loading:



glGenTextures(textureids);

glBindTexture(GL_TEXTURE_2D, texid);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, width, height, 0, GL_RGBA,
GL_UNSIGNED_BYTE, imageByteBuffer);


Actually drawing images, this part gets called 5000 times per frame.



glPushMatrix();
glBindTexture(GL_TEXTURE_2D, gltexid);
glTranslatef(x, y, 0);

glBegin(GL_QUADS);
{
	glTexCoord2f(0, 0);
	glVertex2f(0, 0);

	glTexCoord2f(0, heightratio);
	glVertex2f(0, imageheight);

	glTexCoord2f(widthratio, heightratio);
	glVertex2f(imagewidth, imageheight);

	glTexCoord2f(widthratio, 0);
	glVertex2f(imagewidth, 0);
}
glEnd();

glPopMatrix();


widthratio = imagewidth/texturewidth
heightratio = imagewidth/texturewidth

Now, like I said. This code is not fast enough. Does anyone know how I could speed this code up?
I’ve taken a look at display lists but I do not know if they would actually improve my performance.
I’d really appreciate any help anyone could give.

The game is written in java using lwjgl to bind to OpenGL.
The OpenGL version used is 1.1. Would switching to a different version increase speed?

If you get GL 1.1 it is certainly not accelerated. Are you sure of this version ?

  1. min filter linear is bad for performance if you actually have minified textures (drawn smaller that texel size). Try GL_LINEAR_MIPMAP_NEAREST for a sure win, but you will have to provide mipmaps. GL_LINEAR_MIPMAP_LINEAR will look nicer, at a small performance price.
  2. do you switch between a lot of different textures ? This has a cost. Try to group several quads with the same texture, calling glBindTexture only once at the beginning.
  3. avoid immediate mode (glBegin etc) if you want performance, instead try VBO or other vertex arrays methods. You can use a single VBO for a given combination of imagewidth,imageheight,texturewidth,textureheight or maybe use separate arrays for texture and image if too many combinations. see for example this tutorial : http://www.ozone3d.net/tutorials/opengl_vbo.php?lang=2
  4. question : what hardware to you run it/expect it to run acceptably ?
  5. what render resolution do you use ?
  6. what texture resolutions do you use ?

After doing some measurements I noticed it’s spending 7500 nanoseconds on drawing a frame and 1000 nanoseconds on all the other code.

Nanoseconds? 7500 nanoseconds is 7.5 microseconds, which is 0.0075 milliseconds.

At 60fps, you have 16.67 milliseconds available per frame. So your computation of time is wrong, your units are off, or you’re running at a perfectly acceptable speed.

Actually drawing images, this part gets called 5000 times per frame.

Put multiple images in the same texture (in different locations of that texture, of course), so that you don’t have to call glBindTexture so often. When you do this, you also shouldn’t call glBegin/glEnd for each quad.

You should sort your rendering by what textures you’re using.

If you get GL 1.1 it is certainly not accelerated. Are you sure of this version ?

I’m using org.lwjgl.opengl.GL11 which is described as: "The core OpenGL1.1 API. "
I take it this is OpenGL 1.1 though I guess it could be running these “classic” OpenGL 1.1 calls in a newer version of OpenGL.

  1. do you switch between a lot of different textures ? This has a cost. Try to group several quads with the same texture, calling glBindTexture only once at the beginning.

Brilliant! Just did a quick test and this should improve performance up to 100% :slight_smile:
Most of my textures that are drawn in large quantities are indeed the same image.

Nanoseconds? 7500 nanoseconds is 7.5 microseconds, which is 0.0075 milliseconds.

Ups, yeh I made a mistake. 7500 microseconds, not nanoseconds.

  1. question : what hardware to you run it/expect it to run acceptably ?
  2. what render resolution do you use ?
  3. what texture resolutions do you use ?
  1. preferably netbooks. Intergrated graphics cards coupled with a < 2ghz processor. Though if the finished product runs on modern hardware I’ll be happy too. I’m making the game for the fun of making it anyway.
  1. currently hardcoded to 800x600. support for multiple color depths

  2. There’s a few big textures 600x600 for the background but most are tiny bullets about 20x20 pixels in size. That’s why it’s easy to fill up the screen with 5000 of them. All of these textures currently use the same code to draw them to the screen.

The game has a sort of classic look to it, so jagged edges are no problem either. I’m prioritising speed over looks as much as I know how.

I’ll be having a look at the other suggestions as well but as I only started playing around with graphics a couple of weeks ago I just had to google what a texel was, I’ll need some time figuring everything out. Thanks for pointing me in the right direction. :slight_smile:

mumbles I’ve heard of mipmaps before… what are they? opens google

5000 textures is a lot. Are you sure you don’t mean 5000 quads?

If it’s really 5000 textures, I don’t think mipmaps are going to do you any good at all here. Your game is 2D so mipmaps are in fact more or less irrelevant to you, and will only unnecessarily use extra video RAM.

What you really need to do is hit Google and learn about Texture Atlases. I think these are going to be of the most immediate benefit as they will enable you to combine many of the small textures you use into a single larger texture, thereby cutting down on your texture changes and enabling you to start batching your draw calls. Once you’ve got Texture Atlases implemented come back and ask about the batching part. :wink:

Sorry for being vague, again I’m new to this :slight_smile:
I have at this point only a few textures, a few of which are drawn to the screen several thousand times per frame.

Also, I just called glGenerateMipmap by importing GL30 which seems to be an OpenGL 3.0 function. So I guess that means I’m not using OpenGL 1.1 :slight_smile:

About GL_TEXTURE_MAG_FILTER and GL_TEXTURE_MIN_FILTER, I’m currently not scaling textures and I’m not planning to either. Enabling mipmapping seems to have no effect which is to be expected if I’m not scaling. I set the values to GL_NEAREST now which seem to be cheapest(in terms of cpu), while not using mipmaps. Even though I don’t intend to scale in the first place.

Update: I just implemented grouping texture drawing to reduce the amount of glBindTexture calls and this increased performance by 85%. Thanks for the idea. ^^ It’s on 220fps on modern hardware. Still only 7 fps on the netbook though.

Indeed, if there is no actual need for MIN/MAG, GL_NEAREST makes the most sense.

On lower-end hardware, fillrate will surely be a limiting factor.
You can verify this by running the same program at a very reduced resolution, like 64x64 : if the speed increase a lot, then there the only thing you can do is avoiding overdraw (try to have each screen pixel drawn only once, avoiding layering as much as possible).

I see you use blending : this has a cost too. If only some tiles actually need blending, try rendering everything else without blending, then activate blending and alpha test with a low alpha threshold (may make fully transparent parts slightly faster on low end hardware, to be verified), and draw the tiles needing blending.

If speed is not that great even at very low resolution, then you have to try point 3 above. It may be interesting to quickly try a display list, to verify any possible gain. But I advise against using DL, as it is VERY costly to recompile when rendering changes.

EDIT : you probably do not need depth buffer testing, so try to disable depth test and do not request a depth buffer.

I cannot be certain of this but I think although you should draw transparent objects far to near to get the order right, solid objects should be drawn near to far so as much of the depth buffer is filled as early as possible, meaning anything behind will fail the depth test and textures might(?) not even be sampled.

If you’re using depth test, that is.

You can now get some extra performance by batching up some of your draw calls. Issuing a separate glBegin/glEnd for every quad you draw can be extremely expensive, especially on the kind of integrated 3D chip you’ll have in your netbook. You really only need this when your texture changes. You can have as many quads as you want between a glBegin (GL_QUADS) and a glEnd, so it makes sense to take advantage of this.

You can also get rid of the glPushMatrix/glTranslate/glPopMatrix and just add the x and y values to your glVertex calls; updating a matrix 5000 times per frame is also an expensive operation.

Here’s some sample code for a “quad batcher” that will accomplish these. I’ve used the same variable names as you have so you should be able to easily relate it to your own, although mine is C++.

void QuadBatcher (void)
{
	unsigned int lasttexid = 0;
	bool quadbegun = false;

	for (int i = 0; i < 5000; i++)
	{
		// get details for this object; stored in gltexid, x, y, etc

		if (gltexid != lasttexid)
		{
			// finish the last batch
			if (quadbegun) glEnd ();

			// begin a new one
			glBindTexture (GL_TEXTURE_2D, gltexid);
			glBegin (GL_QUADS);
			quadbegun = true;

			lasttexid = gltexid;
		}

		glTexCoord2f (0, 0);
		glVertex2f (x, y);

		glTexCoord2f (0, heightratio);
		glVertex2f (x, imageheight + y);

		glTexCoord2f (widthratio, heightratio);
		glVertex2f (imagewidth + x, imageheight + y);

		glTexCoord2f (widthratio, 0);
		glVertex2f (imagewidth + x, y);
	}

	// draw anything left over
	if (quadbegun) glEnd ();
}

(This is untested so no copy/pasting please!)

I did some more work and it’s getting better.
glPushMatrix/glTranslate/glPopMatrix are now gone thanks to mhagain. Didn’t improve speed but I do like the code better this way.

I had a look at Vertex Arrays, Display Lists and VBO to replace immediate mode and decided to try out Vertex Arrays. I decided to go for Vertex Arrays because my game barely has any stationary objects. As far as I can tell, VBO and Display lists increase performance if the vertices don’t change between two draws of a texture by not requiring the same vertices to be sent twice. However I can assume in my game that all but a very few vertices will be changing every frame.

One more thing. I’m currently using blending to draw my bullets as my bullets are round but the texture is square. The pixels around the bullet are fully transparent. Is there a cleaner(faster) way to do this? I don’t need support for half transparency at this point.

Then disable blending, and enable alpha test. Any threshold should work.

glDisable(GL_BLEND);
glEnable(GL_ALPHA_TEST);
glAlphaFunc(GL_GREATER,0.1f);

For a round bullet, it might be interesting to provide a smaller hexagon or octogon instead of a quad with lots of lost space, trading more triangles for less superfluous fragment operations. Only useful if there are enough gain in number of fragments.

Hey,

GL_ALPHA_TEST works a treat thanks, it’s not faster at this point but probably will in the future.

I’m really happy with it now. I went from single digit fps up to several hundred, It even runs at 70 fps on the netbook now. My final design looks terrible but it’s the fastest design I’ve been able to get to run so far.

I have different VBO’s for vertices and texture coordinates. Which is confusing… This is the code:

//Loading a texture

	
// Fills the FloatBuffer with the texture coords 6000 times for 
// support for up to 6000 bullets	
for (int i = 0; i < 6000; i++) {
	texcoords.put(0);
	texcoords.put(0);
	texcoords.put(0);
	texcoords.put(bulletimg[id].heightratio);
	texcoords.put(bulletimg[id].widthratio);
	texcoords.put(bulletimg[id].heightratio);
	texcoords.put(bulletimg[id].widthratio);
	texcoords.put(0);
}

texcoords.rewind();

//Write the coords to a VBO
ARBVertexBufferObject.glBindBufferARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB,buffer[id]);
ARBVertexBufferObject.glBufferDataARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, texcoords,ARBVertexBufferObject.GL_STATIC_DRAW_ARB);



//Bullet drawing, runs once per texture.



//first bullet of this texture
vertices.put(x);
vertices.put(y);
vertices.put(x);
vertices.put(bulletimg[imgid].origheight + y);
vertices.put(bulletimg[imgid].origwidth + x);
vertices.put(bulletimg[imgid].origheight + y);
vertices.put(bulletimg[imgid].origwidth + x);
vertices.put(y);


//second bullet of this texture
vertices.put(x2);
vertices.put(y2);
vertices.put(x2);
vertices.put(bulletimg[imgid].origheight + y2);
vertices.put(bulletimg[imgid].origwidth + x2);
vertices.put(bulletimg[imgid].origheight + y2);
vertices.put(bulletimg[imgid].origwidth + x2);
vertices.put(y2);


glEnableClientState(GL_VERTEX_ARRAY);
glEnableClientState(GL_TEXTURE_COORD_ARRAY);
glBindTexture(GL_TEXTURE_2D, image[imgid].texid);
ARBVertexBufferObject.glBindBufferARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB,buffer[3]);
vertices.rewind();

//buffer the vertex data to a vbo
ARBVertexBufferObject.glBufferDataARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB,vertices,ARBVertexBufferObject.GL_STREAM_DRAW_ARB);
glVertexPointer(2, GL_FLOAT, 0, 0);

//bind to the vbo containing texture data
ARBVertexBufferObject.glBindBufferARB(ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB,buffer[imgid]);
glTexCoordPointer(2, GL_FLOAT, 0, 0);

//draw the bullets
glDrawArrays(GL_QUADS, 0,bulletcount * 4);
glDisableClientState(GL_VERTEX_ARRAY);
glDisableClientState(GL_TEXTURE_COORD_ARRAY);



So every texture writes it’s own texture coordinates (which never change) into a VBO 6000 times. This is ugly as hell, but it’s a one time action during loading. It does eat up memory though and I’d like to get rid of it.
This problem is this though: If the array pointed to with glVertexPointer is larger than the array pointed to with glTexCoordPointer, which it is unless I write my data to texcoords 6000 times, the quads get drawn without the texture on them.

When it wants to draw a specific bullet type to the screen it generates the vertices for every instance of that bullet, this data changes completely every frame. It then writes all these vertices to VBO number 3, this VBO is shared between all the bullet types, it’s only needed once anyway. It then uses VBO 3 for the vertices and the texture’s own VBO for the texture coordinates.

I decided to try it this way because the texture coordinates never change, but the vertices allways change, writing those two in the same VBO made no sense to me because changing the vertices would require resending the texture coordinates as well.

What I would like to still achieve is:

  • Even more speed
  • Not have the same floating points repeated 6000 times(in texcoords)

Anyone happen to know how I could clean this up?

My advice is to not worry about the memory element of it for now. 6000 sets of texcoords is small stuff - less than 50k - so you’d be investing time and effort for no meaningful return.

Glad to hear that you got it running well. :slight_smile: