glMultiDrawArrays from a buffer?

So, I’m writing a simple, but high poly game. I’m convinced a good bit of lag, and 99% of the OGL calls in my game(by profiling) are me calling this:

glMultiDrawArrays(mode = GL_TRIANGLE_STRIP, first = [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108, 112, 116, 120, 124, 128, 132, 136, 140, 144, 148, 152, 156, 160, 164, 168, 172, 176, 180, 184, 188, 192, 196, 200, 204, 208, 212, 216, 220, 224, 228, 232, 236, 240, 244, 248, 252, 256, 260, 264, 268, 272, 276, 280, 284, 288, 292, 296, 300, 304, 308, 312, 316, 320, 324, 328, 332, 336, 340, 344, 348, 352, 356, 360, 364, 368, 372, 376, 380, 384, 388, 392, 396, 400, 404, 408, 412, 416, 420, 424, 428, 432, 436, 440, 444, 448, 452, 456, 460, 464, 468, 472, 476, 480, 484, 488, 492, 496, 500, 504, 508, 512, 516, 520, 524, 528, 532, 536, 540, 544, 548, 552, 556, 560, 564, 568, 572, 576, 580, 584, 588, 592, 596, 600, 604, 608, 612, 616, 620, 624, 628, 632, 636, 640, 644, 648, 652, 656, 660, 664, 668, 672, 676, 680, 684, 688, 692, 696, 700, 704, 708, 712, 716, 720, 724, 728, 732, 736, 740, 744, 748, 752, 756, 760, 764, 768, 772, 776, 780, 784, 788, 792, 796, 800, 804, 808, 812, 816, 820, 824, 828, 832, 836, 840, 844, 848, 852, 856, 860, 864, 868, 872, 876, 880, 884, 888, 892, 896, 900, 904, 908, 912, 916, 920, 924, 928, 932, 936, 940, 944, 948, 952, 956, 960, 964, 968, 972, 976, 980, 984, 988, 992, 996, 1000, 1004, 1008, 1012, 1016, 1020], count = [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], drawcount = 256)

(from apitrace)

I feel like I would get significantly better performance if I didn’t send thousands of those to my GPU every frame. I looked into Indirect calls, but they seem to read from local main memory, not a buffer (like a VAO, etc).
What would be a good way to optimize this? Thanks.

[QUOTE=javaprophet;1282087]
I looked into Indirect calls, but they seem to read from local main memory, not a buffer (like a VAO, etc).
What would be a good way to optimize this? Thanks.[/QUOTE]

The glDraw*Indirect() calls have the same behaviour as their non-indirect counterparts.

In the compatibility profile, attributes are sourced from a buffer object if one was bound to GL_ARRAY_BUFFER at the time of the glVertexAttribPointer() call or from client memory otherwise. Indices are sourced from a buffer object if one is currently bound to GL_ELEMENT_ARRAY_BUFFER or from client memory otherwise. The indirect “command” structures are sourced from a buffer object if one is currently bound to GL_DRAW_INDIRECT_BUFFER or from client memory otherwise.

In the core profile, attributes, indices and commands can only be sourced from buffer objects.

Unlike the GL_ELEMENT_ARRAY_BUFFER binding, the GL_DRAW_INDIRECT_BUFFER binding is global (per-context) state; it is not stored in a VAO.

It looks like you want to use glMultiDrawArrays to render parts of a vertex buffer without re-organizing the data, but it’s bad practice. It’s like calling 256 times glDrawArrays but with less driver overhead. Batch your data into one single buffer continuously and call glDrawArrays once, you’ll get much better performance.

Use indices to concatenate all of those individual triangle strips into a single draw call; you don’t need multi-draw at all for this.

I’ve done this, and I’m trying to use Primitive Restart to separate strips.

	glBindVertexArray(vao->vao);
	glEnable (GL_PRIMITIVE_RESTART);
	glPrimitiveRestartIndex(65535);
	glDrawElements(GL_TRIANGLE_STRIP, vao->index_count, GL_UNSIGNED_SHORT, NULL);
	glDisable (GL_PRIMITIVE_RESTART);

It causes a segfault in my FGLRX driver, which I assume is because Primitive restart is not turning on correctly, and it is saying index 65535.

VAO creation code:

if (!overwrite) glGenVertexArrays(1, &vao->vao);
	glBindVertexArray(vao->vao);
	if (!overwrite) {
		glGenBuffers(1, &vao->vbo);
		glGenBuffers(1, &vao->vib);
	}
	glBindBuffer(GL_ARRAY_BUFFER, vao->vbo);
	glBufferData(GL_ARRAY_BUFFER, count * (textures ? sizeof(struct vertex_tex) : sizeof(struct vertex)), verticies, GL_STATIC_DRAW);
	glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, vao->vib);
	vao->index_count = restart ? (count + (count / restart) - 1) : count;
	vao->vertex_count = count;
	uint16_t indicies[vao->index_count];
	size_t vi = 0;
	for (size_t i = 0; i < vao->index_count; i++) {
		if (restart && i > 0 && ((i + 1) % (restart + 1)) == 0) {
			indicies[i] = 65535;
		} else {
			indicies[i] = vi++;
		}
	}
	glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(uint16_t) * vao->index_count, indicies, GL_STATIC_DRAW);
	glVertexPointer(3, GL_FLOAT, textures ? sizeof(struct vertex_tex) : sizeof(struct vertex), 0);
	if (textures) glTexCoordPointer(2, GL_FLOAT, sizeof(struct vertex_tex), (void*) (sizeof(struct vertex)));
	glEnableClientState (GL_VERTEX_ARRAY);
	if (textures) glEnableClientState (GL_TEXTURE_COORD_ARRAY);
	glBindVertexArray(0);
	vao->tex = textures;

I’ve verified the output of indicies, it should work according to the specs as far as I can see. (given a restart of 4, it will have vertex, vertex, vertex, vertex, 65535, vertex, vertex, vertex, vertex, etc)

I haven’t a clue why it won’t work.