Batch Batch Batch

I’m trying to visulize a model currently but found less efficient in fps.
Before I pass the model to art designer to make it more batch-friendly, I’ve got no idea about the relationship about relationship between ‘hardware specification’ and ‘maximum geometry input’, here’s how I summarize (assuming there’s no texture change):

  1. I understand the CPU clock can determine the maximum batch number, but what will it be exactly like?
  2. What can determine the maximum vertices number in one batch? How can I figure it out? (The art designer needs to know before optimization).
  3. Can I understand the relationship between batch and effiency like “An excess of batch number or vertices number in one batch over hardware limitation will both descend the frame rate”?

Many thanks!

Drawcalls are the main limit in OpenGL and Direct3D, because of how much the drivers have to do in the background and then hit a CPU limit. Also this has the disadvantage that your application gets less CPU time.
This is one of the main reasons AMD develops there Mantle API.

I think the limits in OGL and D3D are around 6000 drawcalls in one frame for 60Hz. Not sure to what CPUs this maps.
There are also some extensions from nvidia to get better drawcall performance. But I never hat to work on drawcall limited projects so this is all copy/paste from my reading memory. :wink:

The maximum vertex one current PC hardware is only limited by raw power of the GPU. For example my 7870 gives me 4294967296 (2^32) for GL_MAX_ELEMENTS_VERTICES. What is basically the maximum number of vertice you can index with the biggest index type. But vertice/index can be optimized for caching. And that is something you use algorithms for, e.g.: https://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html

How come the CPU clock can determine the maximum batch number?

The maximum vertex count is determined by the amount of memory. But that is probably not the information you seek.

There is no answer on this question. There is a myriad of graphics cards, drivers and HW/SW combinations. If you are targeting a particular audience, the answer should be found by profiling such configurations. As long as everything stays in the graphics memory the performance is good. If the swap arises then smaller buffers will gain better performance.

[QUOTE=Osbios;1257049]Drawcalls are the main limit in OpenGL and Direct3D, because of how much the drivers have to do in the background and then hit a CPU limit[/QUOTE].
It is rather a frivolous assumption that the bottleneck is aways in a number of draw calls. It depends on the application design and in most cases (with real and well designed applications) it is not the case.

It is also a frivolous claim. It depends on many factors.

Yes, bindless extensions significantly boost performance, primarily by decreasing cache misses and direct access to resident buffers.

A long time has past since Tom Forsyth published his work. The GPU architecture has changed a lot. Last time I tried it (several years ago) on Fermi architecture I got no improvements. So, be careful.

bindless extensions significantly boost performance

Like everything less I have not found this to be automatically true

Last time I tried it (several years ago) on Fermi architecture I got no improvements

Aleksandar - Is it your experience that this re-ordering your vertex list is not a high priority in performance tuning?

It is true if there are tens of thousands of draw calls. The boost is not as NV reported (7.5x), but there is a boost (at least about 50%). Of course, it depends on the application. If the application is not a CPU bound, then there is no improvements for sure.

Indeed. Could you provide some other results? What GPU are you using and how many vertices there are in your buffers?
If you report any improvement over standard triangle strips with primitive restart, I’ll be willing to repeat my own.

but there is a boost (at least about 50%)

The test I was doing was drawing 4000 draws of buffers with 10000 random 2 point lines segments.Each draw generates a new set of line segments. For this I saw no improvement over glBegin/glEnd. I assume it is because I am not able to reuse the buffer. I have been trying to improve my rendering of streamed point clouds but everything I try gives about the same render time.

standard triangle strips with primitive restart

I have triangles from DTMs with between 100,000 to 1,000,000 vertices. I cannot create triangle strips. I was using the ATI library to optimize caching but they are longer supporting that library so I was looking to write my own.
My benchmarks to date don’t seem to so any improvement on an nVidia GTX 770; that was why I was interested in your comment.

Who says that VBO can beat immediate mode for two point lines? :slight_smile:
Compared to regular VBOs, resident buffers should be, at least, slightly faster.
In your case the bottleneck is in number of function calls. Resident buffers cannot solve it. Immediate-mode do not have cache misses. You have an equal number of calls. It is great that the result is even comparable with immediate-mode. :slight_smile:

[QUOTE=tonyo_au;1257072]I have triangles from DTMs with between 100,000 to 1,000,000 vertices. I cannot create triangle strips. I was using the ATI library to optimize caching but they are longer supporting that library so I was looking to write my own.
My benchmarks to date don’t seem to so any improvement on an nVidia GTX 770; that was why I was interested in your comment.[/QUOTE]
There are (at least) two different caches for vertices: input and post-transform. The first one is responsible for fetching attributes. Even for that the locality is important, but what counts more is post-transformed cache, since it prevents reexecution of VS. That’s why we are talking about index reordering. But, nowadays there are a lot of processing units in the GPU. The calculation is spread over hundreds of them. Each of them operates on the fraction of input buffer, so some vertices have to be duplicated. The order is still important, but it is not a long way runner. It is preferable to use a cache oblivious methods for index ordering, since cache sizes, algorithms and scheduling differs form vendor to vendor, form card’s generation to card’s generation, and even from driver’s revision to revision. Also, cache size is now greater than it was a decade ago. That’s why I pointed out that we should get something for granted.

Resident buffers

What is a good use for these buffers?

a cache oblivious methods for index ordering

I don’t quite understand what this means. Because of the way we triangulate the DTM, triangles with shared vertices are often near each other in the index buffer. Is what you are saying that this is good enough without doing any special processing of the index buffer.

Many thanks! Can I understand as:

  1. CPU has its own limit of on sending the call order to GPU over the time. For example, during the 1/60 s time interval, only around 6000 draw call can be excuted and moreover this is based on CPU doing nothing but only sending the OPENGL command. If the CPU has to finish any frame process(application logic), this interval would be much longer. In addition, CPU time should be different on processing 3000 draw calls and 600 draw calls since the execution is not parallel.
  2. When comes to maximum vertices count, different GPUs vary on its ‘GL_MAX_ELEMENTS_VERTICES’. If we exceed the count, OpenGL error will be caused. On the other hand, if ‘GL_MAX_ELEMENTS_VERTICES’, the processing time will depend on the numbers of pipeline, GPU clock, GPU memory swap and so on.
  3. The intention of optimization of caching is to reduce the batch size and make the most use of GPU.
  4. GPU and CPU processing can be deemed as parallel because CPU can send the next batch when GPU is processing the last batch. But in most case, the GPU is almost hungry. So the most possible approach for newbie’s optimization is optimize the batch, the nVidia extension is for this purpose.

I gratefully welcome any correction.

I got it from a PPT of nVidia on the web named “Batch Batch Batch”. Can I simply think it as: “the CPU works together with GPU on the same time. CPU send Opengl commands in sequence, GPU execute the commonds also in senquence but processes the primitives in parellel. The bottleneck always existes on the slower side. But for the morden architecture, GPU are usually be much faster, and the buffer swap will also be taken into account.”
BTW, would be kind enough to elaborate what is resident buffer and what’s the principle for bindless extension boosting the performance?

[QUOTE=likethe265;1257122]Can I understand as:

  1. CPU has its own limit of on sending the call order to GPU over the time. For example, during the 1/60 s time interval, only around 6000 draw call can be excuted and moreover this is based on CPU doing nothing but only sending the OPENGL command.[/QUOTE]

Not “nothing”. The CPU (via the driver) performs a lot of “prep work” to get ready for issuing draw calls, often termed validation. For instance, pushing needed buffers/textures onto the GPU which haven’t already been uploaded, pinning CPU-side memory buffers, activating new shader programs, uploading uniform blocks, binding textures, resolving block handle/offset pairs into GPU addresses, etc. along with performing all manor of “is this state valid” checks.

The amount of this work varies by driver, draw submission method, and the number of intervening state changes, and even with all that held constant, consumes different amounts of time depending on specific CPU and CPU memory speeds involved, so any “max N draw calls/frame” heuristic you might come up with is going to vary based on the system and the specifics of your example.

In any case, with smaller and smaller batches (with the above factors held constant), you do hit a point where you are CPU bound submitting batches and not GPU bound. You’d of course like to avoid this case because you could be rendering more or more interesting content with that time.

  1. The intention of optimization of caching is to reduce the batch size and make the most use of GPU.

Not to reduce the batch size, no. Several types of caching were discussed above. I’ll assume you meant vertex cache optimization. The reason for optimizing that is it means fewer total vertices need to be transformed by the GPU to render your mesh.

Background: At any point in your frame, you’ll be bottlenecked on something. Previously, we were discussing what would make you CPU bound (the GPU is somewhat academic here). However, in cases where you are GPU bound (what you want), you could be bottlenecked on a few things. One of these things is vertex transform rate. There’s a limit to how fast GPUs can transform vertices, and if you feed the GPU properly, you can hit this limit. If you are vertex transform bound, and if you can reduce the number of vertices the GPU has to transform to render your mesh, then you can speed up the time required to render your mesh. That’s the point of vertex cache optimization.

  1. GPU and CPU processing can be deemed as parallel because CPU can send the next batch when GPU is processing the last batch. But in most case, the GPU is almost hungry.

In an application written without regard to performance, perhaps. But once optimized this is typically not the case. You do figure out how to “keep the GPU busy” without wasting too much time on state changes or irrelevant content while still meeting your frame rate requirements.

Resident buffers reduce cache misses involved in setting vertex array state by using pointers (GPU addresses) instead of objects IDs.

The cache oblivious methods are “unaware” of the size and organization of the cache. They usually only assumes that the probability of finding a vertex in a cache is inversely proportional to the elapsed time since it was processed. If your processing preserves spatial locality of the vertices it should have a high vertex post-transform cache hit-rate. Considering what you’ve said, it seems you are using irregular grid for the DTM. What kind of visualization are you doing? AFAIK regular grids are more GPU friendly than irregular. In last decade I’ve been using only regular grids for the terrain rendering.

Why are you consider CPU single-cored? Issuing commands to a GL driver should come only from a single thread (in order to avoid synchronization problems and decrease performance), but doing other stuff should have (several) other threads. The problem in many draw-calls is the same as in regular function calls as well as cache pollution imposed by state changing and IDs resolving.

Of course that this limit should not be exceeded, but my advice is to stay far below in order to prevent significant performance lost if objects swap occurs.

Nope! The point is in the increase of batch size. The greater the batch the fewer the state changes. That was the point of the document you’ve mentioned.

[QUOTE=likethe265;1257122]4. GPU and CPU processing can be deemed as parallel because CPU can send the next batch when GPU is processing the last batch. But in most case, the GPU is almost hungry. So the most possible approach for newbie’s optimization is optimize the batch, the nVidia extension is for this purpose.[/QUOTE]Which extension? What does it mean to “optimize the batch”? The number of batches should be decreased while its size can be increased. Using bindless just increases number of batches that can be handled in real-time by reducing cache pollution while resolving IDs.

…using irregular grid for the DTM. What kind of visualization are you doing? AFAIK regular grids are more GPU friendly than irregular

We develop road design software. Our clients have both irregular and regular grid DTMs. We can create regular grid overlays from the original data and these display much faster as I can easily do LOD with tessellation on these. Unfortunately a lot of clients feel these are too inaccurate for their modelling. The other new trend is to use point clouds for the DTM which is giving me more grief from a speed view point:sorrow:

I’m sorry for hijacking the thread, but terrain rendering is my preoccupation for a very long time, I’m always for a discussion on the topic. :wink:

Sounds interesting. How large is the terrain you have to visualize? Is it a single “plate”, or an out-of-the-core model?

Personally, I don’t like a tessellation shader approach since it confines solution to SM5 hardware only, and restricts block size to 64x64 at best. I’m using at least 4x bigger blocks.

There is no such a thing as “feelings” in computer science. You could prove your approach by screen-space error metric. Or, if they don’t believe to your proofs, you could render the same scene with both approaches. :slight_smile:

I guess it is a consequence of LIDAR usage. :wink:
Well, I suppose you could convert it to a regular grid (with several MIP-layers) using some appropriate tool and high enough sampling frequency (depending on whether it is an urban area or a wilderness).

The test I was doing was drawing 4000 draws of buffers with 10000 random 2 point lines segments.

It seems like you are trying to visualize the point cloud directly by drawing a line-segment between each two vertices. I’m suggesting you to use some tool (or your own program-code) to convert point-cloud to a grid and then visualize on a standard way.

How large is the terrain you have to visualize

It varies a lot from a simple intersection to a DTM to cover 100-150km of highway and surrounds at a resolution of 250mm - hence irregular DTM’s as not all the data is at this fine a resolution. We also dynamically merge DTMs to reflect various cuts like a new road through a hill.

There is no such a thing as “feelings” in computer science

You obviously haven’t dealt with engineers a lot. These are people who demand collision detection between a manhole and pipe down to 100mm then dig the hole with a back-end loader;)

consequence of LIDAR usage

This is not my direct responsibility but with a lot of pre-procession we are displaying 4 billion points in real-time with just glBegin/glEnd and clever LOD. I am just looking at other ways of managing the data in OpenGL so see how much of the pre-procession we can remove and still get real-time rendering.

I don’t like a tessellation shader approach

This was just quick to implement and gave a big boost to performance. I will be revisiting this later in the year when one of the other programmers has finished a new data structure for regular grids.

Of course that this limit should not be exceeded, but my advice is to stay far below in order to prevent significant performance lost if objects swap occurs.

This is especially true for AMD cards. While they report a GL_MAX_VERTICES of 2^32-1, and GL_MAX_ELEMENTS_VERTICES of 2^24-1 (16M), if you use multiple vertex buffers with sizes greater than roughly 100K vertices, you’ll start to see performance degradation. In an extreme example, I had a 20M point, 88M element array model that took almost 60 seconds to draw on a FirePro W8000 if I didn’t dice the model up into small vertex array chunks first (an Nvidia Quadro K5000 took 225ms). I was using element arrays of 300K elements, but vertex arrays of full size.

Once I’d diced the vertex buffers up into 100K vertex chunks (requiring ~200 draw calls and VAO changes), the AMD W8000 drew it in 62ms (Nvidia K5000, ~100ms). This was also mirrored in the consumer cards I tested (Radeon 6950, GEForce 670). Both AMD and Nvidia drivers performed better with many small batches, but the nearly 1000x performance increase on the AMD card suggests that their driver is much more sensitive to buffer sizes than Nvidia (which was 2.25x faster).