Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 2 of 2 FirstFirst 12
Results 11 to 17 of 17

Thread: Batch Batch Batch

  1. #11
    Newbie Newbie
    Join Date
    Dec 2013
    Posts
    3
    Quote Originally Posted by Aleksandar View Post
    How come the CPU clock can determine the maximum batch number?
    I got it from a PPT of nVidia on the web named "Batch Batch Batch". Can I simply think it as: "the CPU works together with GPU on the same time. CPU send Opengl commands in sequence, GPU execute the commonds also in senquence but processes the primitives in parellel. The bottleneck always existes on the slower side. But for the morden architecture, GPU are usually be much faster, and the buffer swap will also be taken into account."
    BTW, would be kind enough to elaborate what is resident buffer and what's the principle for bindless extension boosting the performance?

  2. #12
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,213
    Quote Originally Posted by likethe265 View Post
    Can I understand as:
    1. CPU has its own limit of on sending the call order to GPU over the time. For example, during the 1/60 s time interval, only around 6000 draw call can be excuted and moreover this is based on CPU doing nothing but only sending the OPENGL command.
    Not "nothing". The CPU (via the driver) performs a lot of "prep work" to get ready for issuing draw calls, often termed validation. For instance, pushing needed buffers/textures onto the GPU which haven't already been uploaded, pinning CPU-side memory buffers, activating new shader programs, uploading uniform blocks, binding textures, resolving block handle/offset pairs into GPU addresses, etc. along with performing all manor of "is this state valid" checks.

    The amount of this work varies by driver, draw submission method, and the number of intervening state changes, and even with all that held constant, consumes different amounts of time depending on specific CPU and CPU memory speeds involved, so any "max N draw calls/frame" heuristic you might come up with is going to vary based on the system and the specifics of your example.

    In any case, with smaller and smaller batches (with the above factors held constant), you do hit a point where you are CPU bound submitting batches and not GPU bound. You'd of course like to avoid this case because you could be rendering more or more interesting content with that time.

    3. The intention of optimization of caching is to reduce the batch size and make the most use of GPU.
    Not to reduce the batch size, no. Several types of caching were discussed above. I'll assume you meant vertex cache optimization. The reason for optimizing that is it means fewer total vertices need to be transformed by the GPU to render your mesh.

    Background: At any point in your frame, you'll be bottlenecked on something. Previously, we were discussing what would make you CPU bound (the GPU is somewhat academic here). However, in cases where you are GPU bound (what you want), you could be bottlenecked on a few things. One of these things is vertex transform rate. There's a limit to how fast GPUs can transform vertices, and if you feed the GPU properly, you can hit this limit. If you are vertex transform bound, and if you can reduce the number of vertices the GPU has to transform to render your mesh, then you can speed up the time required to render your mesh. That's the point of vertex cache optimization.

    4. GPU and CPU processing can be deemed as parallel because CPU can send the next batch when GPU is processing the last batch. But in most case, the GPU is almost hungry.
    In an application written without regard to performance, perhaps. But once optimized this is typically not the case. You do figure out how to "keep the GPU busy" without wasting too much time on state changes or irrelevant content while still meeting your frame rate requirements.
    Last edited by Dark Photon; 01-02-2014 at 08:54 AM.

  3. #13
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,146
    Quote Originally Posted by tonyo_au View Post
    What is a good use for these buffers?
    Resident buffers reduce cache misses involved in setting vertex array state by using pointers (GPU addresses) instead of objects IDs.

    Quote Originally Posted by tonyo_au View Post
    I don't quite understand what this means. Because of the way we triangulate the DTM, triangles with shared vertices are often near each other in the index buffer. Is what you are saying that this is good enough without doing any special processing of the index buffer.
    The cache oblivious methods are "unaware" of the size and organization of the cache. They usually only assumes that the probability of finding a vertex in a cache is inversely proportional to the elapsed time since it was processed. If your processing preserves spatial locality of the vertices it should have a high vertex post-transform cache hit-rate. Considering what you've said, it seems you are using irregular grid for the DTM. What kind of visualization are you doing? AFAIK regular grids are more GPU friendly than irregular. In last decade I've been using only regular grids for the terrain rendering.

    Quote Originally Posted by likethe265 View Post
    1. CPU has its own limit of on sending the call order to GPU over the time. For example, during the 1/60 s time interval, only around 6000 draw call can be excuted and moreover this is based on CPU doing nothing but only sending the OPENGL command. If the CPU has to finish any frame process(application logic), this interval would be much longer. In addition, CPU time should be different on processing 3000 draw calls and 600 draw calls since the execution is not parallel.
    Why are you consider CPU single-cored? Issuing commands to a GL driver should come only from a single thread (in order to avoid synchronization problems and decrease performance), but doing other stuff should have (several) other threads. The problem in many draw-calls is the same as in regular function calls as well as cache pollution imposed by state changing and IDs resolving.

    Quote Originally Posted by likethe265 View Post
    2. When comes to maximum vertices count, different GPUs vary on its 'GL_MAX_ELEMENTS_VERTICES'. If we exceed the count, OpenGL error will be caused. On the other hand, if 'GL_MAX_ELEMENTS_VERTICES', the processing time will depend on the numbers of pipeline, GPU clock, GPU memory swap and so on.
    Of course that this limit should not be exceeded, but my advice is to stay far below in order to prevent significant performance lost if objects swap occurs.

    Quote Originally Posted by likethe265 View Post
    3. The intention of optimization of caching is to reduce the batch size and make the most use of GPU.
    Nope! The point is in the increase of batch size. The greater the batch the fewer the state changes. That was the point of the document you've mentioned.

    Quote Originally Posted by likethe265 View Post
    4. GPU and CPU processing can be deemed as parallel because CPU can send the next batch when GPU is processing the last batch. But in most case, the GPU is almost hungry. So the most possible approach for newbie's optimization is optimize the batch, the nVidia extension is for this purpose.
    Which extension? What does it mean to "optimize the batch"? The number of batches should be decreased while its size can be increased. Using bindless just increases number of batches that can be handled in real-time by reducing cache pollution while resolving IDs.

  4. #14
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,117
    ...using irregular grid for the DTM. What kind of visualization are you doing? AFAIK regular grids are more GPU friendly than irregular
    We develop road design software. Our clients have both irregular and regular grid DTMs. We can create regular grid overlays from the original data and these display much faster as I can easily do LOD with tessellation on these. Unfortunately a lot of clients feel these are too inaccurate for their modelling. The other new trend is to use point clouds for the DTM which is giving me more grief from a speed view point

  5. #15
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,146
    I'm sorry for hijacking the thread, but terrain rendering is my preoccupation for a very long time, I'm always for a discussion on the topic.
    Quote Originally Posted by tonyo_au View Post
    We develop road design software.

    Sounds interesting. How large is the terrain you have to visualize? Is it a single "plate", or an out-of-the-core model?

    Quote Originally Posted by tonyo_au View Post
    We can create regular grid overlays from the original data and these display much faster as I can easily do LOD with tessellation on these.
    Personally, I don't like a tessellation shader approach since it confines solution to SM5 hardware only, and restricts block size to 64x64 at best. I'm using at least 4x bigger blocks.

    Quote Originally Posted by tonyo_au View Post
    Unfortunately a lot of clients feel these are too inaccurate for their modelling.

    There is no such a thing as "feelings" in computer science. You could prove your approach by screen-space error metric. Or, if they don't believe to your proofs, you could render the same scene with both approaches.

    Quote Originally Posted by tonyo_au View Post
    The other new trend is to use point clouds for the DTM which is giving me more grief from a speed view point
    I guess it is a consequence of LIDAR usage.
    Well, I suppose you could convert it to a regular grid (with several MIP-layers) using some appropriate tool and high enough sampling frequency (depending on whether it is an urban area or a wilderness).

    The test I was doing was drawing 4000 draws of buffers with 10000 random 2 point lines segments.
    It seems like you are trying to visualize the point cloud directly by drawing a line-segment between each two vertices. I'm suggesting you to use some tool (or your own program-code) to convert point-cloud to a grid and then visualize on a standard way.

  6. #16
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,117
    How large is the terrain you have to visualize

    It varies a lot from a simple intersection to a DTM to cover 100-150km of highway and surrounds at a resolution of 250mm - hence irregular DTM's as not all the data is at this fine a resolution. We also dynamically merge DTMs to reflect various cuts like a new road through a hill.
    There is no such a thing as "feelings" in computer science

    You obviously haven't dealt with engineers a lot. These are people who demand collision detection between a manhole and pipe down to 100mm then dig the hole with a back-end loader

    consequence of LIDAR usage
    This is not my direct responsibility but with a lot of pre-procession we are displaying 4 billion points in real-time with just glBegin/glEnd and clever LOD. I am just looking at other ways of managing the data in OpenGL so see how much of the pre-procession we can remove and still get real-time rendering.
    I don't like a tessellation shader approach

    This was just quick to implement and gave a big boost to performance. I will be revisiting this later in the year when one of the other programmers has finished a new data structure for regular grids.

  7. #17
    Member Regular Contributor malexander's Avatar
    Join Date
    Aug 2009
    Location
    Ontario
    Posts
    320
    Of course that this limit should not be exceeded, but my advice is to stay far below in order to prevent significant performance lost if objects swap occurs.
    This is especially true for AMD cards. While they report a GL_MAX_VERTICES of 2^32-1, and GL_MAX_ELEMENTS_VERTICES of 2^24-1 (16M), if you use multiple vertex buffers with sizes greater than roughly 100K vertices, you'll start to see performance degradation. In an extreme example, I had a 20M point, 88M element array model that took almost 60 seconds to draw on a FirePro W8000 if I didn't dice the model up into small vertex array chunks first (an Nvidia Quadro K5000 took 225ms). I was using element arrays of 300K elements, but vertex arrays of full size.

    Once I'd diced the vertex buffers up into 100K vertex chunks (requiring ~200 draw calls and VAO changes), the AMD W8000 drew it in 62ms (Nvidia K5000, ~100ms). This was also mirrored in the consumer cards I tested (Radeon 6950, GEForce 670). Both AMD and Nvidia drivers performed better with many small batches, but the nearly 1000x performance increase on the AMD card suggests that their driver is much more sensitive to buffer sizes than Nvidia (which was 2.25x faster).

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •