Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 2 12 LastLast
Results 1 to 10 of 17

Thread: Batch Batch Batch

  1. #1
    Newbie Newbie
    Join Date
    Dec 2013
    Posts
    3

    Batch Batch Batch

    I'm trying to visulize a model currently but found less efficient in fps.
    Before I pass the model to art designer to make it more batch-friendly, I've got no idea about the relationship about relationship between 'hardware specification' and 'maximum geometry input', here's how I summarize (assuming there's no texture change):
    1. I understand the CPU clock can determine the maximum batch number, but what will it be exactly like?
    2. What can determine the maximum vertices number in one batch? How can I figure it out? (The art designer needs to know before optimization).
    3. Can I understand the relationship between batch and effiency like "An excess of batch number or vertices number in one batch over hardware limitation will both descend the frame rate"?

    Many thanks!

  2. #2
    Intern Newbie
    Join Date
    May 2013
    Posts
    41
    Drawcalls are the main limit in OpenGL and Direct3D, because of how much the drivers have to do in the background and then hit a CPU limit. Also this has the disadvantage that your application gets less CPU time.
    This is one of the main reasons AMD develops there Mantle API.

    I think the limits in OGL and D3D are around 6000 drawcalls in one frame for 60Hz. Not sure to what CPUs this maps.
    There are also some extensions from nvidia to get better drawcall performance. But I never hat to work on drawcall limited projects so this is all copy/paste from my reading memory.


    The maximum vertex one current PC hardware is only limited by raw power of the GPU. For example my 7870 gives me 4294967296 (2^32) for GL_MAX_ELEMENTS_VERTICES. What is basically the maximum number of vertice you can index with the biggest index type. But vertice/index can be optimized for caching. And that is something you use algorithms for, e.g.: https://home.comcast.net/~tom_forsyt...cache_opt.html

  3. #3
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,072
    Quote Originally Posted by likethe265 View Post
    1. I understand the CPU clock can determine the maximum batch number, but what will it be exactly like?
    How come the CPU clock can determine the maximum batch number?

    Quote Originally Posted by likethe265 View Post
    2. What can determine the maximum vertices number in one batch? How can I figure it out? (The art designer needs to know before optimization).
    The maximum vertex count is determined by the amount of memory. But that is probably not the information you seek.

    Quote Originally Posted by likethe265 View Post
    3. Can I understand the relationship between batch and effiency like "An excess of batch number or vertices number in one batch over hardware limitation will both descend the frame rate"?
    There is no answer on this question. There is a myriad of graphics cards, drivers and HW/SW combinations. If you are targeting a particular audience, the answer should be found by profiling such configurations. As long as everything stays in the graphics memory the performance is good. If the swap arises then smaller buffers will gain better performance.

    Quote Originally Posted by Osbios View Post
    Drawcalls are the main limit in OpenGL and Direct3D, because of how much the drivers have to do in the background and then hit a CPU limit
    .
    It is rather a frivolous assumption that the bottleneck is aways in a number of draw calls. It depends on the application design and in most cases (with real and well designed applications) it is not the case.

    Quote Originally Posted by Osbios View Post
    I think the limits in OGL and D3D are around 6000 drawcalls in one frame for 60Hz.
    It is also a frivolous claim. It depends on many factors.

    Quote Originally Posted by Osbios View Post
    There are also some extensions from nvidia to get better drawcall performance.
    Yes, bindless extensions significantly boost performance, primarily by decreasing cache misses and direct access to resident buffers.

    Quote Originally Posted by Osbios View Post
    But vertice/index can be optimized for caching. And that is something you use algorithms for, e.g.: https://home.comcast.net/~tom_forsyt...cache_opt.html
    A long time has past since Tom Forsyth published his work. The GPU architecture has changed a lot. Last time I tried it (several years ago) on Fermi architecture I got no improvements. So, be careful.

  4. #4
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,104
    bindless extensions significantly boost performance

    Like everything less I have not found this to be automatically true

  5. #5
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,104
    Last time I tried it (several years ago) on Fermi architecture I got no improvements

    Aleksandar - Is it your experience that this re-ordering your vertex list is not a high priority in performance tuning?

  6. #6
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,072
    Quote Originally Posted by tonyo_au View Post
    Like everything less I have not found this to be automatically true
    It is true if there are tens of thousands of draw calls. The boost is not as NV reported (7.5x), but there is a boost (at least about 50%). Of course, it depends on the application. If the application is not a CPU bound, then there is no improvements for sure.

    Quote Originally Posted by tonyo_au View Post
    Aleksandar - Is it your experience that this re-ordering your vertex list is not a high priority in performance tuning?
    Indeed. Could you provide some other results? What GPU are you using and how many vertices there are in your buffers?
    If you report any improvement over standard triangle strips with primitive restart, I'll be willing to repeat my own.

  7. #7
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,104
    but there is a boost (at least about 50%)


    The test I was doing was drawing 4000 draws of buffers with 10000 random 2 point lines segments.Each draw generates a new set of line segments. For this I saw no improvement over glBegin/glEnd. I assume it is because I am not able to reuse the buffer. I have been trying to improve my rendering of streamed point clouds but everything I try gives about the same render time.

    standard triangle strips with primitive restart

    I have triangles from DTMs with between 100,000 to 1,000,000 vertices. I cannot create triangle strips. I was using the ATI library to optimize caching but they are longer supporting that library so I was looking to write my own.
    My benchmarks to date don't seem to so any improvement on an nVidia GTX 770; that was why I was interested in your comment.

  8. #8
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,072
    Quote Originally Posted by tonyo_au View Post
    The test I was doing was drawing 4000 draws of buffers with 10000 random 2 point lines segments.Each draw generates a new set of line segments. For this I saw no improvement over glBegin/glEnd.
    Who says that VBO can beat immediate mode for two point lines?
    Compared to regular VBOs, resident buffers should be, at least, slightly faster.
    In your case the bottleneck is in number of function calls. Resident buffers cannot solve it. Immediate-mode do not have cache misses. You have an equal number of calls. It is great that the result is even comparable with immediate-mode.

    Quote Originally Posted by tonyo_au View Post
    I have triangles from DTMs with between 100,000 to 1,000,000 vertices. I cannot create triangle strips. I was using the ATI library to optimize caching but they are longer supporting that library so I was looking to write my own.
    My benchmarks to date don't seem to so any improvement on an nVidia GTX 770; that was why I was interested in your comment.
    There are (at least) two different caches for vertices: input and post-transform. The first one is responsible for fetching attributes. Even for that the locality is important, but what counts more is post-transformed cache, since it prevents reexecution of VS. That's why we are talking about index reordering. But, nowadays there are a lot of processing units in the GPU. The calculation is spread over hundreds of them. Each of them operates on the fraction of input buffer, so some vertices have to be duplicated. The order is still important, but it is not a long way runner. It is preferable to use a cache oblivious methods for index ordering, since cache sizes, algorithms and scheduling differs form vendor to vendor, form card's generation to card's generation, and even from driver's revision to revision. Also, cache size is now greater than it was a decade ago. That's why I pointed out that we should get something for granted.

  9. #9
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,104
    Resident buffers

    What is a good use for these buffers?

    a cache oblivious methods for index ordering

    I don't quite understand what this means. Because of the way we triangulate the DTM, triangles with shared vertices are often near each other in the index buffer. Is what you are saying that this is good enough without doing any special processing of the index buffer.

  10. #10
    Newbie Newbie
    Join Date
    Dec 2013
    Posts
    3
    Many thanks! Can I understand as:
    1. CPU has its own limit of on sending the call order to GPU over the time. For example, during the 1/60 s time interval, only around 6000 draw call can be excuted and moreover this is based on CPU doing nothing but only sending the OPENGL command. If the CPU has to finish any frame process(application logic), this interval would be much longer. In addition, CPU time should be different on processing 3000 draw calls and 600 draw calls since the execution is not parallel.
    2. When comes to maximum vertices count, different GPUs vary on its 'GL_MAX_ELEMENTS_VERTICES'. If we exceed the count, OpenGL error will be caused. On the other hand, if 'GL_MAX_ELEMENTS_VERTICES', the processing time will depend on the numbers of pipeline, GPU clock, GPU memory swap and so on.
    3. The intention of optimization of caching is to reduce the batch size and make the most use of GPU.
    4. GPU and CPU processing can be deemed as parallel because CPU can send the next batch when GPU is processing the last batch. But in most case, the GPU is almost hungry. So the most possible approach for newbie's optimization is optimize the batch, the nVidia extension is for this purpose.

    I gratefully welcome any correction.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •