Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 3 of 5 FirstFirst 12345 LastLast
Results 21 to 30 of 47

Thread: VBOs strangely slow?

  1. #21
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948

    Re: VBOs strangely slow?

    Have you tried explicit synchronization with NV_fence/ARB_sync and using GL_UNSYNCHRONIZED with glMapBufferRange?

  2. #22
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,126

    Re: VBOs strangely slow?

    Quote Originally Posted by Alfonse Reinheart
    Have you tried explicit synchronization with NV_fence/ARB_sync and using GL_UNSYNCHRONIZED with glMapBufferRange?
    No, sure hadn't. What do you envision here?

    Thought the whole purpose of MapBuffer NULL / UNSYNCHRONIZED is so the GPU can have multiple buffers in flight for the same buffer handle, and thus pipeline the buffer uploads, avoiding stalls. So I'm not seeing where fences come in.

    I also didn't test a technique that has been touted here for buffer upload speed-up (since this is such a trivial test app), and that's mapping the buffer in a foreground thread, taking the potentially multi-ms hit of the memcpy in a background thread, and then unmapping in the forground thread, with ring-buffer work queues between the threads. But that's typically only useful if you've got other (typically GL) work to do in the foreground thread. This little test app's just gonna wait on the memcpy to unmap anyway because it has nothing better to do.

  3. #23
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948

    Re: VBOs strangely slow?

    Thought the whole purpose of MapBuffer NULL / UNSYNCHRONIZED is so the GPU can have multiple buffers in flight for the same buffer handle, and thus pipeline the buffer uploads, avoiding stalls. So I'm not seeing where fences come in.
    GL_UNSYNCHRONIZED is not the same as GL_INVALIDATE.

    GL_INVALIDATE tells the implementation, "I don't care what was in the buffer before; I just want some memory!"

    GL_UNSYNCHRONIZED says, "I don't care that you may currently be using the buffer, and that my attempt to modify it while in use can have horrible consequences. I will take responsibility for making sure the buffer is not in use when I modify it, so give me a pointer already!"

    They're both solutions to the same basic problem (I rendered with a buffer last frame, and I want to change it and use it this frame), but with different needs. GL_INVALIDATE/glBufferData(NULL) is ultimately giving you two buffer objects: the one that's currently in use and the one you're writing to. GL_UNSYNCHRONIZED is all about using only one piece of memory to avoid the synchronization.

    The idea is that you fill up a buffer object, do something with it, and then set a fence. If you want to change the buffer, you check your fence. If the fence has not passed yet, you go do something else (and therefore this only works when you have "something else" that you could be doing). When the fence has passed, you can now fill the buffer.

  4. #24
    Member Regular Contributor
    Join Date
    Apr 2006
    Location
    Irvine CA
    Posts
    299

    Re: VBOs strangely slow?

    GL_UNSYNCHRONIZED can allow for idioms where the client is generating a large number of small batches dynamically; it makes it much more efficient to stack them up one after another within a smaller number of larger sized VBO's. For example you could have a 4MB VBO, and be able to map/write/unmap/draw several hundred times using that storage, before ever having to orphan or fence, if you are processing kilobyte-ish batches of data.

    In this regard it's closer to the D3D NO_OVERWRITE hint. "Yes, I know I just wrote 512 bytes of stuff at offset 0, and maybe it hasn't been processed yet - I would like to go back in and write 1280 bytes of new stuff starting at offset 512 now in the same buffer... and I'd rather not have to wait." And so you repeat until you hit the end of the buffer - no hazards, no risks.

    Concurrency goes up esp on a multi-threaded driver when you can use the cheap operation more frequently than the expensive one (unsync map = cheap ... orphaning = less cheap).

    When this style makes sense (depends on your app), you can cut way down on the driver memory management work, since it just sees one particular size of buffer being orphaned / recycled, and those events are much less frequent than maps and unmaps.

    Ideally, you reach a steady state where the driver is round-robining between a few physical buffers of that one large size, allocations stop happening, and the driver need not care if you are blasting rand()-sized batches in various numbers into that storage.

    The key idea is really that careful fencing and sync efforts are only needed in the absence of orphaning and in cases where you are going back to *rewrite* some storage that may be pending drawing, like trying to do some sort of sub-section update to a mesh, say.

    The flip side of that is that you can do high performance dynamically generated batches of mixed sizes with no fences at all, and with low driver overhead, if you constrain your access patterns to only write/use any given segment of a buffer exactly once before the buffer is orphaned. This is a familiar pattern from the D3D playbook.

  5. #25
    Junior Member Newbie
    Join Date
    Feb 2010
    Posts
    14

    Re: VBOs strangely slow?

    I'm afraid all these details are too much for this poor systems programmer. I'll play the ouija board, figure out code that works well on my own system, and not worry too much about other systems. Still..

    What I came up with in the end for the actual application (code here, but there's way too much of it) is to use two VBOs, for the variable data, which I switch between once per frame (using glMapBufferRange to invalidate if available, glBufferData otherwise), and a static_draw VBO for the quite static vertex grid. This works well enough; it's as fast as the ncurses output mode, which means about twice the speed of any other mode even counting immutable overhead.

    If you really want to see the actual code.. uh, the important functions would be swap_pbos in graphics.cpp, and render_shader/init_gl (shader branch, latter) in enabler_sdl.cpp, but I would suggest you stay away. For one thing, the code's embarrassing and impenetrable.

    I've also got ARB_sync in there, on the theory that blocking in SDL_GL_SwapBuffers is a very bad thing and I can't figure out a better way to limit framerates to what my (8600M) gpu can handle.

    But now you're saying display lists are likely to be faster? And the drivers will use multiple VBOs as appropriate if I just invalidate before mapping? Are those also true for ATI cards?

    Also, is there an ATI equivalent of bindless graphics?

  6. #26
    Junior Member Regular Contributor
    Join Date
    Nov 2009
    Location
    France
    Posts
    114

    Re: VBOs strangely slow?

    "But now you're saying display lists are likely to be faster?"
    -> internally, the driver will convert display list to vbo. the main issue with display list is that it is hard to predict when an implementation can optimize, because there are many corner cases in opengl...

    "And the drivers will use multiple VBOs as appropriate if I just invalidate before mapping? Are those also true for ATI cards?"
    -> yes. the implementation will reallocate a buffer and avoid any unnecessary synchronization overhead.

    "Also, is there an ATI equivalent of bindless graphics?"
    -> you can use vertex_array_object.
    Pierre B.
    AMD Fellow

  7. #27
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,126

    Re: VBOs strangely slow?

    Quote Originally Posted by Dark Photon
    Quote Originally Posted by Alfonse Reinheart
    Have you tried ... GL_UNSYNCHRONIZED with glMapBufferRange?
    ... Thought the whole purpose of MapBuffer NULL / UNSYNCHRONIZED is so the GPU can have multiple buffers in flight for the same buffer handle
    My apologies. I tested/meant INVALIDATE, but Alfonse said UNSYNCHRONIZED, and I merely copied and missed the distinction.

    And thanks Rob and Alfonse for the detailed responses! I learned a few things, and I'm sure I'm not alone.

  8. #28
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,126

    Re: VBOs strangely slow?

    Quote Originally Posted by Pierre Boudier
    "Also, is there an ATI equivalent of bindless graphics?"
    -> you can use vertex_array_object.
    But on NVidia, avoid using VAOs on top of bindless. Yes, it works, but in my experience, you'll pay a little perf for doing that (but test on your setup to be sure).

    Presumably bindless gives you the VAO speed-up, and without (I assume) a bazillion little VAOs floating around in the GL driver.

  9. #29
    Junior Member Newbie
    Join Date
    Feb 2010
    Posts
    14

    Re: VBOs strangely slow?

    Naturally, trying to use display lists ran into the problem that my vertex shader uses gl_VertexID, which appears not to be set when executing display lists.

    Is there a reasonable alternative? Some way of setting a per-vertex or per-primitive counter?

  10. #30
    Advanced Member Frequent Contributor
    Join Date
    Apr 2003
    Posts
    661

    Re: VBOs strangely slow?

    Its time for some new whitepapers from ATI/nVidia on how to deal with updating VBOs/UBOs/PBOs quickly. Clean up some myths and get straight on the facts. I'm tired of guessing.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •