The time between issuing a command (calling an OpenGL function) and the GPU completing execution of that command may be long. Like, several frames. Between those two points, the command is “pending”. If you try to modify something (e.g. a texture) that is used by a pending command, the driver may just block until it has finished executing the command. So if you create a texture, draw something using that texture, modify it, draw something else, modify it again, draw something else, …, and you don’t want any part of that process to stall, the hardware has to store every version of that data required by a pending command.
In some cases, it may automatically allocate storage for some of the intermediate versions. In other cases, it will just wait until any pending commands which were using the thing that you’re trying to modify have completed. Which could take a long time. If you don’t want that to happen, then one way is to simply never modify anything; just create a new object and use that instead.
Not stalling the GPU boils down to sending it commands at least as fast as it is executing them. So you may need to avoid stalling the CPU in order to avoid stalling the GPU.
Yes. The texture won’t necessarily be updated on the GPU for some time after the call returns, but any commands which are enqueued after the glTextureSubImage2D command won’t be executed until after the texture has been updated. Also, if the data source is a PBO, any modification to the PBO’s contents needs to wait until the texture has been updated from the PBO.
If you replace the entire PBO with glBufferData(), the driver may choose to “orphan” the existing data store and allocate a new one. The data will immediately be copied from client memory into the new store, the old store will be freed automatically once any pending commands (e.g. glTexSubImage2D) using that data have completed. If you replace a portion with glBufferSubData(), this is unlikely; the driver will wait until pending commands using the modified region have completed before glBufferSubData() returns. Similarly, if you map the buffer for writing, the driver should wait until pending command have completed. With persistent mappings, you have to handle synchronisation yourself.
Copies from client memory to GPU memory are synchronous. Copies from GPU memory to GPU memory are appended to the command queue. The driver doesn’t wait for the copy to complete before the corresponding function returns, but commands enqueued by subsequent functions won’t start until the copy has completed (unless the driver can determine that the order doesn’t matter). If you’re using PBOs, it’s the gl[Get]Buffer[Sub]Data() or glMapBufferRange functions which will introduce CPU-GPU synchronisation issues, not the glTex[Sub]Image() functions.
If you’re using multiple threads, you may not need a fence; you can just let the uploading thread block until the upload has completed. A fence is more useful with a single thread where you’re trying to determine whether the GPU has finished using some data, so that you can modify it without blocking.
By default, OpenGL behaves as if everything executes immediately. Any deferral (backgrounding) is transparent. Functions will wait if they need to wait. If you’re copying data back to client memory, that has to wait until the data is available. If you’re copying data from client memory, either the driver has to copy that to temporary storage, or whatever you’re overwriting must be “done with”, or the function will wait until that is the case.