Better GPU program execution handling.
Currently OpenGL still works with draw calls.
As if the GPU can't store a whole program, needs to be fed high level functions function call by function call.
This can keep performance low and introduces an artificial barrier for complex operations.
Instead of sending one by one call to the GPU.
The following model:
1 shaders get compiled for GPU
2 program or pieces of program gets handled to GPU.
3 CPU says what part of program on GPU to execute and where to wait for synchronization instead of doing this per call.
4 GPU executes pieces of program
In this model both the CPU AND GPU have their program in their memory ready to go, the program that each needs to execute.
The CPU does not say what draw call to execute but what part of a program to execute, this can be a bunch of draw calls.
The GPU executes the parts and then notifies the CPU when it has finished and can do the next part.
Maybe for optimal performance the GPU could save a table with synchronization point ID and the stuff to execute that could be multiple parts to execute per synchronization point. Each part would contain program execution starting point, program execution length, number of times it needs to be looped. Each processor will need to have a table of what other processor(s) to notify after each part done for each synchronization.
This will work much better then doing things per call.
Programmers can minimize the amount of waiting and synchronization required during runtime.
Having a function to be able to tell to upload a shader to GPU and prepare it for execution would be necessary for the best combination of performance and programmability. With the default to do this as early as possible just after context creation.
To be able to change executing program on the fly it must be possible to compile the new shaders for the GPU, upload them to the GPU, then say where in the program to check for and switch to the new shader. (Application programmer must insert function to clarify this, if no function then driver must not allow program changes while executing program.)
Of course in other API's that have a similar old fashioned execution model used with fully capable processors this needs to be changed. Fixing the deficiencies with the new execution model would certainly help other processors: audio processors.
It would be optimal to be able to let the processors also talk when they done their part of a program to each other without CPU communication in between. This way you can avoid extra synchronization delays because off processorA > CPU > processorB and do this: processorA > CPU, processorB
e.g.
processorA does part 1 then notifies processorB
processorB does part 2 then notifies CPU
CPU does part 3 then notifies processor A or B to start part 4
processor A or B does part 4 then notifies CPU
CPU does part 5 and notifies processor A and B to both start their share of part 6
processor A and B do their share of part 6 and after completion both notify the CPU independently from each other