Register calling convention

The existing API uses the stdcall calling convention because this is the standard for Windows.
However, modern high performance compilers such as those from Borland usually default to using the Register calling convention because it is faster.

Stdcall, cdecl, safecall and pascal conventions need to create a stack-frame and write all of the parameters to it every time you call the function, and adds several extra instuctions to manipulate the BP/SP registers during the call and again at the return.
It also requires the parameters to be read from memory when used (using an offset address calculation) when they are probably already in registers.

The register convention passes the first 3 parameters directly, so unless the call has 4+ parameters the only thing written to the stack is the return address.
This makes the stack much smaller, limits cache pollution, reduces memory accesses and executes faster.

As a high-performance API, should not OpenGL 3.0 calls be as fast and efficient as possible ?

This is not limited to the Intel implimentations, most other processors are also faster with the register calling convention.

The OpenGL API is meant to be used from many languages so it needs calling convention that is supported by most of them. The convention used by OS API is obvious choice. As far as I know, the register calling convention differs in both number of registers and order in which they are used even between various compilers of C language.

Additionally you will hit performance limits long before you can make so many OGL calls/s that the calling convention has any performance effect.

That’s purely an implementation detail and nothing OpenGL should be concerned about.

Wait, Borland is still around?

www.borland.com Still going strong with new Vista compatable compilers for Delphi, C# & C++.

This would be one of those useless optimisations. API functions are not called often enough for the calling convention to matter, and if they are - you are doing something wrong. Hov many cycles would you win per function call? Two? Three? You should optimise critical sections of the code and leave the rest alone. Also, this would be a mess for compilers. Have you ever tried writing a pascal interface unit for a C++ library compiled with “register”?

P.S. Win64 uses some kind of universal calling convention that uses registers to pass parameters.

Zengar: Hov many cycles would you win per function call? Two? Three?
26 clock cycles, 14 instruction bytes, 8 memory accesses.

Komat: As far as I know, the register calling convention differs in both number of registers and order in which they are used even between various compilers of C language.

Yes, if there is no standard then i agree that its not worth considering.

Zengar: Win64 uses some kind of universal calling convention that uses registers to pass parameters
Interesting, do you have a link to any info on this?
Even if M$ does set a standard “universal calling convention” for Win64 and force everyone to support it, once the first 3.0 application is written it will be too late to change the calling convention.

26 clock cycles, 14 instruction bytes, 8 memory accesses.
I wonder where do you get those numbers…

Let us say, we have three parameters, all have to be loaded from memory. So we have PUSH A; PUSH B; PUSH C; CALL f vs. MOV EAX, A; MOV EDX, B; MOV ECX, D; CALL f. For Athlon 64 - I have the docs right here - PUSH and MOV have basically the same latency of 3. If I understend it right, three MOVs will execute simultaneously, needing 3 clock cycles together. Three PUSHs are possible in 7 clock cycles. So you are saving nothing. Or have I made a mistake somewhere?

Interesting, do you have a link to any info on this?
The link won’t show up, but you can google for “Win64 calling convention”, the first hit has a link to MSDN article.

Originally posted by Simon Arbon:
26 clock cycles, 14 instruction bytes, 8 memory accesses.
26 clock cycles. OK, let’s say you have 10,000 API calls every frame. Let say your app runs at 60fps. Let’s say your CPU is 3GHz. So we’re spending about 26 * 10,000 * 60 / 3,000,000,000 = 0.0052. So we’re talking about 0.5% of the total CPU time.

Zengar: I wonder where do you get those numbers…
I am assuming that the calling routine calculated the 3 parameters and the compiler was smart enough to have allocated them to the correct registers for the call.
function Add3( A,B,C: integer ): integer; stdcall;
begin
Result := A+B+C; {Add 3 numbers together}
end;

Register
Add EDX,ECX
Add EAX,EDX
Ret

Call Add3Reg

StdCall
Push EBP
Mov EBP,ESP
Mov EAX,[EBP+$08]
Add EAX,[EBP+$0C]
Add EAX,[EBP,$10]
Pop EBP
Ret $000C

Push EDI
Push ESI
Push EBX
Call Add3std

As you can see, most of the extra cycles are in building & destroying the stack frame.
The clock counts are approximate as they vary wildly between different processors.

Humus: So we’re talking about 0.5% of the total CPU time.
Or 5% on a 300MHz processor, unfortunately not every 8 year old can afford a brand new computer and I’m writing a G rated game that has to satisfy a wide age group, and i already have to turn off features on slower machines as i have run out of CPU clock cycles.

The function will still have to create a stack frame if it uses local variables (and most API calls will). Your example uses an overly simple function, that should be inlined anyway. What if the function needs the EAX to do some preprocessing before it can use the argument? Indeed, it will have to push the EAX anyway. Register calling convention is only usefull when the called function is simple and small. Note, with Amd64 is is different, as we have more general-purpose registers.

P.S. If you are writing such a game, you won’t have 10,000 API calls per frame… basically, in your logics about 50% of the CPU time will be spent only on “calling” API functions, rendering all OSes useless.

Zengar: The function will still have to create a stack frame if it uses local variables
Oops, sorry, i forgot that most compilers work that way.
The one i use (a proprietry object-pascal compilor) only puts local variables in a stack frame for recursive calls.
Usually it allocates extra space in the object for local method variables or uses a global block that is shared by subroutines that are in separate branches of the call tree.

P.S. If you are writing such a game, you won’t have 10,000 API calls per frame… basically, in your logics about 50% of the CPU time will be spent only on “calling” API functions, rendering all OSes useless.
I would scale the application so it does 10,000 calls on a 3GHz processor and 1,000 on a 300MHz processor.

The general concensus seems to be to stick with StdCall for 32-bit programs, but whats happening with Win64? M$ might require a separate 64-bit version of the API that uses its “universal calling convention”.

If I undersood it correcty, all applications on Win64 are required to use the universal calling convention (this should also simplify the debugging). The 64-bit version of the API is available for years already, just as Win64 is.