View Full Version : win2K and slow opengl?!?

07-28-2002, 07:02 AM

I just setup my pc for dual boot 98 and win2K. I have been working on an opengl app in 98 for a while and when I run the same app in win2K the frame rate sinks a lot. Being dual boot the hardware is the same, I sate the video resolutions (and depth) the same for both os's, I installed the latest nvidia drivers on both (I'm using GF2 Pro), and I checked to make sure I was getting the same pixel format on both. I ran some tests and here is the results:

Win 98 (Full pipeline 115K poly): 14 f/s
Win 2K (Full pipeline 115K poly): 8 f/s

Then I tried removing my call to glDrawArrays to see if the code leading up to the rendering was slow here are the results:

Win 98 (No render 115K poly): 149 f/s
Win 2K (No render 115K poly): 166 f/s

Thats strange... I'm using the same compiled exe and the same level files, so I wouldn't think this could be a caching issue (both wouls see it, right?).

Has anyone seen anything like this before? Any ideas?

Thanks... http://www.opengl.org/discussion_boards/ubb/smile.gif


07-28-2002, 08:22 AM
How's your AGP drivers on W2k?

We usually find the same or better performance on Win2k than on 16-bit Windows. Of course, that could be due to overlap with other parts of our program.

07-28-2002, 09:55 AM
Well, I had the same problem here once anytime too. I personally count Win2k as more stable but slower system... Win98 and it's hybrids exactly the other way round, relative fast, but crashing.

It can be very many things... majorly the memory management is as far as I know a bit slower on Win2k... better, but slower. The drivers could for sure also be one of the reasons... but as till now everybody who runs both systems told me that on Win2k all was a bit slower... I don't think that OpenGL nor your program are responsible for this.


07-28-2002, 03:33 PM

I think your correct, I ran a few nvidia demos on both os's and 98 was always faster. Oh well...

I hate to sound paranoid, but is this some kind of Microsoft conspiracy?



07-29-2002, 01:24 AM
Hehe, well, I don't think that Microsoft made Win2k wanted slower. But it's kernel is simply so different that it can raise these 10% difference easily. After all Win NT's / Win2k's intention wasn't to play games on it, but to work with or to use it as server. I think you know, what I try to say.

You should port it to Linux and check your FPS there http://www.opengl.org/discussion_boards/ubb/wink.gif. It will for sure take a week for you till you have everything set up, but your FPS should be even a big bunch higher than on Win98 http://www.opengl.org/discussion_boards/ubb/wink.gif.

The OS makes the difference http://www.opengl.org/discussion_boards/ubb/smile.gif, you really shouldn't worry about Win2k... people are used to that it's a lil slower on that OS, hehe. But if it's also far slower on WinXP... you should begin to worry as it... unfortunately... will surely become standard more and more. http://www.opengl.org/discussion_boards/ubb/smile.gif


07-29-2002, 02:52 AM
john_at_kbs_is, both Win98 *and* Win2K look very slow. Are you doing anything fancy? Lots of spotlights, etc?

I mean

Win 98 (Full pipeline 115K poly): 14 f/s ~= 1.6 million polys per second.
Win 2K (Full pipeline 115K poly): 8 f/s ~= 900,000 polys per second.

I routinely get close to 20 million polys per second. [Single textured. 1 directional light. No *quite* so many polys per frame.]. With a Geforce2 GTS and PIII 750. I feel like I'm banging on the same drum again. But are you sure that you're getting a hardware accelerated pixel format? Please see my posts on http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/006961.html (near the bottom)

Both your problems may in fact be the same. Follow the given advice and see what you find out. I would hope that you would achieve rendering that is an order of magnitude greater than what you have been getting thus far (unless you are doing lots of fancy stuff).


Nicolas Lelong
07-29-2002, 07:54 AM
Hi everyone,

i wanted to add a question of mine about a recurrent problem i have and which may be similar.

Whevener my system (Win2K + GeForce2Pro) is too heavily loaded in 3d the GL applications launched crawl !

For example, if I have a 3dsMAX running with a big scene in it, if I launch my GL app, it crawls as if doing software rendering. If I close MAX and relaunch my app, everything is well !... I tried testing the PFD to check if it is accelerated, and it keeps telling me it is...

any idea would be greatly welcome !

07-29-2002, 08:06 AM

Well I didn't think I had a problem with 98. You say your getting 20M polys? Are you running VARs, lists, or arrays? Are you using tri-strips and on average how many tris are in one strip?

I hate comparing frame rates because its never an apples to apples comparison. http://www.opengl.org/discussion_boards/ubb/smile.gif

I will check my pixel format, I made sure that they were the same, but I never checked the generic flag.

By the way the results included 2 layers per poly, so the polys per second were like 1.8M and 3.2M. Also one of the layers is slower than the other. My vertex struct is 64 bytes and in the slow layer I access data from the second 32 bytes causing a second cache hit per vertex, but this will happen on both os's.

Can you post the info on your app and I'll check my pixel format.



07-29-2002, 08:09 AM

on top of the questions I just asked:
what data are you sending to gl (coords, textcoords, color, normals)?



07-29-2002, 09:38 AM
John, here's some info from an old opengl app that I just ran.

122368 (tri-striped) polys per frame compiled into a display list
22-23M polys per second
[I'll let you work out the framerate]
1 256x256 texture only
1 directional light
1 normal per vertex
1 set of 2d texture coords per vertex
Don't think that vertex colours are defined.
Not sure of the length of the strips. Sorry. Running on Win2k.

It's basically a simple procedural landscape with some trees. Have got the latest nvidia drivers and vsync turned off (obviously!)

Maybe your 3.2M polys per second isn't that far off if you're not tri-stripping and got vsync turned on. Would expect (perhaps?) 1/3 performance if you're just rendering discrete triangles and then you'll take another hit for vsync...Would definitely make sure that you're getting an accelerated pixel format though. 2 people this week have complained about poor opengl performance. I downloaded *both* their code and found that they were both rendering in software on one of my machines and hardware on the other. They had some uninitialised garbage values in the pixel format descriptor. Zeroing the structure before placing values into it fixed that.

Have you tried using glGetString? It will tell you if whether you're running in software or hardware.

Another thing. I have always found that opengl runs slightly *faster* on NT and 2K than 98. Direct3D is the exact opposite though - ever so slightly slower on 2K than 98. Just my $0.02.

Please keep us posted...

07-29-2002, 09:40 AM
Nicolas Lelong: It is so slow, because YOU HAVE SOFTWARE RENDERING then. You can only run one application using the 3D card at once. The second one gets software rendering automatically. There are some cards... for professionals... which can host up to 5 and more applications at once, but a consumer card... and I think that's what you are using... only one.


07-29-2002, 09:55 AM
vsync, hmmm.... That I've never checked. I do use tri strips, but I average only about 5 tri per strip (still cuts verts down by 50%). Also if I run only the fast layer I get 33 frames (single layer) thats like 3.8M.

oh as for the glString I do that when the app fires up. It always returns nvidia info and ext's. If I'm running software will it return Microsoft info?

I need to test the app some more when I get home tonight...



07-29-2002, 11:10 AM
@ BlackJack:

Come on, that can't be true. I just tried because I was too sure you must be wrong. I ran the same app twice and the framerate was just about half of the framerate I get when there's just one app running. And running gmax (in opengl mode) in the background doesn't change the framerate at all. Where do you have these information from?

07-29-2002, 04:49 PM

did some checking and here is what I got:

-VSync was off on both os's.
-Pixel format is cool and is not returning the generic format.

ok I've been using this cool little app called wcpuid.exe to check my system AGP stats. It tells me both os's on agp 2x. I have Via apollo chipset, the 4x drivers never worked), but I try the new drivers on win2k and pow! Agp 4x! Ok now the app is running the same as in 98. So I install the new drivers in 98 and pow! Wcpuid tells me its now running 4x, but wait the app aren’t any faster. I think the goofy old drivers were running 4x, but reporting 2x?

Now both os’s are running the same even with the nvidia demos!

The question: is the 3.2M polys I getting now good? I’m using vertex arrays with DrawArrays (not indexed), for layer one I’m sending coords (3 x float) and text coords (2 x float), that’s 20 bytes total and for layer 2 I’m sending the same plus color (3 x float). For both layers that’s 52 bytes. The data is always sent from system memory and I’m only running PC133 memory.

Kevinhoque, I don’t know much about compiled display list, is it possible that some of the data is stored in video memory? Are you using indexed arrays?

I think at this point I’ve maxed out my memory bus… http://www.opengl.org/discussion_boards/ubb/smile.gif

Let me know what you guys think…

Thanks for the help…


Nicolas Lelong
07-29-2002, 10:44 PM

I was aware that software rendering is hidden somewhere - but the problem does not occur every time I have 2 3d apps launched. It only occurs when one (or+) uses a quite considerable amount of geometry (textures?).

I agree that there must be some kind of limitation somewhere. In fact, I wanted to know which limit it is. Obviously, it is not (only) the number of applications.


07-29-2002, 11:16 PM
If i understand it well, you are not using display lists, no VAR (Nvidia) or VAO (ATI) extension. Just plain OpenGL vertex arrays. And you get 3.2 MTris/sec ? Sounds perfectly normal to me. If you think about it, the driver has to copy and transfer all your data to the video card every frame, what do you expect ? Use display lists, or vendor specific extensions to speed your rendering up, and you'll easily multiply by 3 or 4 your framerate.


07-30-2002, 05:06 AM
ya, the number I got last night was:

(polys * polys_to_vert_tristrip_ratio * 52bytes (my data) * frames) = min_data_across_mem_bus;

(115200 * 1.5 * 52 * 14) = 125.7984M bytes;

and I'm only running PC133 not bad.

I tried using VARs and they are very useful, however the amount of data that I have won’t fit into video mem and trying to copy vertex data to the buffer every so often is really slow. I’ve considered placing a subset of all the objects into video mem, but at any given point in time only maybe 2% of the data will get used in the current frame. Plus using 16M of video mem will limit the amount of mem the card can use for caching textures. I’d hate to pull the same texture across the agp bus twice in the same frame.

Stupid question: agp mem is a chunk (or chunks) of my system mem, but the agp bus bypasses the CPU, right? So, storing data in agp mem will still be limited to the PC133 speed, right? Just making sure I understand what’s going on. There is nothing more dangerous in this field than having the wrong idea on how you hardware works… http://www.opengl.org/discussion_boards/ubb/smile.gif



07-30-2002, 06:15 AM
John, most definitely my geometry is being stored either on the card (likely) or in agp mem.

Can't you use NV_vertex_array_range and NV_fence if you are using a great deal of memory? What about allocating your vertex arrays in agp mem? Surely you'll have more of this than video ram? Not sure if there are any constraints here. Have not used VAR myself (yet) although have used the equivalents in d3d. As to how agp mem works - I don't know. But here's a link that might help

According to some nvidia docs that are kicking about, agp mem can very often be considered just as fast as video ram. [Although this may be only on systems with fast ddr or rambus ram - something that they don't mention so more than likely.] But as Ysaneya says, if you're just using vanilla vertex arrays then 3.2m polys s-1 is probably about right...


07-30-2002, 07:44 AM
sometimes, drivers are very important. i had some hangs on my app, and changing the nvidia drivers make these hangs disappear...

08-02-2002, 08:41 AM
Cool, well I think I'm going to try and implement the VARs.



08-02-2002, 09:17 AM
> I do use tri strips, but I average only
> about 5 tri per strip (still cuts verts
> down by 50%).

There is a fair bit of overhead per call issues to the driver (although the amount of overhead varies depending on whether you change modelview, enable states, etc).

Also, if you issue the same triangle strip as a triangle index list, the verts will cache as well as in the strip, and the number of transformed verts will be the same, so it really doesn't cost measurably more.

Chances are that if you're currently using 5 triangle strips, bunching all your strips into one big triangle list instead may be faster.

08-02-2002, 10:10 AM
I've looked into this a few times and I'm not sure that it would help me much. I only looked into nvidia’s caching and I think it only cached a handful of verts (around 4 - 10?). My strips are not laid out to ensure that the verts would repeat frequently enough to help. I estimated that at best I would see a 1% – 2% repeat likely hood. Normally any increase is worth a quick change, but I’m looking at a major overhaul. http://www.opengl.org/discussion_boards/ubb/frown.gif

So indexing is out…

Thanks anyway…


08-02-2002, 12:20 PM

Think again. If you're drawing a strip like this:

1 3 5 7

2 4 6 8

Then the stripification of that would look like:

1 2 3 4 5 6 7 8

The triangle list would look something like:

1 2 3 2 4 3 3 4 5 4 6 5 5 6 7 6 8 7

Note that the cache utilization here is 66%, which means that your actual vertex transform throughput will be exactly the same as for the strip. However, when you batch multiple strips in one triangle list, you will save on driver call overhead.

08-02-2002, 12:38 PM
I do see what you are saying; however the 66% utilization brings me back up to my current performance. My chances of hitting any of verts 1-8 are not very good right now, back to my 1%-2%. So, my performance would be 67%-68% where right now it is the equivalent to 66%. Besides, if the tris received a bad sort I wouldn’t be guaranteed the original 66%.


08-02-2002, 08:16 PM
You're saying you're using several tristrips, each of which is very small.

I'm telling you to use a single, big triangle list.

Assuming the vertex cost is the same, making a single large buffer call is typically more efficient than many small buffer calls, unless "big" goes beyond some per-card limit, which is in the thousands for even the most restrictive card.

Now, each each tri strip needs its own modelview and texture state, then you might want to start thinking about pre-transform and texture sheeting to be able to pack it into a single triangle list to make it go faster.

AGP 4X is a gigabyte per second, give or take. PC133 can (just barely) be provoked to go that fast, in a single direction.

[This message has been edited by jwatte (edited 08-02-2002).]

08-03-2002, 10:03 AM
Originally posted by jwatte:

Note that the cache utilization here is 66%, .

How did you calculate 66%, or is this a given number that you are quoting?


08-03-2002, 01:52 PM

I do agree with you that the call overhead stinks, but untill it becomes my major bottle-neck I cant justify a major overhaul.


08-04-2002, 07:45 AM

If you look at the triangle list, each triangle is drawn with two old verts and one previously un-issued vert. Thus there's a 66% cache hit rate except for the first triangle (assuming the vertex cache is at least 4 elements deep :-).

08-05-2002, 05:01 AM
I know this is a little of the original topic, but:

If my current bottle neck is the memory bus and I’m going to use VARs to render parts of my level faster. I’m assuming that the CPU and the AGP bus will fight for use of the system memory. That means that if the VAR data is stored in AGP memory I will still be limited to the system memory speed and not the AGP bus speed. Is this correct or can the AGP bus and the CPU access the memory at the same time? This would effectively double the system memory speed if use properly. I’ve read a lot on Intel’s site about AGP, however a lot of this info is based on evenly distributing work between the CPU and GPU, which is not the problem right now.



08-07-2002, 10:38 AM
yes, more stupid questions... http://www.opengl.org/discussion_boards/ubb/smile.gif

I’ve been reading a lot of AGP info on the Intel site and the more I read the more confused I get. AGPx4 is said to have around a 1G peek transfer, but my memory is PC133 133M peek transfer (right?). So is the blazing speed something seen only by people with faster memory busses or can the AGP bus magically access the memory faster?

Also, if my system memory is a bottleneck then the AGP and CPU will still be fighting for the same 133M right?


08-07-2002, 12:02 PM
PC133 memory has a peak throughput (unidirectional) of about 1 GB / second, so AGP 4x is well matched to that. It's 8 bytes per clock (64 bits wide) at 133 MHz clock.

AGP memory access will indeed contend with the CPU for memory bus speed. However, AGP DMA is sufficiently loose in its specification that the north bridge can make much more efficient use of the bandwidth that is there, than if you were doing PCI DMA. Also, unless your algorithm is somehow degenerate, then your CPU will be slurping in a cache chunk, then chew on it for a while, then slurp in the next one. While it's chewing, AGP can get at the memory "for free".

If you're not using AGP memory, but still using vertex arrays, then it's likely that the driver will copy data out of your system memory buffer, and into its own AGP buffer, from which it will then issue the geometry. This is likely to suck.

If you really have a fully streaming bottleneck, and have done everything you can to batch your processing to avoid unnecessary bus turn arounds and partial cache line evictions, then the only thing you can do to increase performance much is to move your vertex data to VRAM. Note that it'll compete with texture data on the card at that point.

08-07-2002, 12:39 PM

Thanks for the response. I almost have the code in place to start testing VARs with my app. After reading your post I think that I should see a significant performance boost just from AGP storage. I’ve been very concerned with starving my video card of video memory because my app switches texture more frequently than I would like. I’m hoping that the driver will continue to do a good job of caching the textures. http://www.opengl.org/discussion_boards/ubb/smile.gif

Thanks for the help…