PDA

View Full Version : Quadro bug



dukey
08-18-2016, 12:51 PM
I have a program that streams 3d data over network, and then renders it. I have this inexplicable problem where only on quadro cards, as soon as I render a few frames of data, the network connection drops. I've tried everything imaginable to simulate this bug on my GTX card, by pausing the network traffic, simulating high CPU load, changing the way I receive the data.

On a quadro card, it consistently breaks. If i don't call RenderData() the network stream works flawlessly.

It sounds not driver related by. However when I googled the error message "WSAECONNABORTED and recv "
Amazingly the 3rd result on google is someone with an identical issue

http://stackoverflow.com/questions/3369913/how-to-investigate-client-side-wsaeconnaborted-happening-very-often-only-on-mach

Dark Photon
08-18-2016, 07:40 PM
That's pretty strange. Sure sounds like some strange interaction between the NVidia Quadro driver and the network driver. Possibly the Quadro driver is disabling interrupts for too long, using too much buffer space, or otherwise somehow starving out the network driver in some way.

Overall, a few things you might try to investigate this one involve: 1) monitoring your app and the driver, 2) modifying your app, and 3) modifying the NVidia driver config.

1) I'd uses Resource Monitor, Process Explorer, or whatever you can get your hands on to keep an eye on both 1) CPU/memory consumption for your application as well as 2) CPU/memory consumption for the NVidia driver and your network driver (or possibly the kernel in general). Hopefully that'll provide some clues. Do you see high CPU? Do you see memory growth?

2) I would try modifying your GL app to see if you can isolate what your app is doing that instigates this problem (if anything). For instance, are you free-running? If so, enable VSync (Sync-to-VBlank) (and time your frame loop to verify that you're getting it). Try pumping up the SwapInterval to > 1. Try disabling most of your draw submission code and just do a Clear and Swap of your window. Gradually re-enable certain sub-portions of your frame draw code. Possibly slow things down by putting a glFinish() after Swap, and possibly after certain strategic points during your frame. Try reducing window resolution and/or hiding the window altogether.

3) The other thing I would try is modifying your NVidia driver config (or the environment in which it operates) to see if you can make the network problem go away. For instance, to reduce memory consumption/bandwidth, try reducing the size of your virtual screen / desktop. If multiple monitors, cut down to one. If one, reduce the resolution. Try forcing the driver to run single-threaded (if that's an option on Windows). Modify how your system handles IRQ routing and APIC (see below). I realize you're on windows, but here are a few sections of the NVidia Linux driver README.txt caught my eye when I scanned it w.r.t. your problem.



__________________________________________________ ____________________________

Chapter 8. Common Problems
__________________________________________________ ____________________________

This section provides solutions to common problems associated with the NVIDIA
Linux x86_64 Driver.


Q. My X server fails to start, and my X log file contains the error:

(EE) NVIDIA(0): The NVIDIA kernel module does not appear to
(EE) NVIDIA(0): be receiving interrupts generated by the NVIDIA
graphics
(EE) NVIDIA(0): device PCI:x:x:x. Please see the COMMON PROBLEMS
(EE) NVIDIA(0): section in the README for additional information.


A. This can be caused by a variety of problems, such as PCI IRQ routing
errors, I/O APIC problems, conflicts with other devices sharing the IRQ (or
their drivers), or MSI compatibility problems.

If possible, configure your system such that your graphics card does not
share its IRQ with other devices (try moving the graphics card to another
slot if applicable, unload/disable the driver(s) for the device(s) sharing
the card's IRQ, or remove/disable the device(s)).

Depending on the nature of the problem, one of (or a combination of) these
kernel parameters might also help:

Parameter Behavior
-------------- ---------------------------------------------------
pci=noacpi don't use ACPI for PCI IRQ routing
pci=biosirq use PCI BIOS calls to retrieve the IRQ routing
table
noapic don't use I/O APICs present in the system
acpi=off disable ACPI


The problem may also be caused by MSI compatibility problems. See "MSI
Interrupts" for details.
...

Q. My X server fails to start, and my X log file contains the error:

(EE) NVIDIA(0): The interrupt for NVIDIA graphics device PCI:x:x:x
(EE) NVIDIA(0): appears to be edge-triggered. Please see the COMMON
(EE) NVIDIA(0): PROBLEMS section in the README for additional
information.


A. An edge-triggered interrupt means that the kernel has programmed the
interrupt as edge-triggered rather than level-triggered in the Advanced
Programmable Interrupt Controller (APIC). Edge-triggered interrupts are not
intended to be used for sharing an interrupt line between multiple devices;
level-triggered interrupts are the intended trigger for such usage. When
using edge-triggered interrupts, it is common for device drivers using that
interrupt line to stop receiving interrupts. This would appear to the end
user as those devices no longer working, and potentially as a full system
hang. These problems tend to be more common when multiple devices are
sharing that interrupt line.

This occurs when ACPI is not used to program interrupt routing in the APIC.
It may also occur when ACPI is disabled, or fails to initialize. In these
cases, the Linux kernel falls back to tables provided by the system BIOS.
In some cases the system BIOS assumes ACPI will be used for routing
interrupts and configures these tables to incorrectly label all interrupts
as edge-triggered. The current interrupt configuration can be found in
/proc/interrupts.

Available workarounds include: updating to a newer system BIOS, a more
recent Linux kernel with ACPI enabled, or passing the 'noapic' option to
the kernel to force interrupt routing through the traditional Programmable
Interrupt Controller (PIC). The Linux kernel also provides an interrupt
polling mechanism you can use to attempt to work around this problem. This
mechanism can be enabled by passing the 'irqpoll' option to the kernel.

Currently, the NVIDIA driver will attempt to detect edge triggered
interrupts and X will purposely fail to start (to avoid stability issues).
This behavior can be overridden by setting the "NVreg_RMEdgeIntrCheck"
NVIDIA Linux kernel module parameter. This parameter defaults to "1", which
enables the edge triggered interrupt detection. Set this parameter to "0"
to disable this detection.
...

Driver fails to initialize when MSI interrupts are enabled

The Linux NVIDIA driver uses Message Signaled Interrupts (MSI) by default.
This provides compatibility and scalability benefits, mainly due to the
avoidance of IRQ sharing.

Some systems have been seen to have problems supporting MSI, while working
fine with virtual wire interrupts. These problems manifest as an inability
to start X with the NVIDIA driver, or CUDA initialization failures. The
NVIDIA driver will then report an error indicating that the NVIDIA kernel
module does not appear to be receiving interrupts generated by the GPU.

Problems have also been seen with suspend/resume while MSI is enabled. All
known problems have been fixed, but if you observe problems with
suspend/resume that you did not see with previous drivers, disabling MSI
may help you.

NVIDIA is working on a long-term solution to improve the driver's out of
the box compatibility with system configurations that do not fully support
MSI.

MSI interrupts can be disabled via the NVIDIA kernel module parameter
"NVreg_EnableMSI=0". This can be set on the command line when loading the
module, or more appropriately via your distribution's kernel module
configuration files (such as those under /etc/modprobe.d/).
...

Q. OpenGL applications leak significant amounts of memory on my system!

A. If your kernel is making use of the -rmap VM, the system may be leaking
memory due to a memory management optimization introduced in -rmap14a. The
-rmap VM has been adopted by several popular distributions, the memory leak
is known to be present in some of the distribution kernels; it has been
fixed in -rmap15e.

If you suspect that your system is affected, try upgrading your kernel or
contact your distribution's vendor for assistance.

dukey
08-19-2016, 03:28 AM
Hi Dark Photon, thanks for the reply.

That's pretty strange. Sure sounds like some strange interaction between the NVidia Quadro driver and the network driver. Possibly the Quadro driver is disabling interrupts for too long, using too much buffer space, or otherwise somehow starving out the network driver in some way.


That was my guess. Perhaps it was somehow flushing the TCP buffer. Normally receiving TCP is not time dependant. UDP will just throw packets away if the buffer is full or you don't receive them fast enough.

I found a work around, setting the driver profile to dynamic streaming seems to fix it. I don't know if it actually fixes it, or just minimises the problem so I don't notice it. It also doesn't happen if I have a debugger attached. So I am guessing its some sort of timing issue. I am not even sure what could possibly break the TCP connection.

The documentation for recv says this

Winsock may need to wait for a network event before the call can complete. Winsock performs an alertable wait in this situation, which can be interrupted by an asynchronous procedure call (APC) scheduled on the same thread.

Wonder if the driver could be doing that.

Dark Photon
08-19-2016, 06:23 PM
I found a work around, setting the driver profile to dynamic streaming seems to fix it.

Interesting. Good find! Websearching that option (out of curiosity) reveals quite a few posts recommending that folks flip their Quadros to that mode (or Visual Simulation) to solve performance problems and crash issues.

One of those hits has a familiar description:

https://devtalk.nvidia.com/default/topic/917151/opengl/running-openscenegraph-opengl-previews-on-quadro-cards-heavy-cpu-peaks/


Q: In the recent months, several costumers reported that their CPUs stall heavily when using our software.
...
On said system, the symptoms are the following:



When running our software with [OpenGL] previews enabled, the CPU usage sporadically spikes at 100 % on all cores for about 1 to 10 seconds.
In this time, our software as well as most other processes are unresponsive.
Apart from these peaks, the CPU usage is < 10 %

...
When the [Quadro] graphics card in the above-mentioned test system is replaced with a GeForce GTX 670 (driver 361.43), the problem disappears.

...

A: If your OpenGL application generates a lot of dynamic load and/or needs to keep some continuous baseline performance, please try first if the profile "Workstation - Dynamic Streaming" changes the behavior. It's explicitly meant to address these use cases.