PDA

View Full Version : Seg fault in Nvidia libGL file



prsman
12-22-2017, 09:36 AM
Hi everyone. I am posting here because I cant find a solution to a problem I'm having. I am on a Mint 18.1 xfce desktop with a Nvidia 7300GS video card. Driver 304.135.
I have checked all the config/system files on my system and cant find anything wrong with them. The problem is two applications open and crash. In my syslog file there is a segmentation fault in libGL.so.304.135 after the crash. I have no other problems with the display. Other OpenGL apps work fine.
I have switched to a different desktop, same OS though. Reinstalled the driver. FYI, the libGL.so.304.135 file is from Nvidia.
Card problem or what am I missing? Thanks for reading.

PS, I dont have this problem using the nouveau driver. 64bit system.

lrwxrwxrwx 1 root root 10 Mar 17 2017 /usr/lib32/nvidia-304/libGL.so -> libGL.so.1
lrwxrwxrwx 1 root root 16 Mar 17 2017 /usr/lib32/nvidia-304/libGL.so.1 -> libGL.so.304.135
-rw-r--r-- 1 root root 833540 Mar 17 2017 /usr/lib32/nvidia-304/libGL.so.304.135
lrwxrwxrwx 1 root root 14 Aug 10 15:55 /usr/lib/i386-linux-gnu/mesa/libGL.so.1 -> libGL.so.1.2.0
-rw-r--r-- 1 root root 453128 Aug 10 15:55 /usr/lib/i386-linux-gnu/mesa/libGL.so.1.2.0
lrwxrwxrwx 1 root root 10 Mar 17 2017 /usr/lib/nvidia-304/libGL.so -> libGL.so.1
lrwxrwxrwx 1 root root 16 Mar 17 2017 /usr/lib/nvidia-304/libGL.so.1 -> libGL.so.304.135
-rw-r--r-- 1 root root 1076560 Mar 17 2017 /usr/lib/nvidia-304/libGL.so.304.135
lrwxrwxrwx 1 root root 14 Aug 10 15:51 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 -> libGL.so.1.2.0
-rw-r--r-- 1 root root 463424 Aug 10 15:52 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1.2.0

Dark Photon
12-22-2017, 08:32 PM
We're going to need more info to be able to help you much.

Looks like you've got two sets of NVidia GL drivers on your system. That's often a recipe for problems like you're seeing.

NVidia:


lrwxrwxrwx 1 root root 10 Mar 17 2017 /usr/lib32/nvidia-304/libGL.so -> libGL.so.1
lrwxrwxrwx 1 root root 16 Mar 17 2017 /usr/lib32/nvidia-304/libGL.so.1 -> libGL.so.304.135
-rw-r--r-- 1 root root 833540 Mar 17 2017 /usr/lib32/nvidia-304/libGL.so.304.135
lrwxrwxrwx 1 root root 10 Mar 17 2017 /usr/lib/nvidia-304/libGL.so -> libGL.so.1
lrwxrwxrwx 1 root root 16 Mar 17 2017 /usr/lib/nvidia-304/libGL.so.1 -> libGL.so.304.135
-rw-r--r-- 1 root root 1076560 Mar 17 2017 /usr/lib/nvidia-304/libGL.so.304.135


Mesa:


lrwxrwxrwx 1 root root 14 Aug 10 15:55 /usr/lib/i386-linux-gnu/mesa/libGL.so.1 -> libGL.so.1.2.0
-rw-r--r-- 1 root root 453128 Aug 10 15:55 /usr/lib/i386-linux-gnu/mesa/libGL.so.1.2.0
lrwxrwxrwx 1 root root 14 Aug 10 15:51 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 -> libGL.so.1.2.0
-rw-r--r-- 1 root root 463424 Aug 10 15:52 /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1.2.0


First, have you gone through the correct steps to disable nouveau so that your NVidia drivers get full-and-clear control of the GPU (websearch: Mint 18.1 switch from nouveau to nvidia drivers)? Have you installed the NVidia drivers properly? If you install them from their run script, it handles searching your system for other potentially conflicting libraries and removing them.

Some things to check on your system:

lsmod | egrep -i 'nouveau|nvidia'
glxinfo | egrep 'OpenGL|glx'

On an OpenGL program that works fine, try running:

ldd PROGNAME | grep GL

for instance:

ldd `which glxgears | grep GL`

Now for an OpenGL program which doesn't work fine, try running the same. Do you see a difference in which GL it's linking with?

You can try running strace or valgrind on a program which doesn't work to see if you can get a line on what it's trying to do when it crashes.

prsman
12-23-2017, 12:34 PM
Thanks for the reply, Dark Photon. The Nvidia driver is installed with Mint's Driver Manager, not the .run file. I will do the checks you suggested and report back.
I had a feeling it might a conflict but I dont enough about Linux to "drill down" into a problem. All the checks I did shows nouveau is not installed. Thanks to you I have a path to drill down on this problem. Will report back. Thanks

(edit) Well this is weird. The lsmod command for nouveau returns nothing, nvidia returns nvidia. The command glxinfo | egrep returns Nvidia glx, not mesa. Using ldd Program | grep GL returns nothing. I guess the two apps dont use GL.

I changed out the video card for a newer one, installed the driver with Mints Driver Manager and get the same seg fault in the newer Nvidia libGl file.

If the ldd command returns nothing for the failing apps which implies they dont use GL, new card and driver, is mesa conflicting?


kernel: [ 4106.429222] python[3731]: segfault at 8 ip 00007f933155e82d sp 00007ffc5aecea70 error 4 in libGL.so.340.102[7f93314b0000+c7000]


kernel: [ 4220.778952] mintinstall[3824]: segfault at 8 ip 00007fd5bd0bc82d sp 00007ffd93932620 error 4 in libGL.so.340.102[7fd5bd00e000+c7000]

Dark Photon
12-23-2017, 07:25 PM
(edit) Well this is weird. The lsmod command for nouveau returns nothing, nvidia returns nvidia.

That's good. It suggests your nouveau support was probably disabled properly when the NVidia driver was installed.


The command glxinfo | egrep returns Nvidia glx, not mesa.

That's good again!


Using ldd Program | grep GL returns nothing. I guess the two apps dont use GL.

That's pretty fishy, and suggests they may not be using OpenGL afterall.

Try it with glxinfo and/or glxgears to confirm that you do see dynamic dependencies on libGL.


I changed out the video card for a newer one, installed the driver with Mints Driver Manager and get the same seg fault in the newer Nvidia libGl file.

If the ldd command returns nothing for the failing apps which implies they dont use GL, new card and driver, is mesa conflicting?

Weird. Well the ldd thing does suggest that those executables do not have a "direct link-time dependency" on OpenGL. However, that doesn't mean that they don't have an "indirect link-time depenency" on OpenGL. (for instance, program -> somelibrary -> libGL.so). You might do an ldd on its dependencies. Or more generally, you might run "env LD_DEBUG=all your_program". That'll give you very verbose information about the decisions the dynamic linker is making about which dynamic libraries to pull in and from where. Greping the output for libGL might be revealing.

Alternatively, it could be that your program loads libGL.so into memory at runtime after startup via dlopen() / dlsym(). Running "strace | egrep 'dlopen|dlsym'" might help confirm/refute whether they're doing that.

In any case, somehow they're pulling in libGL.

Too soon to say whether this has anything to do with conflicting Mesa libs.

Dark Photon
12-23-2017, 07:30 PM
kernel: [ 4106.429222] python[3731]: segfault at 8 ip 00007f933155e82d sp 00007ffc5aecea70 error 4 in libGL.so.340.102[7f93314b0000+c7000]


I just noticed the python in your error. Do these crashes you're getting only occur when you're using Python scripts that make use of OpenGL?

If not, I'd redirect your debugging to built executable images which link with libGL that crash in the NVidia driver.

If there aren't any, you may just have a problem with your Python OpenGL (PyOpenGL?) bindings finding the correct GL library.

prsman
12-24-2017, 01:19 PM
Hi Dark Photon, thanks for sticking with this.


Alternatively, it could be that your program loads libGL.so into memory at runtime after startup via dlopen() / dlsym(). Running "strace | egrep 'dlopen|dlsym'" might help confirm/refute whether they're doing that.


I did not see in the strace where it calls for libGL. Runnig ldd /usr/bin/mintinstall returns: not a dynamic executable. I maybe running this command wrong.

you might run "env LD_DEBUG=all your_program


My linux chops are not enough to now how to run this.
Both apps that crash are python scripts.
From the strace, one where the app fails:


--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=5086, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
close(3) = 0
wait4(5086, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 5086
getuid() = 1000
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f55f106a4b0}, {0x53d3a0, [], SA_RESTORER, 0x7f55f1410390}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER, 0x7f55f106a4b0}, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7ffc6da1794c) = 5088
wait4(5088, OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
Segmentation fault
[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 5088
rt_sigaction(SIGINT, {0x53d3a0, [], SA_RESTORER, 0x7f55f106a4b0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7f55f106a4b0}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=5088, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f55f1410390}, {0x53d3a0, [], SA_RESTORER, 0x7f55f106a4b0}, 8) = 0
brk(0x1833000) = 0x1833000
brk(0x1831000) = 0x1831000
exit_group(0) = ?
+++ exited with 0 +++

In line 9 after the wait, is where I have to enter my password, it opens and crashes and I get the seg fault in syslog.


--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=5450, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
close(3) = 0
wait4(5450, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 5450
getuid() = 1000
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f5c6ec584b0}, {0x53d3a0, [], SA_RESTORER, 0x7f5c6effe390}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER, 0x7f5c6ec584b0}, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7ffde08f80fc) = 5452
wait4(5452, OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
Killed
[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 5452
rt_sigaction(SIGINT, {0x53d3a0, [], SA_RESTORER, 0x7f5c6ec584b0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7f5c6ec584b0}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=5452, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f5c6effe390}, {0x53d3a0, [], SA_RESTORER, 0x7f5c6ec584b0}, 8) = 0
brk(0x2561000) = 0x2561000
brk(0x255f000) = 0x255f000
exit_group(0) = ?
+++ exited with 0 +++


In line 9, same as above but after the password the app opens, stays open until I close it. No seg fault in syslog.
Could it be a problem with the python OpenGL bindings? Thanks

Dark Photon
12-24-2017, 02:06 PM
you might run "env LD_DEBUG=all your_program

My linux chops are not enough to now how to run this.

Don't psyc yourself out. It's easy. For example, instead of:

glxinfo
run:

env LD_DEBUG=all glxinfo

However, I don't think you need to do this now. Your info says this problem is happening down in a child process, not in the linking of the main process you're launching.


Both apps that crash are python scripts.

Not only that, but (from your output), both are trying to run some Java stuff, and totally unsuccessfully it appears. This is getting weird.

If apps like glxinfo and glxgears work fine, but only these Python scripts crash like this, then that suggests that there is a problem with the way these Python scripts are trying to use OpenGL, or the environment in which they're being run hasn't been setup properly.


In line 9 after the wait, is where I have to enter my password, it opens and crashes and I get the seg fault in syslog.

Password? What, is this trying to do a remote login to some machine, or to change user to root or something?


Could it be a problem with the python OpenGL bindings? Thanks

It's sounding likely that it's at least something specific to what these Python scripts are doing, possibly how the Python bindings are trying to load/use OpenGL (or Java). How's Java involved in this?

The fact that they do segfault in libGL.so.340.102 (NVidia GL library) suggests they're at least in the NVidia GL library and not one of those other GL libs you have on your system. But who knows what they're doing wrong that makes them crash.

Your straces suggest that it's segfaulting in a child process, which in the last case appears that it might be a Java process. Is that Java process what's trying to use OpenGL?

You may be able to get more clues what's going by running with "strace -f" rather than just "strace", which will trace down into child processes as well.

However, I really think you're going to need to find a way to simplify this test cases you've got. Let's get Java out of the picture, and possibly even Python too. Alternatively, you should probably seek help from the folks that wrote this Python/Java/GL app.

So far based on the evidence (and how stable I know NVidia's GL drivers are), this is looking like you've got some app/process that's mis-using the NVidia GL drivers, and that you just happen to crash in there because the app code is buggy or is encountering a use case it wasn't coded for.

prsman
12-24-2017, 03:41 PM
Hi Dark Photon, thanks again. The app needs my password because it can make changes to the system. (install programs). I dont know how Java is involved but the other failing app gives a similar java message. Will run the strace -f and see what I find.


So far based on the evidence (and how stable I know NVidia's GL drivers are), this is looking like you've got some app/process that's mis-using the NVidia GL drivers, and that you just happen to crash in there because the app code is buggy or is encountering a use case it wasn't coded for.

Then we can agree this is not an OpenGL problem per say and something involving OpenGL. If I dont find something from strace I will close this post. Thank you for your help.

Dark Photon
12-25-2017, 03:14 PM
Another thing you can try: see if you can catch the crash in gdb. When it crashes, dump the stack to see what the program was doing when it crashed (stack trace):


gdb YOUR_PROGRAM
set args ARGS_TO_YOUR_PROGRAM
run
<wait for crash>
where
quit

prsman
12-26-2017, 04:45 PM
Hi Dark Photon, well strace -f gave me a file 12 megs and 256K of lines. Read it all and could find anything about libGL. Then I installed a program called apport which is crash reporter.
I am having trouble reading the core dump. Using the command less on the crash file I see this:


7f9b134d3000-7f9b13509000 r-xp 00000000 08:01 1182746 /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1.0.0
7f9b13509000-7f9b13708000 ---p 00036000 08:01 1182746 /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1.0.0
7f9b13708000-7f9b1370a000 r--p 00035000 08:01 1182746 /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1.0.0
7f9b1370a000-7f9b1370c000 rw-p 00037000 08:01 1182746 /usr/lib/x86_64-linux-gnu/mesa-egl/libEGL.so.1.0.0
7f9b1370c000-7f9b137d3000 r-xp 00000000 08:01 3542199 /usr/lib/nvidia-340/libGL.so.340.102
7f9b137d3000-7f9b13802000 rwxp 000c7000 08:01 3542199 /usr/lib/nvidia-340/libGL.so.340.102
7f9b13802000-7f9b1381e000 r-xp 000f6000 08:01 3542199 /usr/lib/nvidia-340/libGL.so.340.102
7f9b1381e000-7f9b13a1d000 ---p 00112000 08:01 3542199 /usr/lib/nvidia-340/libGL.so.340.102
7f9b13a1d000-7f9b13a42000 rw-p 00111000 08:01 3542199 /usr/lib/nvidia-340/libGL.so.340.102

So the file is being called I think. I will run your gdb command. Hopefully that will tell us something. Thanks again.

(EDIT) I ran the gdb command and when it crashed, I typed where and got: no stack ? The apport program generated a crash file and in it is the same
as in the quotes above for the other failing app.. Interesting the crash reports are called _PROGRAM.py.0.crash. In each program directory is a file called Program.py
I will try the gdb program again in case I screwed up.

prsman
12-26-2017, 05:35 PM
Found this file from apport program. May help.


ERROR: apport (pid 4086) Tue Dec 26 16:01:59 2017: called for pid 4048, signal 11, core limit 0, dump mode 1
ERROR: apport (pid 4086) Tue Dec 26 16:01:59 2017: script: /usr/lib/linuxmint/mintinstall/mintinstall.py, interpreted by /usr/bin/python2.7 (command line "/usr/bin/python2 /usr/lib/linuxmint/mintinstall/mintinstall.py")
ERROR: apport (pid 4086) Tue Dec 26 16:01:59 2017: is_closing_session(): no DBUS_SESSION_BUS_ADDRESS in environment
ERROR: apport (pid 4086) Tue Dec 26 16:03:05 2017: wrote report /var/crash/_usr_lib_linuxmint_mintinstall_mintinstall.py.0.cr ash