Debugging segfaults from logs to gdb

How to find what is going wrong on a segfault kernel message


So this week after a version upgrade on GraphicsMagick we got some segfaults on our servers. Nothing terrible, twelve segfaults or close to that on a 24 hour period. The only information was a line on /var/log/kernel.log:

Feb 22 13:28:27 serverXX kernel: [1953364.275653] gm[16356]: segfault at 0 ip 00007fd137bd41e0 sp 00007fff5770dcd0 error 6 in libGraphicsMagick.so.3.7.0[7fd1379b9000+29d000]

No core dumps since ulimit -c is zeroed. What to do to at least have an idea of what is happening?

Well luckily I build the packages for our internal use so I had he build directory available with the unstripped binaries, with that it’s trivial to use the GNU Debugger (gdb) and find what is going on.

First, notice that the segfault is happening on a shared lib, this is per se a complication. You see, when you have the segfault to happen on a non-shared lib binary the ip (instruction pointer) value points to the instruction on the binary, in this case it is pointing to a shared lib, dynamically linked on the gm binary.

To find the instruction, then, subtract the offset given on the segfault message (it’s the 7fd1379b9000 part after the lib’s name) from the ip:

00007fd137bd41e0−7fd1379b9000 = 21B1E0

Finally, using GDB you can check what is happening @ that addres on the library, provided you have an unstripped object (you can get it with -dbg packages on debian/ubuntu):

(gdb) info symbol 0x21B1E0
WriteOnePNGImage + 13648 in section .text

There’s the culprit. You can also find some info on the stripped library using nm, remember that nm will not show anything on shared libs if not used with the -D option (showing just part of the output):

root@XXX:~# nm -D /usr/lib/libGraphicsMagick.so.3.7.0                                                                                                                                                             
0000000000104650 T AccessCacheViewPixels
0000000000104700 T AccessDefaultCacheView
00000000000ea030 T AccessDefinition
00000000001061e0 T AccessImmutableIndexes
0000000000106170 T AccessMutableIndexes
[...]
0000000000210d50 T RegisterJP2Image
0000000000213170 T RegisterPNGImage
0000000000210cf0 T UnregisterJP2Image
0000000000213110 T UnregisterPNGImage

You can see that there are some PNG related symbols around the address 0x21xxxx. If you check the code for GraphicsMagic PNG support you will see that WritePNGImage is part of the RegisterPNGImage code.

In this case I correlated the logs and found that the request that caused the segfault completed without problems and the PNG image was correctly generated, so my conclusion is that the segfault is happening on some non-crucial part of those functions, but there’s not a lot of things to do exactly pinpoint the problem.

gdb, nm and ldd are powerful tools when debugging or trying to do a postmortem on a segfault. It would be easier to find what exactly is going on with a core dump and maybe more info.

cya!

 
comments powered by Disqus