Debugging beyond the debugger

Published on May 8, 2019

Collection of tools hanging on a wall — Real-life debugging tools

Most programming – and sysadmin – problems can be debugged in a fairly straight forward manner using logs, print statements, educated guesses, or an actual debugger. Sometimes, though, the problem is more elusive. There’s a wider box of tricks that can be employed in these cases but I’ve not managed to find a nice overview of them, so here’s mine. I’m mainly focusing on Linux and similar systems, but there tend to be alternatives available for other Operating Systems or VMs if you seek them out.

Networking

tcpdump

tcpdump prints out descriptions of packets on a network interface. You can apply filters to limit which packets are displayed, chose to dump the entire content of the packet, and so forth.

Typical usage might look something like:

# tcpdump -nSi eth0 port 80
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:03:35.577781 IP6 2001:db8::1.54742 > 2001:db8::2.80: Flags [S], seq 2815779044, win 64800, options [mss 1440,sackOK,TS val 2378811665 ecr 0,nop,wscale 7], length 0
16:03:35.586853 IP6 2001:db8::2.80 > 2001:db8::1.54742: Flags [S.], seq 1522609102, ack 2815779045, win 28560, options [mss 1440,sackOK,TS val 3063610173 ecr 2378811665,nop,wscale 7], length 0
16:03:35.586877 IP6 2001:db8::1.54742 > 2001:db8::2.80: Flags [.], ack 1522609103, win 507, options [nop,nop,TS val 2378811674 ecr 3063610173], length 0
16:03:35.620678 IP6 2001:db8::1.54742 > 2001:db8::2.80: Flags [P.], seq 2815779045:2815779399, ack 1522609103, win 507, options [nop,nop,TS val 2378811708 ecr 3063610173], length 354: HTTP: GET / HTTP/1.1

Here you can see the start of a plain text HTTP request: the three-way handshake as the TCP connection is established followed by a GET request. Even if the data is encrypted as it will be in most cases, it’s often useful to see the “shape” of the transmissions – did the client start sending data when it connected, did the server ever respond, etc.

Daniel Miessler has a good tutorial on tcpdump if you’re not familiar with it and don’t want to jump straight into the man page.

… with Docker

Docker sets up separate network namespaces for each container. To see the traffic across the interfaces of a single container you can nsenter the container’s network namespace:

# nsenter -t $(docker inspect --format '{{.State.Pid}}' my_container) -n tcpdump -nS port 80

This retrieves the PID for the container, and tells nsenter to enter the network (-n) namespace from the given target (-t) PID, and then run the given command (in this case tcpdump ...).

openssl s_client / s_server

When a connection is using TLS it’s often useful to try connecting to the server and see what certificate it presents, algorithms it negoitates, and so forth. OpenSSL offers two useful subcommands which can help with this: s_client for connecting as a client, and s_server for listening to connections.

For example using s_client to connect to google.com on the standard HTTPS port shows us details about the server cert and its verification status:

$ openssl s_client -connect google.com:443
CONNECTED(00000003)
depth=2 OU = GlobalSign Root CA - R2, O = GlobalSign, CN = GlobalSign
verify return:1
depth=1 C = US, O = Google Trust Services, CN = Google Internet Authority G3
verify return:1
depth=0 C = US, ST = California, L = Mountain View, O = Google LLC, CN = *.google.com
verify return:1
---
Certificate chain
 0 s:C = US, ST = California, L = Mountain View, O = Google LLC, CN = *.google.com
   i:C = US, O = Google Trust Services, CN = Google Internet Authority G3
 1 s:C = US, O = Google Trust Services, CN = Google Internet Authority G3
   i:OU = GlobalSign Root CA - R2, O = GlobalSign, CN = GlobalSign
---
# ...

Whereas connecting to my webserver and providing an unknown host in the SNI field results in an SSL alert 112 (“The server name sent was not recognized”) and no server certificate is sent:

$ openssl s_client -connect chameth.com:443 -servername example.com
CONNECTED(00000003)
140384831313024:error:14094458:SSL routines:ssl3_read_bytes:tlsv1 unrecognized name:../ssl/record/rec_layer_s3.c:1536:SSL alert number 112
---
no peer certificate available
---
# ...

Often if you hit this kind of alert in an application the exact error will be lost somewhere in the many layers between the SSL library and the logs, so being able to directly connect and test can help diagnose a lot of issues.

Once a connection is established you can read and write plain text and it will be encrypted and decrypted automatically.

Java apps

If a Java app is involved in the connection, you can enable a lot of built-in debugging with a simple JVM property: javax.net.debug. You can tweak what exactly gets logged, but the easiest thing to do is just set the property to all and you’ll see information about certificate chains, verification, and packet dumps:

$ java -Djavax.net.debug=all -jar ....
# ...
found key for : duke
chain [0] = [
[
  Version: V1
  Subject: CN=Duke, OU=Java Software, O="Sun Microsystems, Inc.",
  L=Cupertino, ST=CA, C=US
# ...

More information about Java’s debugging options is available on docs.oracle.com.

Thread and core dumps

Higher-level languages frequently provide an interactive way to dump the current executation state of all of their threads (a “thread dump”). This is useful to spot deadlocks, some types of race conditions, and as a quick and dirty method of investigating hangs or excessive CPU usage.

With both Java and Go applications you can send a QUIT signal to have a thread dump printed out; Go applications will quit after doing so, Java ones will carry on running. At most terminals you can hit Ctrl and \ to send a QUIT signal.

For Java you can also use the jstack tool from the JDK to dump threads by PID; this can be useful if the application is running in the background or has redirected sysout:

$ jstack 8321
Attaching to process ID 8321, please wait...
Debugger attached successfully.
Client compiler detected.

Thread t@5: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=-1107318896 (Interpreted frame)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=116 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove() @bci=2, line=132 (Interpreted frame)
 - java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 (Interpreted frame)

# ...

A core dump provides more complete information about the state of a process, but is often more complex to interpret. The gcore utility from GDB will create a core dump of a process with a given PID. You can then generally load the core file using your normal debugger, depending on the language in question.

System calls

strace is the swiss army knife for seeing what a process is doing. It details each system call made by a program (you can filter them down, of course). For example:

$ strace -e read curl https://google.com/
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \236\0\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P!\0\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200l\2\0\0\0\0\0"..., 832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20Q\0\0\0\0\0\0"..., 832) = 832
# ...
read(3, "\0\0\0\0\0\0\0\4\25\345\366\302\273sE6\365wI\225\321|\3435Z\362\216\372\215\251aO"..., 253) = 253
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="https://www.google.com/">here</A>.
</BODY></HTML>
read(3, "\27\3\3\0!", 5)                = 5
# ...

Brendan Gregg has a nice guide on strace and alternatives.

… with docker

When the application is running in docker you can usually just strace it from the host with the correct PID (from e.g. docker inspect --format '{{.State.Pid}}' my_container). Sometimes you may need to trace the startup of an application though, which is a bit trickier. Instead you can run a new container using the same PID namespace as your target, and the permissions needed to strace:

$ docker run --rm -it --pid=container:my_container \
  --net=container:my_container \
  --cap-add sys_admin \
  --cap-add sys_ptrace \
  alpine

From within the new container you can install strace, and trace any running program within the target container using strace -p as normal. To start a new program you need access to the target container’s file system, which you can get to via /proc/1/root (PID 1 being the main process that docker started in the target container).

Files

Sometimes the problem might relate to file access. There are a couple of straight forward - but nonetheless useful - tools which might help here. inotifywait uses the Linux inotify subsystem to watch files or directories for operations. For example:

$ inotifywait -mr site/content
Setting up watches.  Beware: since -r was given, this may take a while!
Watches established.
site/content/post/ MODIFY 2019-05-08-debugging-beyond-the-debugger.md
site/content/post/ OPEN 2019-05-08-debugging-beyond-the-debugger.md
site/content/post/ MODIFY 2019-05-08-debugging-beyond-the-debugger.md
site/content/post/ MODIFY 2019-05-08-debugging-beyond-the-debugger.md
site/content/post/ CLOSE_WRITE,CLOSE 2019-05-08-debugging-beyond-the-debugger.md
# ...

Here the -m switch makes inotifywait monitor the files forever (instead of exiting on the first modification, which is the normal behaviour) and r makes it recurse into the directory and monitor each file and subdirectory in there.

If you want to see what processes currently have a file open, fuser is the go-to tool. For example:

$ fuser -v /
                     USER PID ACCESS COMMAND
/:                   root     kernel mount /
                     chris      2961 .rc.. systemd
                     chris      2986 .r... gdm-x-session
                     chris      2994 .r... dbus-daemon
                     chris      3001 .r... gnome-session-b
# ...

Honourable mentions

These aren’t really debugging tools, but I feel it’s worth mentioning as they often feature somewhere along the debugging-of-weird-problems journey.

I’ve seen some weird and wonderful problems happen because a disk is full, so a quick df early on in the debugging process never hurts. Some apps may hang, some may corrupt their config, some may fall over and die; sometimes the manner in which they fail doesn’t obviously point to a disk space issue.

Another issue that comes up now and then – especially inside VMs or other environment that don’t have a decent amount of “noise” happening – is entropy exhaustion. A quick look at /proc/sys/kernel/random/entropy_avail should be enough to confirm that everything is ticking along nicely. If it’s exceedingly low then you may find that anything involving random number generation stalls (TLS connections for example).

Thanks for reading!

Want to comment on this post? Point out a mistake? Send me a message on BlueSky! Want to get in touch privately instead? Use this form to send me an e-mail. Found this helpful? Or just really, really enjoyed it? You could buy me a coffee to say thanks.

Docker reverse proxying, redux

Six years ago, I described my system for configuring a reverse proxy for docker containers. It involved six containers including a key-value store and a webserver. Nothing in that system has persisted to this day. Don’t get me wrong – it worked – but there were a lot of rough edges and areas for improvement.

Automatic reverse proxying with Docker and nginx

Over the past few weeks I’ve gradually been migrating services from running in LXC containers to Docker containers. It takes a while to get into the right mindset for Docker - thinking of containers as basically immutable - especially when you’re coming from a background of running things without containers, or in “full” VM-like containers. Once you’ve got your head around that, though, it opens up a lot of opportunities: Docker doesn’t just provide a container platform, it turns software into discrete units with a defined interface.

Finding an awkward bug with Claude Code

I recently encountered a bug in one of my projects that I couldn’t immediately figure out. It was an issue in Centauri, my reverse proxy. After its config was updated, I noticed it stopped serving responses. Looking at the logs, I could see it was obtaining new certificates from Let’s Encrypt for a couple of domains, but I’d designed it so that wouldn’t block requests (or so I thought). After a few minutes of confusion, everything went back to working.

Simple backups with Restic and Hetzner Cloud

I have a confession: for the past few years I’ve not been backing up any of my computers. Everyone knows that you should do backups, but actually getting around to doing it is another story.

// Personal website of Chris Smith