Always close the response body! Running out of file descriptors in Golang

Update 10-07-2020: One thing I love about programming is you make mistakes, you learn, you iterate. So in that spirit… don’t do what I suggested below! In my final code sample I defer closing the response body. First, this is unnecessary. There’s no work that needs to be done between error checking and closing the body so just do it synchronously. Second, don’t defer in a loop unless you’re absolutely sure you know what you’re doing! As pointed out in Inanc Gumus’s handy blog post and Erin’s amazing Monzo incident thread, defers will get pushed onto the function’s stack and could exhaust it. This is a serious risk in my example because I have no way of knowing how many connections I’m trying to close. So anyways, “Always close the response body!”… but avoid defer in a for loop.

In Go you cannot just fire and forget a function. Failing to handle the return types could lead your program to fail. It is quite easy, for instance, to leak file descriptors. This is very, very bad for long-running or high throughput processes. Check for leaks.

Consider a web server built to provide a REST API to some other tool. Our example is for a message broker, but this could be any software you want to proxy or apply a standard interface to. Users send CRUD requests to the server and a client library does the heavy lifting. The following code is being called when the server receives a DELETE request for an app:

func (s *Server) Disconnect(req Request) error {

  ...

  resp, err = s.Client.DeleteUser(req.User)
  if err != nil {
    return err
  }
  defer resp.Body.Close()

  connections, err := s.client.ListConnections()
  if err == nil {
    for _, connection := range connections {
      if connection.User == req.User {
        s.client.CloseConnection(connection.Name)
      }
    }
  }

  ...

  return nil
}

Each app has an associated user in the broker. Over the lifetime of that user they may have created any number of independent connections (e.g. message queues). We want to delete the associated user. We then check for orphaned connections and close them one by one.

The problem is that we’ve forgotten to consider the response type of CloseConnection. ListConnections returns a struct generated in the client libary. But CloseConnection just forwards the HTTP response from the broker. The documentation for “net/http” states:

“If the returned error is nil, the Response will contain a non-nil Body which the user is expected to close. If the Body is not both read to EOF and closed, the Client’s underlying RoundTripper (typically Transport) may not be able to re-use a persistent TCP connection to the server for a subsequent “keep-alive” request.”

Response.Body implements an io.Reader so that the “response body is streamed on demand as the Body field is read”. This way the response can be piped to a decoder and avoid buffering very long messages in memory. But this also means the client has to explicitly decide when it has finished with the stream. Some other languages (e.g. ruby) read the entire message into memory by default and have a different syntax for streaming.

Every call to CloseConnection opens a file descriptor. If the response body is not closed, that file descriptor will never be closed. In fact, it will be stuck in a state called CLOSE_WAIT indefinitely. The file descriptor limit varies between kernels and can be altered but a common default for a shell (ulimit -n) is 1024. If the processes spawned by that shell exceed its FD limit, any attempt to open another one - in this case to open another TCP socket - will fail with an EMFILE (Too many open files) error. And that’s basically you done. The process can be restarted. The ulimit can be set higher. But the problem will persist until the source of the leak is fixed. We need to handle the response and error of CloseConnection and close the response body:

  resp, err := s.client.CloseConnection(connection.Name)
  if err != nil {
    return err
  }

  defer resp.Body.Close()

This bug is a slow burn. Unless your CI includes some kind of very high throughput perf test, you’re unlikely to run out of file descriptors before you ship. Without a properly configured long running environment, the only place this will manifest itself is in prod. This is bad enough for a server you mantain. In the very worst case you’ll have an incident, get paged in the middle of the night to restart the server and maybe bump the ulimit so that you can fix it in the morning. If you ship it as a binary your only option will be to ship a new release.

Beyond trying your very hardest to not make this mistake (ever), consider the failure modes of the application you’re trying to build. Test for them where possible. If you’re building a web server, this should include testing for file descriptor leaks. Check lsof for TCP connections in CLOSE_WAIT after your end to end test run. Or if you’re feeling minimalist, you can check /proc/net/tcp for connections from your server’s port (in hexadecimal) that have connection state 8 (CLOSE_WAIT).