on
Always close the response body! Running out of file descriptors in Golang
Update 10-07-2020: One thing I love about programming is you make mistakes, you learn, you iterate. So in that spirit… don’t do what I suggested below! In my final code sample I defer
closing the response body. First, this is unnecessary. There’s no work that needs to be done between error checking and closing the body so just do it synchronously. Second, don’t defer
in a loop unless you’re absolutely sure you know what you’re doing! As pointed out in Inanc Gumus’s handy blog post and Erin’s amazing Monzo incident thread, defer
s will get pushed onto the function’s stack and could exhaust it. This is a serious risk in my example because I have no way of knowing how many connections I’m trying to close. So anyways, “Always close the response body!”… but avoid defer
in a for loop.
In Go you cannot just fire and forget a function. Failing to handle the return types could lead your program to fail. It is quite easy, for instance, to leak file descriptors. This is very, very bad for long-running or high throughput processes. Check for leaks.
Consider a web server built to provide a REST API to some other tool. Our example is for a message broker, but this could be any software you want to proxy or apply a standard interface to. Users send CRUD requests to the server and a client library does the heavy lifting. The following code is being called when the server receives a DELETE request for an app:
func (s *Server) Disconnect(req Request) error {
...
resp, err = s.Client.DeleteUser(req.User)
if err != nil {
return err
}
defer resp.Body.Close()
connections, err := s.client.ListConnections()
if err == nil {
for _, connection := range connections {
if connection.User == req.User {
s.client.CloseConnection(connection.Name)
}
}
}
...
return nil
}
Each app has an associated user in the broker. Over the lifetime of that user they may have created any number of independent connections (e.g. message queues). We want to delete the associated user. We then check for orphaned connections and close them one by one.
The problem is that we’ve forgotten to consider the response type of
CloseConnection
. ListConnections
returns a struct generated in the client
libary. But CloseConnection
just forwards the HTTP response from the broker.
The documentation for “net/http” states:
“If the returned error is nil, the Response will contain a non-nil Body which the user is expected to close. If the Body is not both read to EOF and closed, the Client’s underlying RoundTripper (typically Transport) may not be able to re-use a persistent TCP connection to the server for a subsequent “keep-alive” request.”
Response.Body
implements an io.Reader
so that the “response
body is streamed on demand as the Body field is
read”. This way the response can be
piped to a decoder and avoid buffering very long messages in memory. But this
also means the client has to explicitly decide when it has finished with the
stream. Some other languages (e.g.
ruby)
read the entire message into memory by default and have a different syntax for
streaming.
Every call to CloseConnection
opens a file descriptor. If the response body
is not closed, that file descriptor will never be closed. In fact, it will be
stuck in a state called CLOSE_WAIT
indefinitely. The file descriptor
limit varies between kernels and can be altered but a common default for a
shell (ulimit -n
) is 1024. If the processes spawned by that shell exceed its
FD limit, any attempt to open another one - in this case to open another TCP
socket - will fail with an EMFILE (Too many open files)
error. And that’s
basically you done. The process can be restarted. The ulimit can be set higher.
But the problem will persist until the source of the leak is fixed. We need to
handle the response and error of CloseConnection
and close the response body:
resp, err := s.client.CloseConnection(connection.Name)
if err != nil {
return err
}
defer resp.Body.Close()
This bug is a slow burn. Unless your CI includes some kind of very high throughput perf test, you’re unlikely to run out of file descriptors before you ship. Without a properly configured long running environment, the only place this will manifest itself is in prod. This is bad enough for a server you mantain. In the very worst case you’ll have an incident, get paged in the middle of the night to restart the server and maybe bump the ulimit so that you can fix it in the morning. If you ship it as a binary your only option will be to ship a new release.
Beyond trying your very hardest to not make this mistake (ever), consider the
failure modes of the application you’re trying to build. Test for them where
possible. If you’re building a web server, this should include testing for file
descriptor leaks. Check lsof
for TCP connections in CLOSE_WAIT
after
your end to end test run. Or if you’re feeling minimalist, you can check
/proc/net/tcp
for connections from your server’s port (in hexadecimal) that have connection
state 8 (CLOSE_WAIT
).