Skip to content

Goroutine Leak #614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
antoniomika opened this issue Apr 8, 2025 · 5 comments
Open

Goroutine Leak #614

antoniomika opened this issue Apr 8, 2025 · 5 comments

Comments

@antoniomika
Copy link

Hello!

We're using souin over at pico (repo) for our caching system. Lately, we've been noticing goroutine leaks from souin over time. It doesn't look like it's terribly bad, but we're worried about implications of this long term.

For example, here's a graph from two different instances of pgs w/ souin middleware:

Image

It does look like this is associated with some memory leak, but it's hard to tell by how much and I'd prefer to not throw in more confounders. Here's how memory looks though (in MBs). First resident memory:

Image

And next virtual:

Image

I went ahead and grabbed a dump of the goroutines stacks via a SIGABRT on the process. Those logs can be found here

I've traced it back to the middleware, specifically on the error channel here:

go func(vr *http.Request, cw *CustomWriter) {
prometheus.Increment(prometheus.NoCachedResponseCounter)
errorCacheCh <- s.Upstream(cw, vr, next, requestCc, cachedKey, uri)
}(req, customWriter)
select {
case <-req.Context().Done():
switch req.Context().Err() {
case baseCtx.DeadlineExceeded:
rw.Header().Set("Cache-Status", cacheName+"; fwd=bypass; detail=DEADLINE-EXCEEDED")
customWriter.Rw.WriteHeader(http.StatusGatewayTimeout)
_, _ = customWriter.Rw.Write([]byte("Internal server error"))
s.Configuration.GetLogger().Infof("Internal server error on endpoint %s: %v", req.URL, s.Storers)
return baseCtx.DeadlineExceeded
case baseCtx.Canceled:
return baseCtx.Canceled
default:
return nil
}
case v := <-errorCacheCh:
switch v {
case nil:
_, _ = customWriter.Send()
case Upstream50xError:
_, _ = customWriter.Send()
return nil
}
return v
}
}

My best guess is the request context is completed prior to the upstream request finishing. This results in the error channel never being read and the request waiting. We probably want a way to cancel the upstream request when the context resolves and a defer to close that channel. We need to pay special attention to what is sent if it is closed though.

I'd provide a PR but I'm not super familiar with the souin codebase and if there are any issues posed with a change like this. Let me know if there's anything I can help you with though!

@darkweak
Copy link
Owner

darkweak commented Apr 8, 2025

Hey @antoniomika, what a nice investigation, thank you so much! I don't remember why I didn't put any close to close the channel.

@darkweak
Copy link
Owner

darkweak commented Apr 8, 2025

I think we can write something like that:

	errorCacheCh := make(chan error)
	defer close(errorCacheCh)

	go func(vr *http.Request, cw *CustomWriter) {
		defer func() {
			if r := recover(); r != nil {
				s.Configuration.GetLogger().Infof("recovered due to closed errorCacheCh chan, the request context has finished prematurely %s", req.URL)
			}
		}()
		prometheus.Increment(prometheus.NoCachedResponseCounter)
		errorCacheCh <- s.Upstream(cw, vr, next, requestCc, cachedKey, uri)
	}(req, customWriter)

errorCacheCh := make(chan error)
go func(vr *http.Request, cw *CustomWriter) {
prometheus.Increment(prometheus.NoCachedResponseCounter)
errorCacheCh <- s.Upstream(cw, vr, next, requestCc, cachedKey, uri)
}(req, customWriter)

WDYT?

@antoniomika
Copy link
Author

@darkweak that should work, but shouldn't we be canceling the upstream request?

The other reason we're looking into this as well is because we're a bit worried about souin causing a large number of upstream requests when cache is cleared (or some other issue occurs). Again not too familiar with the code base so not sure if that's even something we should be worried about.

It's hard for us to tell what's happening because we're using the middleware, so we wouldn't see an uptick in requests to our load balancer to check so we're trying to debug by checking ancillary services.

@darkweak
Copy link
Owner

darkweak commented Apr 9, 2025

but shouldn't we be canceling the upstream request?

We should.

souin causing a large number of upstream requests when cache is cleared

What do you mean? Do you want the cache to serve the content as stale?

@antoniomika
Copy link
Author

Kk, I was thinking the upstream request would take a context that we could cancel to stop it. I imagine that'd be a larger refactor though. I should have more time over the weekend to look into that.

As for the second issue, it's a bit more complicated to explain but I'll try to include as much detail as I can.

We use souin with the caching middleware (https://github.com/picosh/pico/blob/21ef05522522d6a53fb21bd6d07e2e0f831a8a8e/pkg/apps/pgs/web.go#L41). When a user uploads an item to pgs, we then clear the cache for that item (https://github.com/picosh/pico/blob/21ef05522522d6a53fb21bd6d07e2e0f831a8a8e/pkg/apps/pgs/web.go#L306).

This all seems to work well, but when a large number of files are uploaded at once, we then see a large spike in get requests to our object store instance. This is semi expected as rsync will utilize range requests for checksumming so we're not too surprised this occurs, but the jump is much larger than we'd normally expect.

The other thought I had was that souin then triggers an upstream request for the cached item that we've purged but I haven't been able to validate that. We have Prometheus enabled so if there's any metrics you can think of for us to check, I'd be happy to.

This is all to say I don't really think this is a souin issue and more-so a problem of how we've implemented things. I just wanted to cover my bases so when I saw the goroutine leak, I figured it'd be a good time to look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants