Goroutine Leak #614

antoniomika · 2025-04-08T00:33:35Z

Hello!

We're using souin over at pico (repo) for our caching system. Lately, we've been noticing goroutine leaks from souin over time. It doesn't look like it's terribly bad, but we're worried about implications of this long term.

For example, here's a graph from two different instances of pgs w/ souin middleware:

It does look like this is associated with some memory leak, but it's hard to tell by how much and I'd prefer to not throw in more confounders. Here's how memory looks though (in MBs). First resident memory:

And next virtual:

I went ahead and grabbed a dump of the goroutines stacks via a SIGABRT on the process. Those logs can be found here

I've traced it back to the middleware, specifically on the error channel here:

souin/pkg/middleware/middleware.go

Lines 942 to 971 in 842d55d

    
           	go func(vr *http.Request, cw *CustomWriter) { 
        
           		prometheus.Increment(prometheus.NoCachedResponseCounter) 
        
           		errorCacheCh <- s.Upstream(cw, vr, next, requestCc, cachedKey, uri) 
        
           	}(req, customWriter) 
        
           	select { 
        
           	case <-req.Context().Done(): 
        
           		switch req.Context().Err() { 
        
           		case baseCtx.DeadlineExceeded: 
        
           			rw.Header().Set("Cache-Status", cacheName+"; fwd=bypass; detail=DEADLINE-EXCEEDED") 
        
           			customWriter.Rw.WriteHeader(http.StatusGatewayTimeout) 
        
           			_, _ = customWriter.Rw.Write([]byte("Internal server error")) 
        
           			s.Configuration.GetLogger().Infof("Internal server error on endpoint %s: %v", req.URL, s.Storers) 
        
           			return baseCtx.DeadlineExceeded 
        
           		case baseCtx.Canceled: 
        
           			return baseCtx.Canceled 
        
           		default: 
        
           			return nil 
        
           		} 
        
           	case v := <-errorCacheCh: 
        
           		switch v { 
        
           		case nil: 
        
           			_, _ = customWriter.Send() 
        
           		case Upstream50xError: 
        
           			_, _ = customWriter.Send() 
        
           			return nil 
        
           		} 
        
           		return v 
        
           	} 
        
           }

My best guess is the request context is completed prior to the upstream request finishing. This results in the error channel never being read and the request waiting. We probably want a way to cancel the upstream request when the context resolves and a defer to close that channel. We need to pay special attention to what is sent if it is closed though.

I'd provide a PR but I'm not super familiar with the souin codebase and if there are any issues posed with a change like this. Let me know if there's anything I can help you with though!

The text was updated successfully, but these errors were encountered:

darkweak · 2025-04-08T18:48:35Z

Hey @antoniomika, what a nice investigation, thank you so much! I don't remember why I didn't put any close to close the channel.

darkweak · 2025-04-08T19:49:04Z

I think we can write something like that:

	errorCacheCh := make(chan error)
	defer close(errorCacheCh)

	go func(vr *http.Request, cw *CustomWriter) {
		defer func() {
			if r := recover(); r != nil {
				s.Configuration.GetLogger().Infof("recovered due to closed errorCacheCh chan, the request context has finished prematurely %s", req.URL)
			}
		}()
		prometheus.Increment(prometheus.NoCachedResponseCounter)
		errorCacheCh <- s.Upstream(cw, vr, next, requestCc, cachedKey, uri)
	}(req, customWriter)

souin/pkg/middleware/middleware.go

Lines 941 to 945 in 842d55d

    
           errorCacheCh := make(chan error) 
        
           go func(vr *http.Request, cw *CustomWriter) { 
        
           	prometheus.Increment(prometheus.NoCachedResponseCounter) 
        
           	errorCacheCh <- s.Upstream(cw, vr, next, requestCc, cachedKey, uri) 
        
           }(req, customWriter)

WDYT?

antoniomika · 2025-04-09T00:04:05Z

@darkweak that should work, but shouldn't we be canceling the upstream request?

The other reason we're looking into this as well is because we're a bit worried about souin causing a large number of upstream requests when cache is cleared (or some other issue occurs). Again not too familiar with the code base so not sure if that's even something we should be worried about.

It's hard for us to tell what's happening because we're using the middleware, so we wouldn't see an uptick in requests to our load balancer to check so we're trying to debug by checking ancillary services.

darkweak · 2025-04-09T09:03:27Z

but shouldn't we be canceling the upstream request?

We should.

souin causing a large number of upstream requests when cache is cleared

What do you mean? Do you want the cache to serve the content as stale?

antoniomika · 2025-04-09T12:32:04Z

Kk, I was thinking the upstream request would take a context that we could cancel to stop it. I imagine that'd be a larger refactor though. I should have more time over the weekend to look into that.

As for the second issue, it's a bit more complicated to explain but I'll try to include as much detail as I can.

We use souin with the caching middleware (https://github.com/picosh/pico/blob/21ef05522522d6a53fb21bd6d07e2e0f831a8a8e/pkg/apps/pgs/web.go#L41). When a user uploads an item to pgs, we then clear the cache for that item (https://github.com/picosh/pico/blob/21ef05522522d6a53fb21bd6d07e2e0f831a8a8e/pkg/apps/pgs/web.go#L306).

This all seems to work well, but when a large number of files are uploaded at once, we then see a large spike in get requests to our object store instance. This is semi expected as rsync will utilize range requests for checksumming so we're not too surprised this occurs, but the jump is much larger than we'd normally expect.

The other thought I had was that souin then triggers an upstream request for the cached item that we've purged but I haven't been able to validate that. We have Prometheus enabled so if there's any metrics you can think of for us to check, I'd be happy to.

This is all to say I don't really think this is a souin issue and more-so a problem of how we've implemented things. I just wanted to cover my bases so when I saw the goroutine leak, I figured it'd be a good time to look into it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goroutine Leak #614

Goroutine Leak #614

antoniomika commented Apr 8, 2025

darkweak commented Apr 8, 2025

darkweak commented Apr 8, 2025

antoniomika commented Apr 9, 2025

darkweak commented Apr 9, 2025

antoniomika commented Apr 9, 2025

Goroutine Leak #614

Goroutine Leak #614

Comments

antoniomika commented Apr 8, 2025

darkweak commented Apr 8, 2025

darkweak commented Apr 8, 2025

antoniomika commented Apr 9, 2025

darkweak commented Apr 9, 2025

antoniomika commented Apr 9, 2025