-
Notifications
You must be signed in to change notification settings - Fork 922
panic when a Context is cancelled #614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I jammed a |
EDIT: please ignore this comment, see comment below. We're seeing a similar issue. Our code basically looks like this: tx, err := db.BeginTx(ctx, nil)
if err != nil {
return err
}
defer tx.Rollback()
rows, err := tx.QueryContext(ctx, sql, args...)
// do stuff with rows ... And we see this panic:
Specifically the diff --git a/vendor/github.com/lib/pq/conn.go b/vendor/github.com/lib/pq/conn.go
index 3b322bc..398eca1 100644
--- a/vendor/github.com/lib/pq/conn.go
+++ b/vendor/github.com/lib/pq/conn.go
@@ -946,6 +946,7 @@ func (cn *conn) recvMessage(r *readBuf) (byte, error) {
n := int(binary.BigEndian.Uint32(x[1:])) - 4
var y []byte
if n <= len(cn.scratch) {
+ fmt.Printf("n=%d len=%d\n", n, len(cn.scratch))
y = cn.scratch[:n]
} else {
y = make([]byte, n) When I prefix my example code above with Hopefully this information will be useful! If I find myself with a bit more time, I'll also try to contribute a patch, but I'm not sure if this will happen at any point soon. |
i ran into this again. the stack trace seems to alternate between the one I posted about (pointing at conn.go:1403) and the one that @felixge posted.
the calling code looks something like this. func DoStuff(arg string) error {
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
tx, err := p.db.BeginTx(ctx, nil)
if err != nil {
return err
}
var thing MyStruct
if err := tx.Stmt(p.selectForUpdate).QueryRow(sessionID).Scan(&thing.Foo, &thing.Bar, ...., &thing.Baz); err != nil {
if err != sql.ErrTxDone {
if rollbackErr := db.Rollback(); rollbackErr != nil {
log.Printf("warning: rollback failed! %s", rollbackErr)
}
}
return err
}
// more here ....
} by tuning the timeout, i can get the errors to reliably happen in the first i spent a while trying to replicate on a very simple test case using |
I attempted to reproduce your bug but was not able to. Here's what I tried:
Some of these error (as expected), but none of them panic'd. Until we can get a reproducible test case, it's hard to know if this is your bug (race condition, multiple go routines using the same connection, not closing something) or a lib/pq bug. |
@mjibson thanks for responding!
I've got a similar attempt at a repro here and also couldn't get a panic to occur.
totally understood. the panic-ing code is using prepared statements on a this feels real subtle. |
i've got a new and different panic trace. it points into
|
@mjibson I would guess neither of us could repro with a simple example because we didn't explicitly use a
(I doctored the trace a little bit to not have employer info posted publicly. If that's an issue, let me know). |
@mjibson i had time to sit down and fix my repro. I can reliably reproduce the panic running this with https://gist.github.com/blinsay/044f39176eeab7f3020d1b527df5bdd7 the race detector seems to be pointing to db.BeginTx in the stdlib where |
I am also experiencing this bug. I'm using BeginTx, in my case I it panics every time. Even using recover() it still crashes my process. |
I'm not 100% sure, but this looks like a bug in the standard library to me. Rows.Next() seems to think it can get away with only holding Rows.closemu, and Tx.rollback() thinks holding a lock on driverConn prevents concurrent access. Seems like someone has some explaining to do. |
@johto the docs don't seem to specify anything about whether or not sql.Tx and driver.Tx need to be safe for concurrent use. it's unclear to me whether the default should be safe or unsafe - it seems like the Go authors and driver authors have been assuming different things. I have an open stdlib issue to see if we can get some clarification: |
I know its not likely to change but has anyone tried this with Go 1.9? If not I'll try to reproduce later this week or next. |
I just reproduced it with the same repro on Go 1.9. EDIT: panic trace is below. here's some sample race detector output.
|
Trying to summarize this issue. Is it fair to say the Go standard library documentation isn't clear for driver writers? But it may be possible for driver writers to fix this issue or the Go team or both? I'm hoping this doesn't get stuck in a situation where the pq team and the Go team expect the other to address the issue and nothing is done. I use context with all database actions so I don't get into a situation where I may get blocked indefinitely. |
It's pretty clear to me that if the documentation doesn't mandate concurrency safety from the drivers, the sql package should make no assumptions.
I stopped contributing before contexts were added, but back when I was still active we certainly assumed we should never use the driver interfaces concurrently. This could very well be unintentional. |
FYI this seems like it was indeed a bug with the stdlib, @johto was right about the assumption that the driver interface should never get used concurrently. There's a fix in review right now! With the patch applied, it seems like the only thing left for lib/pq to handle is that the driver frequently seems to be returning sql.ErrNoRows instead of a context cancellation error or a Postgres query cancellation error. |
hi, does this issue still valid? But I can't reproduce it with Go 1.10.4 and it already almost 1 hour. The command go run -race repro.go -reps 5 -timeout-ms 100 -jitter-ms 5 |
Well then I guess we can close this? |
Thanks for closing! I haven't seen it since reporting the issue. |
I've been using
SELECT FOR UPDATE
in a transaction and wanted to put a timeout on the transaction so that callers wouldn't get blocked. I've been seeing aruntime error: slice bounds out of range
panic inrows.Next
that smells like some invalid data being returned.The panic starts here and only occurs if I get the context to cancel at exaaactly the right time. Adding print statements or changing my context timeout changes how reliably I can reproduce it.
I'm running against Postgres 9.6.2 and have lib/pq at 2704adc.
The text was updated successfully, but these errors were encountered: