Avoid unnecessarily disabling CUDA graphs (ggml-org#7302)

agray3 · teleprint-me · commit dda1347da2d4 · 2024-05-16T22:55:32.000-04:00
As discussed in PR ggml-org#6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.
diff --git a/ggml-cuda.cu b/ggml-cuda.cu
@@ -2558,7 +2558,7 @@ GGML_CALL static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t
         }
 
         // Disable CUDA graphs (from the next token) if the use-case is demanding too many consecutive graph updates.
-        if (cuda_graph_update_required) {
+        if (use_cuda_graph && cuda_graph_update_required) {
             cuda_ctx->cuda_graph->number_consecutive_updates++;
         } else {
             cuda_ctx->cuda_graph->number_consecutive_updates = 0;

Original file line number	Diff line number	Diff line change
`@@ -2558,7 +2558,7 @@ GGML_CALL static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t`
`2558`	`2558`	`}`
`2559`	`2559`
`2560`	`2560`	`// Disable CUDA graphs (from the next token) if the use-case is demanding too many consecutive graph updates.`
`2561`		`- if (cuda_graph_update_required) {`
	`2561`	`+ if (use_cuda_graph && cuda_graph_update_required) {`
`2562`	`2562`	`cuda_ctx->cuda_graph->number_consecutive_updates++;`
`2563`	`2563`	`} else {`
`2564`	`2564`	`cuda_ctx->cuda_graph->number_consecutive_updates = 0;`