[SYCL][Host Task] Bad performance of consecutively submitted host tasks onto an in-order queue #18500

Nuullll · 2025-05-16T03:32:27Z

Describe the bug

While submitting consecutive host tasks to an in-order queue without explicit wait(), the execution time of each host task explodes as the number of submission increases.

To reproduce

Reproducing code

// test.cpp
#include <sycl/sycl.hpp>
#include <iostream>
#include <thread>
#include <chrono>

int main(int argc, char *argv[]) {
    sycl::queue queue(sycl::property::queue::in_order{});

    std::cout << "Using device: " << queue.get_device().get_info<sycl::info::device::name>() << "\n";

    int repeat = 10000;
    if (argc > 1) {
        repeat = std::stoi(std::string(argv[1]));
    }
    int data = 0;

    std::cout << "Submitting " << repeat << " host tasks...\n";
    auto start_time = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < repeat; i++) {
        std::this_thread::sleep_for(std::chrono::microseconds(500));
        auto e = queue.submit([&](sycl::handler &cgh) {
            cgh.host_task([&]() {
                // Simulate some work on the host
                std::this_thread::sleep_for(std::chrono::milliseconds(1));
                data++;
            });
        });
#ifdef WAIT
        e.wait();
#endif
    }

    queue.wait();
    auto end_time = std::chrono::high_resolution_clock::now();
    std::cout << "Total execution time: " << std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count() << " ms\n";
    if (data != repeat) {
        std::cerr << "Error: data mismatch! Expected " << repeat << ", got " << data << "\n";
        return 1;
    }

    return 0;
}

Compile

Compile the code w/ and w/o explicit wait for each submission.

clang++ -fsycl test.cpp -o nowait.out
clang++ -fsycl test.cpp -DWAIT -o wait.out

Run

Pass the number of consecutive submission (repeat) via first argument.

./nowait.out 3000
./wait.out 3000

Results for different `repeat`

Total time in ms

repeat	10	100	1000	3000	10000
wait.out	16	162	1617	4853	16184
nowait.out	11	106	1396	12996	519977

Avg time in ms

repeat	10	100	1000	3000	10000
wait.out	1.6	1.62	1.617	1.618	1.6184
nowait.out	1.1	1.06	1.396	4.332	51.9977

Expected behavior

Even w/o explicit wait() for each submission (onto an in-order queue), the average execution time of each host task should be around 1ms. The 50x slowdown when repeat==10000 is not expected.

Environment

OS: Linux
Target device and vendor: host
DPC++ version: 7987a43
Dependencies version: Not relevant

Additional context

No response

The text was updated successfully, but these errors were encountered:

Nuullll · 2025-05-16T03:48:51Z

Initial analysis indicates that we're tracking unnecessary dependencies in MBlockedUsers:

llvm/sycl/source/detail/scheduler/commands.hpp

Lines 170 to 175 in c568f2e

    
           void addBlockedUserUnique(const EventImplPtr &NewUser) { 
        
             if (std::find(MBlockedUsers.begin(), MBlockedUsers.end(), NewUser) != 
        
                 MBlockedUsers.end()) 
        
               return; 
        
             MBlockedUsers.push_back(NewUser); 
        
           }

For example, we have 4 tasks A, B, C, D that are submitted sequentially to an in-order queue w/o waiting, and suppose that the task endures long enough that none task can complete before all submissions.

We are currently tracking all direct and in-direct blocking relations between all enqueued commands, after D is submitted, the MBlockedUsers for each task:

A.MBlockedUsers = {B, C, D}
B.MBlockedUsers = {C, D}
C.MBlockedUsers = {D}
D.MBlockedUsers = {}

It appears redundant to track {C, D} as blocked users of A, since these tasks are already blocked by B. This redundancy in MBlockedUsers leads to a significant increase in notification time when a task completes, as seen in the following code sections:

llvm/sycl/source/detail/scheduler/commands.cpp

Line 414 in b5abc8a

Scheduler::getInstance().NotifyHostTaskCompletion(MThisCmd);

llvm/sycl/source/detail/scheduler/scheduler.cpp

Line 499 in b5abc8a

Scheduler::enqueueUnblockedCommands(Cmd->MBlockedUsers, Lock, ToCleanUp);

The complexity of notifications is directly proportional to the number of entries in MBlockedUsers.

I'll try to draft a PR to fix this.

…n time explosion for long dependency chains This commit addresses a performance issue observed when submitting consecutive host tasks to an in-order queue without explicit `wait()`. The execution time of each host task was found to increase significantly as the number of submissions grew: intel#18500. The root cause was identified as the unnecessary tracking of indirect blocking dependencies in `MBlockedUsers`. Previously, all direct and indirect blocking relations between enqueued commands were tracked, causing a siginificant increase in notification time upon task completion. For example, in a sequence of tasks `A, B, C, D`, `A.MBlockedUsers` would redundantly include `{C, D}`, even though these tasks are already blocked by `B`. To resolve this, the `enqueueCommand` function in the Scheduler was enhanced to include a `RecursionDepth` parameter. This change prevents excessive growth in the size of `Cmd->MBlockedUsers` in long dependency chains by tracking only direct blocking dependencies, thereby reduction notification time upon command completion.

Nuullll · 2025-05-16T05:56:40Z

#18501 fixes this:

Results for different `repeat`

Total time in ms

repeat	10	100	1000	3000	10000
wait.out	16	162	1617	4853	16184
nowait.out	11	106	1396	12996	519977
wait.out + #18501	16	162	1615	4847	16154
nowait.out + #18501	11	106	1103	3162	11164

Avg time in ms

repeat	10	100	1000	3000	10000
wait.out	1.6	1.62	1.617	1.618	1.6184
nowait.out	1.1	1.06	1.396	4.332	51.9977
wait.out + #18501	1.6	1.62	1.615	1.616	1.6154
nowait.out + #18501	1.1	1.06	1.103	1.054	1.1164

Now consecutive submission w/o waiting constantly performs better than explicit waiting for the given test case.

Nuullll added the performance Performance related issues label May 16, 2025

Nuullll mentioned this issue May 16, 2025

[SYCL][Host Task] Optimize blocked users tracking to prevent execution time explosion for long dependency chains #18501

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][Host Task] Bad performance of consecutively submitted host tasks onto an in-order queue #18500

[SYCL][Host Task] Bad performance of consecutively submitted host tasks onto an in-order queue #18500

Nuullll commented May 16, 2025

Nuullll commented May 16, 2025

Nuullll commented May 16, 2025

[SYCL][Host Task] Bad performance of consecutively submitted host tasks onto an in-order queue #18500

[SYCL][Host Task] Bad performance of consecutively submitted host tasks onto an in-order queue #18500

Comments

Nuullll commented May 16, 2025

Describe the bug

To reproduce

Reproducing code

Compile

Run

Results for different repeat

Total time in ms

Avg time in ms

Expected behavior

Environment

Additional context

Nuullll commented May 16, 2025

Nuullll commented May 16, 2025

Results for different repeat

Total time in ms

Avg time in ms

Results for different `repeat`

Results for different `repeat`