Skip to content

[SYCL][Host Task] Bad performance of consecutively submitted host tasks onto an in-order queue #18500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Nuullll opened this issue May 16, 2025 · 2 comments
Labels
performance Performance related issues

Comments

@Nuullll
Copy link
Contributor

Nuullll commented May 16, 2025

Describe the bug

While submitting consecutive host tasks to an in-order queue without explicit wait(), the execution time of each host task explodes as the number of submission increases.

To reproduce

Reproducing code

// test.cpp
#include <sycl/sycl.hpp>
#include <iostream>
#include <thread>
#include <chrono>

int main(int argc, char *argv[]) {
    sycl::queue queue(sycl::property::queue::in_order{});

    std::cout << "Using device: " << queue.get_device().get_info<sycl::info::device::name>() << "\n";

    int repeat = 10000;
    if (argc > 1) {
        repeat = std::stoi(std::string(argv[1]));
    }
    int data = 0;

    std::cout << "Submitting " << repeat << " host tasks...\n";
    auto start_time = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < repeat; i++) {
        std::this_thread::sleep_for(std::chrono::microseconds(500));
        auto e = queue.submit([&](sycl::handler &cgh) {
            cgh.host_task([&]() {
                // Simulate some work on the host
                std::this_thread::sleep_for(std::chrono::milliseconds(1));
                data++;
            });
        });
#ifdef WAIT
        e.wait();
#endif
    }

    queue.wait();
    auto end_time = std::chrono::high_resolution_clock::now();
    std::cout << "Total execution time: " << std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count() << " ms\n";
    if (data != repeat) {
        std::cerr << "Error: data mismatch! Expected " << repeat << ", got " << data << "\n";
        return 1;
    }

    return 0;
}

Compile

Compile the code w/ and w/o explicit wait for each submission.

clang++ -fsycl test.cpp -o nowait.out
clang++ -fsycl test.cpp -DWAIT -o wait.out

Run

Pass the number of consecutive submission (repeat) via first argument.

./nowait.out 3000
./wait.out 3000

Results for different repeat

Total time in ms

repeat 10 100 1000 3000 10000
wait.out 16 162 1617 4853 16184
nowait.out 11 106 1396 12996 519977

Avg time in ms

repeat 10 100 1000 3000 10000
wait.out 1.6 1.62 1.617 1.618 1.6184
nowait.out 1.1 1.06 1.396 4.332 51.9977

Expected behavior

Even w/o explicit wait() for each submission (onto an in-order queue), the average execution time of each host task should be around 1ms. The 50x slowdown when repeat==10000 is not expected.

Environment

  • OS: Linux
  • Target device and vendor: host
  • DPC++ version: 7987a43
  • Dependencies version: Not relevant

Additional context

No response

@Nuullll Nuullll added the performance Performance related issues label May 16, 2025
@Nuullll
Copy link
Contributor Author

Nuullll commented May 16, 2025

Initial analysis indicates that we're tracking unnecessary dependencies in MBlockedUsers:

void addBlockedUserUnique(const EventImplPtr &NewUser) {
if (std::find(MBlockedUsers.begin(), MBlockedUsers.end(), NewUser) !=
MBlockedUsers.end())
return;
MBlockedUsers.push_back(NewUser);
}

For example, we have 4 tasks A, B, C, D that are submitted sequentially to an in-order queue w/o waiting, and suppose that the task endures long enough that none task can complete before all submissions.

We are currently tracking all direct and in-direct blocking relations between all enqueued commands, after D is submitted, the MBlockedUsers for each task:

A.MBlockedUsers = {B, C, D}
B.MBlockedUsers = {C, D}
C.MBlockedUsers = {D}
D.MBlockedUsers = {}

It appears redundant to track {C, D} as blocked users of A, since these tasks are already blocked by B. This redundancy in MBlockedUsers leads to a significant increase in notification time when a task completes, as seen in the following code sections:

Scheduler::getInstance().NotifyHostTaskCompletion(MThisCmd);

Scheduler::enqueueUnblockedCommands(Cmd->MBlockedUsers, Lock, ToCleanUp);

The complexity of notifications is directly proportional to the number of entries in MBlockedUsers.

I'll try to draft a PR to fix this.

Nuullll added a commit to Nuullll/llvm that referenced this issue May 16, 2025
…n time explosion for long dependency chains

This commit addresses a performance issue observed when submitting
consecutive host tasks to an in-order queue without explicit `wait()`.
The execution time of each host task was found to increase significantly
as the number of submissions grew:
intel#18500.

The root cause was identified as the unnecessary tracking of indirect
blocking dependencies in `MBlockedUsers`. Previously, all direct and
indirect blocking relations between enqueued commands were tracked,
causing a siginificant increase in notification time upon task
completion. For example, in a sequence of tasks `A, B, C, D`,
`A.MBlockedUsers` would redundantly include `{C, D}`, even though these
tasks are already blocked by `B`.

To resolve this, the `enqueueCommand` function in the Scheduler was
enhanced to include a `RecursionDepth` parameter. This change prevents
excessive growth in the size of `Cmd->MBlockedUsers` in long dependency
chains by tracking only direct blocking dependencies, thereby reduction
notification time upon command completion.
@Nuullll
Copy link
Contributor Author

Nuullll commented May 16, 2025

#18501 fixes this:

Results for different repeat

Total time in ms

repeat 10 100 1000 3000 10000
wait.out 16 162 1617 4853 16184
nowait.out 11 106 1396 12996 519977
wait.out + #18501 16 162 1615 4847 16154
nowait.out + #18501 11 106 1103 3162 11164

Avg time in ms

repeat 10 100 1000 3000 10000
wait.out 1.6 1.62 1.617 1.618 1.6184
nowait.out 1.1 1.06 1.396 4.332 51.9977
wait.out + #18501 1.6 1.62 1.615 1.616 1.6154
nowait.out + #18501 1.1 1.06 1.103 1.054 1.1164

Now consecutive submission w/o waiting constantly performs better than explicit waiting for the given test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance related issues
Projects
None yet
Development

No branches or pull requests

1 participant