Skip to content

Regroup CI doctests and adjust unit-tests recipe #1996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 6, 2025

Conversation

EliahKagan
Copy link
Member

@EliahKagan EliahKagan commented May 6, 2025

This distributes work across CI jobs more evenly. In particular, it speeds up test-fast on windows-latest (which had become the slowest of all CI jobs, including other Windows jobs). To do so, it leverages the idea that the doctests are important enough that we should probably continue automatically running them on Windows, but not so important that it particularly matters which of reasonably normal Windows environments and targets they run for.

Those changes entail modifying the justfile, so I've included the further overlapping change of adding --no-fail-fast to the commands in the unit-tests recipe there. This benefits CI, and I think does not hurt local runs either.

This also makes it so that the doctest-running command in just unit-tests runs more than zero doctests, which I think is the more intuitive effect. (That is, it runs all doctests in the workspace, rather than just all the doctests in the gitoxide crate, of which there currently aren't any.)

Substantial further details, including about what this is trying to improve, and rationale, are in the commit message.

The diff, as shown on GitHub, is somewhat confusing in justfile. Except for one line that is moved and changed, all changed lines are actually changed only by having --no-fail-fast added to the end.

This seeks to make improvements in four overlapping areas:

- The CI `test-fast` job for `windows-latest` had been taking the
  longest, and integrating PRs would be more efficient if it could
  be sped up. If it didn't have to build and run doctests, then it
  would run markedly faster. `test-fast` runs doctests because...

- The `unit-tests` recipe in the `justfile`, which is one of the
  recipes the "full" CI `test` job runs via the `ci-test` recipe,
  runs a number of `nextest` commands on individual crates (with
  `-p`, except for testing the top-level `gitoxide` crate, and not
  with `--workspace`). It also ran doctests, but only on the
  `gitoxide` top-level crate. But the `gitoxide` crate currently
  has no doctests, while some `gix-*` crates do.

- On CI, we usually prefer `--no-fail-fast`. But it is not always
  used, because the commands in the `unit-tests` justfile recipe do
  not use it. `--no-fail-fast` is not always preferred when running
  tests locally, but...

- Both locally and on CI, in most cases that a test fails in a
  commmand in the `unit-tests` justfile recipe, the effect is that
  tests have run and results have been reported for multiple
  `nextest` commands, yet not all the tests specified in the most
  recent `nextest` command to run. Thus, omitting `--no-fail-fast`
  may not have the most intuitive effect in `just unit-tests`, even
  when run locally (even if the user would omit `--no-fail-fast` in
  individual `cargo nextest` runs carried out manually).

This commit makes the following changes:

1. Add `--no-fail-fast` to each of the commands in the `unit-tests`
   recipe in the `justfile`: the numerous `cargo nextest` commands,
   as well as the `cargo test` command used to run doctests.

2. Add `--workspace` to the `cargo test` command used to run
   doctests in the `unit-tests` recipe in the `justfile`, and move
   it to the end of the recipe.

3. Not to be confused with that `cargo test` command, move the
   other `cargo test` command used to run doctests in the `ci.yml`
   workflow (which alredy passed `--workspace`, as its purpose was
   to run all doctests in all crates) from the `tests-fast` job
   definition into the `test-32bit-windows-size` job, and rename
   that latter job `test-32bit-windows-size-doc` accordingly.

The rationale for (3) may not be obvious. The idea is:

- Running the doctests on only one Unix-like system should be
  enough, so long as they are run for all crates in the workspace.
  So the change in the `unit-tests` recipe in the `justfile` makes
  it so the CI `test` job (which includes a `unit-tests` run)
  covers doctests sufficiently, *except* for Windows.

- Although we should probably keep running doctests regularly on
  Windows, removing it from `test-fast`, including on Windows, is
  the simplest way to make the Windows `test-fast` job run faster.
  (It also makes the job definition clearer, since some of the
  other steps relate to each other more closely than they do to the
  step that ran the doctests.)

- It should be sufficient to run the doctests in any Windows
  environment. And it is best to avoid adding a new Windows job
  just for this, since various other Windows jobs might be added
  sometime soon (such as for ARM64, native Windows containers, the
  Git for Windows SDK, MinGit, BusyBox MinGit, and possibly others;
  some of these may be possible to combine, but likely a few more
  Windows jobs may be introduced for these, so avoiding adding
  extra Windows jobs now may make it easier to avoid having too
  many Windows jobs, in terms of queuing, GHA cache usage, energy
  usage, and other resources). So if this can be added to another
  Windows job without causing problems, that is preferable.

- The Windows job that took the least amount of time, usually by
  several minutes, was the `test-32bit-windows-size` job.

It is hope that this keeps the benefits of GitoxideLabs#1556, GitoxideLabs#1559, and GitoxideLabs#1654,
while improving CI testing performance most of the time.
@EliahKagan EliahKagan marked this pull request as ready for review May 6, 2025 04:25
@EliahKagan EliahKagan merged commit ca8d64c into GitoxideLabs:main May 6, 2025
22 checks passed
@EliahKagan EliahKagan deleted the run-ci/regroup branch May 6, 2025 04:25
@EliahKagan
Copy link
Member Author

EliahKagan commented May 6, 2025

It's hard to predict what effect caching dependencies (with rust-cache) will and won't have on the performance of future runs. I think this makes workflow runs overall faster, since it did in multiple runs in my fork, but we'll find out. Further adjustments might be in order.

@Byron
Copy link
Member

Byron commented May 6, 2025

Thanks so much!

The CI run took 17 minutes, which is less than the 30 minutes I thought it would take by now. Overall wallclock time is 1h27m, which seems close to what I remember though.

@EliahKagan
Copy link
Member Author

EliahKagan commented May 7, 2025

Thanks so much!

No problem!

The CI run took 17 minutes

Hopefully it will be even less in most runs. I think most runs took about that long in from the start of the first job to the end of the last job, before the changes here.

which is less than the 30 minutes I thought it would take by now.

Even though Windows is currently the slowest platform for running the tests, and the overall trend is to do more things on Windows, as well as to have more test cases and more fixtures across all platforms... nonetheless I think the overall trend of total time from the first job starting to the last job ending should be a gradual decrease, because the runner machines get gradually faster, and because we try not to pack too much into an individual job. For an example of the latter that is unrelated to Windows, the test job was split into test and test-journey in 711c2f5 (#1674).

So if it looks like the time is approaching 30 minutes, I would consider that to be a bug, unless there is a strong reason for it. (Please feel free to ping me anytime about performance degradations on CI, as well as anything else you think may be of interest.)

Overall wallclock time is 1h27m, which seems close to what I remember though.

I think by this you mean the sum of the wall-clock durations of all the jobs, i.e. the time they would take if they were not run concurrently, which is longer than the time they take when measured on a single clock. If so, it makes sense to me that this has not much improved, at least not by anything in this PR. This PR redistributes the work across jobs more evenly. But it probably doesn't decrease the total work. It does stop running doctests on macOS in test-fast. But it starts running all of them on Ubuntu in test. Also, moving the doctest run for Windows from the test-fast job to the 32-bit Windows job might actually make things take longer in total, depending on various small factors.

I think it may make sense to try to decrease this total time--especially since the likely addition of new jobs, such as Windows jobs along the "various other Windows jobs" examples given in ee55ded, may increase it if their contribution is not in some way offset by speed improvements elsewhere. But that's not part of what this PR specifically attempted.

(That is also something I think should be considered together with other resources. I think the total combined running time may be less important than total energy usage, which is somewhat feasible to estimate; queuing delays and runner availability; GitHub Actions cache storage; workflow maintainability; and the ability to adapt the workflow to changes such as new runners or, in case it should ever be needed, different CI solutions.)

@Byron
Copy link
Member

Byron commented May 7, 2025

(Please feel free to ping me anytime about performance degradations on CI, as well as anything else you think may be of interest.)

Thank you, appreciated!

I think by this you mean the sum of the wall-clock durations of all the jobs, i.e. the time they would take if they were not run concurrently, which is longer than the time they take when measured on a single clock.

Yes, this is what I meant - I should have written total runtime. And even though seeing that decrease is good for the environment, it's nothing we should necessarily optimise for as wallclock time is the thing that should be reduced (just not at all cost). And it's true that the total runtime doesn't directly correlate to energy usage either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants