Skip to content

CI/BLD: Restrict ci/code_checks.sh to tracked repo files #36386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

plammens
Copy link
Contributor

@plammens plammens commented Sep 15, 2020

Previously, some of the checks in code_checks.sh ran unrestricted on all the
contents of the repository root (recursively), so that if any files extraneous
to the repo were present (e.g. a virtual environment directory, or generated source files), they were
checked too, potentially causing many false positives when a developer runs
./ci/code_checks.sh locally to check that the code is ready to be put in a PR.

The checker invocations that were already scoped (i.e. they were already
restricted, in one way or another, to the actual pandas code, e.g. by
restricting the search to the pandas subfolder) have been left as-is,
while those that weren't are now given an explicit list of files that are
tracked in the repo.

@plammens plammens force-pushed the restrict-ci-code-checks-to-tracked branch from 6b3bddb to 590b54c Compare September 15, 2020 19:57
@dsaxton dsaxton added the CI Continuous Integration label Sep 16, 2020
@WillAyd
Copy link
Member

WillAyd commented Sep 16, 2020

I don't think it's worth adding complexity here. This script is intended for our CI processes. For local development you can use pre-commit:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#pre-commit

@plammens
Copy link
Contributor Author

plammens commented Sep 17, 2020

I don't think it's worth adding complexity here. This script is intended for our CI processes. For local development you can use pre-commit:

The point is to use ci/code_checks.sh locally precisely because it's the script used by CI. From personal experience (and not just from the pandas repo), there are many times in which one manually checks black, isort, flake8, etc., and they all pass, but when you push the commits, the CI actually fails; this is a waste of both the developer's time and the CI system's resources. This happens because the CI script is much more complex than a simple black or flake8 or whatever, and is configured in a very specific way. The only way to ensure that your commits will pass CI is to run the exact same checks that the CI will run—and this can only be ensured if you run the CI script locally.

Moreover, why should we require more effort from developers to set up their own local checks? The "homemade" local checks will probably be very different from what is running on CI and thus they will generate a lot of false negatives (false check passes), as said above, and perhaps even some false positives (false check fails). We can easily kill two birds with one stone here. After all, what we're testing on CI is exactly the set of conditions we want the codebase to satisfy, isn't it?

Finally, yes, this is adding a tiny bit of complexity, but it will also benefit the CI on its own: it helps make the CI process less brittle. Under no circumstances should the CI check files that are not part of the Git repo. For example, suppose that in the future a change is made to the CI process such that before the code checks are run, some files are generated first; as it is right now, code_checks.sh would also check the latter and probably report errors. To make a concrete example, suppose the documentation build is checked before the code checks are run: the code checks will fail because code_checks.sh will find issues in the generated .rst files (I found this out from personal experience).

@plammens plammens marked this pull request as ready for review September 17, 2020 02:10
@plammens plammens changed the title BLD: Restrict ci/code_checks.sh to tracked repo files CI/BLD: Restrict ci/code_checks.sh to tracked repo files Sep 17, 2020
@WillAyd
Copy link
Member

WillAyd commented Sep 17, 2020

This happens because the CI script is much more complex than a simple black or flake8 or whatever, and is configured in a very specific way.

These in particular are managed through the configuration file, so they won't differ from being run in pre-commit which has the added bonus of being cross platform

@MarcoGorelli
Copy link
Member

Thanks @plammens for the PR, although just for the record

there are many times in which one manually checks black, isort, flake8, etc., and they all pass, but when you push the commits, the CI actually fails

in ~ 1 year of contributing to pandas, this has never happened to me.

Sometimes tests pass locally but fail during CI, but I've found black, isort, flake8 to pass/fail reliably

@plammens
Copy link
Contributor Author

Admittedly, the examples I made were pretty terrible 🙂; it's true that black, isort and flake8 are consistent in local vs CI. It's more about custom scripts like validate_docstrings.py, validate_unwanted_patterns.py, the invgrep pattern searches, doctests and unit tests.

And maybe I expressed myself incorrectly: the problem is not checks that are run in the same exact configuration locally and in CI which produce different results (these are almost exclusively doctests and unit tests, and this PR doesn't fix that), the problem is running differently configured checks or running different checks (i.e. less checks) than those on CI, which is prone to happening if we require the developer to transcribe every single check as a precommit hook or whatever works for them locally. A quick Ctrl + F tells me that there are about 74 distinct checks being made by the code_checks.sh script. Sure, I could go ahead and replicate all of these in precommit hooks in my local environment, or run them manually, but that would be a significant waste of time and I still would make mistakes.

A hypothetical (but maybe not-so-hypothetical 😉) example of what happens to me:

  • I check black and flake8, but I forget to check isort. CI fails on isort.
  • I check black, flake8 and isort. CI fails on validate_docstrings.
  • I check black, flake8, isort, and validate_docstrings.py. CI fails on validate_unwanted_patterns.py.
  • I check black, flake8, isort, validate_docstrings.py and validate_unwanted_patterns.py. CI fails on one of the many invgreps.
  • And so on... (I hope you don't think I write such terrible code that all of these checks fail, I'm just making an illustrative example 🙂.)

The obvious solution is to have some form of automation that runs all the necessary checks. But this already exists: it's code_checks.sh! Why should any developer spend time replicating that script in a way that works with their local setup?

That's why I believe it's beneficial to use a "centralized" checking script available to all developers that does the exact same checks as CI.

And again, if you're not convinced by the "easier local checks" argument, the argument that this improves the CI process on its own still stands:

Finally, yes, this is adding a tiny bit of complexity, but it will also benefit the CI on its own: it helps make the CI process less brittle. Under no circumstances should the CI check files that are not part of the Git repo. For example, suppose that in the future a change is made to the CI process such that before the code checks are run, some files are generated first; as it is right now, code_checks.sh would also check the latter and probably report errors. To make a concrete example, suppose the documentation build is checked before the code checks are run: the code checks will fail because code_checks.sh will find issues in the generated .rst files (I found this out from personal experience).

I still don't understand what's the downside to these changes 🤔. If you don't want to touch the CI script, here are some alternative ideas, just to throw them out:

  • What about providing a pre-commit configuration? The problem with this is that we'd have to provide hooks for every check, including custom scripts and bash functions like invgrep, which doesn't sound very viable to me.
  • Providing a separate script for local checking? The problem here is having to keep it in sync with the CI script. And again, since the point is to run the same checks as CI, I'd just use the CI script directly.
  • Same as above, but instead of having separate local and CI scripts, configure all checks in a declarative fashion in setup.cfg or similar, and then use a generator script that would generate both the CI script and the local script? This solves the problem of keeping both of them in sync, but it is quite complex. (And, once again, I'd still prefer just running the CI script directly.)
  • To address the issue of being cross-platform mentioned by @WillAyd, what about using Python scripts for the CI scripts instead of bash scripts? Then one could run the same scripts locally on any platform. (I don't have much experience with CI, so please do tell me if I'm being daft 🙂.)

By the way, the reason I used code_checks.sh to check my changes (and thus noticed these issues) is because it is mentioned in the Code standards section of the pandas development guide:

There is a tool in pandas to help contributors verify their changes before contributing them to the project:

 ./ci/code_checks.sh

The script verifies the linting of code files, it looks for common mistake patterns (like missing spaces around sphinx directives that make the documentation not being rendered properly) and it also validates the doctests. It is possible to run the checks independently by using the parameters lint, patterns and doctests (e.g. ./ci/code_checks.sh lint).

@jbrockmendel
Copy link
Member

there are many times in which one manually checks black, isort, flake8, etc., and they all pass, but when you push the commits, the CI actually fails

I agree with @plammens on this one (havent looked at the PR itself, so im agreeing with this specific statement). I regularly get into a situation in which manually running flake8 passes but then the pre-commit flake8 produces a bunch of spurious complaints.

Side-note: I recently added a "check" to the makefile that duplicates some of the checks in code_checks.sh. That's my bad, should be changed to call code_checks directly to keep the checks in sync.

@jbrockmendel
Copy link
Member

@plammens can you merge master

Extract common code for checking a single file path.
The previous behaviour filtered out too many paths: any subdirectory
whose relative path *contained* any of the ignored paths (which could be
arbitrary strings) would be ignored. E.g., if PATHS_TO_IGNORE contained
"foo", all of "./foo", "./spam/foo", "./spam/foo/eggs", "./barfoobaz",
"./spam/foo.py" would get filtered out.

On the other hand, individual files that *did* appear in the PAHTS_TO_IGNORE
were *not* ignored.

Now the behaviour should be a bit more robust. Ignored file pahts can be
specified as relative paths or absolute paths (since they are all passed
through os.path.abspath); any files below a subdirectory included
in PATHS_TO_IGNORE will be filtered out, and so will any files which are
explicitly mentioned in PATHS_TO_IGNORE.
@plammens plammens force-pushed the restrict-ci-code-checks-to-tracked branch from 1985b5b to 90302ca Compare September 23, 2020 21:01
This flag controls whether individual files explicitly passed as arguments
should override the --excluded-file-paths rule.
Previously, some of the checks in code_checks.sh ran unrestricted on all the
contents of the repository root (recursively), so that if any files extraneous
to the repo were present (e.g. a virtual environment directory), they were
checked too, potentially causing many false positives when a developer runs
./ci/code_checks.sh .

The checker invocations that were already scoped (i.e. they were already
restricted, in one way or another, to the actual pandas code, e.g. by
restricting the search to the `pandas` subfolder) have been left as-is,
while those that weren't are now given an explicit list of files that are
tracked in the repo.
@plammens plammens force-pushed the restrict-ci-code-checks-to-tracked branch from 90302ca to 8611fe6 Compare September 23, 2020 21:43
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Sep 24, 2020

I regularly get into a situation in which manually running flake8 passes but then the pre-commit flake8 produces a bunch of spurious complaints.

Is this still happening after #36412 ?

EDIT

Nevermind, they're still not pinned to the same version, sorry for the noise

@MarcoGorelli
Copy link
Member

Hi @plammens

Sorry for the delay. I've been busy, but also PRs which change multiple things aren't the easiest to review. There's a couple of proposed changes which I think we take and merge quickly if you open separate PRs for them:

Then this PR can be left to just discuss running the checks on tracked files

@plammens
Copy link
Contributor Author

  • defining the function if_gh_actions, which cleans things up

Opened #37110 for this.

Will do this soon. (If I understand correctly, I should undo the changes to the .svg and .html files in #36588 and add the trailing-whitespace pre-commit hook?)

@MarcoGorelli
Copy link
Member

Will do this soon. (If I understand correctly, I should undo the changes to the .svg and .html files in #36588 and add the trailing-whitespace pre-commit hook?)

That would be good, thanks!

@WillAyd
Copy link
Member

WillAyd commented Oct 23, 2020

Is this PR still needed or is everything that we need in pre-commit now?

@MarcoGorelli
Copy link
Member

Is this PR still needed or is everything that we need in pre-commit now?

Not everything is in pre-commit yet, but things are making their way there and I'd very much be inclined with moving as much as possible over (then they'll be cross-platform and'll provide faster feedback to devs).

It would also allow us to reduce the complexity of scripts/validate_unwanted_patterns.py, as pre-commit runs hooks with each file individually rather than recursively on a directory, i.e.

python scripts/validate_unwanted_patterns.py file1.py file2.py ... filen.py

rather than

python scripts/validate_unwanted_patterns.py pandas

Anyway, massive thanks @plammens for having brought up the issue, and if you'd like to help with moving checks over to pre-commit, that'd be welcome!

@plammens
Copy link
Contributor Author

plammens commented Nov 3, 2020

Closing this as it has been superseded by pre-commit configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: exclude directories from ci/code_checks.sh
5 participants