CI/BLD: Restrict ci/code_checks.sh to tracked repo files #36386

plammens · 2020-09-15T19:05:11Z

closes CI: exclude directories from ci/code_checks.sh #36368
passes git diff upstream/master -u -- "*.py" | flake8 --diff

Previously, some of the checks in code_checks.sh ran unrestricted on all the
contents of the repository root (recursively), so that if any files extraneous
to the repo were present (e.g. a virtual environment directory, or generated source files), they were
checked too, potentially causing many false positives when a developer runs
./ci/code_checks.sh locally to check that the code is ready to be put in a PR.

The checker invocations that were already scoped (i.e. they were already
restricted, in one way or another, to the actual pandas code, e.g. by
restricting the search to the pandas subfolder) have been left as-is,
while those that weren't are now given an explicit list of files that are
tracked in the repo.

WillAyd · 2020-09-16T19:35:53Z

I don't think it's worth adding complexity here. This script is intended for our CI processes. For local development you can use pre-commit:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#pre-commit

plammens · 2020-09-17T00:21:57Z

I don't think it's worth adding complexity here. This script is intended for our CI processes. For local development you can use pre-commit:

The point is to use ci/code_checks.sh locally precisely because it's the script used by CI. From personal experience (and not just from the pandas repo), there are many times in which one manually checks black, isort, flake8, etc., and they all pass, but when you push the commits, the CI actually fails; this is a waste of both the developer's time and the CI system's resources. This happens because the CI script is much more complex than a simple black or flake8 or whatever, and is configured in a very specific way. The only way to ensure that your commits will pass CI is to run the exact same checks that the CI will run—and this can only be ensured if you run the CI script locally.

Moreover, why should we require more effort from developers to set up their own local checks? The "homemade" local checks will probably be very different from what is running on CI and thus they will generate a lot of false negatives (false check passes), as said above, and perhaps even some false positives (false check fails). We can easily kill two birds with one stone here. After all, what we're testing on CI is exactly the set of conditions we want the codebase to satisfy, isn't it?

Finally, yes, this is adding a tiny bit of complexity, but it will also benefit the CI on its own: it helps make the CI process less brittle. Under no circumstances should the CI check files that are not part of the Git repo. For example, suppose that in the future a change is made to the CI process such that before the code checks are run, some files are generated first; as it is right now, code_checks.sh would also check the latter and probably report errors. To make a concrete example, suppose the documentation build is checked before the code checks are run: the code checks will fail because code_checks.sh will find issues in the generated .rst files (I found this out from personal experience).

WillAyd · 2020-09-17T22:36:24Z

This happens because the CI script is much more complex than a simple black or flake8 or whatever, and is configured in a very specific way.

These in particular are managed through the configuration file, so they won't differ from being run in pre-commit which has the added bonus of being cross platform

MarcoGorelli · 2020-09-18T08:13:41Z

Thanks @plammens for the PR, although just for the record

there are many times in which one manually checks black, isort, flake8, etc., and they all pass, but when you push the commits, the CI actually fails

in ~ 1 year of contributing to pandas, this has never happened to me.

Sometimes tests pass locally but fail during CI, but I've found black, isort, flake8 to pass/fail reliably

plammens · 2020-09-18T13:05:38Z

Admittedly, the examples I made were pretty terrible 🙂; it's true that black, isort and flake8 are consistent in local vs CI. It's more about custom scripts like validate_docstrings.py, validate_unwanted_patterns.py, the invgrep pattern searches, doctests and unit tests.

And maybe I expressed myself incorrectly: the problem is not checks that are run in the same exact configuration locally and in CI which produce different results (these are almost exclusively doctests and unit tests, and this PR doesn't fix that), the problem is running differently configured checks or running different checks (i.e. less checks) than those on CI, which is prone to happening if we require the developer to transcribe every single check as a precommit hook or whatever works for them locally. A quick Ctrl + F tells me that there are about 74 distinct checks being made by the code_checks.sh script. Sure, I could go ahead and replicate all of these in precommit hooks in my local environment, or run them manually, but that would be a significant waste of time and I still would make mistakes.

A hypothetical (but maybe not-so-hypothetical 😉) example of what happens to me:

I check black and flake8, but I forget to check isort. CI fails on isort.
I check black, flake8 and isort. CI fails on validate_docstrings.
I check black, flake8, isort, and validate_docstrings.py. CI fails on validate_unwanted_patterns.py.
I check black, flake8, isort, validate_docstrings.py and validate_unwanted_patterns.py. CI fails on one of the many invgreps.
And so on... (I hope you don't think I write such terrible code that all of these checks fail, I'm just making an illustrative example 🙂.)

The obvious solution is to have some form of automation that runs all the necessary checks. But this already exists: it's code_checks.sh! Why should any developer spend time replicating that script in a way that works with their local setup?

That's why I believe it's beneficial to use a "centralized" checking script available to all developers that does the exact same checks as CI.

And again, if you're not convinced by the "easier local checks" argument, the argument that this improves the CI process on its own still stands:

Finally, yes, this is adding a tiny bit of complexity, but it will also benefit the CI on its own: it helps make the CI process less brittle. Under no circumstances should the CI check files that are not part of the Git repo. For example, suppose that in the future a change is made to the CI process such that before the code checks are run, some files are generated first; as it is right now, code_checks.sh would also check the latter and probably report errors. To make a concrete example, suppose the documentation build is checked before the code checks are run: the code checks will fail because code_checks.sh will find issues in the generated .rst files (I found this out from personal experience).

I still don't understand what's the downside to these changes 🤔. If you don't want to touch the CI script, here are some alternative ideas, just to throw them out:

What about providing a pre-commit configuration? The problem with this is that we'd have to provide hooks for every check, including custom scripts and bash functions like invgrep, which doesn't sound very viable to me.
Providing a separate script for local checking? The problem here is having to keep it in sync with the CI script. And again, since the point is to run the same checks as CI, I'd just use the CI script directly.
Same as above, but instead of having separate local and CI scripts, configure all checks in a declarative fashion in setup.cfg or similar, and then use a generator script that would generate both the CI script and the local script? This solves the problem of keeping both of them in sync, but it is quite complex. (And, once again, I'd still prefer just running the CI script directly.)
To address the issue of being cross-platform mentioned by @WillAyd, what about using Python scripts for the CI scripts instead of bash scripts? Then one could run the same scripts locally on any platform. (I don't have much experience with CI, so please do tell me if I'm being daft 🙂.)

By the way, the reason I used code_checks.sh to check my changes (and thus noticed these issues) is because it is mentioned in the Code standards section of the pandas development guide:

There is a tool in pandas to help contributors verify their changes before contributing them to the project:
 ./ci/code_checks.sh
The script verifies the linting of code files, it looks for common mistake patterns (like missing spaces around sphinx directives that make the documentation not being rendered properly) and it also validates the doctests. It is possible to run the checks independently by using the parameters lint, patterns and doctests (e.g. ./ci/code_checks.sh lint).

jbrockmendel · 2020-09-23T20:09:19Z

there are many times in which one manually checks black, isort, flake8, etc., and they all pass, but when you push the commits, the CI actually fails

I agree with @plammens on this one (havent looked at the PR itself, so im agreeing with this specific statement). I regularly get into a situation in which manually running flake8 passes but then the pre-commit flake8 produces a bunch of spurious complaints.

Side-note: I recently added a "check" to the makefile that duplicates some of the checks in code_checks.sh. That's my bad, should be changed to call code_checks directly to keep the checks in sync.

jbrockmendel · 2020-09-23T20:09:29Z

@plammens can you merge master

web/pandas/_templates/layout.html

Extract common code for checking a single file path.

The previous behaviour filtered out too many paths: any subdirectory whose relative path *contained* any of the ignored paths (which could be arbitrary strings) would be ignored. E.g., if PATHS_TO_IGNORE contained "foo", all of "./foo", "./spam/foo", "./spam/foo/eggs", "./barfoobaz", "./spam/foo.py" would get filtered out. On the other hand, individual files that *did* appear in the PAHTS_TO_IGNORE were *not* ignored. Now the behaviour should be a bit more robust. Ignored file pahts can be specified as relative paths or absolute paths (since they are all passed through os.path.abspath); any files below a subdirectory included in PATHS_TO_IGNORE will be filtered out, and so will any files which are explicitly mentioned in PATHS_TO_IGNORE.

This flag controls whether individual files explicitly passed as arguments should override the --excluded-file-paths rule.

Previously, some of the checks in code_checks.sh ran unrestricted on all the contents of the repository root (recursively), so that if any files extraneous to the repo were present (e.g. a virtual environment directory), they were checked too, potentially causing many false positives when a developer runs ./ci/code_checks.sh . The checker invocations that were already scoped (i.e. they were already restricted, in one way or another, to the actual pandas code, e.g. by restricting the search to the `pandas` subfolder) have been left as-is, while those that weren't are now given an explicit list of files that are tracked in the repo.

…heck

MarcoGorelli · 2020-09-24T07:52:34Z

I regularly get into a situation in which manually running flake8 passes but then the pre-commit flake8 produces a bunch of spurious complaints.

Is this still happening after #36412 ?

EDIT

Nevermind, they're still not pinned to the same version, sorry for the noise

MarcoGorelli · 2020-10-12T13:18:43Z

Hi @plammens

Sorry for the delay. I've been busy, but also PRs which change multiple things aren't the easiest to review. There's a couple of proposed changes which I think we take and merge quickly if you open separate PRs for them:

defining the function if_gh_actions, which cleans things up
removing trailing whitespace from .c files (for which you opened CLN: clean up new detected trailing whitespace #36588 as a precursor) - here, I'd say that we can continue excluding svg and html. In addition, I think that the trailing-whitespace hook could be added to .pre-commit-config.yaml - could you amend CLN: clean up new detected trailing whitespace #36588 to include this?

Then this PR can be left to just discuss running the checks on tracked files

plammens · 2020-10-13T23:16:47Z

defining the function if_gh_actions, which cleans things up

Opened #37110 for this.

removing trailing whitespace from .c files (for which you opened CLN: clean up new detected trailing whitespace #36588 as a precursor) - here, I'd say that we can continue excluding svg and html. In addition, I think that the trailing-whitespace hook could be added to .pre-commit-config.yaml - could you amend CLN: clean up new detected trailing whitespace #36588 to include this?

Will do this soon. (If I understand correctly, I should undo the changes to the .svg and .html files in #36588 and add the trailing-whitespace pre-commit hook?)

MarcoGorelli · 2020-10-14T07:35:28Z

Will do this soon. (If I understand correctly, I should undo the changes to the .svg and .html files in #36588 and add the trailing-whitespace pre-commit hook?)

That would be good, thanks!

WillAyd · 2020-10-23T21:32:00Z

Is this PR still needed or is everything that we need in pre-commit now?

MarcoGorelli · 2020-10-24T07:18:08Z

Is this PR still needed or is everything that we need in pre-commit now?

Not everything is in pre-commit yet, but things are making their way there and I'd very much be inclined with moving as much as possible over (then they'll be cross-platform and'll provide faster feedback to devs).

It would also allow us to reduce the complexity of scripts/validate_unwanted_patterns.py, as pre-commit runs hooks with each file individually rather than recursively on a directory, i.e.

python scripts/validate_unwanted_patterns.py file1.py file2.py ... filen.py

rather than

python scripts/validate_unwanted_patterns.py pandas

Anyway, massive thanks @plammens for having brought up the issue, and if you'd like to help with moving checks over to pre-commit, that'd be welcome!

plammens · 2020-11-03T14:42:09Z

Closing this as it has been superseded by pre-commit configurations.

plammens force-pushed the restrict-ci-code-checks-to-tracked branch from 6b3bddb to 590b54c Compare September 15, 2020 19:57

dsaxton added the CI Continuous Integration label Sep 16, 2020

plammens marked this pull request as ready for review September 17, 2020 02:10

plammens changed the title ~~BLD: Restrict ci/code_checks.sh to tracked repo files~~ CI/BLD: Restrict ci/code_checks.sh to tracked repo files Sep 17, 2020

jbrockmendel reviewed Sep 23, 2020

View reviewed changes

web/pandas/_templates/layout.html Show resolved Hide resolved

plammens added 3 commits September 23, 2020 21:56

BLD: refactor validate_unwanted_patterns.py

39c0c0e

Extract common code for checking a single file path.

BLD: allow multiple path arguments in validate_unwanted_patterns.py

b3d98f6

plammens force-pushed the restrict-ci-code-checks-to-tracked branch from 1985b5b to 90302ca Compare September 23, 2020 21:01

plammens added 5 commits September 23, 2020 22:42

BLD: add verbose option to validate_unwanted_patterns.py

d2dc5f7

BLD: add --no-override flag to validate_unwanted_patterns.py

633abc4

This flag controls whether individual files explicitly passed as arguments should override the --excluded-file-paths rule.

CLN: clean up new detected trailing whitespace

21feb52

BLD: refactor code_checks.sh to avoid duplication due to GH Actions c…

8611fe6

…heck

plammens force-pushed the restrict-ci-code-checks-to-tracked branch from 90302ca to 8611fe6 Compare September 23, 2020 21:43

plammens mentioned this pull request Sep 23, 2020

CLN: clean up new detected trailing whitespace #36588

Merged

2 tasks

plammens mentioned this pull request Oct 13, 2020

BLD: extract GH Actions check function to avoid duplication in code_checks.sh #37110

Closed

2 tasks

MarcoGorelli mentioned this pull request Oct 19, 2020

CI move non-standard-import checks over to pre-commit #37240

Merged

plammens closed this Nov 3, 2020

mroeschke mentioned this pull request May 2, 2023

CI: exclude directories from ci/code_checks.sh #36368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI/BLD: Restrict ci/code_checks.sh to tracked repo files #36386

CI/BLD: Restrict ci/code_checks.sh to tracked repo files #36386

plammens commented Sep 15, 2020 •

edited

Loading

WillAyd commented Sep 16, 2020

plammens commented Sep 17, 2020 •

edited

Loading

WillAyd commented Sep 17, 2020

MarcoGorelli commented Sep 18, 2020

plammens commented Sep 18, 2020

jbrockmendel commented Sep 23, 2020

jbrockmendel commented Sep 23, 2020

MarcoGorelli commented Sep 24, 2020 •

edited

Loading

MarcoGorelli commented Oct 12, 2020

plammens commented Oct 13, 2020

MarcoGorelli commented Oct 14, 2020

WillAyd commented Oct 23, 2020

MarcoGorelli commented Oct 24, 2020

plammens commented Nov 3, 2020

CI/BLD: Restrict ci/code_checks.sh to tracked repo files #36386

CI/BLD: Restrict ci/code_checks.sh to tracked repo files #36386

Conversation

plammens commented Sep 15, 2020 • edited Loading

WillAyd commented Sep 16, 2020

plammens commented Sep 17, 2020 • edited Loading

WillAyd commented Sep 17, 2020

MarcoGorelli commented Sep 18, 2020

plammens commented Sep 18, 2020

jbrockmendel commented Sep 23, 2020

jbrockmendel commented Sep 23, 2020

MarcoGorelli commented Sep 24, 2020 • edited Loading

EDIT

MarcoGorelli commented Oct 12, 2020

plammens commented Oct 13, 2020

MarcoGorelli commented Oct 14, 2020

WillAyd commented Oct 23, 2020

MarcoGorelli commented Oct 24, 2020

plammens commented Nov 3, 2020

plammens commented Sep 15, 2020 •

edited

Loading

plammens commented Sep 17, 2020 •

edited

Loading

MarcoGorelli commented Sep 24, 2020 •

edited

Loading