Skip to content

docs: add fp-finder util sub command documentation #208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

S0obi
Copy link

@S0obi S0obi commented Apr 26, 2025

Proposed changes

Add a quick mention in the doc about the new fp-finder subcommand for crs-toolchain

The `util` command includes sub-commands that are used from time to time and do not fit nicely into any of the other groups. Currently, the available sub-commands are:

* `renumber-tests`: Used to simplify maintenance of the regression tests. Since every test has a consecutive number within its file, adding or removing tests can disrupt numbering. `renumber-tests` will renumber all tests within each test file consecutively.
* `fp-finder`: Takes a file as input and outputs a filtered, alphabetically sorted list of unique words that are not present in the English dictionary. This can help in identifying potential false positives by focusing on unusual or unknown words.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `fp-finder`: Takes a file as input and outputs a filtered, alphabetically sorted list of unique words that are not present in the English dictionary. This can help in identifying potential false positives by focusing on unusual or unknown words.
* `fp-finder`: Takes a file as input and produces a filtered, alphabetically sorted list of unique words that are not present in the English dictionary (WordNet). This can help in identifying potential false positives by focusing on unusual or unknown words.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure about the Wordnet part ? The dictionary (https://github.com/dwyl/english-words) doesn't seem to mention anything related to it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... When we refactored the original util, we settled on using WordNet, which is a known and well documented source. I had simply assumed that the repository you were working with supplied the same word list. I quickly checked and there are a couple of Go packages that provide an interface to the WordNet database. I'd really like to continue using WordNet. One, because it's well known, and two, because I want to ensure consistency when running the tool. Could I trouble you to modify your implementation to use WordNet?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I read again coreruleset/crs-toolchain#181 and it seems like there was a bit of misunderstanding :)

Can you point me the library that you think will make sense to use ? I quickly took a look at https://github.com/fluhus/gostuff/blob/master/nlp/wordnet/parser.go and it seems not really equivalent to the wn command (that's why I proposed to just run the CLI command from crs-toolchain). Honestly, I was not really satisfied by anything I found. Feedback is welcome!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one doesn't look too bad: https://pkg.go.dev/github.com/lloyd/wnram.
Will require downloading the WN database, storing it in the cache dir (I imaging) and then parsing it from there.

Makes sense since the DB is ~80MB, wouldn't want that in the binary

DB available here: https://wordnet.princeton.edu/download
Needs proper attribution (like the original tool had).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created coreruleset/crs-toolchain#229 for working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants