Skip to content

Remove possibly illegal test data #31146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 21, 2020

Conversation

rebecca-palmer
Copy link
Contributor

Much of the test_html data looks like it was saved from real web pages, and not all of them look like the kind that are under a free software license. (I couldn't actually check because only one of the three identifies the source site, and that site raises a security warning.)

This removes the questionable data, and adds tests that try to do something equivalent on known-free data.

If I am mistaken and these pages are under a free license, please instead document that.

This data was added in:
1d573b4 computer_sales_page (HP), for MultiIndex header and empty string handling in headers (should give empty not NaN)
e22fe1b nyse_wsj, for thousands separator
92aa277 macau, for MultiIndex header and thousands separator

@jreback jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jan 20, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback jreback added this to the 1.1 milestone Jan 20, 2020
@WillAyd
Copy link
Member

WillAyd commented Jan 20, 2020

Looks like you just need to remove the Index import from pandas/tests/io/test_html.py

@rebecca-palmer
Copy link
Contributor Author

From the checks bot:

Check import format using isort
##[error]Skipped 2 files
Check import format using isort DONE
##[error]Process completed with exit code 1.

No other errors, so not sure what that's supposed to mean?

The build bots fail with: Git checkout failed with exit code: 128

@jreback
Copy link
Contributor

jreback commented Jan 20, 2020

@rebecca-palmer you should install the pre-commit hooks to help you on PRs, they automatically do black,isort,mypy, do pip instasll pre-commit

@simonjayhawkins simonjayhawkins merged commit 964400d into pandas-dev:master Jan 21, 2020
@simonjayhawkins
Copy link
Member

Thanks @rebecca-palmer

assert not any(s.isna().any() for _, s in df.items())

@pytest.mark.slow
def test_thousands_macau_index_col(self, datapath, request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there no replacement for this test? It highlight behaviour we where hoping to fix in #29622 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to have misinterpreted this issue as "blank column names should not load as NaN", after observing that empty headers loaded as "Unnamed: n" while empty dtype=str body cells loaded as NaN. (I hence added "assert "Unnamed" in result.columns[-1]" to check for that.)

I now suspect this was because I was testing this in bs4 4.8.

wikipedia_states has empty body cells, so re-adding this check should be easy. (Possibly relevant to the "what caused this" discussion: it also has nested tables but they aren't the one being read.)

@simonjayhawkins
Copy link
Member

@meeseeksdev backport to 1.0.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants