Remove possibly illegal test data #31146

rebecca-palmer · 2020-01-20T08:05:33Z

Much of the test_html data looks like it was saved from real web pages, and not all of them look like the kind that are under a free software license. (I couldn't actually check because only one of the three identifies the source site, and that site raises a security warning.)

This removes the questionable data, and adds tests that try to do something equivalent on known-free data.

If I am mistaken and these pages are under a free license, please instead document that.

This data was added in:
1d573b4 computer_sales_page (HP), for MultiIndex header and empty string handling in headers (should give empty not NaN)
e22fe1b nyse_wsj, for thousands separator
92aa277 macau, for MultiIndex header and thousands separator

… tests Regain test coverage after the previous commit's removals

jreback

lgtm.

https://github.com/pandas-dev/pandas/pull/31146/checks?check_run_id=398501596

linting issue, ping on green.

pandas/tests/io/test_html.py

WillAyd · 2020-01-20T20:20:54Z

Looks like you just need to remove the Index import from pandas/tests/io/test_html.py

rebecca-palmer · 2020-01-20T22:41:32Z

From the checks bot:

Check import format using isort
##[error]Skipped 2 files
Check import format using isort DONE
##[error]Process completed with exit code 1.

No other errors, so not sure what that's supposed to mean?

The build bots fail with: Git checkout failed with exit code: 128

jreback · 2020-01-20T22:48:06Z

@rebecca-palmer you should install the pre-commit hooks to help you on PRs, they automatically do black,isort,mypy, do pip instasll pre-commit

simonjayhawkins · 2020-01-21T10:19:42Z

Thanks @rebecca-palmer

alimcmaster1 · 2020-02-02T15:50:51Z

pandas/tests/io/test_html.py

-        assert not any(s.isna().any() for _, s in df.items())
-
-    @pytest.mark.slow
-    def test_thousands_macau_index_col(self, datapath, request):


Was there no replacement for this test? It highlight behaviour we where hoping to fix in #29622 ?

I seem to have misinterpreted this issue as "blank column names should not load as NaN", after observing that empty headers loaded as "Unnamed: n" while empty dtype=str body cells loaded as NaN. (I hence added "assert "Unnamed" in result.columns[-1]" to check for that.)

I now suspect this was because I was testing this in bs4 4.8.

wikipedia_states has empty body cells, so re-adding this check should be easy. (Possibly relevant to the "what caused this" discussion: it also has nested tables but they aren't the one being read.)

simonjayhawkins · 2020-05-04T19:57:17Z

@meeseeksdev backport to 1.0.x

Co-authored-by: rebecca-palmer <[email protected]>

rebecca-palmer added 2 commits January 19, 2020 15:25

TST: remove possible copyright violations

bba6223

TST: Add thousands separator, blank header cell, MultiIndex read_html…

9aa9ead

… tests Regain test coverage after the previous commit's removals

jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jan 20, 2020

jreback approved these changes Jan 20, 2020

View reviewed changes

jreback added this to the 1.1 milestone Jan 20, 2020

WillAyd reviewed Jan 20, 2020

View reviewed changes

pandas/tests/io/test_html.py Show resolved Hide resolved

TST: remove unused import

06cf966

linting

4c0292a

simonjayhawkins merged commit 964400d into pandas-dev:master Jan 21, 2020

alimcmaster1 reviewed Feb 2, 2020

View reviewed changes

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request May 4, 2020

Backport PR pandas-dev#31146: Remove possibly illegal test data

94348f0

meeseeksmachine mentioned this pull request May 4, 2020

Backport PR #31146 on branch 1.0.x (Remove possibly illegal test data) #33976

Merged

simonjayhawkins pushed a commit that referenced this pull request May 4, 2020

Backport PR #31146: Remove possibly illegal test data (#33976)

ac21311

Co-authored-by: rebecca-palmer <[email protected]>

simonjayhawkins modified the milestones: 1.1, 1.0.4 May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove possibly illegal test data #31146

Remove possibly illegal test data #31146

rebecca-palmer commented Jan 20, 2020

jreback left a comment

WillAyd commented Jan 20, 2020

rebecca-palmer commented Jan 20, 2020

jreback commented Jan 20, 2020

simonjayhawkins commented Jan 21, 2020

alimcmaster1 Feb 2, 2020

rebecca-palmer Feb 2, 2020

simonjayhawkins commented May 4, 2020

Remove possibly illegal test data #31146

Remove possibly illegal test data #31146

Conversation

rebecca-palmer commented Jan 20, 2020

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Jan 20, 2020

rebecca-palmer commented Jan 20, 2020

jreback commented Jan 20, 2020

simonjayhawkins commented Jan 21, 2020

alimcmaster1 Feb 2, 2020

Choose a reason for hiding this comment

rebecca-palmer Feb 2, 2020

Choose a reason for hiding this comment

simonjayhawkins commented May 4, 2020