Skip to content

Raise FileNotFoundError in read_json if input looks like file path but file is missing #46718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 7, 2022
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -432,6 +432,7 @@ Other API changes
<https://developers.googleblog.com/2022/02/making-oauth-flows-safer.html?m=1#disallowed-oob>`_.
The ``auth_local_webserver = False`` option is planned to stop working in
October 2022. (:issue:`46312`)
- :func:`read_json` now raises ``FileNotFoundError`` (previously ``ValueError``) when input is a string ending in ``.json``, ``.json.gz``, ``.json.bz2``, etc. but no such file exists. (:issue:`29102`)
-

.. ---------------------------------------------------------------------------
Expand Down
12 changes: 12 additions & 0 deletions pandas/io/json/_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@

from pandas.io.common import (
IOHandles,
_extension_to_compression,
file_exists,
get_handle,
is_fsspec_url,
Expand Down Expand Up @@ -698,6 +699,9 @@ def _get_data_from_filepath(self, filepath_or_buffer):

This method turns (1) into (2) to simplify the rest of the processing.
It returns input types (2) and (3) unchanged.

It raises FileNotFoundError if the input is a string ending in
one of .json, .json.gz, .json.bz2, etc. but no such file exists.
"""
# if it is a string but the file does not exist, it might be a JSON string
filepath_or_buffer = stringify_path(filepath_or_buffer)
Expand All @@ -716,6 +720,14 @@ def _get_data_from_filepath(self, filepath_or_buffer):
errors=self.encoding_errors,
)
filepath_or_buffer = self.handles.handle
elif (
isinstance(filepath_or_buffer, str)
and filepath_or_buffer.lower().endswith(
(".json",) + tuple(f".json{c}" for c in _extension_to_compression)
)
and not file_exists(filepath_or_buffer)
):
raise FileNotFoundError(f"File {filepath_or_buffer} does not exist")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can investigate if this logic can be moved according to #46718 (comment) (such that not just json files raise this) that would be great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the safe assumptions to make here about how the file path ends? Is there sth like _extension_to_compression but for file formats?

# Only for write methods
if "r" not in mode and is_path:
    check_parent_directory(str(handle))

_supported_formats = ["json", "csv", "xls", "xlsx", ...]
_allowed_extensions = tuple(itertools.chain(
    (x,) + tuple(f"{x}{c}" for c in _extension_to_compression)
    for x in _supported_formats
))
if (
    is_path
    and handle.lower().endswith(_allowed_extensions)
    and not file_exists(handle)
):
    raise FileNotFoundError(f"File {handle} does not exist")

Copy link
Member

@twoertwein twoertwein Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After read_json knows that it is a file, I think you can just raise inside get_handle when the file doesn't exist:

if "r" not in mode and is_path and not file_exists(handle):
    raise ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean this, right?

if "r" in mode and is_path and not file_exists(handle):
    raise ...

Copy link
Contributor Author

@janosh janosh Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually read_json doesn't even call get_handle since it's behind a bunch of or cases none of which are true:

if (
    not isinstance(filepath_or_buffer, str)
    or is_url(filepath_or_buffer)
    or is_fsspec_url(filepath_or_buffer)
    or file_exists(filepath_or_buffer)
):
    self.handles = get_handle(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were attempts at creating an efficient way of checking whether a str is a legit JSON string. I assume these attempts were unsuccessful - we check the reverse. If that's the case, then moving the check into get_handle, will not work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, then moving the check into get_handle, will not work.

So leave PR as is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So leave PR as is?

Probably.


return filepath_or_buffer

Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/io/json/test_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -1566,6 +1566,20 @@ def test_read_json_with_url_value(self, url):
expected = DataFrame({"url": [url]})
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize(
"compression",
["", ".gz", ".bz2", ".tar"],
)
def test_read_json_with_very_long_file_path(self, compression):
# GH 46718
long_json_path = f'{"a" * 1000}.json{compression}'
with pytest.raises(
FileNotFoundError, match=f"File {long_json_path} does not exist"
):
# path too long for Windows is handled in file_exists() but raises in
# _get_data_from_filepath()
read_json(long_json_path)

@pytest.mark.parametrize(
"date_format,key", [("epoch", 86400000), ("iso", "P1DT0H0M0S")]
)
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/io/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ def test_iterator(self):
(pd.read_hdf, "tables", FileNotFoundError, "h5"),
(pd.read_stata, "os", FileNotFoundError, "dta"),
(pd.read_sas, "os", FileNotFoundError, "sas7bdat"),
(pd.read_json, "os", ValueError, "json"),
(pd.read_json, "os", FileNotFoundError, "json"),
(pd.read_pickle, "os", FileNotFoundError, "pickle"),
],
)
Expand Down Expand Up @@ -253,7 +253,7 @@ def test_write_missing_parent_directory(self, method, module, error_class, fn_ex
(pd.read_hdf, "tables", FileNotFoundError, "h5"),
(pd.read_stata, "os", FileNotFoundError, "dta"),
(pd.read_sas, "os", FileNotFoundError, "sas7bdat"),
(pd.read_json, "os", ValueError, "json"),
(pd.read_json, "os", FileNotFoundError, "json"),
(pd.read_pickle, "os", FileNotFoundError, "pickle"),
],
)
Expand Down