Skip to content

GH2550 revisited #4830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yt87 opened this issue Jan 20, 2021 · 2 comments · Fixed by #8923
Closed

GH2550 revisited #4830

yt87 opened this issue Jan 20, 2021 · 2 comments · Fixed by #8923

Comments

@yt87
Copy link

yt87 commented Jan 20, 2021

Is your feature request related to a problem? Please describe.
I am retrieving files from AWS: https://registry.opendata.aws/wrf-se-alaska-snap/. An example:

import s3fs
import xarray as xr

s3 = s3fs.S3FileSystem(anon=True)
s3path = 's3://wrf-se-ak-ar5/gfdl/hist/daily/1980/WRFDS_1980-01-0[12].nc'
remote_files = s3.glob(s3path)
fileset = [s3.open(file) for file in remote_files]

ds = xr.open_mfdataset(fileset, concat_dim='Time', decode_cf=False)
ds

Data files for 1980 are missing time coordinate, so the above code fails. The time could be obtained by parsing file name, however in the current implementation the source attribute is available only when the fileset consists of strings or Paths.

Describe the solution you'd like
I would suggest to return to the original suggestion in #2550 - pass filename_or_object as an argument to preprocess function, but with necessary inspection. Here is my attempt (code in open_mfdataset):

open_kwargs = dict(
        engine=engine, chunks=chunks or {}, lock=lock, autoclose=autoclose, **kwargs
    )

    if preprocess is not None:
        # Get number of free arguments
        from inspect import signature
        parms = signature(preprocess).parameters
        num_preprocess_args = len([p for p in parms.values() if p.default == p.empty])
        if num_preprocess_args not in (1, 2):
            raise ValueError('preprocess accepts only 1 or 2 arguments')

    if parallel:
        import dask

        # wrap the open_dataset, getattr, and preprocess with delayed
        open_ = dask.delayed(open_dataset)
        getattr_ = dask.delayed(getattr)
        if preprocess is not None:
            preprocess = dask.delayed(preprocess)
    else:
        open_ = open_dataset
        getattr_ = getattr

    datasets = [open_(p, **open_kwargs) for p in paths]
    file_objs = [getattr_(ds, "_file_obj") for ds in datasets]
    if preprocess is not None:
        if num_preprocess_args == 1:
            datasets = [preprocess(ds) for ds in datasets]
        else:
            datasets = [preprocess(ds, p) for (ds, p) in zip(datasets, paths)]

With this, I can define function fix as follows:

def fix(ds, source):
    vtime = datetime.strptime(os.path.basename(source.path), 'WRFDS_%Y-%m-%d.nc')
    return ds.assign_coords(Time=[vtime])

ds = xr.open_mfdataset(fileset, preprocess=fix, concat_dim='Time', decode_cf=False)

This is backward compatible, preprocess can accept any number of arguments:

from functools import partial
import xarray as xr

def fix1(ds):
    print('fix1')
    return ds

def fix2(ds, file):
    print('fix2:', file.as_uri())
    return ds

def fix3(ds, file, arg):
    print('fix3:', file.as_uri(), arg)
    return ds

fileset = [Path('/home/george/Downloads/WRFDS_1988-04-23.nc'),
           Path('/home/george/Downloads/WRFDS_1988-04-24.nc')
          ]
ds = xr.open_mfdataset(fileset, preprocess=fix1, concat_dim='Time', parallel=True)
ds = xr.open_mfdataset(fileset, preprocess=fix2, concat_dim='Time')
ds = xr.open_mfdataset(fileset, preprocess=partial(fix3, arg='additional argument'),
                       concat_dim='Time')
fix1
fix1
fix2: file:///home/george/Downloads/WRFDS_1988-04-23.nc
fix2: file:///home/george/Downloads/WRFDS_1988-04-24.nc
fix3: file:///home/george/Downloads/WRFDS_1988-04-23.nc additional argument
fix3: file:///home/george/Downloads/WRFDS_1988-04-24.nc additional argument

Describe alternatives you've considered
The simple solution would be to make xarray s3fs aware. IMHO this is not particularly elegant. Either a check for an attribute, or an import within a try/except block would be needed.

@dcherian
Copy link
Contributor

the source attribute is available only when the fileset consists of strings or Paths.

Is it possible to fix this instead?

@yt87
Copy link
Author

yt87 commented Jan 25, 2021

One could always set source to str(filename_or_object). In this case:

import s3fs

s3 = s3fs.S3FileSystem(anon=True)
s3path = 's3://wrf-se-ak-ar5/gfdl/hist/daily/1980/WRFDS_1980-01-02.nc'
fileset = s3.open(s3path)
fileset
fileset.path

prints

<File-like object S3FileSystem, wrf-se-ak-ar5/gfdl/hist/daily/1980/WRFDS_1980-01-02.nc>

'wrf-se-ak-ar5/gfdl/hist/daily/1980/WRFDS_1980-01-02.nc'

It is easy to parse the above fileset representation, but there is no guarantee that some other external file representation will be amenable to parsing.

If the fix is only for s3fs, getting path attribute is more elegant, however this would require xarray to be aware of the module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants