You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
importpandasaspdimportfastparquetasfpfromosimportpathasos_pathdf=pd.DataFrame({'ab': [1, 2, 3], 'a':['a', 'b', 'c']})
file=os_path.expanduser('~/Documents/code/data/test.parquet')
# Case 1: write: pandas/pyarrow || read: pandas/pyarrow# OK : 'ab' column is read back with type 'category'# NOOK : does not naturally overwrite (file names are different)# as a consequence, df that is read back contains twice more data.# OK : with snappy not installed, naturally does not compress data.df.to_parquet(file, partition_cols=['ab'])
df.to_parquet(file, partition_cols=['ab'])
df_rec_pd1=pd.read_parquet(file)
print(df_rec_pd1['ab'])
# Case 2: write: pandas/fastparquet || read: pandas/fastparquet# OK : 'ab' column is read back with type 'category'# OK : naturally overwrites (same file names)# NOOK : with snappy not installed, it does not naturally not compress the data,# and use of 'compression' keyword is required to specify 'uncompressed'.df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df_rec_pd2=pd.read_parquet(file, engine='fastparquet')
print(df_rec_pd2['ab'])
# Case 3: write: pandas/fastparquet || read: pandas/pyarrow# OK : 'ab' column is read back with type 'category'# OK : naturally overwrites (same file names)# NOOK : with snappy not installed, it does not naturally not compress the data,# and use of 'compression' keyword is required to specify 'uncompressed'.df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df.to_parquet(file, partition_cols=['ab'], engine='fastparquet', compression='uncompressed')
df_rec_pd3=pd.read_parquet(file)
print(df_rec_pd3['ab'])
# Case 4: write: pandas/pyarrow || read: pandas/fastparquet# NOOK : reading does not work. pyarrow does not generate common metada which# fastparquet is looking for.df.to_parquet(file, partition_cols=['ab'])
df.to_parquet(file, partition_cols=['ab'])
df_rec_pd4=pd.read_parquet(file, engine='fastparquet')
print(df_rec_pd4['ab'])
# Case 5: write: fastparquet || read: pandas/pyarrow# OK : 'ab' column is read back with type 'category'# OK : naturally overwrites (same file names)fp.write(file, df, file_scheme='hive', partition_on=['ab'], compression='BROTLI')
fp.write(file, df, file_scheme='hive', partition_on=['ab'], compression='BROTLI')
df_rec_fp_pd=pd.read_parquet(file)
# Case 6: write: fastparquet || read: fastparquet# OK : nothing to say, perfect :)fp.write(file, df, file_scheme='hive', partition_on=['ab'])
fp.write(file, df, file_scheme='hive', partition_on=['ab'])
df_rec_fp=fp.ParquetFile(file).to_pandas()
# Case 7: write: pandas/pyarrow || read: fastparquet# NOOK : reading does not work. It seems that even if snappy is not available# at writing, it is indicated in metadata as the engine that has been# used. As fastparquet is not finding it, it raise an error.df.to_parquet(file)
df_rec_pd_fp=fp.ParquetFile(file).to_pandas()
# Case 8: write: pandas/pyarrow || read: fastparquet# NOOK : bug already reported: https://github.com/pandas-dev/pandas/issues/39480# It seems pandas/pyarrow is not writing categories of int as categories.df.to_parquet(file, compression='BROTLI')
df_rec_pd_fp=fp.ParquetFile(file).to_pandas()
Problem description
Hi. I report here 4 different problems as far as I can see (a 5th related one having been reported earlier in #39480). Trying to synthetize:
when using partition_cols in to_parquet, the ways it is managed by fastparquet and pyarrow are different. Most notably, fastparquet is writing common metdata in the root directory, while pyarrow ...? I don't know what it does, but at least, it does not write this 'common metadata' file. This raises a bug as in case 4: writing with pyarrow, reading with fastparquet. fastparquet does not find the common metadata file and refuses to read the data
when not having 'snappy' installed, for writing step, no error is raised with pyarrow. But when read back with fastparquet, this time, reader believes it has been compressed with 'snappy' and complains not to find it. This raises the error as in case 7.
when not having 'snappy' installed, for the writing and reading steps, could the writer and reader simply/naturally fall back to 'uncompressed'? With fastparquet, this forces the user to add the parameter compression set to uncompressed as in cases 2, & 3.
when using partition_cols with pyarrow, names of parquet files are always different. Hence, writing twice the same dataset will result in duplicating the data (on the opposite, fastparquet uses always the same file names, which ensures natural overwriting), as in case 1.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 9d598a5
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-41-generic
Version : #46~20.04.1-Ubuntu SMP Mon Jan 18 17:52:23 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8
the default compression in fastparquet is in fact none. When called from pandas, though, it's snappy (because they wanted to choose one of the two defaults to go with), causing the problem. I don't believe arrow uses python-snappy, so that can compress without it. Note that the newest fastparquet now has a hard dependency on cramjam, which includes snappy, so it will always be available.
indeed pyarrow does not write the metadata by default, but I believe it can. Fastparquet always does. As of the latest release, fastparquet can be passed a directory and it will find the data files just as pyarrow does. Previously, you had to use glob and pass a list of data files.
in dask we has a lot of discussion about overwriting. It seems the best to explicitly delete the contents of a directory before writing to it, unless "append" is specified. Dask now allows for a filename template (perhaps still in PR) to specify the names of the data flies; but this is not done in the backend libraries themselves. There's probably no pressing need for pandas to implement this.
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
Hi. I report here 4 different problems as far as I can see (a 5th related one having been reported earlier in #39480). Trying to synthetize:
when using
partition_cols
into_parquet
, the ways it is managed by fastparquet and pyarrow are different. Most notably, fastparquet is writing common metdata in the root directory, while pyarrow ...? I don't know what it does, but at least, it does not write this 'common metadata' file. This raises a bug as in case 4: writing with pyarrow, reading with fastparquet. fastparquet does not find the common metadata file and refuses to read the datawhen not having 'snappy' installed, for writing step, no error is raised with pyarrow. But when read back with fastparquet, this time, reader believes it has been compressed with 'snappy' and complains not to find it. This raises the error as in case 7.
when not having 'snappy' installed, for the writing and reading steps, could the writer and reader simply/naturally fall back to 'uncompressed'? With fastparquet, this forces the user to add the parameter
compression
set touncompressed
as in cases 2, & 3.when using
partition_cols
with pyarrow, names of parquet files are always different. Hence, writing twice the same dataset will result in duplicating the data (on the opposite, fastparquet uses always the same file names, which ensures natural overwriting), as in case 1.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 9d598a5
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-41-generic
Version : #46~20.04.1-Ubuntu SMP Mon Jan 18 17:52:23 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8
pandas : 1.2.1
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : 0.5.0
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2
The text was updated successfully, but these errors were encountered: