-
Notifications
You must be signed in to change notification settings - Fork 2.8k
load_dataset for text files not working #622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you give us more information on your os and pip environments (pip list)? |
@thomwolf Sure. I'll try downgrading to 3.7 now even though Arrow say they support >=3.5. Linux (Ubuntu 18.04) - Python 3.8Package - Versioncertifi 2020.6.20 Windows 10 - Python 3.8Package - Versioncertifi 2020.6.20 |
Downgrading to 3.7 does not help. Here is a dummy text file:
A temporary work around for the "text" type, is dataset = Dataset.from_dict({"text": Path(dataset_f).read_text().splitlines()}) |
@banunitte Please do not post screenshots in the future but copy-paste your code and the errors. That allows others to copy-and-paste your code and test it. You may also want to provide the Python version that you are using. |
I have the exact same problem in Windows 10, Python 3.8. |
I have the same problem on Linux of the script crashing with a CSV error. This may be caused by 'CRLF', when changed 'CRLF' to 'LF', the problem solved. |
I pushed a fix for Not sure about the windows one yet |
To complete what @lhoestq is saying, I think that to use the new version of the dataset = load_dataset('text', script_version='master', data_files=XXX) We do versioning by default, i.e. your version of the dataset lib will use the script with the same version by default (i.e. only the |
|
Traceback` (most recent call last):
File "main.py", line 281, in <module>
main()
File "main.py", line 190, in main
train_data, test_data = data_factory(
File "main.py", line 129, in data_factory
train_data = load_dataset('text',
File "/home/me/Downloads/datasets/src/datasets/load.py", line 608, in load_dataset
builder_instance.download_and_prepare(
File "/home/me/Downloads/datasets/src/datasets/builder.py", line 468, in download_and_prepare
self._download_and_prepare(
File "/home/me/Downloads/datasets/src/datasets/builder.py", line 546, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/me/Downloads/datasets/src/datasets/builder.py", line 888, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "/home/me/.local/lib/python3.8/site-packages/tqdm/std.py", line 1130, in __iter__
for obj in iterable:
File "/home/me/.cache/huggingface/modules/datasets_modules/datasets/text/512f465342e4f4cd07a8791428a629c043bb89d55ad7817cbf7fcc649178b014/text.py", line 103, in _generate_tables
pa_table = pac.read_csv(
File "pyarrow/_csv.pyx", line 617, in pyarrow._csv.read_csv
File "pyarrow/error.pxi", line 123, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 2 Unfortunately i am still getting this issue on Linux. I installed datasets from source and specified script_version to master. |
Since #644 it should now work on windows @ScottishFold007
Same for you @BramVanroy . Not sure about the one on linux though |
Linux here: I was using the 0.4.0 nlp library load_dataset to load a text dataset of 9-10Gb without collapsing the RAM memory. However, today I got the csv error message mentioned in this issue. After installing the new (datasets) library from source and specifying the script_verson = 'master' I'm still having this same error message. Furthermore, I cannot use the dictionary "trick" to load the dataset since the system kills the process due to a RAM out of memory problem. Is there any other solution to this error? Thank you in advance. |
Hi @raruidol I'm not sure why you're having the csv error on linux. |
The crash message shows up when loading the dataset:
And this is the exact message:
And these are the pip packages I have atm and their versions:
|
I tested on google colab which is also linux using this code:
wget https://raw.githubusercontent.com/abisee/cnn-dailymail/master/url_lists/all_train.txt
from datasets import load_dataset
d = load_dataset("text", data_files="all_train.txt", script_version='master') And I don't get this issue. > Could you test on your side if these lines work @raruidol ? also cc @Skyy93 as it seems you have the same issue If it works: And if it doesn't work: Either way it should help to find where this bug comes from and fix it :) Thank you in advance ! |
Update: also tested the above code in a docker container from jupyter/minimal-notebook (based on ubuntu) and still not able to reproduce |
It looks like with your text input file works without any problem. I have been doing some experiments this morning with my input files and I'm almost certain that the crash is caused by some unexpected pattern in the files. However, I've not been able to spot the main cause of it. What I find strange is that this same corpus was being loaded by the nlp 0.4.0 library without any problem... Where can I find the code where you structure the input text data in order to use it with pyarrow? |
Under the hood it does import pyarrow as pa
import pyarrow.csv
# Use csv reader from Pyarrow with one column for text files
# To force the one-column setting, we set an arbitrary character
# that is not in text files as delimiter, such as \b or \v.
# The bell character, \b, was used to make beeps back in the days
parse_options = pa.csv.ParseOptions(
delimiter="\b",
quote_char=False,
double_quote=False,
escape_char=False,
newlines_in_values=False,
ignore_empty_lines=False,
)
read_options= pa.csv.ReadOptions(use_threads=True, column_names=["text"])
pa_table = pa.csv.read_csv("all_train.txt", read_options=read_options, parse_options=parse_options) Note that we changed the parse options with datasets 1.0 |
Could you try with |
I was just exploring if the crash was happening in every shard or not, and which shards were generating the error message. With \b I got the following list of shards crashing:
I also tried with \a and the list decreased but there were still several crashes:
Which means that it is quite possible that the assumption of that some unexpected pattern in the files is causing the crashes is true. If I am able to reach any conclusion I will post It here asap. |
Hmmm I was expecting it to work with \a, not sure why they appear in your text files though |
Hi @lhoestq, is there any input length restriction which was not before the update of the nlp library? |
No we never set any input length restriction on our side (maybe arrow but I don't think so) |
@lhoestq Can you ever be certain that a delimiter character is not present in a plain text file? In other formats (e.g. CSV) , rules are set of what is allowed and what isn't so that it actually constitutes a CSV file. In a text file you basically have "anything goes", so I don't think you can ever be entirely sure that the chosen delimiter does not exist in the text file, or am I wrong? If I understand correctly you choose a delimiter that we hope does not exist in the file, so that when the CSV parser starts splitting into columns, it will only ever create one column? Why can't we use a newline character though? |
Okay, I have splitted the crashing shards into individual sentences and some examples of the inputs that are causing the crashes are the following ones: 4. DE L’ORGANITZACIÓ ESTAMENTAL A L’ORGANITZACIÓ EN CLASSES A mesura que es desenvolupava un sistema econòmic capitalista i naixia una classe burgesa cada vegada més preparada per a substituir els dirigents de les velles monarquies absolutistes, es qüestionava l’abundància de béns amortitzats, que com s’ha dit estaven fora del mercat i no pagaven tributs, pels perjudicis que ocasionaven a les finances públiques i a l’economia en general. Aquest estat d’opinió revolucionari va desembocar en un conjunt de mesures pràctiques de caràcter liberal. D’una banda, les que intentaven desposseir les mans mortes del domini de béns acumulats, procés que acostumem a denominar desamortització, i que no és més que la nacionalització i venda d’aquests béns eclesiàstics o civils en subhasta pública al millor postor. D’altra banda, les que redimien o reduïen els censos i delmes o aixecaven les prohibicions de venda, és a dir, les vinculacions. La desamortització, que va afectar béns dels ordes religiosos, dels pobles i d’algunes corporacions civils, no va ser un camí fàcil, perquè costava i costa trobar algú que sigui indiferent a la pèrdua de béns, drets i privilegis. I té una gran transcendència, va privar els antics estaments de les Espanyes, clero i pobles —la noblesa en queda al marge—, de la força econòmica que els donaven bona part de les seves terres i, en última instància, va preparar el terreny per a la substitució de la vella societat estamental per la nova societat classista. En aquesta societat, en teoria, les agrupacions socials són obertes, no tenen cap estatut jurídic privilegiat i estan definides per la possessió o no d’uns béns econòmics que són lliurement alienables. A les Espanyes la transformació va afectar poc l’aristocràcia latifundista, allà on n’hi havia. Aquesta situació va afavorir, en part, la persistència de la vella cultura de la societat estamental en determinats ambients, i això ha influït decisivament en la manca de democràcia que caracteritza la majoria de règims polítics que s’han anat succeint. Una manera de pensar que sempre sura en un moment o altre, i que de fet no acaba de desaparèixer del tot. 5. INICI DE LA DESAMORTITZACIÓ A LES ESPANYES Durant el segle xviii, dins d’aquesta visió lliberal, va agafar força en alguns cercles de les Espanyes el corrent d’opinió contrari a les mans mortes. Durant el regnat de Carles III, s’arbitraren les primeres mesures desamortitzadores proposades per alguns ministres il·lustrats. Aquestes disposicions foren modestes i poc eficaces, no van aturar l’acumulació de terres per part dels estaments que constituïen les mans mortes i varen afectar principalment béns dels pobles. L’Església no va ser tocada, excepte en el cas de 110 la revolució liberal, perquè, encara que havia perdut els seus drets jurisdiccionals, havia conservat la majoria de terres i fins i tot les havia incrementat amb d’altres que procedien de la desamortització. En la nova situació, les mans mortes del bosc públic eren l’Estat, que no cerca mai l’autofinançament de les despeses de gestió; els diners que manquin ja els posarà l’Estat. 9. DEFENSA I INTENTS DE RECUPERACIÓ DELS BÉNS COMUNALS DESAMORTITZATS El procés de centralització no era senzill, perquè, d’una banda, la nova organització apartava de la gestió moltes corporacions locals i molts veïns que l’havien portada des de l’edat mitjana, i, de l’altra, era difícil de coordinar la nova silvicultura amb moltes pràctiques forestals i drets tradicionals, com la pastura, fer llenya o tallar un arbre aquí i un altre allà quan tenia el gruix suficient, les pràctiques que s’havien fet sempre. Les primeres passes de la nova organització centralitzada varen tenir moltes dificultats en aquells indrets en què els terrenys municipals i comunals tenien un paper important en l’economia local. La desobediència a determinades normes imposades varen prendre formes diferents. Algunes institucions, com, per exemple, la Diputació de Lleida, varen retardar la tramitació d’alguns expedients i varen evitar la venda de béns municipals. Molts pobles permeteren deixar que els veïns continuessin amb les seves pràctiques tradicionals, d’altres varen boicotejar les subhastes d’aprofitaments. L’Estat va reaccionar encomanant a la Guàrdia Civil el compliment de les noves directrius. Imposar el nou règim va costar a l’Administració un grapat d’anys, però de mica en mica, amb molta, molta guarderia i gens de negociació, ho va aconseguir. La nova gestió estatal dels béns municipals va deixar, com hem comentat, molta gent sense uns recursos necessaris per a la supervivència, sobre tot en àrees on predominaven les grans propietats, i on els pagesos sense terra treballaven de jornalers temporers. Això va afavorir que, a bona part de les Espanyes, les primeres lluites camperoles de la segona meitat del segle xix defensessin la recuperació dels comunals desamortitzats; per a molts aquella expropiació i venda dirigida pels governs monàrquics era la causa de molta misèria. D’altres, més radicalitzats, varen entendre que l’eliminació de la propietat col·lectiva i la gestió estatal dels boscos no desamortitzats suposava una usurpació pura i dura. En les zones més afectades per la desamortització això va donar lloc a un imaginari centrat en la defensa del comunal. La Segona República va arribar en una conjuntura econòmica de crisi, generada pel crac del 1929. Al camp, aquesta situació va produir una forta caiguda dels preus dels productes agraris i un increment important de l’atur. QUADERNS AGRARIS 42 (juny 2017), p. 105-126 I think that the main difference between the crashing samples and the rest is their length. Therefore, couldn't the length be causing the message errors? I hope with these samples you can identify what is causing the crashes considering that the 0.4.0 nlp library was loading them properly. |
So we're using the csv reader to read text files because arrow doesn't have a text reader. So we have to option:
As long as the text file follows some encoding it wouldn't make sense to have characters such as the bell character. However I agree it can happen.
Exactly. Arrow doesn't allow the newline character unfortunately. |
Thanks for digging into it ! Characters like \a or \b are not shown when printing the text, so as it is I can't tell if it contains unexpected characters.
To check that you could try to run import pyarrow as pa
import pyarrow.csv
open("dummy.txt", "w").write((("a" * 10_000) + "\n") * 4) # 4 lines of 10 000 'a'
parse_options = pa.csv.ParseOptions(
delimiter="\b",
quote_char=False,
double_quote=False,
escape_char=False,
newlines_in_values=False,
ignore_empty_lines=False,
)
read_options= pa.csv.ReadOptions(use_threads=True, column_names=["text"])
pa_table = pa.csv.read_csv("dummy.txt", read_options=read_options, parse_options=parse_options) on my side it runs without error though |
That's true, It was my error printing the text that way. Maybe as a workaround, I can force all my input samples to have "\b" at the end? |
I don't think it would work since we only want one column, and "\b" is set to be the delimiter between two columns, so it will raise the same issue again. Pyarrow would think that there is more than one column if the delimiter is found somewhere. Anyway, I I'll work on a new text reader if we don't find the right workaround about this delimiter issue. |
I just merged a new text reader based on pandas. Until we do a new release you can experiment with it using from datasets import load_dataset
d = load_dataset("text", data_files=..., script_version="master") ( |
Thank you @lhoestq I have tried again with the new text reader and there is still some error. Depending on how do I load the data I have spotted two different crashes. When I try to load the full-size corpus text file I get the following output:
However, when loading the sharded version, the error is different:
|
I had the same error but managed to fix it with adding Have a look at Stackoverflow EDIT: |
Indeed good catch ! You can expect a patch release by tomorrow |
Setting Also I'm not able to reproduce your issue on macos with files containing \r\n as end of lines. |
I found a way to implement it without third party lib and without separator/delimiter logic. I'd love to have your feedback on the PR @Skyy93 , hopefully this is the final iteration of the text dataset :) Let me know if it works on your side ! Until there's a new release you can test it with from datasets import load_dataset
d = load_dataset("text", data_files=..., script_version="master") |
Looks good! Thank you for your support. |
The same problem happens with "csv". dataset = load_dataset("csv", data_files="custom_data.csv", delimiter="\t", column_names=["title", "text"], script_version="master") |
Could you open a new issue for CSV please ? |
I think this issue (loading text files) is now solved. Closing it, please open a new issue with full details to continue or start another discussion on this topic. |
Uh oh!
There was an error while loading. Please reload this page.
Trying the following snippet, I get different problems on Linux and Windows.
(ps This example shows that you can use a string as input for data_files, but the signature is
Union[Dict, List]
.)The problem on Linux is that the script crashes with a CSV error (even though it isn't a CSV file). On Windows the script just seems to freeze or get stuck after loading the config file.
Linux stack trace:
Windows just seems to get stuck. Even with a tiny dataset of 10 lines, it has been stuck for 15 minutes already at this message:
The text was updated successfully, but these errors were encountered: