Skip to content

fMRIprep / nipype error while checking node hash #3009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeroenvanbaar opened this issue Aug 28, 2019 · 16 comments
Closed

fMRIprep / nipype error while checking node hash #3009

jeroenvanbaar opened this issue Aug 28, 2019 · 16 comments
Assignees
Milestone

Comments

@jeroenvanbaar
Copy link

jeroenvanbaar commented Aug 28, 2019

Summary

When running fmriprep 1.5.0rc2 in a singularity container, I get the following nipype.workflow warning. It asked for opening an issue here. I don't get this error when I run fmriprep 1.4.1rc1.

The warning/error:

190828-18:05:54,262 nipype.workflow WARNING:
	 Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:

	Node: fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_4_wf.bold_reference_wf.enhance_and_skullstrip_bold_wf.map_brainmask
	Interface: niworkflows.interfaces.fixes.FixHeaderApplyTransforms
	Traceback:
Traceback (most recent call last):

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 300, in load_resultfile
    result = loadpkl(results_file)

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/utils/filemanip.py", line 686, in loadpkl
    pkl_file = pklopen(infile.name, 'rb')

  File "/usr/local/miniconda/lib/python3.7/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)

  File "/usr/local/miniconda/lib/python3.7/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')

FileNotFoundError: [Errno 2] No such file or directory: 'result_norm.pklz'

Script/Workflow details

My Singularity command:

singularity run --cleanenv -B /gpfs_home/jvanbaar/data/jvanbaar/my_project_folder:/p1,/gpfs/scratch/jvanbaar:/s1 fmriprep-1.5.0rc2.simg /p1/sourcedata /p1/derivatives participant -w /s1 --participant-label sub-001 --fs-license-file /p1/freesurfer.txt --fs-no-reconall --output-spaces MNI152NLin2009cAsym --resource-monitor --write-graph --use-syn-sdc --ignore fieldmaps --n_cpus 4 --mem_mb 32000

Happy to provide more information.

@oesteban
Copy link
Contributor

You seem to be reusing /gpfs/scratch/jvanbaar as work directory. Can you test whether the following command works for you?:

singularity run --cleanenv -B /gpfs_home/jvanbaar/data/jvanbaar/my_project_folder:/p1,/gpfs/scratch/jvanbaar:/s1 fmriprep-1.5.0rc2.simg /p1/sourcedata /p1/derivatives participant -w /s1/work-001 --participant-label sub-001 --fs-license-file /p1/freesurfer.txt --fs-no-reconall --output-spaces MNI152NLin2009cAsym --resource-monitor --write-graph --use-syn-sdc --ignore fieldmaps --n_cpus 4 --mem_mb 32000

@jeroenvanbaar
Copy link
Author

jeroenvanbaar commented Aug 28, 2019

Thanks for the quick response! I'm now using /s1/sub-001 as work directory, but am still getting the same error. Different traceback though:

190828-19:03:41,432 nipype.workflow WARNING:
	 Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:

	Node: fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref
	Interface: niworkflows.interfaces.registration.EstimateReferenceImage
	Traceback:
Traceback (most recent call last):

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:

AttributeError: 'NoneType' object has no attribute 'outputs'

@oesteban
Copy link
Contributor

Okay, this is not the same error. Are you trying to parallelize several subjects running several fmriprep processes separately?

@jeroenvanbaar
Copy link
Author

jeroenvanbaar commented Aug 29, 2019

Yes, this is when running 10 separate fmriprep processes in parallel on separate HPC nodes (10 subjects). I defined a different scratch folder for each process

@oesteban
Copy link
Contributor

Can you post your submission script?

@jeroenvanbaar
Copy link
Author

Here's my slurm submission script:

#!/bin/bash

#SBATCH -J fMRIprep

#SBATCH -c 4
#SBATCH --mem=32G
#SBATCH --account=my_account_name
#SBATCH -t 24:00:00

#SBATCH --array=1,2,4,5,6,7,8,10,11,13
#SBATCH -e fmriprep_logs/fmriprep_sub-%a.err
#SBATCH -o fmriprep_logs/fmriprep_sub-%a.out

#SBATCH --mail-type=ALL
#SBATCH --mail-user=my_email_address
 
echo "Starting fMRIprep for subject "$SLURM_ARRAY_TASK_ID

subject_dir=$(printf "sub-%03d" $SLURM_ARRAY_TASK_ID)
scratch_dir=$(printf "/s1/sub-%03d" $SLURM_ARRAY_TASK_ID)

echo "Scratch directory: "$scratch_dir

singularity run --cleanenv -B /gpfs_home/jvanbaar/data/jvanbaar/my_project_folder:/p1,/gpfs/scratch/jvanbaar:/s1 fmriprep-1.5.0rc2.simg /p1/sourcedata /p1/derivatives participant -w $scratch_dir --participant-label $subject_dir --fs-license-file /p1/freesurfer.txt --fs-no-reconall --output-spaces MNI152NLin2009cAsym --resource-monitor --write-graph --use-syn-sdc --ignore fieldmaps --n_cpus 4 --mem_mb 32000

@mattcieslak
Copy link
Contributor

mattcieslak commented Sep 3, 2019

I am also getting this error whenever running a workflow in 1.2.1 that has mapnodes and is using multiprocessing. It happens in about 40% of the time I run a workflow and occurs randomly. If I re-run the workflow multiple times it will eventually succeed. So it's not specific to fmriprep, I think it's nipype 1.2.1.

Edit: I'm getting this error both when using singularity 3.3.0 and docker

@oesteban
Copy link
Contributor

oesteban commented Sep 3, 2019

@mattcieslak could you write a failing example for Nipype that we can use to debug this?

@oesteban
Copy link
Contributor

oesteban commented Sep 4, 2019

Okay, I can see this is currently happening in fMRIPrep's master - I'll take this one.

@oesteban
Copy link
Contributor

oesteban commented Sep 4, 2019

@jeroenvanbaar can you provide full logs from fMRIPrep?

@mattcieslak I've opened the issue referenced above for the case of MapNodes.

@oesteban
Copy link
Contributor

oesteban commented Sep 4, 2019

It seems you can ignore this warning, as it seems not to stop the workflow:

190828-19:03:41,432 nipype.workflow WARNING:
	 Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:

	Node: fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref
	Interface: niworkflows.interfaces.registration.EstimateReferenceImage
	Traceback:
Traceback (most recent call last):

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:

AttributeError: 'NoneType' object has no attribute 'outputs'

190828-19:03:41,434 nipype.workflow INFO:
	 [Node] Setting-up "fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref" in "/s1/sub-001/fmriprep_wf/single_subject_001_wf/func_preproc_ses_1_task_wordReading_run_1_wf/bold_reference_wf/gen_ref".
190828-19:03:41,442 nipype.workflow INFO:
	 [Node] Cached "fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref" - collecting precomputed outputs
190828-19:03:41,442 nipype.workflow INFO:
	 [Node] "fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref" found cached.

Nonetheless, we'll try to find out why this warning is being overused.

@oesteban
Copy link
Contributor

oesteban commented Sep 10, 2019

Another one. This time it happened after the interface was run:

190910-02:31:11,300 nipype.utils ERROR:
	 No metadata was found in the pkl file. Make sure you are currently using the same Nipype version from the generated pkl.
190910-02:31:11,346 nipype.workflow ERROR:
	 Node add_dvars_header failed to run on host 1064e7e45ec5.
190910-02:31:11,394 nipype.workflow ERROR:
	 Saving crash info to /out/fmriprep/sub-01/log/20190910-021645_86b7c962-680e-4c02-a110-e8aef9529849/crash-20190910-023111-UID1001-add_dvars_header-8b4860c1-2a0c-4a70-a879-961f5dbdc4d8.txt
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/legacymultiproc.py", line 381, in _send_procs_to_workers
    self.procs[jobid].run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 488, in run
    self, report_type='postexec', is_mapnode=isinstance(self, MapNode))
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 152, in write_report
    result = node.result  # Locally cache result
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 198, in result
    op.join(self.output_dir(), 'result_%s.pklz' % self.name))[0]
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:
AttributeError: 'NoneType' object has no attribute 'outputs'

190910-02:31:11,473 nipype.workflow INFO:
	 [Job 238] Completed (fmriprep_wf.single_subject_01_wf.func_preproc_task_mixedgamblestask_run_02_wf.bold_confounds_wf.add_dvars_header).

EDIT: Added the link, and checked that just rerunning the build (i.e. no clearing of cache) did work out. - might be some sort of synchronization issue.

oesteban added a commit to oesteban/nipype that referenced this issue Sep 10, 2019
Prevents nipy#3009 and nipy#3014 from happening - although this might not solve those issues,
this patch will help find their origin by making ``load_resultfile`` more strict (and
letting it raise exceptions).

The try .. except structure is moved to the only place is was being used within the Node code.
@satra
Copy link
Member

satra commented Sep 10, 2019

@oesteban - something seems to have fundamentally changed for these errors to start popping up, which makes me worry a bit.

can we run it with config option (stop on first crash = True, and also increase the timeout - although normally that should not affect things locally)?

also the call to results.outputs happens after this check:

    if not results_file.exists():
        return result, aggregate, attribute_error

which also suggests a write-timing/results availability issue. i.e. the plugin has returned control without finishing writing a results file. which i didn't think could happen for multiproc, but now that we are using concurrent_futures with a future, this can easily happen, especially on a system like circle, where execution speeds can be limited.

how about we try out SoftFileLock from filelock (which we are using in pydra), when saving and loading result files? if this works, that may help a few areas.

oesteban added a commit to oesteban/nipype that referenced this issue Sep 10, 2019
Minimize the access to the ``result`` property when writing
pre/post-execution reports.

This modification should particularly preempt nipy#3009 (comment)
@oesteban
Copy link
Contributor

something seems to have fundamentally changed for these errors to start popping up, which makes me worry a bit.

We're on the same page :(

can we run it with config option (stop on first crash = True, and also increase the timeout - although normally that should not affect things locally)?

Yes, I think setting stop on first crash true is a good idea for fMRIPrep anyways.

which also suggests a write-timing/results availability issue. i.e. the plugin has returned control without finishing writing a results file. which i didn't think could happen for multiproc, but now that we are using concurrent_futures with a future, this can easily happen, especially on a system like circle, where execution speeds can be limited.

This happened with LegacyMultiProc, so there must be something else. But yes, I've been looking into this and I can only think of two possibilities: write-timing/results availability and that the results object gets mutated elsewhere (which seems impossible, btw).

how about we try out SoftFileLock from filelock (which we are using in pydra), when saving and loading result files? if this works, that may help a few areas.

Sounds good, of course.

oesteban added a commit to oesteban/nipype that referenced this issue Sep 10, 2019
Minimize the access to the ``result`` property when writing
pre/post-execution reports.

This modification should particularly preempt nipy#3009 (comment)
oesteban added a commit to oesteban/nipype that referenced this issue Sep 11, 2019
Prevents nipy#3009 and nipy#3014 from happening - although this might not solve those issues,
this patch will help find their origin by making ``load_resultfile`` more strict (and
letting it raise exceptions).

The try .. except structure is moved to the only place is was being used within the Node code.
oesteban added a commit to oesteban/nipype that referenced this issue Sep 11, 2019
Minimize the access to the ``result`` property when writing
pre/post-execution reports.

This modification should particularly preempt nipy#3009 (comment)
@effigies effigies added this to the 1.2.3 milestone Sep 11, 2019
@oesteban
Copy link
Contributor

This has been addressed with #3024. I'll leave #3014 open as the changes we've been including might not have solved that one.

@oesteban
Copy link
Contributor

Interestingly, this error (warning) on regular nodes seems to be really short-lived:

191010-06:42:30,318 nipype.utils ERROR:
         No metadata was found in the pkl file. Make sure you are currently using the same Nipype version from the generated pkl.
191010-06:42:30,324 nipype.workflow WARNING:
         Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue
at https://github.com/nipy/nipype/issues adding the following information:

        Node: fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister
        Interface: niworkflows.interfaces.freesurfer.PatchedBBRegisterRPT
        Traceback:
Traceback (most recent call last):

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:

AttributeError: 'NoneType' object has no attribute 'outputs'

191010-06:42:30,325 nipype.workflow INFO:
         [Node] Setting-up "fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister" in "/scratch/03763/oesteban/fmriprep-work/sub-NDARINV00CY2MDM/fmriprep_wf/single_subject_NDARINV00CY2MDM_wf/func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf/bold_reg_wf/bbreg_wf/bbregister".
191010-06:42:30,353 nipype.workflow INFO:
         [Job 615] Cached (fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_03_wf.bold_reg_wf.bbreg_wf.mri_coreg).
191010-06:42:30,378 nipype.workflow INFO:
         [Node] Cached "fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister" - collecting precomputed outputs
191010-06:42:30,378 nipype.workflow INFO:
         [Node] "fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister" found cached.

When the node is checked by the execution plugin, we see the warning is issued when trying to read the outputs of a prior node feeding into the bbregister node. A few seconds later, the results files for the inputs are checked again and now it works - the node is found cached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants