Error checking MapNode hash with fMRIPrep #3014

oesteban · 2019-09-04T16:06:20Z

Summary

190904-01:20:18,695 nipype.workflow ERROR:
	 Node t1_conform failed to run on host 890949213e2e.
190904-01:20:18,696 nipype.workflow ERROR:
	 Saving crash info to /tmp/ds054/derivatives/smriprep/sub-100185/log/20190904-011955_7143d03c-6f71-442e-b03b-868465d5d7f5/crash-20190904-012018-UID1001-t1_conform-79d9866d-bc88-4848-9512-c54ff701db1b.txt
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/multiproc.py", line 269, in _send_procs_to_workers
    num_subnodes = self.procs[jobid].num_subnodes()
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 1204, in num_subnodes
    self._get_inputs()
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 1219, in _get_inputs
    super(MapNode, self)._get_inputs()
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:
AttributeError: 'NoneType' object has no attribute 'outputs'

https://circleci.com/gh/oesteban/smriprep/502

Related to #3009 (in that case, it occurred for a regular Node).

The text was updated successfully, but these errors were encountered:

Created nipy/nipype#3014 to keep track of this one.

oesteban · 2019-09-04T17:59:17Z

Cleaning up the cache solved the problem.

@satra - I believe this (and #3009) are an artifact of working directory degradation, but I'm worried that they could have some other source since they happen while _get_inputs is run. Gathering inputs from previous nodes should work, since their result file has been already checked (and should exist, for the case of #3009. In this case of MapNodes, I think that the outputs of the prior node got aggregated for some obscure reason.

Before I go ahead and start some refactor that can end up being a rats' nest, we wanted to touch base with you. Do you want to allocate some time to chat (cc/ @effigies )?

satra · 2019-09-04T18:33:41Z

let's chat. i would like to understand if this is something that is a new bug that was created by a commit, or is because of mixing versions of caches.

effigies · 2019-09-05T02:17:38Z

I wonder if this is the same problem as #2694. IIRC, that was also impossible to reproduce because a clean working directory didn't exhibit the problem, and there was no clear path toward intentionally producing a broken one.

oesteban · 2019-09-05T14:15:04Z

I believe that last month's refactor should have addressed that, precisely.

My impression is that nipype is overly noisy: it seems that the first time the cache is checked -early in checking it there is a call to _get_inputs()- in the master thread, the result file of some preceding node is not yet found. In the second pass of the cache check (when the node has been already submitted to the executor, so this is done, e.g., by the child process or a new cluster node), it seems that _get_inputs() actually finds the result file.

This synchronization issue worries me a little, as I would think the result file of preceding nodes should be available already in the first call to _get_inputs().

#2694 is related to the interpretation of the result file of the current node (vs. prior nodes, which is this case).

satra · 2019-09-05T14:23:06Z

@oesteban - this really depends on the file system and the executor.

should never be an issue using multiproc/concurrentfuture, since the filesystem is local and any filesystem cache does not have to be written out to a server. if the issue is happening with local executors, it is likely a different reason.
this is often the case in many job schedulers with NFS clients/servers, which is why there is a timeout configuration in nipype to explicitly deal with delayed dirty cache writes across compute nodes. a user can change this config option for such scenarios.

oesteban · 2019-09-05T14:42:21Z

Right, that's why I believe this is mostly noise and things aren't looking bad. However, not being able to come up with a clear explanation for what is happening has me a bit bugged (mostly because of 1 in your comment).

oesteban · 2019-09-05T14:52:11Z

Please note that the errors happen within _get_inputs(), which means that the failing cache is not the results file of the current node, but some prior node. We will need to improve the cache checking logic so that _get_inputs() is not called this early, and first check whether the hashfile and the result file are present. Right now, the cache check is as follows:

Does current node's folder exist?
Run _get_inputs() which propagates results from prior nodes connected to the inputs.
Check the hashfile (requires the actual inputs to calculate hash)

A useful optimization that I'm sure will minimize some of these issues is a new step between 1-2 to check whether the result file exists and a hashfile-looking file exists. However, that optimization should happen after we clear up this weirdness because in local filesystems _get_inputs() should not fail.

Ref.: nipy#3014 (comment)

Prevents nipy#3009 and nipy#3014 from happening - although this might not solve those issues, this patch will help find their origin by making ``load_resultfile`` more strict (and letting it raise exceptions). The try .. except structure is moved to the only place is was being used within the Node code.

Generating the hashvalue when outputs are not ready at cache check stage when the node's directory does not exist (or no results file is in there) leads to nipy#3014. This PR preempts those problems by delaying the hashval calculation.

effigies · 2019-09-12T14:31:31Z

Is there a path forward for testing this, or is it mostly a "wait and see if this keeps cropping up after the next release" sort of issue?

oesteban · 2019-09-12T14:58:13Z

I'm afraid we are in a "wait and see if this keeps cropping up after the next release" sort of issue situation. There might be a path forward based on the fact that there are a bunch of usual suspects falling for this, so it doesn't seem to be a general problem of all MapNodes.

However, this is harder to workaround than #3009 because for MapNodes we really need to propagate the inputs from prior interfaces at this point to know how many nodes are spawned.

effigies · 2019-09-12T15:00:13Z

Okay. Just going to move it out of 1.2.3 then.

stilley2 · 2019-09-23T20:12:44Z

Hi all,
I'm seeing a problem that I think is related to this issue. If I use the multiproc plugin and commit 01de656

     Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:

    Node: report.page_sub-XXXX.reportlets
    Interface: nipype.interfaces.utility.base.Merge
    Traceback:
Traceback (most recent call last):

  File "/home/stilley2/lib/nipype/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file).outputs

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/utils.py", line 301, in load_resultfile
    result = loadpkl(results_file)

  File "/home/stilley2/lib/nipype/nipype/utils/filemanip.py", line 689, in loadpkl
    with pklopen(infile.name, 'rb') as pkl_file:

  File "/usr/lib64/python3.7/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)

  File "/usr/lib64/python3.7/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')

FileNotFoundError: [Errno 2] No such file or directory: 'result_compare1.pklz'

This happens pretty consistently. Also, occasionally the pipeline will freeze. If i update to the next commit (686aef9) things seem to work. However, I'm not sure if that is just hiding the problem. I am doing everything on a local file system. If you all think this is unrelated I can open a new issue.

Thanks!

oesteban · 2019-09-23T23:00:21Z

HI @stilley2, looking at your traceback, this totally looks like #3009. This one has been left open after #3026 was merged because it only affects MapNodes as @mattcieslak reported in #3009 (comment).

However, this WARNING should not be fatal - your processing should have continued and if it failed, that should've happened for some other reason. You are right that that commit basically preempts this WARNING to show up in most of the cases by more carefully checking the work directory when it looks like the node is cached.

Could you try to isolate what is the condition that causes the pipeline freeze? How did you check that it was blocked?

stilley2 · 2019-09-24T14:12:46Z

I just did some digging. The pipeline hangs at getting the lockfile in filemanip.loadpkl. I assume that I don't see the error in later commits because loadpkl isn't called as frequently or something. I think adding a timeout to the lock is a good idea though. Out of curiosity, how is it that we're trying to load results files before they're ready to check the node hash, but not running into this problem when we actually run the nodes?

satra · 2019-09-24T15:11:22Z

@stilley2 - thanks for this. i am pretty positive i introduced an unwanted race condition. via the locking mechanism. i'll take a look later today.

stilley2 · 2019-09-24T15:20:13Z

I think I found what might be the cause for a lot of these issues. In loadpkl the directory is changed via the indirectory context manager

nipype/nipype/utils/filemanip.py

Line 710 in 56a0335

with indirectory(infile.parent):

. However, between these line and any actual file loading (or presumably lock activity), the directory can be changed via the _async_callback function in multiproc.py

nipype/nipype/pipeline/plugins/multiproc.py

Line 148 in 56a0335

def _async_callback(self, args):

. This can lead to the warning about results files not being found, and might be responsible for the lockfile issues as well.

stilley2 · 2019-09-24T15:22:25Z

If anyone wants to prove this to themselves, just add the following lines right after the indirectory context manager

    from time import sleep
    sleep(10)
    if os.getcwd() != infile.parent:
        raise RuntimeError('directory switched')

and run a pipeline with the MultiProc plugin.

oesteban · 2019-09-30T18:43:39Z

To keep close: https://neurostars.org/t/fmriprep-1-5-0-errors/5122/5

…g inputs This PR attempts to alleviate nipy#3014 by opening the result file of a source node only once when that node feeds into several inputs of the node collecting inputs. Before these changes, a call to ``_load_results`` was issued for every input field that needed to collect its inputs from a past node. Now, all the inputs comming from the same node are put together and the ``_load_results`` function is called just once. The PR also modifies the manner the ``AttributeError``s (nipy#3014) were handled to make it easier to spot whether an error occured while loading results araises when gathering the inputs of a node-to-be-run or elsewhere.

oesteban · 2019-10-15T02:10:01Z

Another one bites the dust - https://neurostars.org/t/error-populating-the-input-of-a-node-in-fmriprep-the-results-file-of-the-source-node-does-not-exist/5203/5

oesteban · 2019-11-02T00:04:09Z

Interestingly, it seems MultiProc checks on mapnodes twice:

191101-16:56:56,106 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm0" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0".
191101-16:56:56,105 nipype.utils INFO:
         Running nipype version 1.3.0-rc1.post-dev+g2e5436d31 (latest: 1.2.3)
191101-16:56:56,108 nipype.workflow INFO:
         [Node] Running "_t1w_conform_xfm0" ("niworkflows.interfaces.freesurfer.PatchedLTAConvert"), a CommandLine Interface with command:
lta_convert --inlta identity.nofile --outlta /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0/out.lta --src /mnt/data/datasets/THP002/sub-THP0005/ses-THP0005JHU1/anat/sub-THP0005_ses-THP0005JHU1_run-02_T1w.nii.gz --trg /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform/mapflow/_t1w_conform0/sub-THP0005_ses-THP0005JHU1_run-02_T1w_ras.nii.gz
191101-16:56:56,159 nipype.utils INFO:
         Running nipype version 1.3.0-rc1.post-dev+g2e5436d31 (latest: 1.2.3)
191101-16:56:56,236 nipype.workflow INFO:
         [Node] Finished "_t1w_conform_xfm0".
191101-16:56:56,266 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".
191101-16:56:56,269 nipype.workflow INFO:
         [Node] Running "_t1w_conform_xfm1" ("niworkflows.interfaces.freesurfer.PatchedLTAConvert"), a CommandLine Interface with command:
lta_convert --inlta identity.nofile --outlta /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1/out.lta --src /mnt/data/datasets/THP002/sub-THP0005/ses-THP0005MGH1/anat/sub-THP0005_ses-THP0005MGH1_T1w.nii.gz --trg /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform/mapflow/_t1w_conform1/sub-THP0005_ses-THP0005MGH1_T1w_ras.nii.gz
191101-16:56:56,409 nipype.workflow INFO:
         [Node] Finished "_t1w_conform_xfm1".
191101-16:56:57,337 nipype.workflow INFO:
         [Node] Setting-up "dmriprep_wf.single_subject_THP0005_wf.anat_preproc_wf.anat_template_wf.t1w_conform_xfm" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm".
191101-16:56:57,340 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm0" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0".
191101-16:56:57,341 nipype.workflow INFO:
         [Node] Cached "_t1w_conform_xfm0" - collecting precomputed outputs
191101-16:56:57,341 nipype.workflow INFO:
         [Node] "_t1w_conform_xfm0" found cached.
191101-16:56:57,342 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".
191101-16:56:57,342 nipype.workflow INFO:
         [Node] Cached "_t1w_conform_xfm1" - collecting precomputed outputs
191101-16:56:57,343 nipype.workflow INFO:
         [Node] "_t1w_conform_xfm1" found cached.
191101-16:56:57,344 nipype.workflow INFO:
         [Node] Finished "dmriprep_wf.single_subject_THP0005_wf.anat_preproc_wf.anat_template_wf.t1w_conform_xfm".

You'll see first one of those:

191101-16:56:56,266 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".

and about a second later:

191101-16:56:57,342 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".

This is happening running sMRIPrep from dMRIPrep on a local installation of nipype on the branch of #3075 .

This is on a fresh work directory. If I reuse the old workdir, then MultiProc preempt running it again (only once).

satra · 2019-11-02T02:46:32Z

mapnodes have always done that for any distributed plugin. they first insert all the subnodes, then insert the parent node as a dependency of these subnodes. when the parent node re-executes, it checks all the subnodes once more, but at this point they shouldn't run because they have been cached.

oesteban · 2019-11-02T19:32:48Z

Okay, that's reassuring in the sense that we did not introduce this during a refactor - but my expectation about having found a potential lead is now gone :P

satra · 2019-11-07T17:06:18Z

@oesteban - do you know if #3089 resolved this?

oesteban self-assigned this Sep 4, 2019

oesteban added the bug label Sep 4, 2019

oesteban added a commit to oesteban/smriprep that referenced this issue Sep 4, 2019

fix(ds054): clean cache [skip ds005]

f694896

Created nipy/nipype#3014 to keep track of this one.

oesteban added a commit to oesteban/nipype that referenced this issue Sep 5, 2019

fix(loadpkl): ensure a value is returned and it is not None

c178fbb

Ref.: nipy#3014 (comment)

oesteban mentioned this issue Sep 6, 2019

FIX: Ensure loadpkl returns a not None value #3020

Merged

1 task

oesteban added a commit to oesteban/nipype that referenced this issue Sep 6, 2019

fix(loadpkl): ensure a value is returned and it is not None

19ce2c1

Ref.: nipy#3014 (comment)

oesteban mentioned this issue Sep 10, 2019

FIX: Disallow returning None in pipeline.utils.load_resultfile #3023

Merged

1 task

oesteban mentioned this issue Sep 11, 2019

ENH: Lightweight node cache checking #3026

Merged

1 task

oesteban mentioned this issue Sep 11, 2019

fMRIprep / nipype error while checking node hash #3009

Closed

oesteban added this to the 1.2.3 milestone Sep 11, 2019

effigies modified the milestones: 1.2.3, 1.3.0 Sep 12, 2019

satra mentioned this issue Sep 24, 2019

fix: remove switching directory to load results #3047

Closed

1 task

oesteban mentioned this issue Sep 26, 2019

MAINT: Purge Cython and depend on SDCflows nipreps/fmriprep#1792

Merged

oesteban mentioned this issue Oct 3, 2019

Problems during processing with 1.5.0 nipreps/fmriprep#1798

Closed

oesteban mentioned this issue Oct 11, 2019

ENH: Minimize the number of calls to _load_results when populating inputs #3075

Merged

dPys mentioned this issue Oct 13, 2019

Strange results file checking behavior that breaks workflows at nested mapnodes in nipype 1.2.3 #3076

Closed

effigies mentioned this issue Oct 26, 2019

ENH: Always fsync pickle files #3083

Closed

1 task

satra mentioned this issue Nov 5, 2019

Enh/wait #3089

Merged

1 task

oesteban closed this as completed in #3089 Nov 8, 2019

dmd mentioned this issue Jan 6, 2020

The 'moving_image' trait of a _GenerateSamplingReferenceInputSpec instance must be a pathlike object or string representing an existing file, but a value of '/' <class 'str'> was specified. nipreps/fmriprep#1924

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error checking MapNode hash with fMRIPrep #3014

Error checking MapNode hash with fMRIPrep #3014

oesteban commented Sep 4, 2019

oesteban commented Sep 4, 2019

satra commented Sep 4, 2019

effigies commented Sep 5, 2019

oesteban commented Sep 5, 2019

satra commented Sep 5, 2019

oesteban commented Sep 5, 2019

oesteban commented Sep 5, 2019

effigies commented Sep 12, 2019

oesteban commented Sep 12, 2019 •

edited

Loading

effigies commented Sep 12, 2019

stilley2 commented Sep 23, 2019

oesteban commented Sep 23, 2019

stilley2 commented Sep 24, 2019

satra commented Sep 24, 2019

stilley2 commented Sep 24, 2019

stilley2 commented Sep 24, 2019

oesteban commented Sep 30, 2019

oesteban commented Oct 15, 2019

oesteban commented Nov 2, 2019 •

edited

Loading

satra commented Nov 2, 2019

oesteban commented Nov 2, 2019

satra commented Nov 7, 2019

Error checking MapNode hash with fMRIPrep #3014

Error checking MapNode hash with fMRIPrep #3014

Comments

oesteban commented Sep 4, 2019

Summary

oesteban commented Sep 4, 2019

satra commented Sep 4, 2019

effigies commented Sep 5, 2019

oesteban commented Sep 5, 2019

satra commented Sep 5, 2019

oesteban commented Sep 5, 2019

oesteban commented Sep 5, 2019

effigies commented Sep 12, 2019

oesteban commented Sep 12, 2019 • edited Loading

effigies commented Sep 12, 2019

stilley2 commented Sep 23, 2019

oesteban commented Sep 23, 2019

stilley2 commented Sep 24, 2019

satra commented Sep 24, 2019

stilley2 commented Sep 24, 2019

stilley2 commented Sep 24, 2019

oesteban commented Sep 30, 2019

oesteban commented Oct 15, 2019

oesteban commented Nov 2, 2019 • edited Loading

satra commented Nov 2, 2019

oesteban commented Nov 2, 2019

satra commented Nov 7, 2019

oesteban commented Sep 12, 2019 •

edited

Loading

oesteban commented Nov 2, 2019 •

edited

Loading