Skip to content

Error checking MapNode hash with fMRIPrep #3014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
oesteban opened this issue Sep 4, 2019 · 22 comments · Fixed by #3089
Closed

Error checking MapNode hash with fMRIPrep #3014

oesteban opened this issue Sep 4, 2019 · 22 comments · Fixed by #3089
Assignees
Labels
Milestone

Comments

@oesteban
Copy link
Contributor

oesteban commented Sep 4, 2019

Summary

190904-01:20:18,695 nipype.workflow ERROR:
	 Node t1_conform failed to run on host 890949213e2e.
190904-01:20:18,696 nipype.workflow ERROR:
	 Saving crash info to /tmp/ds054/derivatives/smriprep/sub-100185/log/20190904-011955_7143d03c-6f71-442e-b03b-868465d5d7f5/crash-20190904-012018-UID1001-t1_conform-79d9866d-bc88-4848-9512-c54ff701db1b.txt
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/multiproc.py", line 269, in _send_procs_to_workers
    num_subnodes = self.procs[jobid].num_subnodes()
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 1204, in num_subnodes
    self._get_inputs()
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 1219, in _get_inputs
    super(MapNode, self)._get_inputs()
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:
AttributeError: 'NoneType' object has no attribute 'outputs'

https://circleci.com/gh/oesteban/smriprep/502

Related to #3009 (in that case, it occurred for a regular Node).

@oesteban oesteban self-assigned this Sep 4, 2019
@oesteban oesteban added the bug label Sep 4, 2019
oesteban added a commit to oesteban/smriprep that referenced this issue Sep 4, 2019
Created nipy/nipype#3014 to keep track of this one.
@oesteban
Copy link
Contributor Author

oesteban commented Sep 4, 2019

Cleaning up the cache solved the problem.

@satra - I believe this (and #3009) are an artifact of working directory degradation, but I'm worried that they could have some other source since they happen while _get_inputs is run. Gathering inputs from previous nodes should work, since their result file has been already checked (and should exist, for the case of #3009. In this case of MapNodes, I think that the outputs of the prior node got aggregated for some obscure reason.

Before I go ahead and start some refactor that can end up being a rats' nest, we wanted to touch base with you. Do you want to allocate some time to chat (cc/ @effigies )?

@satra
Copy link
Member

satra commented Sep 4, 2019

let's chat. i would like to understand if this is something that is a new bug that was created by a commit, or is because of mixing versions of caches.

@effigies
Copy link
Member

effigies commented Sep 5, 2019

I wonder if this is the same problem as #2694. IIRC, that was also impossible to reproduce because a clean working directory didn't exhibit the problem, and there was no clear path toward intentionally producing a broken one.

@oesteban
Copy link
Contributor Author

oesteban commented Sep 5, 2019

I believe that last month's refactor should have addressed that, precisely.

My impression is that nipype is overly noisy: it seems that the first time the cache is checked -early in checking it there is a call to _get_inputs()- in the master thread, the result file of some preceding node is not yet found. In the second pass of the cache check (when the node has been already submitted to the executor, so this is done, e.g., by the child process or a new cluster node), it seems that _get_inputs() actually finds the result file.

This synchronization issue worries me a little, as I would think the result file of preceding nodes should be available already in the first call to _get_inputs().

#2694 is related to the interpretation of the result file of the current node (vs. prior nodes, which is this case).

@satra
Copy link
Member

satra commented Sep 5, 2019

@oesteban - this really depends on the file system and the executor.

  1. should never be an issue using multiproc/concurrentfuture, since the filesystem is local and any filesystem cache does not have to be written out to a server. if the issue is happening with local executors, it is likely a different reason.

  2. this is often the case in many job schedulers with NFS clients/servers, which is why there is a timeout configuration in nipype to explicitly deal with delayed dirty cache writes across compute nodes. a user can change this config option for such scenarios.

@oesteban
Copy link
Contributor Author

oesteban commented Sep 5, 2019

Right, that's why I believe this is mostly noise and things aren't looking bad. However, not being able to come up with a clear explanation for what is happening has me a bit bugged (mostly because of 1 in your comment).

@oesteban
Copy link
Contributor Author

oesteban commented Sep 5, 2019

Please note that the errors happen within _get_inputs(), which means that the failing cache is not the results file of the current node, but some prior node. We will need to improve the cache checking logic so that _get_inputs() is not called this early, and first check whether the hashfile and the result file are present. Right now, the cache check is as follows:

  1. Does current node's folder exist?
  2. Run _get_inputs() which propagates results from prior nodes connected to the inputs.
  3. Check the hashfile (requires the actual inputs to calculate hash)

A useful optimization that I'm sure will minimize some of these issues is a new step between 1-2 to check whether the result file exists and a hashfile-looking file exists. However, that optimization should happen after we clear up this weirdness because in local filesystems _get_inputs() should not fail.

oesteban added a commit to oesteban/nipype that referenced this issue Sep 5, 2019
oesteban added a commit to oesteban/nipype that referenced this issue Sep 6, 2019
oesteban added a commit to oesteban/nipype that referenced this issue Sep 10, 2019
Prevents nipy#3009 and nipy#3014 from happening - although this might not solve those issues,
this patch will help find their origin by making ``load_resultfile`` more strict (and
letting it raise exceptions).

The try .. except structure is moved to the only place is was being used within the Node code.
oesteban added a commit to oesteban/nipype that referenced this issue Sep 11, 2019
Prevents nipy#3009 and nipy#3014 from happening - although this might not solve those issues,
this patch will help find their origin by making ``load_resultfile`` more strict (and
letting it raise exceptions).

The try .. except structure is moved to the only place is was being used within the Node code.
oesteban added a commit to oesteban/nipype that referenced this issue Sep 11, 2019
Generating the hashvalue when outputs are not ready at cache check stage
when the node's directory does not exist (or no results file is in
there) leads to nipy#3014.

This PR preempts those problems by delaying the hashval calculation.
oesteban added a commit to oesteban/nipype that referenced this issue Sep 11, 2019
Generating the hashvalue when outputs are not ready at cache check stage
when the node's directory does not exist (or no results file is in
there) leads to nipy#3014.

This PR preempts those problems by delaying the hashval calculation.
@oesteban oesteban added this to the 1.2.3 milestone Sep 11, 2019
@effigies
Copy link
Member

Is there a path forward for testing this, or is it mostly a "wait and see if this keeps cropping up after the next release" sort of issue?

@oesteban
Copy link
Contributor Author

oesteban commented Sep 12, 2019

I'm afraid we are in a "wait and see if this keeps cropping up after the next release" sort of issue situation. There might be a path forward based on the fact that there are a bunch of usual suspects falling for this, so it doesn't seem to be a general problem of all MapNodes.

However, this is harder to workaround than #3009 because for MapNodes we really need to propagate the inputs from prior interfaces at this point to know how many nodes are spawned.

@effigies effigies modified the milestones: 1.2.3, 1.3.0 Sep 12, 2019
@effigies
Copy link
Member

Okay. Just going to move it out of 1.2.3 then.

@stilley2
Copy link
Contributor

Hi all,
I'm seeing a problem that I think is related to this issue. If I use the multiproc plugin and commit 01de656

     Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:

    Node: report.page_sub-XXXX.reportlets
    Interface: nipype.interfaces.utility.base.Merge
    Traceback:
Traceback (most recent call last):

  File "/home/stilley2/lib/nipype/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file).outputs

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/utils.py", line 301, in load_resultfile
    result = loadpkl(results_file)

  File "/home/stilley2/lib/nipype/nipype/utils/filemanip.py", line 689, in loadpkl
    with pklopen(infile.name, 'rb') as pkl_file:

  File "/usr/lib64/python3.7/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)

  File "/usr/lib64/python3.7/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')

FileNotFoundError: [Errno 2] No such file or directory: 'result_compare1.pklz'

This happens pretty consistently. Also, occasionally the pipeline will freeze. If i update to the next commit (686aef9) things seem to work. However, I'm not sure if that is just hiding the problem. I am doing everything on a local file system. If you all think this is unrelated I can open a new issue.

Thanks!

@oesteban
Copy link
Contributor Author

HI @stilley2, looking at your traceback, this totally looks like #3009. This one has been left open after #3026 was merged because it only affects MapNodes as @mattcieslak reported in #3009 (comment).

However, this WARNING should not be fatal - your processing should have continued and if it failed, that should've happened for some other reason. You are right that that commit basically preempts this WARNING to show up in most of the cases by more carefully checking the work directory when it looks like the node is cached.

Could you try to isolate what is the condition that causes the pipeline freeze? How did you check that it was blocked?

@stilley2
Copy link
Contributor

I just did some digging. The pipeline hangs at getting the lockfile in filemanip.loadpkl. I assume that I don't see the error in later commits because loadpkl isn't called as frequently or something. I think adding a timeout to the lock is a good idea though. Out of curiosity, how is it that we're trying to load results files before they're ready to check the node hash, but not running into this problem when we actually run the nodes?

@satra
Copy link
Member

satra commented Sep 24, 2019

@stilley2 - thanks for this. i am pretty positive i introduced an unwanted race condition. via the locking mechanism. i'll take a look later today.

@stilley2
Copy link
Contributor

I think I found what might be the cause for a lot of these issues. In loadpkl the directory is changed via the indirectory context manager

with indirectory(infile.parent):
. However, between these line and any actual file loading (or presumably lock activity), the directory can be changed via the _async_callback function in multiproc.py
def _async_callback(self, args):
. This can lead to the warning about results files not being found, and might be responsible for the lockfile issues as well.

@stilley2
Copy link
Contributor

If anyone wants to prove this to themselves, just add the following lines right after the indirectory context manager

    from time import sleep
    sleep(10)
    if os.getcwd() != infile.parent:
        raise RuntimeError('directory switched')

and run a pipeline with the MultiProc plugin.

@oesteban
Copy link
Contributor Author

oesteban added a commit to oesteban/nipype that referenced this issue Oct 11, 2019
…g inputs

This PR attempts to alleviate nipy#3014 by opening the result file of a
source node only once when that node feeds into several inputs of the
node collecting inputs.
Before these changes, a call to ``_load_results`` was issued for every
input field that needed to collect its inputs from a past node.
Now, all the inputs comming from the same node are put together and the
``_load_results`` function is called just once.

The PR also modifies the manner the ``AttributeError``s (nipy#3014) were handled
to make it easier to spot whether an error occured while loading results
araises when gathering the inputs of a node-to-be-run or elsewhere.
@oesteban
Copy link
Contributor Author

@oesteban
Copy link
Contributor Author

oesteban commented Nov 2, 2019

Interestingly, it seems MultiProc checks on mapnodes twice:

191101-16:56:56,106 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm0" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0".
191101-16:56:56,105 nipype.utils INFO:
         Running nipype version 1.3.0-rc1.post-dev+g2e5436d31 (latest: 1.2.3)
191101-16:56:56,108 nipype.workflow INFO:
         [Node] Running "_t1w_conform_xfm0" ("niworkflows.interfaces.freesurfer.PatchedLTAConvert"), a CommandLine Interface with command:
lta_convert --inlta identity.nofile --outlta /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0/out.lta --src /mnt/data/datasets/THP002/sub-THP0005/ses-THP0005JHU1/anat/sub-THP0005_ses-THP0005JHU1_run-02_T1w.nii.gz --trg /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform/mapflow/_t1w_conform0/sub-THP0005_ses-THP0005JHU1_run-02_T1w_ras.nii.gz
191101-16:56:56,159 nipype.utils INFO:
         Running nipype version 1.3.0-rc1.post-dev+g2e5436d31 (latest: 1.2.3)
191101-16:56:56,236 nipype.workflow INFO:
         [Node] Finished "_t1w_conform_xfm0".
191101-16:56:56,266 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".
191101-16:56:56,269 nipype.workflow INFO:
         [Node] Running "_t1w_conform_xfm1" ("niworkflows.interfaces.freesurfer.PatchedLTAConvert"), a CommandLine Interface with command:
lta_convert --inlta identity.nofile --outlta /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1/out.lta --src /mnt/data/datasets/THP002/sub-THP0005/ses-THP0005MGH1/anat/sub-THP0005_ses-THP0005MGH1_T1w.nii.gz --trg /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform/mapflow/_t1w_conform1/sub-THP0005_ses-THP0005MGH1_T1w_ras.nii.gz
191101-16:56:56,409 nipype.workflow INFO:
         [Node] Finished "_t1w_conform_xfm1".
191101-16:56:57,337 nipype.workflow INFO:
         [Node] Setting-up "dmriprep_wf.single_subject_THP0005_wf.anat_preproc_wf.anat_template_wf.t1w_conform_xfm" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm".
191101-16:56:57,340 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm0" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0".
191101-16:56:57,341 nipype.workflow INFO:
         [Node] Cached "_t1w_conform_xfm0" - collecting precomputed outputs
191101-16:56:57,341 nipype.workflow INFO:
         [Node] "_t1w_conform_xfm0" found cached.
191101-16:56:57,342 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".
191101-16:56:57,342 nipype.workflow INFO:
         [Node] Cached "_t1w_conform_xfm1" - collecting precomputed outputs
191101-16:56:57,343 nipype.workflow INFO:
         [Node] "_t1w_conform_xfm1" found cached.
191101-16:56:57,344 nipype.workflow INFO:
         [Node] Finished "dmriprep_wf.single_subject_THP0005_wf.anat_preproc_wf.anat_template_wf.t1w_conform_xfm".

You'll see first one of those:

191101-16:56:56,266 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".

and about a second later:

191101-16:56:57,342 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".

This is happening running sMRIPrep from dMRIPrep on a local installation of nipype on the branch of #3075 .

This is on a fresh work directory. If I reuse the old workdir, then MultiProc preempt running it again (only once).

@satra
Copy link
Member

satra commented Nov 2, 2019

mapnodes have always done that for any distributed plugin. they first insert all the subnodes, then insert the parent node as a dependency of these subnodes. when the parent node re-executes, it checks all the subnodes once more, but at this point they shouldn't run because they have been cached.

@oesteban
Copy link
Contributor Author

oesteban commented Nov 2, 2019

Okay, that's reassuring in the sense that we did not introduce this during a refactor - but my expectation about having found a potential lead is now gone :P

@satra satra mentioned this issue Nov 5, 2019
1 task
@satra
Copy link
Member

satra commented Nov 7, 2019

@oesteban - do you know if #3089 resolved this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants