-
Notifications
You must be signed in to change notification settings - Fork 532
Error checking MapNode hash with fMRIPrep #3014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Created nipy/nipype#3014 to keep track of this one.
Cleaning up the cache solved the problem. @satra - I believe this (and #3009) are an artifact of working directory degradation, but I'm worried that they could have some other source since they happen while Before I go ahead and start some refactor that can end up being a rats' nest, we wanted to touch base with you. Do you want to allocate some time to chat (cc/ @effigies )? |
let's chat. i would like to understand if this is something that is a new bug that was created by a commit, or is because of mixing versions of caches. |
I wonder if this is the same problem as #2694. IIRC, that was also impossible to reproduce because a clean working directory didn't exhibit the problem, and there was no clear path toward intentionally producing a broken one. |
I believe that last month's refactor should have addressed that, precisely. My impression is that nipype is overly noisy: it seems that the first time the cache is checked -early in checking it there is a call to This synchronization issue worries me a little, as I would think the result file of preceding nodes should be available already in the first call to #2694 is related to the interpretation of the result file of the current node (vs. prior nodes, which is this case). |
@oesteban - this really depends on the file system and the executor.
|
Right, that's why I believe this is mostly noise and things aren't looking bad. However, not being able to come up with a clear explanation for what is happening has me a bit bugged (mostly because of 1 in your comment). |
Please note that the errors happen within
A useful optimization that I'm sure will minimize some of these issues is a new step between 1-2 to check whether the result file exists and a hashfile-looking file exists. However, that optimization should happen after we clear up this weirdness because in local filesystems |
Prevents nipy#3009 and nipy#3014 from happening - although this might not solve those issues, this patch will help find their origin by making ``load_resultfile`` more strict (and letting it raise exceptions). The try .. except structure is moved to the only place is was being used within the Node code.
Prevents nipy#3009 and nipy#3014 from happening - although this might not solve those issues, this patch will help find their origin by making ``load_resultfile`` more strict (and letting it raise exceptions). The try .. except structure is moved to the only place is was being used within the Node code.
Generating the hashvalue when outputs are not ready at cache check stage when the node's directory does not exist (or no results file is in there) leads to nipy#3014. This PR preempts those problems by delaying the hashval calculation.
Generating the hashvalue when outputs are not ready at cache check stage when the node's directory does not exist (or no results file is in there) leads to nipy#3014. This PR preempts those problems by delaying the hashval calculation.
Is there a path forward for testing this, or is it mostly a "wait and see if this keeps cropping up after the next release" sort of issue? |
I'm afraid we are in a "wait and see if this keeps cropping up after the next release" sort of issue situation. There might be a path forward based on the fact that there are a bunch of usual suspects falling for this, so it doesn't seem to be a general problem of all MapNodes. However, this is harder to workaround than #3009 because for MapNodes we really need to propagate the inputs from prior interfaces at this point to know how many nodes are spawned. |
Okay. Just going to move it out of 1.2.3 then. |
Hi all,
This happens pretty consistently. Also, occasionally the pipeline will freeze. If i update to the next commit (686aef9) things seem to work. However, I'm not sure if that is just hiding the problem. I am doing everything on a local file system. If you all think this is unrelated I can open a new issue. Thanks! |
HI @stilley2, looking at your traceback, this totally looks like #3009. This one has been left open after #3026 was merged because it only affects MapNodes as @mattcieslak reported in #3009 (comment). However, this WARNING should not be fatal - your processing should have continued and if it failed, that should've happened for some other reason. You are right that that commit basically preempts this WARNING to show up in most of the cases by more carefully checking the work directory when it looks like the node is cached. Could you try to isolate what is the condition that causes the pipeline freeze? How did you check that it was blocked? |
I just did some digging. The pipeline hangs at getting the lockfile in filemanip.loadpkl. I assume that I don't see the error in later commits because loadpkl isn't called as frequently or something. I think adding a timeout to the lock is a good idea though. Out of curiosity, how is it that we're trying to load results files before they're ready to check the node hash, but not running into this problem when we actually run the nodes? |
@stilley2 - thanks for this. i am pretty positive i introduced an unwanted race condition. via the locking mechanism. i'll take a look later today. |
I think I found what might be the cause for a lot of these issues. In loadpkl the directory is changed via the indirectory context manager nipype/nipype/utils/filemanip.py Line 710 in 56a0335
nipype/nipype/pipeline/plugins/multiproc.py Line 148 in 56a0335
|
If anyone wants to prove this to themselves, just add the following lines right after the indirectory context manager from time import sleep
sleep(10)
if os.getcwd() != infile.parent:
raise RuntimeError('directory switched') and run a pipeline with the MultiProc plugin. |
To keep close: https://neurostars.org/t/fmriprep-1-5-0-errors/5122/5 |
…g inputs This PR attempts to alleviate nipy#3014 by opening the result file of a source node only once when that node feeds into several inputs of the node collecting inputs. Before these changes, a call to ``_load_results`` was issued for every input field that needed to collect its inputs from a past node. Now, all the inputs comming from the same node are put together and the ``_load_results`` function is called just once. The PR also modifies the manner the ``AttributeError``s (nipy#3014) were handled to make it easier to spot whether an error occured while loading results araises when gathering the inputs of a node-to-be-run or elsewhere.
Interestingly, it seems MultiProc checks on mapnodes twice:
You'll see first one of those:
and about a second later:
This is happening running sMRIPrep from dMRIPrep on a local installation of nipype on the branch of #3075 . This is on a fresh work directory. If I reuse the old workdir, then MultiProc preempt running it again (only once). |
mapnodes have always done that for any distributed plugin. they first insert all the subnodes, then insert the parent node as a dependency of these subnodes. when the parent node re-executes, it checks all the subnodes once more, but at this point they shouldn't run because they have been cached. |
Okay, that's reassuring in the sense that we did not introduce this during a refactor - but my expectation about having found a potential lead is now gone :P |
Summary
https://circleci.com/gh/oesteban/smriprep/502
Related to #3009 (in that case, it occurred for a regular Node).
The text was updated successfully, but these errors were encountered: