Parallelism of some nodes across processes. #821

oesteban · 2017-11-10T03:37:23Z

When running subjects in parallel, certain nodes as fsdir will be run with the same directory as base_dir and introduce races.

It'd be nice to have a locking mechanism for these nodes.

FileNotFoundError: [Errno 2] No such file or directory: '/scratch/users/oesteban/fmriprep-phase1/work/ds000007/fmriprep_wf/fsdir/_0x6f97333768f8175f647591004cc0cd21_unfinished.json' -> '/scratch/users/oesteban/fmriprep-phase1/work/ds000007/fmriprep_wf/fsdir/_0x6f97333768f8175f647591004cc0cd21.json'

The text was updated successfully, but these errors were encountered:

oesteban · 2017-11-10T03:39:27Z

Or, add parameters to the name, so base_dir is not shared

effigies · 2017-11-10T04:02:01Z

I wonder if this would be better in nipype, where a running node locks its working directory with, e.g. fasteners. By the time we get to code we control, nipype has already decided that it hasn't already been run. While we may be able to figure something out here, the time to lock seems to be at the point of that decision.

oesteban · 2017-11-10T23:23:05Z

Yes, that would be awesome. WDYT @satra?

effigies · 2017-11-11T01:45:08Z

Here's a quick implementation of a DirectoryBasedLock. It will only work on filesystems that explicitly emulate local filesystem atomic semantics.

Since nipype can't provide guarantees with any lock, because the contexts can vary too widely, we could make it optional, and have users provide a lock that has the following protocol:

class LockDir(object):
    def __init__(self, outdir):
        self.outdir = outdir
        self.__enter__ = self.acquire
        self.__exit__ = self.release

    def acquire(self, ...):
        raise NotImplementedError

    def release(self):
        raise NotImplementedError

Then a node that needs to be protected could be instantiated:

node = pe.Node(Interface(), name='node', lock=DirectoryBasedLock)

The default lock would be a no-op, and it would be up to users to provide a lock that provides sufficient mutex guarantees for their environment.

effigies · 2017-11-11T01:52:49Z

Come to think of it, with a protocol like that, we could set up a little TCP service that does nothing but sit and listen for requests with directory names, and let callers know if they get the lock. So depending on filesystem properties would become entirely optional.

oesteban · 2017-11-17T17:02:17Z

Another node suffering from this: mri_coreg. It renders as:

Node: fmriprep_wf.single_subject_MSC08_wf.func_preproc_ses_func10_task_memorywords_wf.bold_reg_wf.bbreg_wf.mri_coreg
Working directory: /scratch/users/oesteban/fmriprep-phase1/work/ds000224/fmriprep_wf/single_subject_MSC08_wf/func_preproc_ses_func10_task_memorywords_wf/bold_reg_wf/bbreg_wf/mri_coreg

Node inputs:

args = <undefined>
brute_force_limit = <undefined>
brute_force_samples = <undefined>
compress_report = auto
conform_reference = <undefined>
dof = 9
environ = {'SUBJECTS_DIR': '/opt/freesurfer/subjects'}
ftol = 0.0001
generate_report = True
ignore_exception = False
initial_rotation = <undefined>
initial_scale = <undefined>
initial_shear = <undefined>
initial_translation = <undefined>
linmintol = 0.01
max_iters = <undefined>
no_brute_force = <undefined>
no_coord_dithering = <undefined>
no_cras0 = <undefined>
no_intensity_dithering = <undefined>
no_smooth = <undefined>
num_threads = 8
out_lta_file = True
out_params_file = <undefined>
out_reg_file = <undefined>
out_report = report.svg
ref_fwhm = <undefined>
reference_file = <undefined>
reference_mask = <undefined>
saturation_threshold = <undefined>
sep = [4]
source_file = /scratch/users/oesteban/fmriprep-phase1/work/ds000224/fmriprep_wf/single_subject_MSC08_wf/func_preproc_ses_func10_task_memorywords_wf/nonlinear_sdc_wf/skullstrip_bold_wf/apply_mask/ants_susceptibility_Warped_masked.nii.gz
source_mask = <undefined>
source_oob = <undefined>
subject_id = sub-MSC08
subjects_dir = /oak/stanford/groups/russpold/data/openfmri/derivatives/ds000224/freesurfer
terminal_output = <undefined>

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/plugins/multiproc.py", line 51, in run_node
    result['result'] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py", line 407, in run
    self._run_interface()
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py", line 515, in _run_interface
    old_cwd = os.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

effigies · 2017-11-17T19:08:11Z

Why would this node be run on two different nodes? That looks like a directory got deleted out from under a running process.

oesteban · 2017-11-19T03:56:28Z

Yep, you are right. I got to that conclusion but missed updating this issue. This FileNotFoundError also happened to the update_metadata node (just in case it lights up some ideas)

oesteban · 2017-11-30T22:37:07Z

This problem is back:

Node: fmriprep_wf.single_subject_13_wf.func_preproc_task_dis_run_02_wf.bold_reg_wf.bbreg_wf.bbregister
Working directory: /scratch/users/oesteban/fmriprep-phase2/work/ds000212/fmriprep_wf/single_subject_13_wf/func_preproc_task_dis_run_02_wf/bold_reg_wf/bbreg_wf/bbregister

Node inputs:

args = <undefined>
compress_report = auto
contrast_type = t2
dof = 9
environ = {'SUBJECTS_DIR': '/opt/freesurfer/subjects'}
epi_mask = <undefined>
fsldof = <undefined>
generate_report = True
ignore_exception = False
init = <undefined>
init_cost_file = <undefined>
init_reg_file = /scratch/users/oesteban/fmriprep-phase2/work/ds000212/fmriprep_wf/single_subject_13_wf/func_preproc_task_dis_run_02_wf/bold_reg_wf/bbreg_wf/mri_coreg/registration.lta
intermediate_file = <undefined>
out_fsl_file = <undefined>
out_lta_file = True
out_reg_file = <undefined>
out_report = report.svg
reg_frame = <undefined>
reg_middle_frame = <undefined>
registered_file = True
source_file = /scratch/users/oesteban/fmriprep-phase2/work/ds000212/fmriprep_wf/single_subject_13_wf/func_preproc_task_dis_run_02_wf/nonlinear_sdc_wf/skullstrip_bold_wf/apply_mask/ants_susceptibility_Warped_masked.nii.gz
spm_nifti = <undefined>
subject_id = sub-13
subjects_dir = /oak/stanford/groups/russpold/data/openfmri/derivatives/ds000212/freesurfer
terminal_output = <undefined>

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/plugins/multiproc.py", line 51, in run_node
    result['result'] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py", line 407, in run
    self._run_interface()
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py", line 515, in _run_interface
    old_cwd = os.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

I'll try to debug this, but may be necessary to escalate to nipype.

effigies · 2017-11-30T22:42:24Z

So, are you assuming that one process is erasing the node out from under the other? Do you have two nodes running the same BOLD files at the same time?

oesteban · 2017-11-30T22:49:54Z

Yes - I'm sure I only have one instance of fmriprep for a given subject. So these nodes should not be deleted while executing.

effigies · 2017-12-01T01:08:54Z

So this doesn't seem to be a parallelism of nodes across processes issue, does it? I agree it looks like a bug. Just not related to the fsdir issue.

oesteban · 2017-12-01T01:21:14Z

~~Correct, I'll update the issue description and title~~ - create a new one

oesteban · 2017-12-01T22:25:14Z

Created, I think this is not an issue anymore. Let's keep an eye on Chris' PR to nipype.

effigies · 2017-12-04T14:23:27Z

AFAIK this is still an issue. Even if nipy/nipype#2278 goes in, we'll still need to create a lock that works reasonably reliably and test it.

satra · 2017-12-04T15:10:01Z

@effigies - for multiproc, one can easily create a local lock and portalocker, which we have would be fine. i'm mostly worried about locks across nodes. also different clusters with different filesystems enable different kinds of locking mechanisms.

effigies · 2017-12-04T15:27:58Z

Yes, this issue is specifically across nodes.

stale · 2019-03-12T17:31:16Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dmd · 2019-09-17T14:33:41Z

I'm getting what I believe is this issue, in 1.5.0:

$ cat /data/ddrucker/testing/derivatives/fmriprep/sub-acj/log/20190917-102450_234554e0-3572-4773-9070-09ebb07bea01/crash-20190917-103025-ddrucker-bids_info-5eccfd72-fb37-442c-82d0-d2127cde5582.txt
Node: fmriprep_wf.single_subject_acj_wf.bids_info
Working directory: /data/ddrucker/testing/fmriprep-work/fmriprep_wf/single_subject_acj_wf/bids_info

Node inputs:

bids_dir = /data/ddrucker/testing
bids_validate = False
in_file = /data/ddrucker/testing/sub-acj/anat/sub-acj_T1w.nii.gz

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/legacymultiproc.py", line 69, in run_node
    result['result'] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 473, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 564, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 649, in _run_command
    result = self._interface.run(cwd=outdir)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 376, in run
    runtime = self._run_interface(runtime)
  File "/usr/local/miniconda/lib/python3.7/site-packages/niworkflows/interfaces/bids.py", line 165, in _run_interface
    self.inputs.bids_validate)
  File "/usr/local/miniconda/lib/python3.7/site-packages/niworkflows/utils/bids.py", line 213, in _init_layout
    layout = BIDSLayout(str(bids_dir), validate=validate)
  File "/usr/local/miniconda/lib/python3.7/site-packages/bids/layout/layout.py", line 212, in __init__
    indexer.index_metadata()
  File "/usr/local/miniconda/lib/python3.7/site-packages/bids/layout/index.py", line 207, in index_metadata
    with open(bf.path, 'r') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/data/ddrucker/testing/fmriprep-work/fmriprep_wf/single_subject_acj_wf/func_preproc_task_cue_wf/bold_split/_0xca3dc0d25055795133385b837b4d35b0_unfinished.json'

What should I do?

effigies · 2019-09-17T14:37:32Z

@dmd I think this is a different but related issue. Could you open a new issue, please?

oesteban added bug impact: low Estimated low impact task optimization labels Nov 10, 2017

oesteban mentioned this issue Nov 10, 2017

Race condition: FreeSurfer setup node run in multiple processes simultaneously #772

Closed

effigies mentioned this issue Nov 12, 2017

ENH: Enable users to add a run lock to a node nipy/nipype#2278

Closed

oesteban mentioned this issue Dec 1, 2017

[BUG] FileNotFoundError: [Errno 2] No such file or directory #868

Closed

oesteban closed this as completed Dec 1, 2017

effigies reopened this Dec 4, 2017

stale bot added the stale label Mar 12, 2019

stale bot closed this as completed Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelism of some nodes across processes. #821

Parallelism of some nodes across processes. #821

oesteban commented Nov 10, 2017

oesteban commented Nov 10, 2017

effigies commented Nov 10, 2017

oesteban commented Nov 10, 2017

effigies commented Nov 11, 2017

effigies commented Nov 11, 2017

oesteban commented Nov 17, 2017

effigies commented Nov 17, 2017

oesteban commented Nov 19, 2017

oesteban commented Nov 30, 2017

effigies commented Nov 30, 2017

oesteban commented Nov 30, 2017

effigies commented Dec 1, 2017

oesteban commented Dec 1, 2017 •

edited

Loading

oesteban commented Dec 1, 2017

effigies commented Dec 4, 2017

satra commented Dec 4, 2017

effigies commented Dec 4, 2017

stale bot commented Mar 12, 2019

dmd commented Sep 17, 2019

effigies commented Sep 17, 2019 •

edited

Loading

Parallelism of some nodes across processes. #821

Parallelism of some nodes across processes. #821

Comments

oesteban commented Nov 10, 2017

oesteban commented Nov 10, 2017

effigies commented Nov 10, 2017

oesteban commented Nov 10, 2017

effigies commented Nov 11, 2017

effigies commented Nov 11, 2017

oesteban commented Nov 17, 2017

effigies commented Nov 17, 2017

oesteban commented Nov 19, 2017

oesteban commented Nov 30, 2017

effigies commented Nov 30, 2017

oesteban commented Nov 30, 2017

effigies commented Dec 1, 2017

oesteban commented Dec 1, 2017 • edited Loading

oesteban commented Dec 1, 2017

effigies commented Dec 4, 2017

satra commented Dec 4, 2017

effigies commented Dec 4, 2017

stale bot commented Mar 12, 2019

dmd commented Sep 17, 2019

effigies commented Sep 17, 2019 • edited Loading

oesteban commented Dec 1, 2017 •

edited

Loading

effigies commented Sep 17, 2019 •

edited

Loading