-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
OOM - Segmentation fault (not ulimit, not cgroups, not max-space, not exhausted RAM) #54692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can't reproduce the segfault in v22.7.0: // repro.js
const bufs = []
let i = 0
while (true) {
++i
bufs.push(Array.from({ length: 10*1024 * 1024 }, () => Math.random().toString()))
// console.log(i)
} $ node --max-old-space-size=32000 --trace-gc repro.js
[144303:0x6bb7000] 88 ms: Scavenge 85.7 (87.0) -> 85.0 (88.0) MB, pooled: 0 MB, 13.39 / 0.00 ms (average mu = 1.000, current mu = 1.000) allocation failure;
[144303:0x6bb7000] 113 ms: Scavenge 87.4 (89.7) -> 86.7 (92.2) MB, pooled: 0 MB, 2.53 / 0.00 ms (average mu = 1.000, current mu = 1.000) allocation failure;
[144303:0x6bb7000] 170 ms: Scavenge 92.8 (96.0) -> 91.1 (96.0) MB, pooled: 0 MB, 1.89 / 0.00 ms (average mu = 1.000, current mu = 1.000) allocation failure;
[144303:0x6bb7000] 223 ms: Scavenge 97.1 (100.5) -> 95.3 (100.5) MB, pooled: 0 MB, 1.39 / 0.00 ms (average mu = 1.000, current mu = 1.000) allocation failure;
[144303:0x6bb7000] 282 ms: Scavenge 101.4 (104.7) -> 99.6 (104.7) MB, pooled: 0 MB, 1.74 / 0.00 ms (average mu = 1.000, current mu = 1.000) allocation failure;
[144303:0x6bb7000] 334 ms: Scavenge (interleaved) 105.6 (109.2) -> 103.9 (109.2) MB, pooled: 0 MB, 1.72 / 0.00 ms (average mu = 1.000, current mu = 1.000) allocation failure;
[144303:0x6bb7000] 397 ms: Mark-Compact 104.0 (109.2) -> 103.9 (109.0) MB, pooled: 0 MB, 61.02 / 0.00 ms (+ 0.2 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 71 ms) (average mu = 0.846, current mu = 0.846) finalize incremental marking via stack guard; GC in old space requested
[... similar messages ...] Additionally, please specify a valid Node.js version to make this easier to reproduce. |
Thank you. |
Maybe this isn't an issue with Node.js, but rather how the VM's memory is managed? AFAICT the program will only segfault when |
If you want I can provide a VM for you to see.. I know it's due to this VM context, but as I can't reproduce the segfault from a C program so I don't understand how the system context is kicking a js app out. |
I'm not sure there is much that can be done in this regard. Could a collaborator transfer this to |
Thank you for your answers, I'll do that. |
I can reproduce this with Docker CE on Linux/x86_641 using image docker run --runtime runc --rm -ti glcr.b-data.ch/jupyterlab/python/base:3.12.8-devtools bash nano index.js node --max-old-space-size=32000 --trace-gc index.js
I understand, why happens with Deno2 in limited environments. But I have no idea why this happens with Node.js on a fairly unlimited one. I wonder what (hidden? Docker?) limit causes this. node --version
uname -a
prlimit
Footnotes |
I am most certainly running into an OOM at ~20 GB when building code-server for Linux/RISC-V (64-bit) using the unofficial Linux/RISC-V Node.js binaries and Docker emulation with QEMU ( Cross reference: Footnotes
|
As I can not reproduce with Docker Desktop for Mac on Apple Silicon, there must be some memory limitation in Docker CE or Debian. I will open a discussion at https://github.com/moby/moby/discussions and point to my reproduction using Docker CE on Linux/x86_64. |
EDIT: Below advice is probably not the cause for you, but might help you identify where the difference might be coming from.
You are possibly affected by this. The Docker Desktop for Mac will not show such a high number for It might not be this limit specifically, but it has been known to cause various services running in containers to regress in performance or allocate large amounts of memory due to excessive file descriptors (due to You can force the container itself to run with lower limits to see if that resolves the issue? For # Add this to your DMS service settings, it will reset the soft limit to 1024
ulimits:
nofile:
soft: 1024 For # Soft limit:
$ docker run --ulimit nofile=1024:524288 --rm -it alpine ash -c 'ulimit -Sn'
1024
# Hard limit:
$ docker run --ulimit nofile=1024:524288 --rm -it alpine ash -c 'ulimit -Hn'
524288 For context, the soft limit is how many file descriptors a process may have. Each process has it's own individual count, it is not a cumulative limit across processes. That limit and others can be configured in the main Docker daemon config + systemd drop-in overrides for If it's not that, then look at what systemd config is for both Docker Engine and containerd:
For Docker Engine v25, In both cases, since neither project wanted to set a default For changes to |
@polarathene It [Docker Desktop for Mac on Apple Silicon] shows the same numbers for docker run --rm -ti debian prlimit
I will give it a try. |
@polarathene When I use However, this does not explain why a segmentation fault occurs when heap reaches ~20 GB. Cross references: |
Follow-up response from: moby/moby#49945 (reply in thread)
You might want to instead experiment with this comment from your first referenced issue. It's been a long time since I had that issue, but years ago the defaults on Linux were often too low for that tunable that it was very easy as a developer to trigger the errors being cited there. I'm specifically referring to That tunable is for the kernel so it should be what your host system has set, unless your container runtime has modified it (which does happen, a common modification is for setting sysctl The other potential cause is from Debian. IIRC the systemd v240 change that implemented the Debian unlike other distros chose to patch that change from systemd to keep old behaviour, which I believe was due to their own patch for PAM. Perhaps they've still got the PAM and systemd related patches being carried, so it's very possible that contributes to your experience if you're unable to replicate in other distros like Fedora / ArchLinux / openSUSE.
If lowering the FD limit prevented that from occurring, it's likely that the higher limit either hit another bottleneck that exhausted a resource like described above, or as per my earlier comment introduced a regression to memory allocation required? You'd have to investigate that further if you want to track it down, the easiest being to switch out components like the Docker host distro given that you're using Debian. As for the 2nd reference, there is very little context on their choice of limit there. It is very likely they have done that similar to why Docker did ("works for me" problem solving), or containerd (copied what Docker did), if you go through my history tracking from the Docker PR, you'll see there is very little information on understanding a correct value, and no real discussion about soft vs hard limit going on IIRC, the focus was on resolving an issue and moving forward quickly due to limited bandwidth/budget as is common with projects 😅 (and it was not as problematic until the systemd v240 change).
|
Uh oh!
There was an error while loading. Please reload this page.
Version
v16.20.2, v20.17.0, v22.7.0
Platform
Subsystem
No response
What steps will reproduce the bug?
The code just have to reach the OOM point
How often does it reproduce? Is there a required condition?
On Outscale VMs
What is the expected behavior? Why is that the expected behavior?
OOM at max-old-size-space
What do you see instead?
OOM when heap reaches ~20G
Additional information
The code works as expected on my own computer and crashes when max-old-space is reached...
But on cloud VMs (of Outscale) it always runs OOM around 20G.
I checked ulimits, cgroups (even if cgroups kills a process with oom reaper, it doesn't throws a segfault), found nothing...
I tried to put 50G fixed value on ulimits to see if unlimited hides a low default value and it's the same.
I looked with /proc/sys/vm/overcommit_memory 0,1,2 values and its the same.
I tried to recompile nodejs on the VM.... Same....
I exhausted ChatGPT ideas....
I thought maybe this is a host limit applied on processes of my VM by the cloud provider, so I tried this :
But this can reach the VM RAM limit (64G or 128G) without any problem.
Same for
stress
command....So I'm running out of ideas...
I hope someone here has a clue about what is happening....
Thank you.
The text was updated successfully, but these errors were encountered: