OOM - Segmentation fault (not ulimit, not cgroups, not max-space, not exhausted RAM) #54692

riverego · 2024-09-01T16:04:24Z

Version

v16.20.2, v20.17.0, v22.7.0

Platform

Linux ip-10-8-1-229 6.1.0-23-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux

But same on ubuntu and debian 11.

Subsystem

No response

What steps will reproduce the bug?

const bufs = []
let i = 0
while (true) {
  ++i
  bufs.push(Array.from({ length: 10*1024 * 1024 }, () => Math.random().toString()))
  // console.log(i)
}

The code just have to reach the OOM point

node --max-old-space-size=32000 --trace-gc index.js

[12808:0x6f27120]   146468 ms: Scavenge 19279.2 (19571.3) -> 19263.9 (19571.3) MB, 50.10 / 0.00 ms  (average mu = 0.831, current mu = 0.831) allocation failure;
[12808:0x6f27120]   146787 ms: Scavenge 19317.6 (19610.3) -> 19302.1 (19610.5) MB, 35.85 / 0.00 ms  (average mu = 0.831, current mu = 0.831) allocation failure;
Segmentation fault

How often does it reproduce? Is there a required condition?

On Outscale VMs

What is the expected behavior? Why is that the expected behavior?

OOM at max-old-size-space

What do you see instead?

OOM when heap reaches ~20G

Additional information

The code works as expected on my own computer and crashes when max-old-space is reached...
But on cloud VMs (of Outscale) it always runs OOM around 20G.

$ cat /proc/<pid>/limits
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             257180               257180               processes
Max open files            1048576              1048576              files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       257180               257180               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

I checked ulimits, cgroups (even if cgroups kills a process with oom reaper, it doesn't throws a segfault), found nothing...

I tried to put 50G fixed value on ulimits to see if unlimited hides a low default value and it's the same.
I looked with /proc/sys/vm/overcommit_memory 0,1,2 values and its the same.
I tried to recompile nodejs on the VM.... Same....
I exhausted ChatGPT ideas....

I thought maybe this is a host limit applied on processes of my VM by the cloud provider, so I tried this :

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main(int argc,char* argv[]){
        size_t oneHundredMiB=1024*1048576;
        size_t maxMemMiB=25*oneHundredMiB;
        void *memPointer = NULL;
        do{
                if(memPointer != NULL){
                        printf("Max Tested Memory = %zi\n",maxMemMiB);
                        memset(memPointer,0,maxMemMiB);
                        free(memPointer);
                }
                maxMemMiB+=oneHundredMiB;
                memPointer=malloc(maxMemMiB);
        }while(memPointer != NULL);
        maxMemMiB -= oneHundredMiB;
        printf("Max Usable Memory aprox = %zi\n",maxMemMiB);

        memPointer = malloc(maxMemMiB);
        memset(memPointer,1,maxMemMiB);
        sleep(30);

        return 0;
}

But this can reach the VM RAM limit (64G or 128G) without any problem.
Same for stress command....

So I'm running out of ideas...
I hope someone here has a clue about what is happening....

Thank you.

The text was updated successfully, but these errors were encountered:

avivkeller · 2024-09-01T16:08:57Z

I can't reproduce the segfault in v22.7.0:

// repro.js
const bufs = []
let i = 0
while (true) {
  ++i
  bufs.push(Array.from({ length: 10*1024 * 1024 }, () => Math.random().toString()))
  // console.log(i)
}

$ node --max-old-space-size=32000 --trace-gc repro.js 
[144303:0x6bb7000]       88 ms: Scavenge 85.7 (87.0) -> 85.0 (88.0) MB, pooled: 0 MB, 13.39 / 0.00 ms  (average mu = 1.000, current mu = 1.000) allocation failure; 
[144303:0x6bb7000]      113 ms: Scavenge 87.4 (89.7) -> 86.7 (92.2) MB, pooled: 0 MB, 2.53 / 0.00 ms  (average mu = 1.000, current mu = 1.000) allocation failure; 
[144303:0x6bb7000]      170 ms: Scavenge 92.8 (96.0) -> 91.1 (96.0) MB, pooled: 0 MB, 1.89 / 0.00 ms  (average mu = 1.000, current mu = 1.000) allocation failure; 
[144303:0x6bb7000]      223 ms: Scavenge 97.1 (100.5) -> 95.3 (100.5) MB, pooled: 0 MB, 1.39 / 0.00 ms  (average mu = 1.000, current mu = 1.000) allocation failure; 
[144303:0x6bb7000]      282 ms: Scavenge 101.4 (104.7) -> 99.6 (104.7) MB, pooled: 0 MB, 1.74 / 0.00 ms  (average mu = 1.000, current mu = 1.000) allocation failure; 
[144303:0x6bb7000]      334 ms: Scavenge (interleaved) 105.6 (109.2) -> 103.9 (109.2) MB, pooled: 0 MB, 1.72 / 0.00 ms  (average mu = 1.000, current mu = 1.000) allocation failure; 
[144303:0x6bb7000]      397 ms: Mark-Compact 104.0 (109.2) -> 103.9 (109.0) MB, pooled: 0 MB, 61.02 / 0.00 ms  (+ 0.2 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 71 ms) (average mu = 0.846, current mu = 0.846) finalize incremental marking via stack guard; GC in old space requested
[... similar messages ...]

Additionally, please specify a valid Node.js version to make this easier to reproduce.

riverego · 2024-09-01T16:10:03Z

Thank you.
Yes I know, on my computer I don't have the issue...
It's only on Outscale VMs

avivkeller · 2024-09-01T16:11:43Z

It's only on Outscale VMs

Maybe this isn't an issue with Node.js, but rather how the VM's memory is managed? AFAICT the program will only segfault when --max-old-space-size is reached.

riverego · 2024-09-01T16:15:22Z

If you want I can provide a VM for you to see..

I know it's due to this VM context, but as I can't reproduce the segfault from a C program so I don't understand how the system context is kicking a js app out.

avivkeller · 2024-09-01T20:13:14Z

I know it's due to this VM context

I'm not sure there is much that can be done in this regard. Could a collaborator transfer this to nodejs/help?

riverego · 2024-09-01T22:54:22Z

Thank you for your answers, I'll do that.

benz0li · 2025-03-07T07:43:57Z

I can reproduce this with Docker CE on Linux/x86_64¹ using image glcr.b-data.ch/jupyterlab/python/base:3.12.8-devtools:

docker run --runtime runc --rm -ti glcr.b-data.ch/jupyterlab/python/base:3.12.8-devtools bash

nano index.js

node --max-old-space-size=32000 --trace-gc index.js

[...]
[37:0x43468000]   154290 ms: Scavenge 19276.9 (19569.0) -> 19261.4 (19569.0) MB, 36.75 / 0.00 ms  (average mu = 0.833, current mu = 0.834) allocation failure; 
[37:0x43468000]   154642 ms: Scavenge 19315.4 (19608.3) -> 19300.0 (19608.5) MB, 22.29 / 0.00 ms  (average mu = 0.833, current mu = 0.834) allocation failure; 
Segmentation fault (core dumped)

I understand, why

[v2.1.1] Fatal process out of memory: Oilpan: CagedHeap reservation. with an address space (ulimit -v/prlimit -v=) below 64 GB denoland/deno#27121

happens with Deno² in limited environments. But I have no idea why this happens with Node.js on a fairly unlimited one.

I wonder what (hidden? Docker?) limit causes this.

node --version

v20.18.1

uname -a

Linux 1b3dd9ed848f 6.1.0-31-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.128-1 (2025-02-07) x86_64 GNU/Linux

prlimit

RESOURCE   DESCRIPTION                             SOFT      HARD UNITS
AS         address space limit                unlimited unlimited bytes
CORE       max core file size                 unlimited unlimited bytes
CPU        CPU time                           unlimited unlimited seconds
DATA       max data size                      unlimited unlimited bytes
FSIZE      max file size                      unlimited unlimited bytes
LOCKS      max number of file locks held      unlimited unlimited locks
MEMLOCK    max locked-in-memory address space   8388608   8388608 bytes
MSGQUEUE   max bytes in POSIX mqueues            819200    819200 bytes
NICE       max nice prio allowed to raise             0         0 
NOFILE     max number of open files             1048576   1048576 files
NPROC      max number of processes            unlimited unlimited processes
RSS        max resident set size              unlimited unlimited bytes
RTPRIO     max real-time priority                     0         0 
RTTIME     timeout for real-time tasks        unlimited unlimited microsecs
SIGPENDING max number of pending signals        1029509   1029509 signals
STACK      max stack size                       8388608 unlimited bytes

Can also be reproduced with Docker CE on Linux/AArch64. Can not be reproduced with Docker Desktop for Mac on Apple Silicon (macOS/arm64 aka AArch64). ↩
Both Deno and Node.js use the V8 JavaScript engine ↩

benz0li · 2025-03-16T19:06:23Z

I am most certainly running into an OOM at ~20 GB when building code-server for Linux/RISC-V (64-bit) using the unofficial Linux/RISC-V Node.js binaries and Docker emulation with QEMU (tonistiigi/binfmt:qemu-v8.1.5¹): https://gitlab.b-data.ch/coder/code-server/-/jobs/157888

Cross reference:

Official RISC-V 64-bit release assets coder/code-server#6986

A build using tonistiigi/binfmt:qemu-v9.2.2 is currently ongoing – remaining just under 20 GB memory usage. ↩

benz0li · 2025-05-09T09:51:34Z

As I can not reproduce with Docker Desktop for Mac on Apple Silicon, there must be some memory limitation in Docker CE or Debian.

I will open a discussion at https://github.com/moby/moby/discussions and point to my reproduction using Docker CE on Linux/x86_64.

polarathene · 2025-05-15T23:34:57Z

EDIT: Below advice is probably not the cause for you, but might help you identify where the difference might be coming from.

Max open files            1048576              1048576              files

You are possibly affected by this. The Docker Desktop for Mac will not show such a high number for ulimit -Sn if I am right?

It might not be this limit specifically, but it has been known to cause various services running in containers to regress in performance or allocate large amounts of memory due to excessive file descriptors (due to LimitNOFILE=infinity). Normally this would be a problem on other Docker hosts where the limit is over a billion, when it's over a million like on Debian it's still a regression but it shouldn't be significant.

You can force the container itself to run with lower limits to see if that resolves the issue?

For compose.yaml, use the ulimits setting:

# Add this to your DMS service settings, it will reset the soft limit to 1024
ulimits:
  nofile:
    soft: 1024

For docker run, use --ulimit option:

# Soft limit:
$ docker run --ulimit nofile=1024:524288 --rm -it alpine ash -c 'ulimit -Sn'
1024

# Hard limit:
$ docker run --ulimit nofile=1024:524288 --rm -it alpine ash -c 'ulimit -Hn'
524288

For context, the soft limit is how many file descriptors a process may have. Each process has it's own individual count, it is not a cumulative limit across processes.

That limit and others can be configured in the main Docker daemon config + systemd drop-in overrides for docker.service + containerd.service as detailed here.

If it's not that, then look at what systemd config is for both Docker Engine and containerd:

For Docker Engine v25, LimitNOFILE=infinity was removed (as can be seen from the link to docker.service above). While for containerd with containerd.service, LimitNOFILE=infinity was also removed but did not land until the 2.0 release. Docker Engine (presently v28) still uses containerd 1.x, thus if your FD limits are still high with Docker Engine 25+ it's probably due to that.

In both cases, since neither project wanted to set a default LimitNOFILE=1024:524288 like I suggested, if your host has systemd release prior to v240 (which added the new hard limit), it will instead have defaults from the kernel 1024:4096 which can be too low for more demanding software.

For changes to docker.service, if you can identify problems with the current settings you can refer to these issues to share your findings and how to resolve it:

benz0li · 2025-05-16T04:20:20Z

You are possibly affected by this. The Docker Desktop for Mac will not show such a high number for ulimit -Sn if I am right?

@polarathene It [Docker Desktop for Mac on Apple Silicon] shows the same numbers for NOFILE:

docker run --rm -ti debian prlimit

RESOURCE   DESCRIPTION                             SOFT      HARD UNITS
AS         address space limit                unlimited unlimited bytes
CORE       max core file size                         0 unlimited bytes
CPU        CPU time                           unlimited unlimited seconds
DATA       max data size                      unlimited unlimited bytes
FSIZE      max file size                      unlimited unlimited bytes
LOCKS      max number of file locks held      unlimited unlimited locks
MEMLOCK    max locked-in-memory address space unlimited unlimited bytes
MSGQUEUE   max bytes in POSIX mqueues            819200    819200 bytes
NICE       max nice prio allowed to raise             0         0 
NOFILE     max number of open files             1048576   1048576 files
NPROC      max number of processes            unlimited unlimited processes
RSS        max resident set size              unlimited unlimited bytes
RTPRIO     max real-time priority                     0         0 
RTTIME     timeout for real-time tasks        unlimited unlimited microsecs
SIGPENDING max number of pending signals         192348    192348 signals
STACK      max stack size                       8388608 unlimited bytes

You can force the container itself to run with lower limits to see if that resolves the issue?

I will give it a try.

benz0li · 2025-05-17T05:32:07Z

You can force the container itself to run with lower limits to see if that resolves the issue?

I will give it a try.

@polarathene When I use --ulimit nofile=65536:65536, the code-server build reaches max. 16.5 GB and succeeds.

However, this does not explain why a segmentation fault occurs when heap reaches ~20 GB.

Cross references:

polarathene · 2025-05-17T07:37:04Z

Follow-up response from: moby/moby#49945 (reply in thread)

Cross references:

You might want to instead experiment with this comment from your first referenced issue.

It's been a long time since I had that issue, but years ago the defaults on Linux were often too low for that tunable that it was very easy as a developer to trigger the errors being cited there.

I'm specifically referring to sysctl fs.inotify.max_user_watches as I believe that was the culprit. Adjusting your FD limits shouldn't be necessary, other than reducing the soft limit to what you'd have on a regular host outside of a container, 1024.

That tunable is for the kernel so it should be what your host system has set, unless your container runtime has modified it (which does happen, a common modification is for setting sysctl net.ipv4.ip_unprivileged_port_start=0 instead of requiring CAP_NET_BIND_SERVICE capability for a non-root user to bind ports below 1024).

The other potential cause is from Debian. IIRC the systemd v240 change that implemented the 1024:524288 change to FD limits also adjusted a related setting in the kernel for the max files you could have open? (it'll be covered by me in one of my prior links, likely the moby PR where I detail historical context).

Debian unlike other distros chose to patch that change from systemd to keep old behaviour, which I believe was due to their own patch for PAM. Perhaps they've still got the PAM and systemd related patches being carried, so it's very possible that contributes to your experience if you're unable to replicate in other distros like Fedora / ArchLinux / openSUSE.

However, this does not explain why a segmentation fault occurs when heap reaches ~20 GB.

If lowering the FD limit prevented that from occurring, it's likely that the higher limit either hit another bottleneck that exhausted a resource like described above, or as per my earlier comment introduced a regression to memory allocation required?

You'd have to investigate that further if you want to track it down, the easiest being to switch out components like the Docker host distro given that you're using Debian.

As for the 2nd reference, there is very little context on their choice of limit there. It is very likely they have done that similar to why Docker did ("works for me" problem solving), or containerd (copied what Docker did), if you go through my history tracking from the Docker PR, you'll see there is very little information on understanding a correct value, and no real discussion about soft vs hard limit going on IIRC, the focus was on resolving an issue and moving forward quickly due to limited bandwidth/budget as is common with projects 😅 (and it was not as problematic until the systemd v240 change).

actions/runner-images@96d9477
Increase max number of open files actions/runner-images#340 (comment) (PR that references your first issue link, reviewer suggests setting max limit which for some reason is only 65,536 in their case)

avivkeller added the memory Issues and PRs related to the memory management or memory footprint. label Sep 1, 2024

avivkeller added the wrong repo Issues that should be opened in another repository. label Sep 1, 2024

riverego closed this as not planned Won't fix, can't repro, duplicate, stale Sep 1, 2024

Umesh-daiict mentioned this issue Mar 3, 2025

Node.js 20 Upgrade: Segmentation Fault Core Dump During Pipeline Lage Build Step #56236

Closed

benz0li mentioned this issue Mar 7, 2025

Update Code to 1.98.0 coder/code-server#7249

Merged

benz0li mentioned this issue May 22, 2025

limited environments run into deno memory issues quarto-dev/quarto-cli#4197

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

OOM - Segmentation fault (not ulimit, not cgroups, not max-space, not exhausted RAM) #54692

OOM - Segmentation fault (not ulimit, not cgroups, not max-space, not exhausted RAM) #54692

riverego commented Sep 1, 2024 •

edited

Loading

avivkeller commented Sep 1, 2024 •

edited

Loading

Uh oh!

riverego commented Sep 1, 2024

Uh oh!

avivkeller commented Sep 1, 2024

Uh oh!

riverego commented Sep 1, 2024 •

edited

Loading

Uh oh!

avivkeller commented Sep 1, 2024

Uh oh!

riverego commented Sep 1, 2024

Uh oh!

benz0li commented Mar 7, 2025 •

edited

Loading

Uh oh!

benz0li commented Mar 16, 2025 •

edited

Loading

Uh oh!

benz0li commented May 9, 2025 •

edited

Loading

Uh oh!

polarathene commented May 15, 2025

Uh oh!

benz0li commented May 16, 2025

Uh oh!

benz0li commented May 17, 2025

Uh oh!

polarathene commented May 17, 2025

Uh oh!

Uh oh!

OOM - Segmentation fault (not ulimit, not cgroups, not max-space, not exhausted RAM) #54692

OOM - Segmentation fault (not ulimit, not cgroups, not max-space, not exhausted RAM) #54692

Comments

riverego commented Sep 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior? Why is that the expected behavior?

What do you see instead?

Additional information

avivkeller commented Sep 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

riverego commented Sep 1, 2024

Uh oh!

avivkeller commented Sep 1, 2024

Uh oh!

riverego commented Sep 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avivkeller commented Sep 1, 2024

Uh oh!

riverego commented Sep 1, 2024

Uh oh!

benz0li commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

benz0li commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

benz0li commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

polarathene commented May 15, 2025

Uh oh!

benz0li commented May 16, 2025

Uh oh!

benz0li commented May 17, 2025

Uh oh!

polarathene commented May 17, 2025

Uh oh!

riverego commented Sep 1, 2024 •

edited

Loading

avivkeller commented Sep 1, 2024 •

edited

Loading

riverego commented Sep 1, 2024 •

edited

Loading

benz0li commented Mar 7, 2025 •

edited

Loading

benz0li commented Mar 16, 2025 •

edited

Loading

benz0li commented May 9, 2025 •

edited

Loading