Score:2

Redhat phantom out of memory issues

bl flag

We have a server that runs various headless applications, such as Java. It processes to stream data, daily python scripts, etc. From time to time some of our applications get out of memory errors.

The problem we have is the monitoring shows there is plenty of ram. We upped it from 128GB to 192GB and it hasn't solved the problem. Our monitoring takes a reading every 20 seconds and shows minimum available memory of 132GB over the last 2 days. But we had some applications fail with out of memory errors this morning. Is it possible to get OOM with plenty of ram available?

EDIT: Answer to questions from David

  • yes the 192GB is just the ram allocated to OS. It is a VM
  • The monitoring will read free/available ram for the OS, we don't have any per process monitoring
  • Most java processes don't specify memory requirements on the CLI (eg Xmx etc)
  • The exception is "Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread"

I would add, multiple processes fail at the same time. To me this would indicate it's not an issue with the process itself but something to do with the system. Some of the apps that fail just do the same thing all day every day, which is to process a fairly consistent stream of data. It's not like they could be flooded with a large number of requests.

cn flag
Memory fragmentation?
sa flag
"unable to create new native thread" is a less-than-usual kind of OOM error. Is it possible that your application leaks threads and you ran into a PID limit? Can you monitor the number of running threads? Linux treats threads as "processes lite", and they can be seen like processes in many monitoring tools
Philipp Wendler avatar
in flag
Duplicate of https://stackoverflow.com/q/16789288/396730? If you only get OOM errors with this message in your Java application and not other errors or somewhere else, this is more related to Java than to the system.
Score:7
ph flag

When you say "we upped it from 128GB to 192GB and it hasn't solved the problem" what do you mean? The JVM heap space? The RHEL VM? Also what do you mean by "our monitoring takes a reading?" Is your monitoring looking at Java heap memory or system memory?

Is it possible to get OOM with plenty of ram available?

Sure. The most common cause is that "plenty of RAM is available" but not of the right kind. e.g. you have RAM on the server, but the Java process isn't configured to use it. Or you have RAM available in the Java heap, but the Java application needs stack memory instead of heap memory. Or perm memory. Or off heap memory.

There are some other edge cases where you can get an OOM error even with the above, but those are pretty rare. Most likely it is that you are adding the wrong kind of memory.

If I were to debug my first steps would be:

  • What exactly is the OOM error and where are you seeing it?
  • Looking at the JVM startup flags (and potentially the config of the application, depending on what kind of application it is).
  • Enabling GC logging in the application.

EDIT IN RESPONSE TO STACK TRACE:

Well, it looks like my "there are some other edge cases" comment was prophetic. I agree with Philipp Wendler's comment that this is a duplicate of https://stackoverflow.com/q/16789288/396730 . You aren't actually running out of memory, you are running out of threads.

You can look here : https://access.redhat.com/solutions/1420363 for how to increase the number of threads (short version: update /proc/sys/kernel/threads-max ). But as is discussed on the linked Stack Overflow post, you probably need to fix your application rather than just bump the limit. Any application using more than the default maximum number of threads is probably leaking threads. (And if they aren't it's definitely being wasteful of threads.) Especially if you say that they aren't being flooded with requests.

Guntram Blohm avatar
in flag
To debug this, I'd create a cron job with a content similar to `echo $(date +"%Y-%m-%d %H:%M"; ps -eLf | wc -l) >> /var/log/processcount.log` (beware, you need to escape the `%` in `crontab`, or put this into a script and run the script). I'd expect the second column (number of processes including threads) to rise constantly until you get the OOM error. After verifying this, run `ps -eLf` alone to check which process keeps creating but not removing threads. And then check that java application for `new Thread()` without a corresponding `thread.join()` after the thread is done.
David Ogren avatar
ph flag
I agree with Guntram, but I suspect that just looking at at ths `ps` once may be enough: you'll see which process is the process hog. Then you can do a `kill -3` or `jstack` to figure out what all of those threads are doing.
MikeKulls avatar
bl flag
Thanks guys, this is really helpful info. I ran `ps -eLf | wc -l` and it returned around 3000. threads-max is set at 1546067, so we're a long way off but maybe something is going crazy at some point. I have added it to cron and will see how it goes over a day.
MikeKulls avatar
bl flag
When I look at `ulimit -u` I get unlimited for the user most processes run as (username of mapr) and 773033 for root (why it's set at that I'm not sure). For my own user account I get 4096 but nothing runs under my account. Maybe for some reason the mapr user is getting 4096?
MikeKulls avatar
bl flag
I believe I have found a solution. There are multiple processes leaking threads. The big mystery, why do they fail at the same time? I suspect this is because they are very similar code, streaming jobs copy/pasted from each other, and are leaking at the same rate. When they fail they get restarted at the same time. I am still investigating, the problem doesn't happen very often
Score:0
bl flag

I thought I would add some of the commands I used to investigate the problem. I added these to cron to run every minute.

#log total count of threads to a file
echo $(date +"%Y-%m-%d %H:%M"; ps -eLf | wc -l) >> /somepath/threadcount_`date '+%Y-%m-%d'`.log

#log the processes using the most threads
ps -eLf | awk '{print $2}' | grep -v PID | uniq -c | sort -nr | head -10 | awk '{print $2,$1}' > /somepath/threadhogs_`date '+%Y-%m-%d_%H-%M-%S'`.log

#send output of top to a file, sorted by memory usage
top -b -n 1 -o RES > /somepath/top_`date '+%Y-%m-%d_%H-%M-%S'`.log
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.