Troubleshooting the “Too Many Open Files” Error
If you have a large or active Alfresco installation, or you have a large number of interactive users, your system may be at risk. Learn what conditions may cause the error “Too many open files,” how to correct this problem, and how to solve connection leaks.
Why should you care about the number of open files?
Since Alfresco 5.0 started using Solr instead of Lucene, the “Too many open files” error is now less common. However, this nasty error still shows up as there are some factors that could significantly increase the number of open handles. Large and active Alfresco installations are especially vulnerable. You should care if you have one or more of the following conditions:
- A large number of concurrent users
- A large number of upload or transformations
- Solr running in the same Tomcat as the repository
- A large number of network connections
- Custom code that opens network connections or files
- Alfresco is in a “death spiral,” which occurs when the system is too slow and overloaded—causing transactions to time out and be tried again and again—flooding the server and the database with open transactions.
Some of those conditions may not be enough to cause a “Too many files open” error by themselves, but some others may single handedly cause the error. In any case, each of them, to a different degree, increase the chances of exceeding the open files limit.
Did you notice that network connections also count as open files? When the number of open handles for files or connections exceeds the limits set for a Linux server, Alfresco will show the error “Too many open files” in catalina.out. When I refer to open files, I am also taking about open connections.
Interestingly, it is very rare to see this error with Alfresco on Windows. The handles limit on Windows for processes running locally is 16,744,448. Although the Linux limit may look to be an issue, it is actually a safety measure. The limits on Linux are meant to prevent a single process from gobbling up all the resources.
Note: This article focuses on Alfresco, but the concepts apply to other applications.
What are those limits?
The default settings in most flavors of Linux are as low as 1,024 handles—which is much less than even an Alfresco departmental system may need. Alfresco used to recommend a soft limit of 4,096 and a hard limit of 65,536 handles in the documentation for Alfresco 4.2 .
You can check the limits in your system using the following commands:
|ulimit -Sn||Prints the soft open files limit. This limit is enforced for a session or process and it can be increased by non-root users. The error “Too many files open” is produced. Linux sends an error signal to the process but it will not kill it.|
|ulimit -Hn||Prints the hard open files limit. This is the ceiling for the soft limit. If this limit is reached the OS kills the process.|
|cat /proc/sys/fs/file-max||Prints the maximum number of open handles for the all the sessions or processes running in operating system.|
If the limit is too low, the “Too many open files” error is produced and it may be necessary to increase the limits allowed by Linux. The typical limits are:
- Default for most OS distributions: soft=1, hard=65536
- Alfresco’s recommendation: soft=4096, hard=65536
- Large Alfresco installations: soft=16196, hard=131072
What if I still get the error after increasing the limit?
If you continue seeing the error after increasing the limits, it is likely that some of the code is leaking connections (i.e., some custom code is not closing all the files or network connections that it has opened). The solution, of course, is to find the leaking code. However, if you have a large code base, it may be difficult to find what could be leaving those leaks open.
We will learn more about the steps to monitor the number of open files in the next section. This can be a very useful tool that you can use along with the logs to find the module that could be causing the problem.
These instructions assume the following:
- That you will be running the script in Linux in the same production server that runs Alfresco
- Java is available in the server (either the JDK or JRE)
Here are the steps:
- Download the file-leak-detector agent which is an open source Java agent. You can find the download link here: http://file-leak-detector.kohsuke.org/
- The downloaded file will include the version information (for example, the file that I got was named “file-leak-detector-1.11-jar-with-dependencies.jar”). You may want to rename the file to “file-leak-detector.jar” as the following steps assume that you renamed the file.
- Get the process ID (PID) of the Alfresco process. For example, you can use the command: ps aux | grep Bootstrap
- Attach the file-leak-detector agent to the Alfresco process using a command similar to the example shown below. The agent has a mini-http server that we can activate and assign to any available port. Just for the sake of example, let’s assume that the process ID was 12345 and we are going to enable its http server on port 19999:
- Create a script like the following to monitor the number of open files (I named my script “monitor_file_descriptors.sh”):
To execute the script, your command line should look like the following in order to monitor the open files one hundred times, every three seconds, and send the output to both the terminal and the file open_files_log.txt:
The script will run for five minutes (100 * 3 = 300 seconds). I would suggest running the script every minute for a week (that would be about 10,000 times every 60 seconds), as shown in the following example:
Naturally, you can also monitor the number of open files using a browser and the URL http://alfresco_host:19999/
If you have a large or active Alfresco installation, or you have a large number of interactive users, your system may be at risk. We explained the conditions that may cause the error “Too many open files,” how to correct the problem, and how to debug “connection leaks.”
To learn more, or for additional information on Alfresco implementations, contact us today.