Solr is Alfresco’s repository search engine. The powers of this search engine almost feel magical. Throw some search terms at it, and it will come back with the most relevant documents in an instant. Or seconds. Or… minutes? What’s going on?
It is common for Solr to get slower as more documents are indexed. After all, the larger the index, the more time it is going to take to navigate the index to find a document. However, Solr should not be unbearably slow.
Solr stores the location of every item in the index or index segment files. As the Alfresco repository grows larger, so do the index files. In a large repository, index segments may be gigabytes big.
Alfresco Software Engineer
To find the documents related to a word, Solr has to search in the index files. As you know, reading from disk is a slow operation, so Solr tries to keep as much data as possible in its memory caches (despite “disk” becoming an obsolete term—the storage medium of choice for Solr is SSD—please bear with us).
In an ideal world, the whole index of files would fit in the server’s memory. However, there are practical limitations to the amount of memory that can be made available to the Solr application.
Although Alfresco’s documentation keeps growing and improving, there are still some things that are missing—or have not been properly updated. For that reason, installations performed “by the book” may perform poorly or run out of memory.
The lack of some documentation details has led many implementers to make mistakes deploying Solr. One of the most common mistakes is allocating too little or too much memory to Solr. Going to either extreme may mean poor performance, strange behavior, or crashing systems. Are you making these mistakes?
Mistake 1: Too Little Memory
Alfresco’s incomplete memory calculation
Alfresco provides a procedure to calculate the memory for Solr caches for version 5.1 (the Solr configuration section was completely removed from the documentation of Alfresco’s Content Services 5.2 and later). The name of the article, “Calculate the memory needed for Solr nodes” seems to provide an easy method to estimate the memory requirements, but in reality it is just a baseline for JVM memory. Alfresco decided to stop publishing the calculation on later versions because there were just too many factors to consider. It does not include all of the Solr caches, the memory to run Solr, Lucene field cache, and the OS memory not allocated to the Java heap.
Alfresco’s formulas for memory calculation will not provide enough memory to run Solr! The formulas calculate the size of some data structures and caches, but not all the memory objects that Solr requires.
Do this: Assign more memory than calculated
Different versions of Alfresco have different caches so you can’t just take the formula at face value. You have to investigate the caches in the solrcore.properties file and adjust the math accordingly. Find the location of your solrcore.properties file in Alfresco’s documentation. If needed, you can reduce the size of the caches to save memory.
It is common to give 50% to 80% more memory to Solr than you had calculated using the documentation.
How do we know when Solr is running out of memory?
Suppose that you calculated Solr’s requirements, and then some. How do you know if you are giving enough memory to run? What happens if you happen to see OutOfMemory exceptions?
Sometimes it is obvious that Solr needs more memory:
- The log (solr.log) shows OutOfMemory exceptions or “GC overhead limit exceeded” errors, or…
- Sometimes there are no errors on the logs, but indexing new documents is becoming painfully slow. Searching can also be very slow. Solr may crash, or behave erratically.
If in doubt, it is better to monitor Solr memory:
- Use a JMX client (such as jConsole, jVisualVM, etc.) or a similar tool that shows the internal memory details of the JVM.
- Monitor the virtual memory usage and the total memory used by Solr in the server. In my presentation at DevCon 2019, I discussed some of the tools that can be used to perform this kind of monitoring.
Mistake 2: Too Much Memory
I imagine Solr as a hungry octopus that feeds from repository data. It may start as a tiny creature, but as the index files grow, it can become a monster. Give an octopus plenty of water, and it can move quickly. Pull an octopus out of the water, and the results will be… less spectacular. Give Solr lots of memory, and watch it shine.
In an ideal world, we would keep this creature in a pool with plenty of water to let it move swiftly. Again, it would be ideal if our Solr server could be swimming in plenty of memory, so it can navigate its index files without having to retrieve data from disk so often. That would mean that the whole index would have to be fit within Solr’s memory.
Index files need to be read, split, sorted, and rebuilt after new data comes into an index. Unsurprisingly, Solr requires large amounts of space on disk and benefits from fast storage. Having enough memory to let Solr swim freely in memory is not unthinkable considering that the costs have been going down for decades. The problem is that, with Java, memory can be hard to manage at some point.
Memory Garbage Collection Can “Stop The World”
Java’s memory management can be funny. Its goal is to provide memory to store new objects. The process to dispose of unused objects and make room for new objects is known as Garbage Collection (GC). Think of it like a restaurant that gives the clean tables to newly-arrived customers, without worrying about cleaning the dirty tables from previous customers. At some point, the restaurant runs out of clean tables, and then Java gives the alarm to stop serving customers, putting every employee to clean all the unoccupied table. Even diners have to freeze during the cleanup! Only when all the tables have been cleaned can the normal business of the restaurant resume.
Such an approach works well when the number of tables in the restaurant is minimum. However, if the restaurant had thousands of tables, the manager would have to close the restaurant for weeks, breaking the business.
The actual GC algorithms used by Java are more complicated than my analogy, but it gives you a feeling of how potentially disruptive the process is. Fortunately, Java has been evolving, providing several algorithms to perform GC, some more appropriate for different kinds of applications. Analogously to our imaginary restaurant, the algorithm suggested by Alfresco (UseConcMarkSweepGC, also known as CMS, and UseParNewGC) may work well when the memory used by Solr is relatively small. However, when the heap is larger than 20 GB or 30 GB, we start to observe “stop the world pauses” (STWP) that may last seconds, or even a full minute. As the server indexes more data, the pauses may occur more frequently.
Long STWPs are potentially dangerous, especially when Alfresco is running on the same web applications server (Tomcat). Long STWPs may cause problems—including bringing down the server (and other servers) in extreme cases. Luckily, there are better alternatives.
Memory has to be managed carefully. Give too much to Solr, and the system may stop with long “stop the world” pauses that may even bring down the server if you don’t use the right configuration. Abundant memory can become too much of a good thing.
G1GC to the rescue!
The G1GC algorithm is much faster because it moves around pointers instead of moving huge blocks of memory. Although STWPs may still occur, they usually will be much shorter. G1GC is a great solution when the heap size is very large. However, there are a few things to consider when using this algorithm:
- I estimate that the algorithm imposes a penalty of about 10% to 15%, which means you will have to increase the size of the heap accordingly.
- G1GC may require some tuning and monitoring to make sure that you get the desired results.
- Be sure to have an updated version of Java 8, as there were significant bugs in the earliest releases.
- The traditional CMS GC algorithm can actually be better in some cases, but it requires time consuming tuning of the multiple options provided by Java. Also, it is being deprecated on Java 11.
Leaving enough space for memory mapping: there’s life outside the heap
Knowing that Solr will perform better with a larger heap, many engineers allocate the biggest heap (the Xmx argument) that the server can accommodate, using up to 90% of the server’s RAM, which can produce significant slowness.
Solr 4 and later take advantage of Java’s memory mapping capabilities. This technology hijacks the Operating System’s (OS) ability to map real RAM memory outside the heap. Usually, when the server runs out of memory, the OS can simulate virtual memory by temporarily swapping out unused blocks of memory to disk, freeing chunks of memory to other active programs. This method to read from disk is much faster than using file channels.
If you run the ‘top’ command on Linux (or use the Task Manager in Windows), you may notice that the Alfresco Solr process is using more memory than expected. Java memory mapping hijacks the ability to quickly transfer data from disk and enables access to memory out of the heap. If you are familiar with Java’s sandboxing philosophy, you may find this surprising.
As memory mapping uses memory out of the heap, it’s crucial to leave enough space. In Linux, the Java heap should take no more than 75% of the server’s memory. With Windows, the heap should not use more than two-thirds of the server’s memory. On the other hand, Alfresco recommends using only 25% of the server’s memory for the JVM. You will probably have to experiment between those limits to find the best performance.
Do not starve Solr! If the Java heap uses almost all the memory, there will not be enough space for memory mapping, OS buffers, and caches. If the memory out of the heap gets full, swapping will occur, reducing Solr’s performance to a crawl.
Deploying Alfresco Solr with the wrong memory size can lead to poor performance, unexpected behaviors, and even server and cluster crashes. Although we wish that estimating memory Solr was an exact science, it requires trial and error, and remains a bit of an art. In this article, we covered a better approach to calculating the minimum memory size, and the potential pitfalls of large amounts of memory.
I can’t thank enough Bindu Wavell from Zia Consulting, and Alex Strachan and Angel Borroy from Alfresco, who shared their knowledge about this topic. All of them are well recognized experts, and I am humbled to have had their help.
- Abbot, Eric; “G1GC Fundamentals: Lessons from Taming Garbage Collection,” https://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection
- Alfresco One 5.1.5; “Calculate the memory needed for Solr nodes,” https://docs.alfresco.com/5.1/concepts/solrnodes-memory.html
- Alfresco Content Services 5.2, “JVM settings,” https://docs.alfresco.com/6.2/concepts/jvm-settings.html
- Angel Borroy from Alfresco kindly provided the following resources:
- These are our latest thoughts on JVM settings: https://issues.alfresco.com/jira/browse/SEARCH-2066
- Alfresco Search Services is using the following parameters by default: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.6.5/solr/bin/solr#L1813
- Those default parameters are based on the following recommendations: https://cwiki.apache.org/confluence/display/solr/ShawnHeisey
- Beckwith, Monica; “Garbage First Garbage Collector Tuning,” https://www.oracle.com/technical-resources/articles/java/g1gc.html
- Colorado, Luis; “Performance Tools of the Trade,” https://www.slideshare.net/LuisColoradoSCJP/alfresco-devcon-2019-performance-tools-of-the-trade
- Hoffman, Chris; “What Is the Windows Page File, and Should You Disable It?” https://www.howtogeek.com/126430/htg-explains-what-is-the-windows-page-file-and-should-you-disable-it/
- Lakshmanan, Ram; “CMS Deprecated. Next Steps?” https://dzone.com/articles/cms-deprecated-next-steps