Batch Class Optimization
Typically, the slowest plugins for a batch class will be:
- OCR (optical character recognition using Recostar or Nuance)
- Image Processing (OCR input, page images, thumbnail images)
- Custom Scripts (possibly)
You may well have different results based on your custom workflow and input document types.
Remember, for any performance optimization, focus on the slow steps first. If a plugin takes 1% of the total processing, there is very little point making it twice as fast—you have only gained 0.5%! If another plugin takes 40% of the processing time, making this just a little faster will give bigger benefits. Making such a plugin twice as fast will give a huge 20% speed boost. So pay attention to where you can make the largest impact first.
Remove Unused Plugins
Remove any unused plugins that are switched to “off.” Although there is only a small overhead for unused plugins, we recommend removing unused plugins. This helps to focus attention on the plugins actually being used in the workflow. Also, for future developers working on the batch class, it will be much clearer which plugins are part of the workflow if unused plugins are removed.
Remove Unnecessary Plugins
If you have only black and white images, the Image Process Create OCR Input plugin can be removed. You will first need to edit the plugin dependencies using the batch class management control panel.
If you are building a fully automated workflow where documents never stop in Review or Validation, then you can remove the Image Process Create Thumbnails and Image Process Create Display Image plugins as these images are only needed for the user interface.
Optimize Slow Plugins, if Possible
Many Ephesoft plugins use third-party tools for document processing. For example, the Import Multipage Files plugin converts incoming documents to single page TIFFs. For incoming PDF files, Ephesoft uses Ghostscript or Recostar/Nuance (Windows/Linux) for image conversion and for incoming TIFFs it uses ImageMagick or GraphicsMagick. If these plugins are slow, you can experiment with changing the selected tool to see if there is a significant performance change for your documents. There is no hard and fast rule that one tool is always faster.
For image processing plugins (Create_Display_Image, Create_Thumbnails), GraphicsMagick seems to give a speed boost over ImageMagick in most cases. We recommend turning on GraphicsMagick for these plugins. To do this, edit the file “<install dir>\Ephesoft\Application\WEB-INF\classes\META-INF\dmca-imagemagick\imagemagick.properties” and change the line:
You will need to restart the Ephesoft service after making this change.
For the Import Multipage Files, you can use the Batch Class Management UI to change Image Conversion Process from ImageMagick to GraphicsMagick.
In addition, for the same plugin, the Optimized TIFF switch generally keeps the size of the TIFF small which is a benefit to other downstream image processing. Using ImageMagick without this switch can turn a 100KB TIFF into a 1MB TIFF.
With script plugins it is very helpful to add profiling code where necessary. If the script plugin is slow, the profiling data will show how long each step is taking. For example, if the custom script is calling a web service or running a database query, add code to log the response time. For cases like this, a possible optimization could be to cache the results of a web service call or database query if the results don’t change (e.g., a list of vendors or suppliers from an external system for invoice processing).
In general, OCR processing is one of the slowest plugins as OCR is very computationally intensive. In addition, your Ephesoft license limits the number of OCR processes and threads which can cause blocking. In rare cases where Ephesoft is being used just as a data entry/indexing user interface, the OCR plugin can be removed. Unfortunately, OCR is computationally intensive so often there is little that can be done to speed up this step. In some cases it is possible to reduce the area on each page that is OCRd. For instance, if you are receiving only invoices which have all the data in the header, then it may be possible to only OCR the top half of each page. Please consult with your implementation team or reseller before you attempt this sort of change as it may impact your classification or extraction results.
For data extraction, the Fixed Form extraction plugin makes an additional call to the OCR engine for each page with data to be extracted. This will add to the overall processing time so try to use Fixed Form extraction only when necessary—such as for handwritten fields or checkboxes.
For Key-Value extraction (KV) the execution time increases with each additional rule so try to minimize the number of rules. Sometimes by adjusting the regular expressions (regex), two different rules can be combined. For example, suppose you have two identical rules to find text on a document the only difference being the Key regex “Invoice Number” or “Invoice No.” You can combine the rules by using a regex like “Invoice (Number|No)” which should, in general, give better performance.
From the previous article you know how to run performance tests on your batch classes. So, as you modify or remove plugins, you are able to repeat the testing and quantify any performance changes. This is an essential step to confirm your changes are acting as expected.
- Using faster CPUs will always give better performance. Sometimes just spending an extra $1000 on faster CPUs will give a significant performance boost and is probably the best bang for your buck available. One website we have found with independent CPU performance statistics is cpubenchmark.net.
- Using a faster disk (SSD) will also give a performance boost due to the I/O impact of the intermediate files generated during Ephesoft processing.
- If using a network storage device, carefully test your I/O throughput to make sure it is not a bottleneck. If possible, do an apples-to-apples test to compare performance with local disk vs network storage.
- Carefully consider the impact of server clustering. If you don’t need to cluster for SLA reasons, then for performance you will be better off with a single server (e.g., an 8 core server will almost always be faster than a 2 x 4 core cluster).
- Hyperthreading is a CPU technology than can make a single core CPU appear as 2 core CPU to the operating system (OS). Often servers are shipped this way from the manufacturer. However a 2 physical core server hyperthreaded to appear as a 4 core server to the OS is not as fast as a server with 4 physical cores. Because Ephesoft licensing is on a per-core basis make sure your server has hyperthreading turned off for the best performance.
- Ephesoft creates many short-lived processes and utilizes temporary files during its processing. To some antivirus software, this can appear to be suspicious behavior. The antivirus software may delay Ephesoft by locking these processes and files. The delays can add up to a significant amount of overall processing time. If possible, work with your system administrator to create exceptions which will instruct the antivirus software to trust Ephesoft and its behavior.
If you have questions, or to learn more contact us today.