Ephesoft’s KV Extraction plugin provides a powerful way to extract values based upon their position to keywords. It’s the core of what allows Transact to avoid the need for templates. But, this plugin has seen little changes over the years. With the release of 2020.1.02, Ephesoft added options to allow Intelligent Character Recognition (ICR) extraction. This is done without needing to dive into RecoStar Design Studio. It opens up the door for better ICR engines in the future. In this post, we will explore the new options and their real-world applications.
Transact’s KV Extraction was previously designed to work with machine-printed values. If you wanted to extract handwritten values, you needed to use Fixed Form Extraction. Now, an option exists to extract handwriting from the same, familiar KV interface as before allowing for more flexibility than the Fixed Form engine alone. Additionally, there’s an option to use the traditional machine print KV extraction first. Then, fall back to the ICR engine if confidence is too low. This will allow for documents filled in both electronically and by hand to use the same extraction rules. Only the handwritten samples are sent to the ICR engine. This saves time if a paid ICR API is added later.
If a document can be received with either machine print or handwriting for the values, the Extraction Type option can be changed to KV + Hand/Machine Print to first try extraction via the traditional KV Extraction algorithm. If the confidence is below a threshold, switch to ICR extraction.
The Auto-Resize K/V Areas option could also be handy for documents that are received from sources with different dimensions. For instance, a document could be received by email or by mail. It could have different scan options selected on the submitter’s personal scanner versus the business’s scanner. When this option is checked and the Key (green) box has been drawn exactly over the key text, incoming documents will be resized. The key text’s size matches the sample’s to prevent value offset and dimensions from being incorrect. This could be handy on forms where small differences in dimensions can cause incorrect or extra data to be extracted.
Lastly, the Value Type option can be switched between Alphanumeric and Numeric. I saw dramatic increases in accuracy for number extraction when this was set to Numeric. In my testing, RecoStar isn’t great on handwritten strings with anything but the best handwriting and when mixed alphanumeric and symbol characters are involved. The results will often be better than the traditional machine print results. However, other ICR engines will be a welcome addition for handwriting extraction. Zia has an integration with Vidado with exceptional freeform handwriting recognition.
Checkbox detection has been available in Fixed Form Extraction in Transact for a long time, but this method is time-consuming and learning to use RecoStar Design Studio isn’t for the faint of heart. With the addition of checkbox detection to KV Extraction, it will be much more cost-effective and flexible to add checkbox detection to more documents. For those that have done checkbox detection in RecoStar, note that the KV Extraction method is using simple pixel count (dark versus light pixel percentage within the drawn value box). On the other hand, RecoStar attempts to detect the box itself then determine the pixel count within that box.
When Checkbox Detection is chosen as the extraction method, you’ll see a Pixel Density % option appear. This option defaults to 100, so you’ll want to greatly lower this value. When you perform a test on the KV Extraction setup screen, the confidence that’s returned is the percentage of pixels that are dark within your current key box. You can use this to help determine a good threshold for checked versus unchecked boxes.
One slight annoyance with this method is that the Value regular expression needs to be manually set to .+. Maybe Ephesoft will make this automatic in future versions, but for now make sure not to overlook that field.
I’ve seen issues with documents that have checkboxes that can be different sizes. A customer’s form is identical in all ways except for the checkbox size. I’m not sure if this is due to differences in rendering by PDF editors when filled out or if the form was modified with different sized checkboxes at some point so now two variations are in circulation. Whatever the case, the smaller checkbox when checked returns a pixel count of 8%, and the larger checkbox when unchecked has a pixel count of 14%. This leaves no overlap to set a threshold that will work for both variations, and since they’re identical besides the checkboxes, I haven’t found a way to classify them differently.
More investigation will be required to find a working solution. Custom code may be required to programmatically determine the checkbox size and set the true/false value based upon the determined checkbox size. Note that this situation is difficult to handle in the traditional fixed-form extraction in RecoStar Design Studio.
Like checkboxes, Transact can determine if a field contains a signature. Unlike checkboxes, this doesn’t seem to be using a basic pixel count but a bit more sophisticated algorithm. In my testing, I moved the value box over actual signatures, blank fields, names, addresses, and machine print. The algorithm seems to look for handwritten letters. When placed over machine print, handwritten numbers, and blank areas, false was returned. When placed over signatures or handwritten text with at least one letter, then it would return true. This should be a good option when looking for human signatures. However, if documents can be filled in electronically and have machine print in the signature field, it may be best to use the checkbox detection feature and determine if it was signed based on the pixel percentage.