Leveraging Regular Expressions in Ephesoft

Recently, I was working on an Ephesoft project that required classification and extraction of wordy legal mortgage documents. The extraction consisted of many fields related to mortgage note documents, fields like payment and principal amounts, interest rates, addresses, maturity and payment dates, and borrower names. In the case of key/value and paragraph extraction in Ephesoft, the quality of your regular expressions is pivotal, and as the extraction effort went on, my expressions started getting more and more complicated but also much more effective.

Regular Expressions

First, let’s back up and talk about who uses regular expressions (also called regex by some enlightened few), what they are, when they’re used, why they’re used, and how they’re used.

Who Uses Regex?

The people who use regular expressions are typically people who are looking to match or pull out a pattern from some sort of group of characters or text. If you’ve ever run a grep command to grab lines from a linux output, you are the person I am talking about.

What is Regex?

A regular expression is a string of characters that lays out a search pattern. A good way to think of regular expressions is like a big sorting machine that looks for plastic or other materials at recycling plants. A bunch of recycling materials (text) go through a series of tests (regex matching patterns) that isolate specific materials (characters you are looking for). The result is that the original material is sorted into piles that match one or more of the tests (the regex match results) and another pile of everything else that did not match (non-matching results).

When/Where/Why/How Should I Use Regex?

Regular expressions are best used when you’ve encountered a large amount of text that you need to search, or pull something out of. As I mentioned above, grep commands are a perfect example. By piping a command that outputs a large amount of text through a grep command, we can filter for only the lines we care for rather than having to search by scrolling through what could be hundreds of lines of text. If you’re looking to learn about regex or are looking for a good place to test your expressions, I recommend regexr.

Leveraging Regex in Ephesoft

Ephesoft provides three main ways for you to leverage regular expressions. For classification, you can use the key-value page processing plugin. For extraction you have the KV extraction plugin and the paragraph extraction plugin.

All three of these strategies need two regular expressions, a key, and a value. For extraction, the key is typically some text that doesn’t change and is consistent from document to document, whereas the value is something that varies much more and should be a much more varying regular expression.

For example, if you’re looking to extract the address of a business on an invoice, the key expression could be the word “Address” and the value expression would match the type of address you’re looking for.

It’s also worth noting that Ephesoft comes with a library of commonly used regular expressions out of the box as well as a wonderful regex builder. The library contains expressions that will match phone numbers, names, email addresses, etc. You may notice that some of the regexes below are already found in the OOTB library. I found it was necessary to create our own expressions to hone in on more specific results and add in some flexibility for inconsistent OCR results among other things.

Five Regular Expressions Built for a Recent Ephesoft Project

When it comes to extracting data from lengthy legal mortgage documents, dollar values, interest rates, addresses, and dates are very common requirements that show up in a very unstructured and inconsistent manner. Because of this, the extraction quality really hinges on having some solid regular expressions for our extraction rules in Ephesoft. Here are some examples:

Interest Rates

\d{1,2}\.\d{3,5}\%

One of the simplest expressions I use, this regex looks for numbers with one to two digits followed by a decimal, three to five digits, and a percentage symbol. Matching values include 2.112% or 11.8999%.

Let’s Break it Down
Chunk 1 – Integers

This chunk looks for the integer of the interest rate. Because it’d be extraordinarily rare to see an interest rate with three digits prior to a decimal, this chunk will only look for one to two integers that precede a decimal point. Of the rate 11.8999%, this chunk will match 11.

Chunk 2 – Decimal

The simplest of all chunks—this is only looking for a dot between chunks 1 and 2.

Chunk 3 – Decimal values

This chunk is looking for three to five digits that follow a dot and precede a percent symbol. The legal documents we had for development almost always had three decimal digits and some went up to five making three to five the best option for this chunk. Of the rate 11.8999%, this chunk will match 8999.

Chunk 4 – Percent symbol

Just as simple as chunk 2—this chunk is looking for a simple percent symbol that follows three to five decimal digits.

Dollar Amounts

For values like payment and principal amounts, we needed some solid regular expressions that would be capable of pulling out amounts typically between $100 and tens of millions of dollars. In some cases, due to potentially unreliable results from the OCR engine, we needed these expressions to be a bit more flexible. Here’s what we used:

A Strict Option

\$(?:\d{1,3},?)?(?:\d{3},?)?(?:\d{3})(?:\.\d{2})

This regex will only match values that start with dollar signs ($), end with cents after a decimal, are greater than $99 and less than $1 billion. If there is no dollar sign, comma, or decimal, this regex will not work—which means that when it matches, we can highly weight it in Ephesoft. For example: $512,392,126.98, $392,126.98, $2,346.12, and $100.00 will all match.

Before we go any further, it’s important for me to mention why we are using non-capturing groups in these cases. Ephesoft will display the matching text from a normal capturing group in the extraction results at the beginning. For example, if we have a regex that looks something like the following:

([a-z]\d[a-z])+

It will match text similar to a1ab2bc3c with group 1 being c3c. Ephesoft would return an extraction result of c3c a1ab2bc3c . To prevent this extra bit of text and to just return a1ab2bc3c, we’d use a non-capturing group:

(?:[a-z]\d[a-z])+

Let’s Break it Down

Because numbers are typically built up from right to left (even though we read them from left to right), we’ll start with chunk 5.

Chunk 5 – Cents

Of the value $512,392,126.98, this chunk will match .98

Probably the most simple of the chunks, this chunk is looking for any two digits preceded by a decimal, or period character. If this pattern cannot be satisfied, we will not find any values.

Chunk 4 – Hundreds

Of the value $512,392,126.98, this chunk will match 126.

This chunk is looking for any three digits that are followed by a decimal and cents/hundreds from chunk 4. When designing this regex, I figured we’d be better off looking for values with a minimum of $100 because the likelihood of someone making mortgage payments under $100 is very low and the potential for false matches in Ephesoft is too great if looking for values under $100. Like chunk 4, if we cannot find a match with this chunk, we will not grab any values.

Chunk 3 – If millions, three digits of thousands

Of the value $512,392,126.98, this chunk will match 392,.

Chunk 2 says “Hey, if chunk 3 is found, and we found a value for millions in the second chunk, grab the three digits of thousands.” The only time this chunk will find a match is if the value is in the millions. This chunk is also optional.

Chunk 2 – One, two, or three decimals of thousands or millions

Of the value $512,392,126.98, this chunk will match 512,.

This chunk says, “If the value is greater than $999 or $999,999, I should match the next one to three integers that come before the recent comma.” This group is optional as denoted by the ? flag.

Chunk 1 – The dollar sign

Lastly, the dollar sign must be matched at the beginning of the value. If not, we cannot verify the number found is, in fact, a dollar value.

A Looser Option

\$?\s?(?:\d{1,3}\s?)?(?:[,\.]?\s?\d{3}\s?)?(?:[,\.]?\s?\d{3}\s?)(?:[,\.]?\s?\d{2})?

This version of the original regex is built very similarly, though it allows a much wider variety of values to match. All of the values that would match with the first regex will also math with this regex. For example. $23,235.23 and $987,123.99 will match in addition to values like:

  • 23, 235 . 23
  • 12.134 , 12

The purpose of this regex is to use it within a lower weighted rule within Ephesoft and give the formatter an opportunity to clean up the value rather than not extracting it at all.

US Addresses

\d{2,}.{3,50}(?:Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New\sHampshire|New\sJersey|New\sMexico|New\sYork|(North|South)\sCarolina|(North|South)\sDakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode\sIsland|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West\sVirginia|Wisconsin|Wyoming|(?:A[LKSZR])|(?:C[AOT])|(?:D[EC])|(?:F[ML])|(?:G[AU])|(?:HI)|(?:I[DLNA])|(?:K[SY])|(?:LA)|(?:M[EHDAINSOT])|(?:N[EVHJMYCD])|(?:MP)|(?:O[HKR])|(?:P[WAR])|(?:RI)|(?:S[CD])|(?:T[NX])|(?:UT)|(?:V[TIA])|(?:W[AVIY]))\s\d{5}(?:-\d{4})?

Though it’s the longest regex of the lot, it is far from the most complex. This regex is designed to match with any US address that contains a house number, a street name (optionally an apartment number or city up to 50 characters), a US state, and a five-digit zip code (with an optional four-digit suffix). Note that address parsing and extraction can be much more complex, both in the US and internationally. While this example is useful in practice, it will not handle every possible variation. There are whole products just for parsing addresses into their components. Check out this blog from one such company here.

One thing that makes this regex so long is that the format of the state can either be an abbreviated name, (e.g., CA) or the full name for that state (e.g., California). This regex will match the following values:

1234 South Baker St. Normandy, Louisiana 12345

29 James Gooden Ave, APT 546 Longwood, DE 54327-0234

Let’s Break it Down
Chunk 1 – House numbers

This chunk matches the house numbers in the address, written as at least two digits. Of the address 1234 South Baker St. Normandy, Louisiana 12345, this chunk will match the 1234.

Chunk 2 – Street, apartment number, etc.

Notice that this chunk specifies between three and 50 of ANY character type. This is the case due to how inconsistent street, apartment numbers, and states appear on addresses. I figured that as long as I was able to grab chunk 1 (house number) and chunk 2 (a proper state) that I wanted everything in between up to 50 characters. Of the address 1234 South Baker St. Normandy, Louisiana 12345, this chunk will match South Baker St. Normandy,.

Chunk 3 – State

As mentioned above, this chunk of the regex takes up the most space due to all of the alternate and/or logic in place. This chunk is looking for a direct text match that will match any state either in long form or abbreviated. Of the address 1234 South Baker St. Normandy, Louisiana 12345, this chunk will match Louisiana.

Chunk 4 – Zip code

This chunk is looking for one mandatory piece and one optional piece. The first piece is five digits representing a normal, five-digit zip code. During testing, I noticed that almost half of the addresses on legal documents contain and use the additional four digits on the end that are preceded by a dash, so that was added to the final chunk. Of the address 129 James Gooden Ave, APT 546 Longwood, DE 54327-0234, this chunk will match 54327-0234.

Dates

Verbose Dates

(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s(?:\d{1,2}(?:st|nd|rd|th)?)\s?(?:(?:[\s,.\-\/])\s?)?(?:(?:19[7-9]\d|20\d{2}))

On legal documents, I noticed a lot of verbose dates being thrown around. These dates include months, numerical days, and a full, four-digit year. This regex matches the following values:
February 13, 1984

Oct 2nd 2011

Let’s Break it Down
Chunk 1 – Month

This chunk looks for an exact match for month on a date. Notice that the month can be fully spelled out like “January,” or can be the three-letter abbreviation, “Jan.” Of the date Oct 2nd, 2001, this chunk will match Oct.

Chunk 2 – Day

This chunk has two parts and the first is mandatory. It looks for a two-digit number that follows the first chunk. The second part is an optional part that looks for an ordinal indicator (st, rd, nd) following the day in the date. Of the date Oct 2nd, 2001, this chunk will match 2nd.

Chunk 3 – Punctuation

This chunk is responsible for grabbing the punctuation that happens to come after the day in the date. This chunk is highly variable from document to document, so there are a lot of potential characters we can grab here.

Chunk 4 – Year

Lastly, the fourth chunk of this regex is responsible for the year. In the case of the project I was working on, we didn’t have to worry about any dates that were before 1970, so this chunk only looks for dates that are post-1970. Of the date Oct 2nd, 2001, this chunk will match 2001.

Numerical Dates

\d{2}/\d{2}/\d{4}

Probably the easiest regex of the dates – This expression is looking for two digits and a slash, followed by two digits and a slash, finishing up with four digits (the common MM/DD/YYYY or DD/MM/YYYY date format). The slashes can be swapped out for dashes or dots if needed.

Legal Dates

(?:first|1(?:st)?)(?: day)? of (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)(?:(?:[\s,.\-\/])\D?)?(?:(?:19[7-9]\d|20\d{2}))

This pattern matches a date that was seen very rarely on documents and is more like a phrase. This date format swaps out the day to the beginning, in the form of “The first day of” for example. Since the day is moved to the front, the day chunk is no longer involved. This expression matches the following values:

First day of October, 2001

1st day of Nov, 2001

Let’s Break it Down
Chunk 1 – Day

This chunk looks for an exact match on either “First” or “1st” followed optionally by the word
“day” and by “of” which is mandatory. For example, ff the date First day of October, 2001, this chunk will match First day of.

Chunks 2–4 are similar to the verbose date expression

Wrap Up

Well, there you have it. Regular expressions are extremely valuable tools that can help you in a number of different situations. From trying to match and pull out a date, a US address, or even dollar values, regexes can be as loose or as strict as you care to make them. But, when leveraged with a K/V rule within Ephesoft, you can see the real power of a well thought out and proper regular expression.

To learn more, or for help with your current Ephesoft implementation, contact us today.

Pin It on Pinterest

Sharing is caring

Share this post with your friends!