Regular Expressions

Basically, a regular expression (or RE) is a pattern describing a certain group of text. They can be considered a type of advanced "Wildcard Search." Their name comes from the mathematical theory on which they are based.

In Bible Analyzer regular expressions can be used for Bible text searches.

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like A, a, or 0, are the simplest regular expressions; they simply match themselves. You can also combine ordinary characters, so last matches the string last in 'last' or 'blast'.

Examples:
Lord will match Lord and Lord (unless Case Sensitive is selected)
or will match or, Lord, error, and any other word with the string or.

Special Characters

Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. Some characters, like "|" or "(", are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning. There are more examples below.

With a character class, also called character set, you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey. Very useful if you do not know whether the document you are searching through is written in American or British English. For example, sep[ae]r[ae]te or li[cs]en[cs]e.

You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can combine ranges and single characters. [0-9a-zA-Z].

Negated Character Classes
Typing a caret after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class. Unlike the dot, negated character classes also match (invisible) line break characters.

It is important to remember that a negated character class still must match a character. q[^u] does not mean: "a q not followed by a u". It means: "a q followed by a character that is not a u". It will not match the q in the string Iraq. It will match the q and the space after the q in Iraq is a country. Indeed: the space will be part of the overall match, because it is the "character that is not a u" that is matched by the negated character class in the above regexp.

Examples:
gr[ae]y will match gray or grey.
lo[uv]e will match love or loue (AV1611 spelling).
sep[ae]r[ae]te will find seperate, separate, seperete, and separete.
lo[^u]e will find love but not loue

You can use alternation to match a single regular expression out of several possible regular expressions. It differs from a character class in that REs with more than one character can be used. ie. cat|dog will find cat or dog.

Examples:
lord|God will find all verses with either Lord or God
mercy|grace|lo[uv]e will match mercy, grace, love or loue

This matches any character except a newline. However, because of its broad capabilities, it can lead to unintended matches.

Examples:
lo.e will find love and loue, but also loqe or lo2e.

These two characters anchor the location of the search. ^ Matches at the start of the string only and $ at the end.

Examples:

^christ will match Christ in, 'Christ hath redeemed us...' (Gal. 3:13), but not Christ in, 'Paul, a servant of Jesus Christ, called...' (Rom. 1:1).

christ\W?$ (don't forget to allow for the punctuation, see bellow) will find Christ in, 'Be ye followers of me, even as I also am of Christ.' (1 Cor. 11:1), but not Christ in Gal. 3:13.

Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. Thus the question mark makes the preceding token in the regular expression optional. E.g.: colou?r matches both colour and color. You can make several tokens optional by grouping them together using round brackets, and placing the question mark after the closing bracket.

Examples:
Lords? will find both Lord and Lords
Right(eousness)? will match Right and Righteousness.
To match a literal ? use \?

The star (or asterisk) causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. The plus causes the resulting RE to match 1 or more repetitions of the preceding RE.

There is also an additional repetition operator that allows you to specify how many times a token can be repeated. The syntax is {min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,} is the same as *, and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat the token exactly min times.

You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of the word boundaries.

Examples:
ab* will match 'a', 'ab', or 'abbbbb...' (until something other thn 'b' is encountered).
ab+ will match 'ab' or 'abbbbb...' It will not match just 'a'.
lord {3} will find lord lord lord (remember to add the space and any possible punctuation for whole words).
To match a literal * or + use a slash before it (\?).

The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn't desired;

if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.

By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a regex operator, e.g. a repetition operator, to the entire group.

Examples:
Jesus (?=Christ) will match Jesus only if it's followed by Christ (use Jesus\W?(?=Christ) to deal with any possible punctuation i.e. Jesus, Christ).

Matches if ... doesn't match next. This is a negative lookahead assertion.

Examples:
Jesus (?!Christ) will match Jesus only if it's not followed by Christ.

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.

With regular expressions you can describe almost any text pattern, including a pattern that matches two words near each other. This pattern consists of three parts: the first word, a certain number of unspecified words, and the second word.

An unspecified word can be matched with the shorthand character class \w+. The spaces and other characters between the words can be matched with \W+ (uppercase W this time).

The complete regular expression becomes \bword1\W+(?:\w+\W+){1,6}?word2\b . The quantifier {1,6}? makes the regex require at least one word between "word1" and "word2", and allow at most six words.

If the words may also occur in reverse order, we need to specify the opposite pattern as well:

\b(?:word1\W+(?:\w+\W+){1,6}?word2|word2\W+(?:\w+\W+){1,6}?word1)\b

If you want to find any pair of two words out of a list of words, you can use:

\b(word1|word2|word3)(?:\W+\w+){1,6}?\W+(word1|word2|word3)\b. This regex will

also find a word near itself, e.g. it will match word2 near word2.

· \bJesus\W+(?:\w+\W+){1,6}?Christ\b will find Jesus and Christ in order separated by at least one word and no more than six.

· \b(Lord|Jesus|Christ)(?:\W+\w+){1,6}?\W+(Lord|Jesus|Christ)\b will match Lord, Jesus or Christ separated by at least one word and no more than six before a second instance of Lord, Jesus or Christ.

For more information about Regular Expressions check this excellent website, http://www.regular-expressions.info/reference.html