Regular Expressions in Bible Analyzer
Basically, a regular expression (or RE) is a pattern describing a certain group of text.
They can be considered a type of advanced "Wildcard Search." Their name comes from
the mathematical theory on which they are based.
In Bible Analyzer regular expressions can be used for Bible text searches.
Regular expressions can contain both special and ordinary characters. Most ordinary
characters, like A, a, or 0, are the simplest regular expressions; they simply match
themselves. You can also combine ordinary characters, so last matches the string last in 'last' or 'blast'.
Examples:
Lord will match Lord and Lord (unless Case Sensitive is selected)
or will match or, Lord, error, and any other word with the string or.
Special Characters
Because we want to do more than simply search for literal pieces of text, we need to
reserve certain characters for special use. Some characters, like "|" or "(", are special.
Special characters either stand for classes of ordinary characters, or affect how the
regular expressions around them are interpreted. If you want to use any of these
characters as a literal in a regex, you need to escape them with a backslash. If you want
to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign will have a special
meaning. There are more examples below.
'[ ]' — Character Classes or Character Sets
|
With a character class, also called character set, you can tell the regex engine to match
only one out of several characters. Simply place the characters you want to match
between square brackets. If you want to match an a or an e, use [ae]. You could use this
in gr[ae]y to match either gray or grey. Very useful if you do not know whether the
document you are searching through is written in American or British English. For
example, sep[ae]r[ae]te or li[cs]en[cs]e.
You can use a hyphen inside a character class to specify a range of characters. [0-9]
matches a single digit between 0 and 9. You can combine ranges and single characters. [0-9a-zA-Z].
Negated Character Classes
Typing a caret after the opening square bracket will negate the character class. The
result is that the character class will match any character that is not in the character
class. Unlike the dot, negated character classes also match (invisible) line break
characters.
It is important to remember that a negated character class still must match a character. q[^u] does not mean: "a q not followed by a u". It means: "a q followed by a character
that is not a u". It will not match the q in the string Iraq. It will match the q and the space
after the q in Iraq is a country. Indeed: the space will be part of the overall match,
because it is the "character that is not a u" that is matched by the negated character
class in the above regexp.
Examples:
gr[ae]y will match gray or grey.
lo[uv]e will match love or loue (AV1611 spelling).
sep[ae]r[ae]te will find seperate, separate, seperete, and separete.
lo[^u]e will find love but not loue
' | ' — Vertical Bar or Pipe (Alternation)
|
You can use alternation to match a single regular expression out of several possible
regular expressions. It differs from a character class in that REs with more than one
character can be used. ie. cat|dog will find cat or dog.
Examples:
lord|God will find all verses with either Lord or God
mercy|grace|lo[uv]e will match mercy, grace, love or loue
This matches any character except a newline. However, because of its broad
capabilities, it can lead to unintended matches.
Examples:
lo.e will find love and loue, but also loqe or lo2e.
'^' '$' — Caret, Dollar Sign (Location Anchors)
|
These two characters anchor the location of the search. ^ Matches at the start of the
string only and $ at the end.
Examples:
^christ will match Christ in, 'Christ hath redeemed us...' (Gal. 3:13), but not Christ in,
'Paul, a servant of Jesus Christ, called...' (Rom. 1:1).
christ\W?$ (don't forget to allow for the punctuation, see bellow) will find Christ in, 'Be
ye followers of me, even as I also am of Christ.' (1 Cor. 11:1), but not Christ in Gal. 3:13.
'?' — Question Mark (Optional Items)
|
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. Thus the
question mark makes the preceding token in the regular expression optional.
E.g.: colou?r matches both colour and color. You can make several tokens optional by
grouping them together using round brackets, and placing the question mark after the
closing bracket.
Examples:
Lords? will find both Lord and Lords
Right(eousness)? will match Right and Righteousness.
To match a literal ? use \?
'*' '+' '{}' — Star, Plus, and Curly Braces (Repitition)
|
The star (or asterisk) causes the resulting RE to match 0 or more repetitions of the
preceding RE, as many repetitions as are possible. The plus causes the resulting RE to
match 1 or more repetitions of the preceding RE.
There is also an additional repetition operator that allows you to specify how many times
a token can be repeated. The syntax is {min,max}, where min is a positive integer
number indicating the minimum number of matches, and max is an integer equal to or
greater than min indicating the maximum number of matches. If the comma is present
but max is omitted, the maximum number of matches is infinite. So {0,} is the same as *,
and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat
the token exactly min times.
You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of the word
boundaries.
Examples:
ab* will match 'a', 'ab', or 'abbbbb...' (until something other thn 'b' is encountered).
ab+ will match 'ab' or 'abbbbb...' It will not match just 'a'.
lord {3} will find lord lord lord (remember to add the space and any possible
punctuation for whole words).
To match a literal * or + use a slash before it (\?).
*?, +?, ?? — Dealing With Greediness
|
The *, +, and ? qualifiers are all greedy; they match as much text as possible.
Sometimes this behaviour isn't desired;
if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not
just '<H1>'. Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous
expression will match only '<H1>'.
By placing part of a regular expression inside round brackets or parentheses, you can
group that part of the regular expression together. This allows you to apply a regex
operator, e.g. a repetition operator, to the entire group.
Examples:
Right(eousness)? will match Right and Righteousness.
(?=...) — Positive Lookahead
|
Matches if ... matches next, but doesn't consume any of the string. This is called a
lookahead assertion.
Examples:
Jesus (?=Christ) will match Jesus only if it's followed by Christ (use Jesus\W?(?=Christ) to deal with any possible punctuation i.e. Jesus, Christ).
(?!...) — Negative Lookahead
|
Matches if ... doesn't match next. This is a negative lookahead assertion.
Examples:
Jesus (?!Christ) will match Jesus only if it's not followed by Christ.
These shortcuts can be used in place of Character Classes and other characters.
\A
Matches only at the start of the string.
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined
as a sequence of alphanumeric characters, so the end of a word is indicated by
whitespace or a non-alphanumeric character.
\B
Matches the empty string, but only when it is not at the beginning or end of a word.
\d
Matches any decimal digit; this is equivalent to the set [0-9].
\D
Matches any non-digit character; this is equivalent to the set [^0-9].
\s
Matches any whitespace character; this is equivalent to the set [ \t\n\r\f\v].
\S
Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v].
\w
matches any alphanumeric character; this is equivalent to the set [a-zA-Z0-9_].
\W
matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_].
\Z
Matches only at the end of the string.
\\
Matches a literal backslash.
Words Near Each Other
With regular expressions you can describe almost any text pattern, including a pattern
that matches two words near each other. This pattern consists of three parts: the first
word, a certain number of unspecified words, and the second word.
An unspecified word can be matched with the shorthand character class \w+. The
spaces and other characters between the words can be matched with \W+ (uppercase W
this time).
The complete regular expression becomes \bword1\W+(?:\w+\W+){1,6}?word2\b . The
quantifier {1,6}? makes the regex require at least one word between "word1" and
"word2", and allow at most six words.
If the words may also occur in reverse order, we need to specify the opposite pattern as
well:
\b(?:word1\W+(?:\w+\W+){1,6}?word2|word2\W+(?:\w+\W+){1,6}?word1)\b
If you want to find any pair of two words out of a list of words, you can use:
\b(word1|word2|word3)(?:\W+\w+){1,6}?\W+(word1|word2|word3)\b. This regex will
also find a word near itself, e.g. it will match word2 near word2.
Examples:
· \bJesus\W+(?:\w+\W+){1,6}?Christ\b will find Jesus and Christ in order separated
by at least one word and no more than six.
· \b(Lord|Jesus|Christ)(?:\W+\w+){1,6}?\W+(Lord|Jesus|Christ)\b will match Lord, Jesus or Christ separated by at least one word and no more than six before a
second instance of Lord, Jesus or Christ.