Getting Started Documentation Glish Learn More Programming Contact Us
Version 1.9 Build 1556
News FAQ
Search Home


next up previous contents index
Next: Functions and Function Calls Up: Expressions Previous: Integer Sequence Expressions

Subsections



Regular Expressions

Regular expressions are the primary means of manipulating and matching strings in Glish. Glish's regular expressions are based on code from Perl (version 5.004_04). If you are comfortable with Perl's regular expressions, you should have little problem with regular expressions in Glish. The section will describe Glish's regular expressions, and it will highlight the few differences between regular expressions in Glish and regular expressions in Perl.


Ranges

Regular expressions describe patterns of characters. The general form of a regular expression is:
    m/[_0-9a-zA-Z]+/
The m indicates that this regular expression is intended for matching (s would indicate substitution, covered next). The characters between the slashes (/) indicate the pattern of characters to match. This example matches an alpha-numeric collection of characters. The square brackets are used to describe ranges of characters. Often, more than one ranges is included within a pair of square brackets; in this case, four character ranges are included:
    _          0-9          a-z          A-Z
The underscore is a degenerate range which includes only one character. The range 0-9 includes the characters which make up numbers, and the ranges a-z and A-Z include all of the lower and upper case letters. These are the characters which are permitted in the variable names in Glish. [_0-9a-zA-Z] only matches a single alpha-numeric character, though. The plus sign following the range indicates 1 or more occurrences of the character or group preceding it, in this case our alpha-numeric range. So [_0-9a-zA-Z]+ will match one or more occurrences, but it will match these characters anywhere in the string. To indicate that the pattern should be matched at the start or the end of the string, ^ and $ (respectively) are used:
    m/^[_0-9a-zA-Z]+$/
This indicates that the string that is being matched should be alpha-numeric from start to end.

Character ranges can also be used to specify anything but the characters in the range. This is done by putting a ^ as the first character in the range, e.g.

    m/^[^_0-9a-zA-Z]+$/
would match a string which contained no alpha-numeric or underscore characters.

The most often used regular expression falls into the category of ranges. In a regular expression, the period matches any character except a newline. So:

    m/./
would match the same thing as:
    m/[^\n]/
A period inside of a range is just a period, but elsewhere in a regular expression the period must be escaped, i.e. ``\.'', to avoid matching any non-newline character. (See § 4.9.5, page [*], for other escape sequences.)


Substitution

The portions of the string which match the regular expression can be substituted for a different string. The syntax is much the same as matching with a regular expression. For example:
    s/[_0-9a-zA-Z]+/ALNUM/
would substitute ALNUM for the first occurrence of an alpha-numeric substring. If you wished to substitute ALNUM for every occurrence of an alpha-numeric, you would add the g flag:
    s/[_0-9a-zA-Z]+/ALNUM/g
This would cause each occurrence to be substituted. This flag is known as the global flag because it causes the regular expression to be applied everywhere in the string. The regular expression is applied as many times as possible.

Along with the global flag there is one other flag 4.1, i. This flag causes case insensitive matching:

    m/foo/i
The i flag will cause this regular expression to match either upper or lower case letters, e.g. foo, Foo or FoO. A regular expression can also be made case insensitive by adding ``(?i)'' to the beginning of it. This would be equivalent to the previous example:
    m/(?i)foo/


Grouping and Alternation

Parentheses are used to group a series of match alternatives, and the alternatives are separated by vertical bars:
    m/eat (green eggs|ham)/
This regular expression would match either eat green eggs or eat ham. Another important quality of parentheses in regular expressions is that the portion of the string which is matched by the regular expression inside of the parentheses is saved for later use. For example:
    s/eat (green eggs|ham)/I don't like $1/g
would substitute each occurrence of eat green eggs or eat ham with I don't like green eggs or I don't like ham. The matched parentheses are bound to the substitution variables, i.e. $1, $2, $3, etc., in the order that they occur. This allows portions of the matched string to be used in the substitution.

Parentheses are used wherever values must be preserved for later use. For example:

    s/^([0-9]+) ([a-z]+)/$2 $1/g
would reverse the order of a number followed and a lower-case name occurring at the beginning of a string. These matched portions of the string are also available in the Glish script after the match is done using the match variable, $m (See § 4.9.7, page [*]).

If you need to group alternatives but do not need to use the matched portion later, you can use:

    s/eat (?:green eggs|ham)/yuck!/g
This example would substitute yuck! for each occurrence of eat green eggs or eat ham. No strings are preserved when the opening parenthesis is ``(?:'' instead of just ``(''. These non-saving parentheses are often useful in regular expressions where nested parentheses are involved because they can eliminate confusion of the reader about which of the nested parentheses are bound to which substitution variables.

In the previous examples, grouping was only required because of the initial string ``eat ''. The last example could be rewritten equivalently as:

    s/eat green eggs|eat ham/yuck!/g


Repetition

In the examples above, you have already seen one example of repetition, i.e. the use of + to indicate one or more occurrences. However, there are a few other possibilities; these are shown in Table 4.1.
Table 4.1: Repetition Operators
Operator Occurrences
+ one or more
* zero or more
? zero or one
{x} exactly x
{x,} x or more
{x,y} at least x but no more than y


Any of these repetition operators can be used with character ranges, as above, or with groups. For example, this:
    s/(?:eat green eggs|eat ham)+/yuck!/
will replace a single sequence of one or more occurrences of eat green eggs or eat ham with yuck!.


Escape Sequences

The backslash character (\) can be used to escape any character. This means that the character stands for itself instead of it's special meaning, e.g. ``\.'' is used to match a period in a regular expression instead of any character (the standard meaning of ``.''). There are, however, a number of escaped characters which have special meaning. Table 4.2 lists the basic escape sequences4.2which can be used in both strings and regular expressions.
Table 4.2: Standard Escapes
Character Match
\n new line
\t tab
\r carriage return
\f form feed
\v vertical space
\e escape character
\a bell


In addition to these standard escape characters, there is a set of escape sequences which can be used to simplify regular expressions. These escapes are equivalent to longer regular expressions (see Table 4.3). These characters are available to make regular expressions easier to read.

Table 4.3: Regular Expression Escapes
Character Match Equivalent Form
\w word [_a-zA-Z0-9]
\W non-word [^_a-zA-Z0-9]
\s whitespace [\ \t\n\r\f]
\S non-whitespace [^\ \t\n\r\f]
\d digit [0-9]
\D non-digit [^0-9]



Greedy and Lazy Matching

Generally, regular expressions do greedy matching; they match as many characters as possible. All of the examples presented so far do greedy matching. The regular expressions in Glish (and of course Perl) can do lazy matching as well as greedy matching. With lazy matching, as few characters as possible are matched while still having the whole regular expression match. The lazy versions of the greedy operators are obtained by adding a question mark, see Table 4.4.
Table 4.4: Lazy Repetition Operators
Operator Occurrences
+? one or more
*? zero or more
?? zero or one
{x}? exactly x
{x,}? x or more
{x,y}? at least x but no more than y


Here is a lazy example to illustrate the difference:

    s/.*?enum\s+(\S+).*\n?$/$1/
This example is attempting to match enum followed by a name, i.e. ``\S+''; this might be done when scanning C++ source code. The lazy match of any characters at the start of the regular expression means that all of those initial characters at the beginning of the string will be matched and discarded as part of the substitution. If a greedy match is used instead:
    s/.*enum\s+(\S+).*\n?$/$1/
it will generally work, but if there is more than one enum on the line, the enums at the beginning of the string would be stripped off as part of the initial ``.*'' and lost with only the last enum being matched. There are often times when several lines of greedy regular expressions can be replaced with a one line lazy regular expression.


Application

Thus far, all of the discussion has been in general terms. This section describes how regular expressions are treated inside of Glish and how to apply these regular expressions to strings.

~ Operator

Regular expressions are treated as first class variables4.3 in Glish this means that they can be created, assigned, and passed to functions; they are just like any other value. A regular expression is created like:

    x := s/.*?enum\s+(\S+).*\n?$/$1/
After this, x is a regular expression, and it has type regex. The is_regex() function is used to check to see if a variable is a regular expression or not. Regular expressions are applied to strings using the regular expression application operator ~ as follows:
    '/* best */ enum foo { A=1, B, C };' ~ x
The result of this application is foo; $m also equals foo because of the parentheses.

The last example illustrated how regular expressions can be assigned to variables. Assignment isn't required, though; regular expressions can be used in place:

    'eat green eggs' ~ s/eat (green eggs|ham)/I don't like $1/g
The result here is I don't like green eggs.

The ~ operator is also used to apply a regular expression which does a match. If the match has a global flag, an integer is returned indicating how many successful matches were made. With no global flag, a boolean is returned indicating if a match was made. So:

    x := 'green'
    x ~ m/r/
yields T because 'green' contains the letter r,
    x ~ m/r/g
yields 1 because the global flag is used, and
    x ~ m/e/g
yields 2 because 'green' contains two letter e's. Both of these types of results are useful. The boolean result can be useful for masking off strings which match a particular criteria, and the integer result is useful for counting the number of occurrences of a particular substring.

This operator can also be used as a unary operator. In this case, the regular expression is applied to the variable ``_''. This example:

    _ := 'green eggs'
    ~ s/e/x/g
yields grxxn xggs; each e in the underscore variable, _, is replaced with x. (See § 8.1.2 for more information and examples.)

Regular Expression Vectors

Regular expressions can also be put into a vector:
    y := [ s/^\s+//, s/\s+$//, s/\s{2,}/ /g ]
and a vector of regular expressions can be applied to a string; the regular expressions are applied one after the other. In this example:
    '   this    string   ' ~ y
when y is applied to a string it will strip off leading and trailing whitespace, and it will replace multiple spaces in the middle of the string with a single space. When multiple regular expressions are applied, the match variable $m accumulates all of the matched parentheses values and stores them either as a vector or a two dimensional array.

=~ and !~ Operators

With the ~ operator, the string that is being operated on is not modified. The regular expression is applied, and a string is returned in the case of substitution while a boolean or integer is returned in the case of a match. The =~ operator4.4, however, modifies the string in place. For example:

    x := 'green'
    x =~ s/e/x/g
yields 2, the number of substitutions made. After the application of the regular expression, x equals 'grxxn'. The result of this substitution is the number of matches if the global flag is used or a boolean indicating if any matches were made if no global flag is used. When a match is done (instead of a substitution), =~ behaves the same as ~.

There is also a !~ operator which is exactly the same as the logical negation of =~, for example this

    x := 'blue'
    print ! (x =~ m/r/), x !~ m/r/
prints two Ts because the string x doesn't have a r in it. The two forms used in the print statement are equivalent. Table 4.5 lists the results for each of the combinations of application operators and regular expression forms.
Table 4.5: Regular Expression Application Operator Results
Operator m// m//g s// s//g
~ boolean integer string string
=~ boolean integer boolean integer
!~ boolean boolean boolean boolean


These operators can also be used as unary operators. In this case, the regular expression is applied to the variable ``_''. Here:

    _ := 'green eggs'
    =~ s/e/x/g
the result is 3 since three substitutions were made, and the underscore variable, _, contains grxxn xggs. (See § 8.1.2 for more information and examples.)

Splitting Strings

Regular expression substitution can be used to split strings. This is done using the special substitution variable $$. Wherever this substitution variable is inserted in the substitution string the string is split. Here's an example:

    x := 'little string'
    x =~ s/\s+/$$/
    x =~ s/i/X$$X/
In this case, x starts out as a string of length one, the first substitution changes it to a string of length two by splitting it at the space between the words, and finally the last substitution splits the string at each i. So after this, x has length four, and it contains the string ``lX Xttle strX Xng''. As this example shows, a regular expression applied to an array of strings gets applied to each element of the array.


Lookahead

It is possible to do lookahead in regular expressions. This is used to match a portion of a string based on what follows. This example:
    'foo bar\txyz' ~ s/.*?(\w+)(?=\t).*/$1/
yields bar. The lookahead is introduced with ``(?=''. This regular expression is looking for a word followed by a tab. The lookahead portion of the regular expression is zero length; it doesn't match any actual characters in the string but rather matches or not based on future characters.

The last example illustrates positive lookahead since it is looking for the existance of certain characters. It is also possible to look for the absense of certain characters:

    'foo bar\txyz' ~ s/.*?(\w+)(?!\t).*/$1/
This example is looking for a word followed by anything but a tab; the result in this case is foo. Negative lookahead is introduced by ``(?!''.


Comments

Finally, if your regular expressions are complicated enough, you may wish to add comments to them. This too is possible:
    'foo bar\txyz' ~ s/.*?(\w+)(?#neg lookahead)(?!\t).*/$1/
As in the previous example, the result here is foo. The comment is introduced with ``(?#'', and it is ignored when the regular expression is evaluated.


next up previous contents index
Next: Functions and Function Calls Up: Expressions Previous: Integer Sequence Expressions   Contents   Index
Please send questions or comments about AIPS++ to aips2-request@nrao.edu.
Copyright © 1995-2000 Associated Universities Inc., Washington, D.C.

Return to AIPS++ Home Page
2006-10-15