Lugaru's Epsilon
Programmer's
Editor

Context:
Epsilon User's Manual and Reference
   Commands by Topic
      Changing Text
         . . .
         Replacing
         Regular Expressions
            Entering Special Characters
            Character Classes
            Regular Expression Examples
            Searching Rules
            . . .
         Rearranging
            Sorting
            Transposing
            Formatting Text
         . . .

Previous   Up    Next
Entering Special Characters  Commands by Topic   Regular Expression Examples


Epsilon User's Manual and Reference > Commands by Topic > Changing Text > Regular Expressions >

Character Classes

In place of any letter, you can specify a character class. A character class consists of a sequence of characters between square brackets. For example, the character class [adef] stands for any of the following characters: "a", "d", "e", or "f".

In place of a letter in a character class, you can specify a range of characters using a hyphen: the character class [a-m] stands for the characters "a" through "m", inclusively. The class [ae-gr] stands for the characters "a", "e", "f", "g", or "r". The class [a-zA-Z0-9] stands for any alphanumeric character.

To specify the complement of a character class, put a caret as the first character in the class. Using the above examples, the class [^a-m] stands for any character other than "a" through "m", and the class [^a-zA-Z0-9] stands for any non-alphanumeric character. Inside a character class, only ^ and - have special meaning. All other characters stand for themselves, including plus, star, question mark, etc.

If you need to put a right square bracket character in a character class, put it immediately after the opening left square bracket, or in the case of an inverted character class, immediately after the caret. For example, the class []x] stands for the characters "]" or "x", and the class [^]x] stands for any character other than "]" or "x".

To include the hyphen character - in a character class, it must be the first character in the class, except for ^ and ]. For example, the pattern [^]-q] matches any character except ], -, or q.

Any regular expression you can write with character classes you can also write without character classes. But character classes sometimes let you write much shorter regular expressions.

The period character (outside a character class) represents any character except a <Newline>. For example, the pattern a.c matches any three-character sequence on a single line where the first character is "a" and the last is "c".

You can also specify a character class using a variant of the angle bracket syntax described in the previous section for entering special characters. The expression <Comma|Period|Question> represents any one of those three punctuation characters. The expression <a-z|A-Z|?> represents either a letter or a question mark, the same as [a-zA-Z]|<?>, for example. The expression <^Newline> represents any character except newline, just as the period character by itself does.

You can also use a few character class names that match some common sets of characters.

 Class  Meaning
 <digit>  A digit, 0 to 9.
 <alpha>  A letter, according to isalpha( ).
 <alphanum>  Either of the above.
 <word>  All of the above, plus the _ character.
 <hspace>  The same as <Space|Tab>.
 <wspace>  The same as <Space|Tab|Newline>.
 <ascii>  An ASCII character, one with a code below 128.
 <any>  Any character including <Newline>.

You can match all characters with a particular Unicode property, using the syntax <p:hex-digit>. After the p: part, you can put the name of a binary property as in p:ASCIIHexDigit, a script name as in p:Cyrillic, or a category name as in p:Zs or p:L. Or you can put the name of an enumerated property, an equal sign, and a value for that property, like p:block=Dingbats or p:Line_break=Alphabetic. Case isn't significant in these names, and certain characters like hyphen and underscore are ignored in property names.

You can combine character classes using addition, subtraction, or intersection. Addition means a matching character can be in either of two classes, as in <alpha|digit> to match either alphabetic characters or digits. Intersection means a matching character must be a member of both classes, as in <p:HexDigit&p:numeric-type=decimal>, which matches characters with the HexDigit binary Unicode property that also have a Numeric-Type property of Decimal. Subtraction means a matching character must be a member of one class but not another, as in <p:currency-symbol&!dollar sign&!cent sign> which matches all characters with the Currency-Symbol property except for the dollar sign and cent sign characters.

More precisely, we can say that inside the angle brackets you can put one or more character "rules", each separated from the next by either a vertical bar | to add the rules together or & to intersect the rules. Any rule may have a ! before it to invert that one rule, or you can put a ^ just after the opening < to invert the entire expression and match its complement.

Each character rule may be a character specification or a range, a character class name from the table above, or a Unicode property specification using the p: syntax above. A range means two character specifications with a hyphen between them. And a character specification means either the name of a character, or # and the numeric code for a character, or the character itself (for any character except >, |, -, or <Nul>).

Separately, Epsilon recognizes the syntax <h:0d 0a 45> as a shorthand to search for a series of characters by their hexadecimal codes. This example is equivalent to the pattern <#0x0d><#0x0a><#0x45>.



Previous   Up    Next
Entering Special Characters  Commands by Topic   Regular Expression Examples


Lugaru Copyright (C) 1984, 2012 Lugaru Software Ltd. All Rights Reserved.