What is Regex?

ยท

4 min read

Regexp or regex expressions:

Regex stands for Regular Expressions

It's used in any language Java, Javascript, PHP, Perl, and Python. It's powerful in searching and manipulating text strings, particularly in processing text files. Think of it as a pattern to express user's needs

One of these lines can replace several lines of programming codes.

Syntax Summary:

Character:

All characters match themselves.

    regex x matches substring "x"
               regex 9 matches "9"
               regex = matches "="
               regex @ matches "@"

Special Regex Characters:

These have special meaning in regex. ., +, *,?,^, $, (,), [,], {,}, |, .

Escape Sequences (\char):

To match a character having a special meaning in regex, you need to use an escape sequence prefix with a backslash ().

            regex \. matches "."
            regex \+ matches "+"
            regex \( matches "("
            regex \\ matches "\"

Regex recognizes common escape sequences.

            regex  \n for newline
            regex \t for tab
            regex \r for carriage-return
            regex \nnn for a up to 3 digit octal number*
            regex \xhh for a two digit hex code
            regex \uhhhh for a 4-digit unicode
            regex \uhhhhhhhh for a 8-digit unicode.

*The octal numeral system, or oct for short, is the base-8 number system and uses the digits 0 to 7.

A sequence of Characters (or String):

Strings can be matched via combining a sequence of characters (called sub-expressions)

        Saturday matches "Saturday"
        *It is case-sensitive, so we can set it to case-sensitive via modifier

OR operator (|)

                four | 4 accepts 4 or "four"

Character class (or Bracket List):

            [...]: Accept ANY ONE of the character within the squeare
            [aeiou] matches "a", "e", "i", "o", "u".

            [.-.] (Range Expression): accept ANY ONE of the characters in the range, 
            [0-9] matches any digit 
            [A-Za-Z] matches any uppercase o lowercase letters.

            [^...] NOT ONE of the character
            [^0-9] matches any non-digit.

Only these four characters require escape sequence inside the bracket list: ^, -,],\ [

Ocurrence indicator:

+: one or more (1+), e.g., [0-9]+ /matches one or more digits such as '123', '000'.
*: zero or more (0+), e.g., [0-9]* matches zero or more digits. It accepts all those in [0-9]+ plus the empty string.
?: zero or one (optional), e.g., [+-]? matches an optional "+", "-", or an empty string.
{m,n}: m to n (both inclusive)
{m}: exactly m times
{m,}: m or more (m+)

Position Anchors:

Doesn't match the character, but position such as start-of-line, end-of-line, start-of-word, and end-of-word.

^, $: start-of-line and end-of-line respectively. E.g., ^[0-9]$ matches a numeric string.
\b: boundary of word, i.e., start-of-word or end-of-word. E.g., \bcat\b matches the word "cat" in the input string.
\B: Inverse of \b, i.e., non-start-of-word or non-end-of-word.
\<, \>: start-of-word and end-of-word respectively, similar to \b. E.g., \<cat\> matches the word "cat" in the input string.
\A, \Z: start-of-input and end-of-input respectively.

Flags:

  • g (global) extend the searching to find all matches for a given expression inside a string, instead of stopping on the first match.
  • m (multi-line) when enabled ^ and $ will match the start and end of a line, instead of the whole string
  • i (insensitive) makes the whole expression case-insensitive (for instance /aBc/i would match AbC)

Let's have some practice! ๐Ÿ˜‰

  • [0-9]: Yes! you are right. It matches a single digit from 0 to 9. '0', '1', '2', '3'..., '9',

  • [0-9]+: With this expression, you can match '000', '123', '123434'. if you created an expression like above and the user enters the next input: '12edt' It'll only match '12'

  • Does this string '123.34' match the following regex \$[0-9]+.?[0-9]+ ? No, it doesn't match. why?

\ $ indicates that it must match a dollar sign at the beginning. - [0-9] + as you know, this pattern has been described in our string by matching '123' - \. accept a dot that will be optional since we add - ? to the pattern (accepting decimal numbers).

In the end, our string '123.34' would have been allowed if and only if there was a dollar sign.

'$123.34' ๐Ÿ˜ will be the correct string

To learn more about regex, this awesome Regex101 tool will help you to understand and create your own regex expressions

If you have any comments, pls! don't hesitate... let me know so we can learn more together.

ย