What is Regular Expression?

  • Computer science can be viewed as a study of formal languages
  • Examples of formal languages include regular, context-free, context- sensitive,…,Turing-complete etc.
  • Regular expressions (respectively, context-free grammars or CFGs) are algebraic representations of regular (respectively, context-free) languages
  • Regular expressions and CFGs are used extensively in pattern matching, web sanitization, defining the input syntax of programs, and syntax of programming languages

Regular expressions (Regex) are a powerful tool for various kinds of string manipulation.
They are a domain specific language (DSL) that is present as a library in most modern programming languages, not just Python.
They are useful for two main tasks:

  • verifying that strings match a pattern (for instance, that a string has the format of an email address),
  • performing substitutions in a string (such as changing all American spellings to British ones).

Why Regular Expression in Python?

It is quite useful when come to data cleaning. Since Regex techniques are mostly used while string manipulating.

Basic Regex in Python

In this article, Basic Regex will be introduced.

Import re

We need to use the library re for regex.

1
import re

Declare Pattern

To avoid any confusion while working with regular expressions, we would use raw strings as r"expression".

1
pattern = r"<expression>"

Some basic functions of re

After you’ve defined a regular expression, you can use some functions to check whether a string is matching the pattern.

  • re.match(pattern, str) - determine whether it matches at the beginning of a string. return boolean.
  • re.search(pattern, str) - finds a match of a pattern anywhere in the string. return boolean.
  • re.findall(pattern, str) - returns a list of all substrings that match a pattern.

Search and Replace

  • re.sub(pattern, replace, str, count=0) - replaces all occurrences of the pattern in string with repl, substituting all occurrences, unless count provided. return the modified string.

Metacharacters

Metacharacters allows you to create regular expressions to represent concepts like “one or more repetitions of a vowel”.

Basic Metacharacters

  • . - Matches any character except line breaks
  • ^ - Matches the beginning of the string, or the beginning of a line if the multiline flag (m) is enabled
  • $ - Matches the end of the string, or the end of a line if the multiline flag (m) is enabled

Quantifiers

  • * - Quantifier. Match 0 or more of the preceding token
    • The “preceding token” can be a single character, a class, or a group of characters in parentheses
  • + - Quantifier. Match 1 or more of the preceding token
    • The “preceding token” can be a single character, a class, or a group of characters in parentheses
  • ? - Quantifier. Match between 0 and 1 times of the preceding token
    • The “preceding token” can be a single character, a class, or a group of characters in parentheses
  • {1,999} - Quantifier. Match between 1 and 999 times of the preceding token
    • The “preceding token” can be a single character, a class, or a group of characters in parentheses

Booleans

  • | - Alternation. Acts like a boolean OR. Matches the expression before or after the |.

Character Set

Aka Character classes.
Character Set provide a way to match only one of a specific set of characters.
A character class is created by putting the characters it matches inside square brackets.
Character classes can also match ranges of characters. Multiple ranges can be included in one class.

Metacharacters have no meaning within character classes.

  • [] - Matches any characters in the set.
  • [a-z] - Matches any lowercase alphabetic character.
  • [D-P] - Matches any uppercase character from D to P.
  • [A-Za-z] - Matches a letter of any case.
  • [0-9] - Matches any digit.

Negated set

Inverted Character classes.

  • ^[] - Matches any characters not in the set.
  • [^A-Z] - Excludes uppercase strings

Capturing Group

A group can be created by surrounding part of a regular expression with parentheses.

Capturing Group is Often used with Metacharacters.

  • () - Groups multiple tokens together and creates a capture group for extracting a substring or using a backreference.

Special Sequences

There are various special sequences you can use in regular expressions. They are written as a backslash followed by another character.

  • \d - Matches any digit character (0-9)

  • \s - Matches any whitespace character (spaces, tabs, line breaks)

  • \w - Matches any word character (alphanumeric & underscore)

  • \D - Matches any non-digit character (NOT 0-9)

  • \S - Matches any non-whitespace character (NOT spaces, tabs, line breaks)

  • \W - Matches any non-word character (NOT alphanumeric & underscore)

  • \A - Matches the beginning of a string

  • \Z - Matches and end of a string

  • \b - Matches the empty string between \w and \W characters

  • \B - Matches the empty string anywhere else.

Useful Info:

I Hate Regex

Regex Explainer - RegExr: Learn, Build, & Test RegEx

Regex Explainer - Regular expression visualizer using railroad diagrams

Online regex tester - Regex101

Regex Crossword

Sololearn - Regex

Regex Cheatsheet

Cleaning Data in Python (Regular Expressions)