What is Regular Expression?

Computer science can be viewed as a study of formal languages

Examples of formal languages include regular, context-free, context- sensitive,…,Turing-complete etc.

Regular expressions (respectively, context-free grammars or CFGs) are algebraic representations of regular (respectively, context-free) languages

Regular expressions and CFGs are used extensively in pattern matching, web sanitization, defining the input syntax of programs, and syntax of programming languages

Regular expressions (Regex) are a powerful tool for various kinds of string manipulation.
They are a domain specific language (DSL) that is present as a library in most modern programming languages, not just Python.
They are useful for two main tasks:

verifying that strings match a pattern (for instance, that a string has the format of an email address),
performing substitutions in a string (such as changing all American spellings to British ones).

Why Regular Expression in Python?

It is quite useful when come to data cleaning. Since Regex techniques are mostly used while string manipulating.

Basic Regex in Python

In this article, Basic Regex will be introduced.

Import re

We need to use the library re for regex.

import re

Declare Pattern

To avoid any confusion while working with regular expressions, we would use raw strings as r"expression".

1	pattern = r"<expression>"

Some basic functions of re

After you’ve defined a regular expression, you can use some functions to check whether a string is matching the pattern.

re.match(pattern, str) - determine whether it matches at the beginning of a string. return boolean.
re.search(pattern, str) - finds a match of a pattern anywhere in the string. return boolean.
re.findall(pattern, str) - returns a list of all substrings that match a pattern.

Search and Replace

re.sub(pattern, replace, str, count=0) - replaces all occurrences of the pattern in string with repl, substituting all occurrences, unless count provided. return the modified string.

Metacharacters

Metacharacters allows you to create regular expressions to represent concepts like “one or more repetitions of a vowel”.

Basic Metacharacters

. - Matches any character except line breaks
^ - Matches the beginning of the string, or the beginning of a line if the multiline flag (m) is enabled
$ - Matches the end of the string, or the end of a line if the multiline flag (m) is enabled

Quantifiers

* - Quantifier. Match 0 or more of the preceding token
- The “preceding token” can be a single character, a class, or a group of characters in parentheses
+ - Quantifier. Match 1 or more of the preceding token
- The “preceding token” can be a single character, a class, or a group of characters in parentheses
? - Quantifier. Match between 0 and 1 times of the preceding token
- The “preceding token” can be a single character, a class, or a group of characters in parentheses
{1,999} - Quantifier. Match between 1 and 999 times of the preceding token
- The “preceding token” can be a single character, a class, or a group of characters in parentheses

Booleans

| - Alternation. Acts like a boolean OR. Matches the expression before or after the |.

Character Set

Aka Character classes.
Character Set provide a way to match only one of a specific set of characters.
A character class is created by putting the characters it matches inside square brackets.
Character classes can also match ranges of characters. Multiple ranges can be included in one class.