Regex Python Cheat Sheet

This article will serve as a comprehensive cheat sheet for regex in Python, providing you with an understanding of the basic concepts and patterns, as well as a rundown of the most common functions and features.

 

Special Characters

Regular expressions use a combination of special characters and literal characters to define search patterns. Here are some of the most common special characters:

  • .: Matches any single character except a newline character.
  • ^: Matches the start of the string.
  • $: Matches the end of the string.
  • *: Matches zero or more repetitions of the preceding character.
  • +: Matches one or more repetitions of the preceding character.
  • ?: Matches zero or one repetition of the preceding character.
  • {m,n}: Matches the preceding character between m and n times.
  • [...]: A character set, matching any one of the characters inside the brackets.
  • [^...]: A negated character set, matching any character not inside the brackets.
  • |: Alternation, matches either the expression before or after the |.
  • (...): Defines a group of characters.

 

Basic Patterns

Here are some basic patterns used in regular expressions:

  • \d: Matches any digit (0-9).
  • \D: Matches any non-digit character.
  • \s: Matches any whitespace character (space, tab, newline, etc.).
  • \S: Matches any non-whitespace character.
  • \w: Matches any word character (letters, digits, or underscores).
  • \W: Matches any non-word character.

 

Python regex Library

The Python regex library, known as the re module, provides a variety of functions for working with regular expressions. Here are some of the most commonly used functions:

 

re.compile()

This function compiles a regular expression pattern into a regex object. This can help improve performance when using the same pattern multiple times.

import re

pattern = re.compile(r'\d+')

 

re.search()

The re.search() function searches the entire string for a match and returns a match object if a match is found. If no match is found, it returns <span class="hljs-literal">None</span>.

import re

pattern = re.compile(r'\d+')
text = "The year is 2023."
result = pattern.search(text)

if result:
    print("Match found:", result.group())
else:
    print("No match found.")

 

re.match()

The re.match() function checks if the regular expression pattern matches at the beginning of the string. It returns a match object if a match is found, and None otherwise.

import re

pattern = re.compile(r'\d+')
text = "2023 is the current year."
result = pattern.match(text)

if result:
    print("Match found:", result.group())
else:
    print("No match found.")

 

re.findall()

The re.findall() function returns all non-overlapping matches of the pattern in the string as a list.

import re

pattern = re.compile(r'\d+')
text = "There are 3 cats and 2 dogs."
result = pattern.findall(text)
print("Matches found:", result)

 

re.finditer()

The re.finditer() function returns an iterator yielding match objects for all non-overlapping matches of the pattern in the string.

import re

pattern = re.compile(r'\d+')
text = "There are 3 cats and 2 dogs."
result = pattern.finditer(text)

for match in result:
    print("Match found:", match.group())

 

re.sub()

The re.sub() function replaces all occurrences of the pattern in the string with the specified replacement string.

import re

pattern = re.compile(r'\d+')
text = "There are 3 cats and 2 dogs."
result = pattern.sub("X", text)
print("Modified text:", result)

 

re.split()

The re.split() function splits the string by occurrences of the pattern.

import re

pattern = re.compile(r'\d+')
text = "There are 3 cats and 2 dogs."
result = pattern.split(text)
print("Split text:", result)

 

Regex Flags

Regex flags modify the behavior of the regex functions. Some commonly used flags include:

  • re.IGNORECASE (or re.I): Performs case-insensitive matching.
  • re.MULTILINE (or re.M): Allows ^ and $ to match the start and end of each line in the string, rather than the entire string.
  • re.DOTALL (or re.S): Makes the . special character match any character, including newline characters.

 

Groups and Capturing

 

Named Groups

Named groups allow you to reference matched text by name instead of by position. You can create named groups using the (?P<name>...) syntax.

import re

pattern = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
text = "The date is 2023-04-30."
result = pattern.search(text)

if result:
    print("Year:", result.group("year"))
    print("Month:", result.group("month"))
    print("Day:", result.group("day"))

 

Lookaround Assertions

Lookaround assertions are a powerful feature in regular expressions that allow you to check for a pattern without consuming any characters.

 

Lookahead

Positive lookahead (?=...) asserts that the pattern inside the lookahead is matched, but does not consume any characters. Negative lookahead (?!...) asserts that the pattern inside the lookahead is not matched.

import re

pattern = re.compile(r'\d+(?=\D)')
text = "There are 3 cats and 2 dogs."
result = pattern.findall(text)
print("Matches found:", result)

 

Lookbehind

Positive lookbehind (?<=...) asserts that the pattern inside the lookbehind is matched immediately before the current position, without consuming any characters. Negative lookbehind (?<!...) asserts that the pattern inside the lookbehind is not matched.

import re

pattern = re.compile(r'(?<=\D)\d+')
text = "There are 3 cats and 2 dogs."
result = pattern.findall(text)
print("Matches found:", result)

 

Conclusion

Regular expressions are a powerful tool for working with text in Python. This regex Python cheat sheet covers the basics of regex syntax, the most commonly used functions from the re module, and some advanced techniques, such as groups and lookaround assertions.

With this knowledge, you can now write more efficient and powerful code when working with text data in Python.

Leave a Comment

Your email address will not be published. Required fields are marked *