Regex in Python
Contents
Regex in Python¶
The regex implementation in Python comes with a unique set of commands for facilitating matches, searches, substitutions, and other operations.
Regular expressions in Python should be defined as raw strings types, i.e.
ptrn = r"some patern"
Table of Contents¶
The re.Match object¶
The re.Match object contains the result of many regex methods. It has member functions for extracting groups with group and groups, methods for finding the index of the matches (returning tupples of starts and ends) with the regs property.
It also contains the matched string in the string property.
Flags and pre-compilation¶
You can pre-compile regex expressions using re.compile(expr, *flags), where the flags are defined in the library. Commonly used are
re.MULTILINEfor matching over many new line charactersre.GLOBALfor matching many occurancesre.IGNORECASEre.LOCALEfor making common escapes follow locale
Functions¶
The python regex library has a variety of specific functions, which all have their best use cases:
Splitting strings with split¶
We can call a similar method to the str.split builtin from the regex library, to split a string by regex pattern
re.split(ptrn, data)
which returns a list of substrings, with the matched split pattern removed.
Searching with search¶
This method returns the first occurance of the pattern, and thus can be useful (and cost effective) when trying to determine if the pattern exists within a string
if re.search(ptrn, data):
...
This method returns a re.Match instance. The generalisation of this method is findall.
Extracting groups with findall¶
For some regex pattern ptrn, we can extract the matches from data using
matches = re.findall(ptrn, data)
In this case, matches is a list containing all matches groups.
finditer¶
Like findall except retuns an iterator of non-overlapping matches. Empty matches are included, just as with findall.
Replacing with sub and subn¶
You can replace a pattern with a substring using
re.sub(ptrn, replacement, string, count=0)
If replacement is a string, it is escaped by default. If it is a callable, then the result must be a string. This method returns the new string, with no side-effects.
A variant to this is subn, which has an identical finger print, but also returns a tuple containing the new string and the number of substitutions made.