Mar 24, 2014 python regex

Notes and tips on Python’s regular expression library.

Regex Module

import re

However, there’s an alternative module “regex” https://pypi.python.org/pypi/regex It is supposed to be superior version.

Usage

match

re.match("[a-z]","something")
	match always starts from beginning of line, "^" is implied. 

re.search("^[a-z]","something")
	similar to perl's full regex search. Starts from anywhere in the string for the match. 
	re.search("\w+","___  anyword") # OK
	
	re.search("^[^a-z]+$","__a") # must not be lower-case alphabet for entire string

Once match/search is found, use group()

m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group(1)
'Malcolm'		
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

# group(0) returns all string match, not just those that are in paren ( )
# group(1) returns 1st match group...

Example from python doc

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

Example

mystr = re.match(".*?(run).*?",line).groups()[0] 
# mystr = always "run", i.e. extracts from string containing "run" .

Use compile(pat) if the regex needs to be repeated several times for the same pattern.

# from python doc
import re
re.compile("a").match("ba", 1)           # succeeds
re.compile("^a").search("ba", 1)         # fails; 'a' not at start
re.compile("^a").search("\na", 1)        # fa ils; 'a' not at start
re.compile("^a", re.M).search("\na", 1)  # succeeds
re.compile("^a", re.M).search("ba", 1)   # fails; no preceding \n

Use iteration to find match.

for match in re.finditer(patter, string):
# once for each regex match...

re.split: advanced split

import re
re.split ("REGEX of Delimiters","TEXT ....")

re.split("\W+", "TEXT...")  # split using any non-words

Dealing with “-“ (dash)

Exclude “-“ as separator, by using ^ and adding \-

re.split('[^\w\-]+',"e-mail")    ==>['e-mail']