Python Automation Cookbook
上QQ阅读APP看书,第一时间看更新

How it works...

The re.search function matches a pattern, no matter its position in the string. As explained previously, this will return None if the pattern is not found, or a match object.

The following special characters are used:

  • ^: Marks the start of the string
  • $: Marks the end of the string
  • \b: Marks the start or end of a word
  • \S: Marks any character that's not a whitespace, including special characters

More special characters are shown in the next recipe.

In step 6 in the How to do it... section, the r'[0123456789-]+' pattern is composed of two parts. The first one is between square brackets, and matches any single character between 0 and 9 (any number) and the dash (-) character. The + sign after that means that this character can be present one or more times. This is called a quantifier in regexes. This makes a match on any combination of numbers and dashes, no matter how long it is.

Step 7 again uses the + sign to match as many characters as necessary before the @ and again after it. In this case, the character match is \S, which matches any non-whitespace character.

Please note that the naive pattern for emails described here is very naive, as it will match invalid emails such as john@smith@test.com. A better regex for most uses is r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)". You can go to http://emailregex.com/ for find it and links to more information. 

Note that parsing a valid email including corner cases is actually a difficult and challenging problem. The previous regex should be fine for most uses covered in this book, but in a general framework project such as Django, email validation is a very long and very unreadable regex.

The resulting matching object returns the position where the matched pattern starts and ends (using the start and end methods), as shown in step 5, which splits the string into matched parts, showing the distinction between the two matching patterns. 

The difference displayed in step 5 is a very common one. Trying to capture GP can end up capturing eg gplant and ba gpipe! Similarly, things\b won't capture things. Be sure to test and make the proper adjustments, such as capturing \bGP\b for just the word GP.

The specific matched pattern can be retrieved by calling group(), as shown in step 6. Note that the result will always be a string. It can be further processed using any of the methods that we've previously seen, such as by splitting the phone number into groups by dashes, for example:

>>> match = re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
>>> [int(n) for n in match.group().split('-')]
[1234, 567, 890]