Python Automation Cookbook
上QQ阅读APP看书,第一时间看更新

Introducing regular expressions

A regular expression, or regex, is a pattern to match text. In other words, it allows us to define an abstract string (typically the definition of a structured kind of text) to check with other strings to see if they match or not.

It is better to describe them with an example. Think of defining a pattern of text as a word that starts with an uppercase A and contains only lowercase Ns and As after that. The word Anna matches it, but Bob, Alice, and James does not. The words Aaan, Ana, Annnn, and Aaaan will also be matches, but ANNA won't.

If this sounds complicated, that's because it is. Regexes can be notoriously complicated because they may be incredibly intricate and difficult to follow. But they are very useful, because they allow us to perform incredibly powerful pattern matching.

Some common uses of regexes are as follow:

  • Validating input data: For example, that a phone number is only numbers, dashes, and brackets.
  • String parsing: Retrieve data from structured strings, such as logs or URLs. This is similar to what's described in the previous recipe.
  • Scrapping: Find the occurrences of something in a long text. For example, find all emails in a web page.
  • Replacement: Find and replace a word or words with others. For example, replace the owner with John Smith.
"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
                                                                                                                                                             – Jamie Zawinski

Regular expressions are at their best when they are kept very simple. In general, if there is a specific tool to do it, prefer it over regexes. A very clear example of this is HTML parsing; check Chapter 3, Building Your First Web Scraping Application, for better tools to achieve this.

Some text editors allow us to search using regexes as well. While most are editors aimed at writing code, such as Vim, BBEdit, or Notepad++, they're also present in more general tools, such as MS Office, Open Office, or Google Documents. But be careful, as the particular syntax may be slightly different.