Introducing regular expressions
A regular expression, or regex, is a pattern to match text. In other words, it allows us to define an abstract string (typically the definition of a structured kind of text) to check with other strings to see if they match or not.
It is better to describe them with an example. Think of defining a pattern of text as a word that starts with an uppercase A and contains only lowercase Ns and As after that. The word Anna matches it, but Bob, Alice, and James does not. The words Aaan, Ana, Annnn, and Aaaan will also be matches, but ANNA won't.
If this sounds complicated, that's because it is. Regexes can be notoriously complicated because they may be incredibly intricate and difficult to follow. But they are very useful, because they allow us to perform incredibly powerful pattern matching.
Some common uses of regexes are as follow:
- Validating input data: For example, that a phone number is only numbers, dashes, and brackets.
- String parsing: Retrieve data from structured strings, such as logs or URLs. This is similar to what's described in the previous recipe.
- Scrapping: Find the occurrences of something in a long text. For example, find all emails in a web page.
- Replacement: Find and replace a word or words with others. For example, replace the owner with John Smith.
– Jamie Zawinski
Regular expressions are at their best when they are kept very simple. In general, if there is a specific tool to do it, prefer it over regexes. A very clear example of this is HTML parsing; check Chapter 3, Building Your First Web Scraping Application, for better tools to achieve this.