Python Automation Cookbook
上QQ阅读APP看书,第一时间看更新

There's more...

Some other useful operations that can be performed on strings are as follows:

  • Strings can be sliced like any other list. This means that 'word'[0:2] will return 'wo'.
  • Use .splitlines() to separate lines by newline character.
  • There are .upper() and .lower() methods, which return a copy with all the characters set to uppercase or lowercase. Their use is very similar to .title():
>>> 'UPPERCASE'.lower()
'uppercase'
  • For easy replacements (for example, change all A to B or change mine to ours), use .replace(). This method is useful for very simple cases, but replacements can get tricky easily. Be careful with the order of replacements to avoid collisions and case sensitivity issues. Note the wrong replacement in the following example:
>>> 'One ring to rule them all, one ring to find them, One ring to bring them all and in the darkness bind them.'.replace('ring', 'necklace')
'One necklace to rule them all, one necklace to find them, One necklace to bnecklace them all and in the darkness bind them.'

This is similar to the issues we'll see with regular expressions matching unexpected parts of your code.

There are more examples to follow later. Refer to the regular expressions recipes for more information.

If you work with multiple languages, or with any kind of non-English input, it is very useful to learn the basics of Unicode and encodings. In a nutshell, given the vast amount of characters in all the different languages in the world, including alphabets not related to the Latin one, such as Chinese or Arabic, there's a standard to try and cover all of them so that computers can properly understand them. Python 3 greatly improved this situation, making the strings internal objects to deal with all of those characters. The encoding that Python uses, and the most common and compatible one, is currently UTF-8.

Dealing with encodings is still relevant when reading from external files that can be encoded in different encodings (for example, CP-1252 or windows-1252, which is a common encoding produced by legacy Microsoft systems, or ISO 8859-15, which is the industry standard).