上QQ阅读APP看书,第一时间看更新
How it works...
Each of the steps performs a specific transformation of the text:
- The first one splits the text on the default separators, whitespaces, and new lines. This splits it into individual words with no lines or multiple spaces for separation.
- To replace the digits, we go through every character of each word. For each one, if it's a digit, an 'X' is returned instead. This is done with two list comprehensions, one to run on the list, and another on each word, replacing only if there's a digit—['X' if w.isdigit() else w for w in word]. Note that the words are joined together again.
- Each of the words is encoded into an ASCII byte sequence and decoded back again into the Python string type. Note the use of the errors parameter to force the replacement of unknown characters such as ?.
The difference between strings and bytes is not very intuitive at first, especially if you never have to worry about multiple languages or encoding transformation. In Python 3, there's a strong separation between strings (internal Python representation) and bytes, so most of the tools applicable to strings won't be available in byte objects. Unless you have a good idea of why you need a byte object, always work with Python strings. If you need to perform transformations like the one in this task , encode and decode in the same line so that you keep your objects in the comfortable realm of Python strings. If you are interested in learning more about encodings, you can check out this brief article ( https://eli.thegreenplace.net/2012/01/30/the-bytesstr-dichotomy-in-python-3 ) and this other longer and more detailed one ( http://www.diveintopython3.net/strings.html ).
- This step first adds an extra newline character (the \n character) for all words ending with a period. This marks the different paragraphs. After that, it creates a line and adds the words one by one. If an extra word will make it go over 80 characters, it finishes the line and starts a new one. If the line already ends with a new line, it finishes it and starts another one as well. Note that there's an extra space added to separate the words.
- Finally, each of the lines is capitalized as a Title (the first letter of each word is upper cased) and all the lines are joined through new lines.