Feature Engineering Made Easy
上QQ阅读APP看书,第一时间看更新

What is feature engineering?

Finally, the title of the book.

Yes, folks, feature engineering will be the topic of this book. We will be focusing on the process of cleaning and organizing data for the purposes of machine learning pipelines. We will also go beyond these concepts and look at more complex transformations of data in the forms of mathematical formulas and neural understanding, but we are getting ahead of ourselves. Let’s start a high level.

Feature engineering is the process of transforming data into features that better represent the underlying problem, resulting in improved machine learning performance.

To break this definition down a bit further, let's look at precisely what feature engineering entails:

  • Process of transforming data: Note that we are not specifying raw data, unfiltered data, and so on. Feature engineering can be applied to data at any stage. Oftentimes, we will be applying feature engineering techniques to data that is already processed in the eyes of the data distributor. It is also important to mention that the data that we will be working with will usually be in a tabular format. The data will be organized into rows (observations) and columns (attributes). There will be times when we will start with data at its most raw form, such as in the examples of the server logs mentioned previously, but for the most part, we will deal with data already somewhat cleaned and organized.
  • Features: The word features will obviously be used a lot in this book. At its most basic level, a feature is an attribute of data that is meaningful to the machine learning process. Many times we will be diagnosing tabular data and identifying which columns are features and which are merely attributes.
  • Better represent the underlying problem: The data that we will be working with will always serve to represent a specific problem in a specific domain. It is important to ensure that while we are performing these techniques, we do not lose sight of the bigger picture. We want to transform data so that it better represents the bigger problem at hand.
  • Resulting in improved machine learning performance: Feature engineering exists as a single part of the process of data science. As we saw, it is an important and oftentimes undervalued part. The eventual goal of feature engineering is to obtain data that our learning algorithms will be able to extract patterns from and use in order to obtain better results. We will talk in depth about machine learning metrics and results later on in this book, but for now, know that we perform feature engineering not only to obtain cleaner data, but to eventually use that data in our machine learning pipelines.

We know what you’re thinking, why should I spend my time reading about a process that people say they do not enjoy doing? We believe that many people do not enjoy the process of feature engineering because they often do not have the benefits of understanding the results of the work that they do.

Most companies employ both data engineers and machine learning engineers. The data engineers are primarily concerned with the preparation and transformation of the data, while the machine learning engineers usually have a working knowledge of learning algorithms and how to mine patterns from already cleaned data.

Their jobs are often separate but intertwined and iterative. The data engineers will present a dataset for the machine learning engineers, which they will claim they cannot get good results from, and ask the Data Engineers to try to transform the data further, and so on, and so forth. This process can not only be monotonous and repetitive, it can also hurt the bigger picture.

Without having knowledge of both feature and machine learning engineering, the entire process might not be as effective as it could be. That’s where this book comes in. We will be talking about feature engineering and how it relates directly to machine learning. It will be a results-driven approach where we will deem techniques as helpful if, and only if, they can lead to a boost in performance. It is worth now ping a bit into the basics of data, the structure of data, and machine learning, to ensure standardization of terminology.