Feature Generation: what it is and how to do it?

In the real-world, dataset collection is loosely controlled, noisy, unreliable, redundant, and incomplete. This makes data pre-processing an integral stage in the machine learning pipeline. In our previous blogs, we introduced how to improve data quality and how to handle imbalanced data. In this article, we will focus specifically on Feature Generation.

Feature Generation

Feature Generation (also known as feature construction, feature extraction or feature engineering) is the process of transforming features into new features that better relate to the target. This can involve mapping a feature into a new feature using a function like log, or creating a new feature from one or multiple features using multiplication or addition.

Figure 1. Feature Generation | Image by Jonas Meier

Feature Generation can improve model performance when there is a feature interaction. Two or more features interact if the combined effect is (greater or less) than the sum of their individual effects. It is possible to make interactions with three or more features, but this tends to result in diminishing returns.

Feature Generation is often overlooked as it is assumed that the model will learn any relevant relationships between features to predict the target variable. However, the generation of new flexible features is important as it allows us to use less complex models that are faster to run and easier to understand and maintain.

Feature Selection

Figure 2. Difference between feature selection and feature extraction | Image by Abhishek Singh

Examples of Feature Generation techniques

Often the relationship between dependent and independent variables are assumed linear, but this is not always the case. There are feature combinations that cannot be represented by a linear system. A new feature can be created based on a polynomial combination of numeric features in a dataset. Moreover, new features can be created using trigonometric combinations.

Manual vs Automated feature generation

Optimise feature generation with EvoML

EvoML can automatically transform, select and generate the most suitable features depending on the characteristics of the dataset. Our data scientists have integrated all the common feature generation methods as well as each method’s best practices (what types of dataset the method is most suitable for) into EvoML. Given a dataset, EvoML automatically tries different combinations of feature generation methods and selects the best ones. Furthermore, EvoML gives users the flexibility to choose which methods they prefer, so that they can easily customise it to their needs.

Figure 3. Data feature analysis on EvoML platform

Conclusion

About the Author

Passionate about bridging the link between research in AI and real-world problems. I love to travel around the world and collect different experiences.

About TurinTech

TurinTech — AI. Optimised.

Learn more about TurinTech
Follow us on social media: LinkedIn and Twitter

AI.Optimised.