In the real-world, dataset collection is loosely controlled, noisy, unreliable, redundant, and incomplete. This makes data pre-processing an integral stage in the machine learning pipeline. In our previous blogs, we introduced how to improve data quality and how to handle imbalanced data. In this article, we will focus specifically on Feature Generation.
Before we get into the details let’s review what a feature is. A feature (or column) represents a measurable piece of data like name, age or gender. It is the basic building block of a dataset. The quality of a feature can vary significantly and has an immense effect on model performance. We can improve the quality of a dataset’s features in the pre-processing stage using processes like Feature Generation and Feature Selection.
Feature Generation (also known as feature construction, feature extraction or feature engineering) is the process of transforming features into new features that better relate to the target. This can involve mapping a feature into a new feature using a function like log, or creating a new feature from one or multiple features using multiplication or addition.
Feature Generation can improve model performance when there is a feature interaction. Two or more features interact if the combined effect is (greater or less) than the sum of their individual effects. It is possible to make interactions with three or more features, but this tends to result in diminishing returns.
Feature Generation is often overlooked as it is assumed that the model will learn any relevant relationships between features to predict the target variable. However, the generation of new flexible features is important as it allows us to use less complex models that are faster to run and easier to understand and maintain.
In fact, not all features generated are relevant. Moreover, too many features may adversely affect the model performance. This is because as the number of features increases, it becomes more difficult for the model to learn mappings between features and target (this is known as the curse of dimensionality). Thus it is important to select the most useful features through Feature Selection, which we will further introduce in our next blog.
Examples of Feature Generation techniques
A transformation is a mapping that is used to transform a feature into a new feature. The right transformation depends on the type and structure of the data, data size and the goal. This can involve transforming single feature into a new feature using standard operators like log, square, power, exponential, reciprocal, addition, division, multiplication etc.
Often the relationship between dependent and independent variables are assumed linear, but this is not always the case. There are feature combinations that cannot be represented by a linear system. A new feature can be created based on a polynomial combination of numeric features in a dataset. Moreover, new features can be created using trigonometric combinations.
Manual vs Automated feature generation
Feature Generation was an ad-hoc manual process that depended on domain knowledge, intuition, data exploration and creativity. However, this process is dataset-dependent, time-consuming, tedious, subjective, and it is not a scalable solution. Automated Feature Generation automatically generates features using a framework; these features can be filtered using Feature Selection to avoid feature explosion. Below you can find some popular open source libraries for automated feature engineering:
- Featuretools for advanced usage
- Optuna — A hyperparameter optimization framework
- Feature-engine: A Python library for Feature Engineering for Machine Learning
Optimise feature generation with EvoML
However, open source libraries may not provide the customisation you need for your unique data science projects. With EvoML , you can customise the automated feature generation process, get better features for better model results, faster.
EvoML can automatically transform, select and generate the most suitable features depending on the characteristics of the dataset. Our data scientists have integrated all the common feature generation methods as well as each method’s best practices (what types of dataset the method is most suitable for) into EvoML. Given a dataset, EvoML automatically tries different combinations of feature generation methods and selects the best ones. Furthermore, EvoML gives users the flexibility to choose which methods they prefer, so that they can easily customise it to their needs.
Feature Generation involves creating new features which can improve model accuracy. In addition to manual processes, there are several frameworks that can be used to automatically generate new features that expose the underlying problem space. These features can subsequently be filtered using Feature Selection, to ensure that only a subset of the most important features is used. This process effectively reduces model complexity and improves model accuracy as well as interpretability.
About the Author
Dr Manal Adham| TurinTech Research Team
Passionate about bridging the link between research in AI and real-world problems. I love to travel around the world and collect different experiences.
TurinTech is the leader in Artificial Intelligence Optimisation. TurinTech empowers businesses to build efficient and scalable AI by automating the whole data science lifecycle with multi-objective optimisation. TurinTech enables organisations to drive AI transformation with minimum human effort, at scale and at speed.
TurinTech — AI. Optimised.