Machine Learning: Your Ultimate Feature Selection Guide Part 2

Introduction

Welcome to the second chapter in our series on feature selection methods in machine learning. In the first chapter, we laid the foundation by exploring filter methods, which offer a cost-effective and efficient approach for reducing the number of features in a dataset. Now, we will delve deeper into the world of feature selection by focusing on two other critical methods: wrapper methods and embedded methods.

Wrapper Methods

They represent a more complex and computational approach compared to filter methods. These methods involve using predictive models to score feature subsets, thereby creating a more tailored and potentially more effective set of features for a specific model. This chapter will explore how wrapper methods operate, their advantages, and the trade-offs involved, particularly in terms of computational cost and the risk of overfitting.

Embedded Methods

On the other hand, it integrates feature selection as part of the model-building process. These methods are intrinsic to certain algorithms that inherently perform feature selection while learning the model parameters. Embedded methods offer a balanced approach, combining the strengths of both filter and wrapper methods. We will examine various embedded method techniques, discussing how they seamlessly incorporate feature selection into the model training process and how they compare to the other methods in terms of efficiency and effectiveness.

As we progress through this chapter, we aim to provide a comprehensive understanding of wrapper and embedded methods, complete with practical examples and insights. This knowledge will empower you to make informed decisions when selecting the appropriate feature selection technique for your machine-learning projects. Stay tuned as we unravel the complexities and nuances of these advanced feature selection methods.

Wrapper Methods

The essence of this category of methods is that the model is run on different subsets of features from the original training dataset. Then, the subset with the best parameters on the training set is selected, which is tested on the test set (the test set does not participate in the process of choosing the optimal subset).

There are two approaches in this class of methods:

Forward selection
Backward elimination

The first method starts with an empty subset, to which different features are gradually added (optimal inclusion is chosen at each step). In the second case, it starts with a subset equal to the original set of features, from which they are gradually removed (each time with a recalculation of the classifier).

Forward Selection

SequentialFeatureSelector (SFS) initially starts without features and finds the one that maximizes the cross-validation score.
Once the first feature is selected, SFS repeats the process, adding a new feature to the previously selected one.
The procedure continues until the desired number of features is reached, determined by the n_features_to_select parameter. If the user does not specify the exact number to select or a quality improvement threshold, the algorithm automatically selects half of the existing set of features.

Below is a function for the process of including features based on the OLS linear regression model from the statsmodels library, where the criterion for feature selection will be the statistical significance indicator (p-value) obtained from this model. And like in the first chapter we will use the Kaggle dataset from “Titanic - Machine Learning from Disaster” (https://www.kaggle.com/competitions/titanic/data):

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
sfs = SequentialFeatureSelector(LR, n_features_to_select=5)
sfs.fit(train, train_y)

sfs.get_support()

sfs.get_feature_names_out()

Output:

Also, the mlxtend library implements the SequentialFeatureSelector function, which accepts not only the model but also the type of metric by which the inclusion occurs, as well as the parameter forward=True/False, responsible for the type of selection, inclusion, or exclusion of features. (A similar parameter exists in the previous SequentialFeatureSelector function from sklearn, only it's called direction and takes values 'backward'/'forward'):

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# Sequential Forward Selection(sfs)
sfs = SFS(LogisticRegression(),
          k_features=11,
          forward=True,
          floating=False,
          scoring='accuracy',
          cv=0)

sfs.fit(train, train_y)
sfs.k_feature_names_

Output:

From all the demonstrated examples, we obtain a filtered list of features, which, according to the data from the used models, cannot be improved in terms of quality metrics by adding the remaining features.

Backward Elimination

The same process, but in the opposite direction: it starts with all features and then begins to remove them from the set.

Again, let's consider a custom-written function using the OLS model.

def backward_elimination(data, target,significance_level = 0.05):
    features = data.columns.tolist()
    while(len(features)>0):
        features_with_constant = sm.add_constant(data[features])
        p_values = sm.OLS(target, features_with_constant).fit().pvalues[1:]
        max_p_value = p_values.max()
        if(max_p_value >= significance_level):
            excluded_feature = p_values.idxmax()
            features.remove(excluded_feature)
        else:
            break 
    return features

# Example of usage
backward_elimination(train, train_y)

Output:

Again, the SequentialFeatureSelector function from sklearn, but with the addition of the value to the parameter direction='backward' (the standard value of the parameter is set to 'forward').

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
sfs = SequentialFeatureSelector(LR, n_features_to_select=11,direction='backward')
sfs.fit(train, train_y)

sfs.get_support()

sfs.get_feature_names_out()

Output:

#Sequential backward selection(sbs)
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sbs = SFS(LogisticRegression(),
         k_features=11,
         forward=False,
         floating=False,
         cv=0)
sbs.fit(train, train_y)
sbs.k_feature_names_

Output:

Just as in the case of feature inclusion, we get a list of features at the output. Further exclusion of any feature from this list during model training will lead to a deterioration in the quality metric.

Sum up of Wrapper Methods

In our exploration of wrapper methods, a key aspect to consider is their high computational cost. Both forward selection and backward elimination are resource-intensive processes, primarily because they involve running the model multiple times for various combinations of features. This iterative process, while thorough, demands significant computational power, especially for datasets with a large number of features.

To mitigate these challenges, several strategies can be employed:

Random Permutations: Instead of exhaustively evaluating all possible combinations of features, random permutations of feature numbers can be used to systematically evaluate the impact of adding or eliminating not just the best or worst feature at each step but also considering the second-best, third-best, and so forth. This approach allows for a more thorough exploration of the feature space. Here's how we can implement this:

from sklearn.model_selection import cross_val_score
import numpy as np
import random

def stepwise_random_permutation(model, X, y, forward=True, num_steps=5, num_permutations=30):
    best_score = 0
    selected_features = set()
    remaining_features = set(X.columns)

    for step in range(num_steps):
        step_scores = []
        for _ in range(num_permutations):
            if forward:
                if not remaining_features:
                    break
                feature_to_add = random.sample(remaining_features, 1)[0]
                current_features = list(selected_features) + [feature_to_add]
            else:
                if not selected_features:
                    break
                feature_to_remove = random.sample(selected_features, 1)[0]
                current_features = list(selected_features - {feature_to_remove})

            score = cross_val_score(model, X[current_features], y, cv=5).mean()
            step_scores.append((score, feature_to_add if forward else feature_to_remove))

        if not step_scores:
            break

        step_scores.sort(reverse=True)
        best_feature = step_scores[0][1]
        if forward:
            selected_features.add(best_feature)
            remaining_features.remove(best_feature)
        else:
            selected_features.remove(best_feature)

        if step_scores[0][0] > best_score:
            best_score = step_scores[0][0]

    return selected_features, best_score

This function will iteratively add or remove features based on their contribution to the model's performance, not necessarily choosing the best or worst feature at each step but exploring a range of top features to determine the optimal combination.

Genetic Algorithms: These algorithms mimic the process of natural selection to identify the most effective feature subsets. By using operations like mutation, crossover, and selection, genetic algorithms can efficiently navigate the search space for optimal or near-optimal solutions. This method provides a balance between exhaustive search and random selection, potentially reducing computation time while still identifying strong feature combinations.

The function below uses a genetic algorithm approach for feature selection. It is a more advanced method and might require additional libraries like deap.

from deap import base, creator, tools, algorithms
import numpy as np

def genetic_algorithm_feature_selection(model, X, y, n_gen=10, size_pop=50, prob_mut=0.2):
    creator.create("FitnessMax", base.Fitness, weights=(1.0,))
    creator.create("Individual", list, fitness=creator.FitnessMax)
    
    toolbox = base.Toolbox()
    toolbox.register("attr_bool", random.randint, 0, 1)
    toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, n=X.shape[1])
    toolbox.register("population", tools.initRepeat, list, toolbox.individual)

    def evalOneMax(individual):
        features = [i for i, x in enumerate(individual) if x == 1]
        if len(features) == 0: return 0,
        score = cross_val_score(model, X.iloc[:, features], y, cv=5).mean()
        return score,

    toolbox.register("evaluate", evalOneMax)
    toolbox.register("mate", tools.cxTwoPoint)
    toolbox.register("mutate", tools.mutFlipBit, indpb=prob_mut)
    toolbox.register("select", tools.selBest)

    pop = toolbox.population(n=size_pop)
    hof = tools.HallOfFame(1)
    stats = tools.Statistics(lambda ind: ind.fitness.values)
    stats.register("avg", np.mean)
    stats.register("min", np.min)
    stats.register("max", np.max)

    algorithms.eaSimple(pop, toolbox, cxpb=0.5, mutpb=0.2, ngen=n_gen, stats=stats, halloffame=hof)

    return hof[0]

Sequential Runs of Forward/Backward Elimination: In practice, a more feasible approach is often to run forward or backward elimination sequentially. This process involves adding (or removing) one feature at a time based on its impact on the model's performance. While this method does not guarantee an absolute optimal set of features, it tends to identify a highly effective subset in a more computationally manageable way.

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.base import clone

def alternating_sequential_feature_selector(model, X, y, iterations=5, scoring='accuracy', cv=5):
    current_features = set()
    best_score = 0
    best_features = None

    for i in range(iterations):
        forward = i % 2 == 0  # Alternate between forward and backward
        
        sfs = SFS(clone(model), 
                  k_features='best', 
                  forward=forward, 
                  scoring=scoring, 
                  cv=cv)

        sfs = sfs.fit(X[list(current_features) + list(X.columns.difference(current_features))], y)
        
        # Update current features and score
        new_features = set(X.columns[list(sfs.k_feature_idx_)])
        new_score = sfs.k_score_
        
        if new_score > best_score:
            best_features = new_features
            best_score = new_score

        current_features = new_features

    return best_features, best_score

In this function, the iterations parameter controls how many times the SFS is run with alternating directions. The model is cloned at each iteration to ensure that a fresh instance is used for each run of SFS. This function alternates between adding features (forward selection) and removing them (backward elimination), seeking to refine the set of features iteratively.

This method can lead to a more nuanced feature selection process, potentially balancing the inclusion of important features and the removal of redundant ones across multiple iterations.

To sum up, all these approaches acknowledge the trade-off between computational cost and the quality of the selected feature set. By applying these strategies, it becomes possible to utilize wrapper methods more effectively, even on larger datasets, while maintaining a balance between computational efficiency and feature set optimality. In the following sections, we will delve deeper into the practical application of these strategies and explore how they can be implemented to enhance the feature selection process in machine learning.

Embedded Methods

Finally, embedded methods allow not to separate feature selection and model training but instead perform selection within the model calculation process. These methods require less computation than wrappers, although more than filters.

Examples of embedded methods:

Regularization
- Ridge
- Lasso
Feature importance
- Tree-Based
- RFE (Recursive Feature Elimination)
- Shapley values

Regularization

Let's start with regularization methods. There are various types of regularization, but the main principle is common. If we consider the work of a classifier without regularization, it involves building a model that would best adjust to predict all points of the training data set.

For example, if the classification algorithm is linear regression, then coefficients of the polynomial that approximates the dependency between features and the target variable are selected. If the mean squared error serves as the quality assessment of the selected coefficients, then the parameters are chosen so that the total square of deviations of the points predicted by the classifier from the real ones is minimal.

The idea of regularization is to build an algorithm that minimizes not only the error but also the number of used variables.

In this material, I will only visually present the formulas for types of tasks with regularization.

During the learning process of the algorithm, the sizes of the coefficients will be proportional to the importance of the corresponding variables, and near-zero values will appear in front of those variables that contribute the least to the error elimination.

The regularization parameter lambda allows you to adjust the contribution of the regularizing operator to the total sum. It can be used to set the priority - accuracy of the model or the minimum number of used variables.

Similar to the previous one in everything except the regularizing operator. It is not the sum of squares but the sum of the absolute values of the coefficients. Despite the slight difference, the properties differ. In Ridge, as lambda increases, all coefficients get values closer to zero but usually do not become exactly zero. In Lasso, with the increase of lambda, more coefficients become zero and stop contributing to the model altogether. Thus, actual feature selection occurs. More significant features will retain their coefficients non-zero, while less significant ones will become zero.

Thus, regularization is a kind of penalty for the excessive complexity of the model, which allows protection against overfitting in the presence of redundant features.

In the example, we will immediately run both regressions and obtain the coefficients (weights for each feature) from the trained models and then construct graphs to visualize the values of these coefficients.

# Ridge
clf = LogisticRegression(penalty='l2')
clf.fit(train, train_y)
coefs_ridge=pd.DataFrame(clf.coef_.T,index=train.columns, columns=['coef'])

# Lasso
clf = LogisticRegression(penalty='l1',solver='liblinear')
clf.fit(train, train_y)
coefs_lasso=pd.DataFrame(clf.coef_.T,index=train.columns, columns=['coef'])

fig, ax = plt.subplots(1, 2, figsize=(16, 8))
plt.rcParams.update({'font.size': 8})

sns.barplot(coefs_ridge.sort_values(
    'coef', ascending=False).reset_index(), x='coef', y='index', ax=ax[0])
ax[0].set_xlabel('Ridge Coef')
sns.barplot(coefs_lasso.sort_values(
    'coef', ascending=False).reset_index(), x='coef', y='index', ax=ax[1])
ax[1].set_xlabel('Lasso Coef')

As we can observe, Lasso regularization indeed zeroed out a large portion of the feature weights, so we can confidently use the remaining ones as the selected features. In the case of Ridge regularization, we also get a clear picture, with the only difference being that we will need to choose a value of weights from the model as a threshold for feature cutoff.

Feature Importance

Feature importance is a metric that shows the "importance" of each feature for training a specific model. Higher importance means that a particular feature will have a greater influence on the model used to predict a specific target variable.

Let's take a real-world example for better understanding. Suppose you need to buy a new home near your workplace. When buying a house, you may consider various factors. The most important factor in decision-making might be the location of the property, so you are likely to search for homes that are close to your workplace. Feature importance works in a similar way; it ranks features based on the impact they have on the model's predictions.

Tree Based Models

Feature importance is calculated as the reduction in the amount of unexplained information at a node (a branching point that divides a tree leaf into two other leaves), weighted by the probability of reaching that node. The probability of a node can be calculated by dividing the number of samples that reach the node by the total number of samples. The higher the value, the more important the feature.

In mathematical definitions, it can be described with the following statements.

For example, let's consider a regular classifier based on Decision Trees (DecisionTreeClassifier) and obtain the feature importance values using the built-in .feature_importances_ function after training.

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
# fit the model
model.fit(train, train_y)
# get importance
importance = model.feature_importances_
fi_tree = pd.DataFrame(importance.T,index=train.columns, columns=['importance'])
plt.figure(dpi=100, figsize=(8, 8))
plt.rcParams.update({'font.size': 9})
plot_scores(fi_tree.importance, "Feature Importance")

Output:

On the graph, we can observe the usual distribution of feature importance; accordingly, the greater the importance of a feature, the more likely it is to perform better in the future model.

Recursive Feature Elimination (RFE)

One of the examples of such methods is Recursive Feature Elimination (RFE). As the name suggests, it belongs to the class of algorithms that progressively exclude features from the overall pool. This method requires choosing a classifier that will be used to assess the features, for example, linear regression.

For instance, initially, linear regression is trained on all features, and the feature that receives the highest absolute coefficient value during training is set aside. In the next step, the training is performed on all features except the one set aside, and so on.

Although the exclusion method better tracks the relationships between features, it is much more computationally expensive. Moreover, all methods of this category require significantly more computation than filtering methods. Additionally, in the case of a large number of features and a small size of the training data set, these methods are at risk of overfitting.

The step-by-step process of RFE:

First, RFE trains a user-defined model on the entire set of features. The user model can be any supervised learning model with a fit_ method that provides information about feature importance.
It then computes the importance of each feature from the trained model. Importance can be obtained either through any model attribute (e.g., coef_, feature_importances_), or via a user-defined metric.
The method then discards the feature with the least feature importance indicator and repeats steps 1 and 2 with a reduced number of features.
Step 3 is repeated until the number of features reaches the user-specified count.

The difference between RFE and SFS (Sequential Feature Selection) is that SFS does not require the calculation of feature importance indicators. SFS can also perform both forward and backward selection, while RFE can only perform backward elimination. Furthermore, SFS is usually slower than RFE. For example, during backward elimination, transitioning from n features to n-1 using k-fold cross-validation requires training n * k models, whereas for RFE, only k trainings are needed.

As an example, consider the RFE function from the sklearn library and input a linear regression model into it ( RFE(LR,n_features_to_select=10) )

from sklearn.feature_selection import RFE
 
#select 5 the most informative features
rfe = RFE(LR,n_features_to_select=10) 
rfe = rfe.fit(train, train_y)
rfe.get_feature_names_out()

Output:

As a result, we will get a list of features (10 in total since we set this threshold ourselves) that have shown the best results over 10 iterations in terms of feature importance.

Shapley Values (Shapley Additive Explanations)

Game theory includes a category of cooperative games, where groups of players can combine their efforts, forming coalitions to achieve the best outcome. In turn, the optimal distribution of the payoff among the players can be determined using the Shapley vector. It represents a distribution in which each player's payoff is equal to his average contribution to the overall welfare under a certain coalition formation mechanism (more strictly from Wikipedia: the expected value of their contribution to the corresponding coalitions):

Applying the above principles of game theory to the interpretation of ML models, the following assumptions can be made:

The outcome of supervised learning (based on a given example) is a game;
the payoff is the difference between the expected outcome on all available examples and the result obtained on a given example;
players' contributions to the game are the influence of each feature value on the payoff, i.e., the result.

Then, when calculating the Shapley vector, it is necessary to form coalitions from a limited set of features. However, not every ML model allows simply removing a feature without retraining the model from scratch. Therefore, to form coalitions, features are usually not removed but replaced with random values from a "background" data set. The averaged result of the model with random feature values is equivalent to the result of the model where this feature is completely absent.

The practical implementation of this approach is represented by the special SHAP library (Shapley Additive exPlanations), which is supported for "tree ensemble" models in XGBoost, LightGBM, CatBoost, scikit-learn, and pyspark.

For example, we train an xgboost model and use the Shap library to illustrate the distributions of Shapley vector values.

import shap
import xgboost as xgb
from sklearn.model_selection import train_test_split

model_xgb = xgb.XGBClassifier(
    base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)
                            
model_xgb.fit(train, train_y)


shap_values = shap.TreeExplainer(model_xgb).shap_values(train)
shap.summary_plot(shap_values, train, show=True, color_bar=True)

Output:

The resulting graph is interpreted as follows:

values to the left of the central vertical line represent the negative class (0), and to the right represent the positive (1) according to the error matrix of the predictive ML model (the further to the left, the more the model tends to be confident in classifying the observation as belonging to class 0, and conversely, the further to the right - to class 1)
the thickness of the line is directly proportional to the number of observation points;
the redder the points, the higher the feature value at that point.

For example: In our data, the variable Sex takes two values {"male": 1, "female": 0} In the picture, we can see that a significant portion (which is quite large given the thickness of the line) of values 1 (redder - higher value) are located to the left of the vertical line, meaning this feature affects the model as follows: the model classifies men with high confidence as not survived, and women - as survived (target variable Survived - 1/0, where 1 means survived)

Another possible usage method is calculating the average Shapley value:

explainer = shap.Explainer(model_xgb, train)
shap_values2 = explainer(train)
shap.plots.bar(shap_values2)

This graph allows us to look at the average absolute Shapley values for each feature, which can be interpreted in the same way as Feature Importance.

Conclusion

As we conclude the second installment of our series on feature selection methods in machine learning, we have traversed the intricacies of wrapper and embedded methods. Our journey began with an in-depth examination of wrapper methods, which, while computationally intensive, offer a tailored selection process, potentially enhancing model performance. We then ventured into the realm of embedded methods, which elegantly integrate feature selection within the model training process, striking a harmony between efficiency and effectiveness.

Throughout this chapter, we've armed ourselves with practical insights and examples that illuminate the strengths and potential pitfalls of each method. From the careful orchestration of features in wrapper methods to the inherent selection capabilities of embedded methods, we now possess a clearer vision for choosing the right technique for our projects.

Stay tuned for our next chapter, where we will demystify AutoML's role in feature selection, providing a gateway to efficient model building and empowering us to harness the full potential of our data. Join us as we step into the future of machine learning, where feature selection becomes less of an art and more of a science, all thanks to the advancements in AutoML.

Machine Learning: Your Ultimate Feature Selection Guide Part 2 - Select the Real Best

Introduction

Wrapper Methods

Embedded Methods

Wrapper Methods

Forward Selection

Backward Elimination

Sum up of Wrapper Methods

Embedded Methods

Regularization

Feature Importance

Tree Based Models

Recursive Feature Elimination (RFE)

Shapley Values (Shapley Additive Explanations)

Conclusion