Feature Selection is a crucial process in feature engineering as part of the Machine Learning life cycle. It focuses on identifying the most impactful features in the given dataset and helps to determine dependent variables, predictors, fields, or attributes from a dataset’s original set of features. This process is critical in solving the problem by filtering out only the most relevant features.
Feature selection, an integral part of feature engineering, works alongside Feature Extraction. While Feature Selection involves selecting a subset of existing features, Feature Extraction involves creating new features. Feature selection aims to significantly enhance model performance, reduce computational costs, and improve interpretability.
Feature selection reduces overfitting by eliminating redundant, irrelevant, or noisy data, leading to higher model accuracy and efficiency.
This article will discuss how the automated feature selection has been implemented technically.
Why feature selection is important in ML
Machine learning models learn from data, and the quality and relevance of that data significantly impact their performance. Removing redundant or unimportant features in the dataset helps avoid overfitting and reduces model complexity. In this case, the model learns the nature of the training data instead of generalized understanding. Equally, a few features may result in an underfit model, which fails to obtain the essential patterns in the dataset and leads to unexpected results.
An automated feature selection method helps identify and retain the model’s most revealing features and improves the results’ performance, efficiency, outcome accuracy, interpretability, and explainability.
Automated Feature Selection
Automated Feature Selection uses algorithms to select the most relevant features in a given dataset and enhance the model performance by reducing overfitting and minimizing computational costs. Unlike manual feature selection, automated methods are faster, handle large datasets more efficiently, and are less subjective.
Filter Methods
These methods evaluate the statistical relationships between each feature and the target variable, independent of any machine learning model. To implement Filter Methods for Automated Feature Selection, we can use various statistical techniques to evaluate each feature’s relevance to the target variable. Here are some standard Filter Methods and sample Python implementations using libraries like scikit-learn. Common techniques include correlation coefficient, chi-square test, mutual information, and variance threshold.
Correlation Coefficient
This method measures how strongly a set of features correlates with the target. This is typically a data visualization approach. Let’s examine this with the winequality.csv file with Python libraries, which consists of the following features. Our target is QUALITY of the Wine by accessing the other features of the wine. During the feature selection process, features with low correlation can be dropped or removed.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from statsmodels.graphics.tsaplots import plot_acf
df_temperature = pd.read_csv(‘winequality.csv’, encoding=’utf-8′)
df_temperature.head()
Output
correlation_matrix = df_temperature.corr()
sns.heatmap(correlation_matrix, annot=True, cmap=”coolwarm”)
plt.show()
Output
Observations from the Heatmap from the above picture
Feature Relationships
- Alcohol and Quality: There is a moderate positive correlation (approximately 0.48) between “alcohol” and “quality,” suggesting that higher alcohol content might be associated with higher wine quality ratings.
- Density and Fixed Acidity: There’s a strong positive correlation (around 0.67) between “density” and “fixed acidity,” indicating that denser wines tend to have higher acidity.
- pH and Fixed Acidity: There is a strong negative correlation (around –0.68) between “pH” and “fixed acidity,” meaning wines with higher acidity tend to have lower pH values.
- Total Sulfur Dioxide and Free Sulfur Dioxide: These two features are highly correlated (around 0.67), which is expected since total sulfur dioxide includes free sulfur dioxide.
Features with high correlations with each other, like “fixed acidity” and “density” or “free sulfur dioxide” and “total sulfur dioxide,” may contain redundant information. We might consider using one of each correlated pair in feature selection to avoid multicollinearity.
Quality Relationships:
- Quality and Volatile Acidity: There is a negative correlation (around -0.39) between “quality” and “volatile acidity,” suggesting that higher volatile acidity might be associated with lower wine quality.
- Quality and Alcohol: As mentioned, “alcohol” has a moderately positive correlation with “quality,” potentially making it an essential feature in predicting wine quality.
Final Understanding: Overall, this heatmap provides insights into which features have strong relationships with each other and the target feature (“quality” in this case). It can guide feature selection and engineering decisions on what should be chosen and ignored in the data preprocessing stage of a machine learning workflow.
Chi-Square Test
The Chi-Square Test is a statistical test that measures the association between categorical data in features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur), evaluates how expected frequencies compare to observed frequencies, and identifies features that have a significant relationship with the target variable
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder
X = df_temperature.drop(“quality”, axis=1)
y = df_temperature[“quality”]
#Select top k features based on chi-squared scores
chi2_selector = SelectKBest(score_func=chi2, k=5)
X_kbest = chi2_selector.fit_transform(X, y)
#Get selected feature names
selected_features = X.columns[chi2_selector.get_support()]
print(“Selected features:”, selected_features)
Output
Selected features: Index([‘volatile acidity’, ‘citric acid’, ‘free sulfur dioxide’,
‘total sulfur dioxide’, ‘alcohol’],
dtype=’object’)
Final Observation and Understanding: The following are the selected output features and indicate that the top 5 features chosen by the chi-squared test are:
- volatile acidity
- citric acid
- free sulfur dioxide
- total sulfur dioxide
- alcohol
Mutual Information
Measures the dependency between each feature (‘volatile acidity’, ‘citric acid’, ‘free sulfur dioxide’, ‘total sulfur dioxide’, and ‘alcohol’) and the target (Quality). High mutual information suggests the feature is relevant. It performs feature selection for a classification task and a measure that captures the dependency between each feature in the dataset and the target variable.
from sklearn.feature_selection import mutual_info_classif
import pandas as pd
Mutual Information for classification
mi_scores = mutual_info_classif(X, y)
mi_series = pd.Series(mi_scores, index=X.columns)
mi_series = mi_series.sort_values(ascending=False)
print(“Top features based on mutual information:”, mi_series.head())
Output
Top features based on mutual information:
alcohol 0.165384
volatile acidity 0.148159
sulphates 0.124302
density 0.089712
total sulfur dioxide 0.078592
Final Observation and Understanding:
- The higher the mutual information score, the more informative the feature is for predicting the target variable. Alcohol has the highest mutual information score (0.165384), meaning it has the strongest dependency with the target variable Quality among all features.
- The next most informative features are volatile acidity, sulphates, density, and total sulfur dioxide, with slightly lower scores.
A higher mutual information score indicates a stronger dependency, meaning the feature is more informative for predicting the target.
Variance threshold:
Eliminates low-quality features, assuming they carry limited information((‘volatile acidity’, ‘citric acid’, ‘free sulfur dioxide’, ‘total sulfur dioxide’, and ‘alcohol’). Removes features with very low variance, assuming they contain no helpful information.
from sklearn.feature_selection import VarianceThreshold
#Remove all features with variance below a threshold
variance_selector = VarianceThreshold(threshold=0.1)
X_high_variance = variance_selector.fit_transform(X)
#Get selected feature names
selected_features = X.columns[variance_selector.get_support()]
print(“Features with high variance:”, selected_features)
Output
Features with high variance: Index([‘fixed acidity’, ‘residual sugar’, ‘free sulfur dioxide’,
‘total sulfur dioxide’, ‘alcohol’],
dtype=’object’)
Final Observation and Understanding:
This indicates that these features have a variance above the threshold (0.1) and are therefore retained for further analysis or modeling.
Conclusion
We have discussed that automated feature selection is critical in Machine Learning workflows, especially when handling large datasets with numerous features in the given dataset (winequality.csv).
We experimented with all processes with the sample dataset and identified the most relevant features, enhancing model performance, interpretability, and computational efficiency.
We understand techniques like the “Correlation Coefficient” offer insights into relationships among features, helping to identify highly correlated variables. The Chi-Square Test evaluates the association between categorical features and the target (Quality), enabling us to retain those with significant relationships with other features (‘volatile acidity’, ‘citric acid’, ‘free sulfur dioxide’, ‘total sulfur dioxide’, ‘alcohol’).
The mutual information method quantifies the dependency between features and the target, emphasizing the features that provide the most predictive value (alcohol, volatile acidity, sulfates, density, and total sulfur dioxide).
Finally, Variance Threshold helps eliminate low-variance features that add little information, further simplifying the model and suggesting the list of features (‘fixed acidity’, ‘residual sugar’, ‘free sulfur dioxide’, ‘total sulfur dioxide’, ‘alcohol.’)
Each technique brings unique strengths to the feature selection process, and choosing the appropriate method based on dataset characteristics can significantly impact the model’s success in real-world applications.
By leveraging these automated feature selection techniques we have analyzed so far, data scientists can streamline the model-building process with a list of filter methods to reduce overfitting and improve model accuracy and performance efficiency.
References
- https://www.excelr.com/blog/machine-learning/optimizing-machine-learning-models-with-automated-feature-selection
- https://www.stratascratch.com/blog/feature-selection-techniques-in-machine-learning/