Image designed by the author – Shanthababu
Introduction
Every ML Engineer and Data Scientist must understand the significance of “Hyperparameter Tuning (HPs-T)” while selecting the right machine/deep learning model and improving the performance of the model(s).
To make it simple, for every single machine learning model selection is a major exercise and it is purely dependent on selecting the equivalent set of hyperparameters, and all these are indispensable to train a model. It is always referring to the parameters of the selected model and remember it cannot be learned from the data, and it needs to be provided before the model gets into the training stage, ultimately the performance of the machine learning model improves with a more acceptable choice of hyperparameter tuning and selection techniques. The main intention of this article is to make you all aware of hyperparameter tuning.
Hyperparameter tuning is basically referred to as tweaking the parameters of the model, which is basically a prolonged process.
Before going into detail, let’s ask some valuable self-questions on hyperparameter tuning, I am sure this would help you a lot on this magic word. Personally, I experienced that and explained it here.
What are Hyperparameters? How to Differ from a Model Parameter?
As we know that there are parameters that are internally learned from the given dataset and derived from the dataset, they are represented in making predictions, classification and etc., These are so-called Model Parameters, and they are varying with respect to the nature of the data we couldn’t control this since it depends on the data, like ‘m‘ and ‘C‘ in linear equation, which is the value of coefficients learned from the given dataset.
Some set of parameters that are used to control the behaviour of the model/algorithm and are adjustable in order to obtain an improvised model with optimal performance is so-called Hyperparameters.
The best model algorithm(s) will sparkle if your best choice of Hyper-parameters
ML Life Cycle
If you ask me what is Hyperparameters in simple words, the one-word answer is Configuration.
Without thinking too much, I can say the quick Hyperparameter is the “Train-Test Split Ratio (80-20)” in our simple linear regression model.
Image designed by the author – Shanthababu
YES! now I can see that, you’re really starting to feel what could be HPs and how it would optimize the model. That’s why I have mentioned earlier in easy language this is configuring values.
Let me give one more example – You can compare this with selecting and setting the font and its size for better readability and clarity while you document your content to be perfect and precise.
Coming back to machine learning and recalling Ridge Regression (L2 Regularization) and Lasso Regression (L1 Regularization), In regularized terms we use to have lambda (λ) I mean the Penalty Factor helps us to get a smooth surface instead of an irregular graph.
This term is used to push the coefficients(β) values near zero in terms of magnitude, For more details please refer to my earlier articles https://www.analyticsvidhya.com/blog/2021/11/study-of-regularization-techniques-of-linear-model-and-its-roles/. This is nothing but hypermeters.
Image designed by the author – Shanthababu
For better clarity and understanding, here is one more classical representation for you.
Image designed by the author – Shanthababu
From the above equation, you can understand a better view of what MODEL and HYPER PARAMETERS are.
Hyperparameters are supplied as arguments to the model algorithm during initializing them as keys, value and their values are picked by the data scientist, who is building the model in iterative mode.
Hyperparameter Space
As we know that there is a list of HPs for any selected algorithm(s) and our job is to figure out the best combination of HPs and to get the optimal results by tweaking them strategically, this process will be providing us with the platform for Hyperparameter Space and this combination leads to provide the best optimal results, no doubt in that but finding this combo is not so easy, we have to search throughout the space. Here every combination of selected HP value is said to be the “MODEL” and have to evaluate the same on the spot. For this reason, there are two generic approaches to search effectively in the HP space are GridSearch CV and RandomSearch CV. Here CV denotes Cross-Validation.
Image designed by the author – Shanthababu
Before going to apply the above-mentioned search options to the data/model, we must split the data into 3 different sets. I can understand your mind voice, already we are splitting the dataset as Train and Test, and now one more track? Yes, there is a valid reason there, that is nothing but to prevent the “DATA LEAKAGE” during Training, Validating and Testing. remember we shouldn’t touch the test data set until we move the model into production deployment.
Data Leakage
Well! Now quickly will understand what is Data leakage in ML, this is mainly due to not following some of the recommended best practices during the Data Science/Machine Learning life cycle. The result is Data Leakage, that’s fine, what is the issue here, after successful testing with perfect accuracy followed by training the model then the model has been planned to move into production. At this moment ALL Is Well.
Still, if the actual/real-time data is applied to this model in the production environment, you will get poor scores. By this time, you may think that why did this happen and how to fix this. This is all because of the data that we split data into training and testing subsets. During the training the model has the knowledge of data, which the model is trying to predict, this results in inaccurate and bad prediction outcomes after the model is deployed into production.
Causes of Data Leakage
- Data Pre-processing
- The major root cause is doing all EDA processes before splitting the dataset into test and train
- Doing straightforward normalizing or rescaling on a given dataset
- Performing Min/Max values of a feature
- Handling missing values without reserving the test and train
- Removing outliers and Anomaly on a given dataset
- Applying standard scaler, scaling, assert normal distribution on the full dataset
Image designed by the author – Shanthababu
Bottom line is, we should avoid doing anything to our training dataset that involves having knowledge of the test dataset. So that our model will perform in production as a generalised model.
will go through the available Hyperparameters across the various algorithms and how we could implement all these factors and impact the model.
Steps to Perform Hyperparameter Tuning
- Select the right type of model.
- Review the list of parameters of the model and build the HP space
- Finding the methods for searching the hyperparameter space
- Applying the cross-validation scheme approach
- Assess the model score to evaluate the model
Image designed by the author – Shanthababu
Now, time to discuss a few Hyperparameters and their influence on the model.
Train, Test Split Estimator: With the help of this, we use to set the test and train size for the given dataset and along with random state, this is permutations to generate the same set of splits., otherwise you will get a different set of test and train sets, tracing your model during evaluation is bit complex or if we omitted this system will generate this number and leads to unpredictable behaviour of the model. The random state provides the seed, for the random number generator, in order to stabilize the model.
train_test_split( X, y, test_size=0.4, random_state=0)
Logistic Regression Classifier: The parameter C in Logistic Regression Classifier is directly related to the regularization parameter λ but is inversely proportional to C=1/λ.
LogisticRegression(C=1000.0, random_state=0)LogisticRegression(C=1000.0, random_state=0)
KNN (k-Nearest Neighbors) Classifier: As we know the k-nearest neighbour’s algorithm (KNN) is a non-parametric method used for regression and classification problems. Predominantly this is used for classification problems, in which the number of neighbours and power parameter
KNeighborsClassifier(n_neighbors=5, p=2, metric=’minkowski’)
– n_neighbors is the number of neighbors
– p is Minkowski (the power parameter)
If p = 1 Equivalent to manhattan_distance,
p = 2. For Euclidean_distance
Support Vector Machine Classifier
SVC(kernel=’linear’, C=1.0, random_state=0)
– kernel specifies the kernel type to be used in the chosen algorithm,
kernel = ‘linear’, for Linear Classification
kernel = ‘rbf’ for Non-Linear Classification.
C is the penalty parameter (error)
random_state is a pseudo-random number generator
Decision Tree Classifier
Here, the criterion is the function to measure the quality of a split, max_depth is the maximum depth of the tree, and random_state is the seed used by the random number generator.
DecisionTreeClassifier(criterion=’entropy’, max_depth=3, random_state=0)
Lasso Regression
Lasso(alpha = 0.1) the regularization parameter is alpha.
Principal Component Analysis
PCA(n_components = 4)
Perceptron Classifier
Perceptron (n_iter=40, eta0=0.1, random_state=0)
– n_iter is the number of iterations,
-eta0 is the learning rate,
-random_state is random number generator.
Influencing on Models
Overall, Hyperparameters are influencing the below factors while designing your model. Please remember this.
- Linear Model
- What degree of polynomial features should use?
- Decision Tree
- What is the maximum allowed depth?
- What is the minimum number of samples required at a leaf node in the decision tree?
- Random forest
- How many trees we should include?
- Neural Network
- How many neurons we should keep in a layer?
- How many layers, should keep in a layer?
- Gradient Descent
- What learning rate should we?
So, once we started thinking about introducing the hyperparameters in our model then the overall architecture model would be like the below.
Image designed by the author – Shanthababu
Hyperparameter Optimization Techniques
In the ML world, there are many Hyperparameter optimization techniques are available.
- Manual Search
- Random Search
- Grid Search
- Halving
- Grid Search
- Randomized Search
- Automated Hyperparameter tuning
- Bayesian Optimization
- Genetic Algorithms
- Artificial Neural Networks Tuning
- HyperOpt-Sklearn
- Bayes Search
Image designed by the author – Shanthababu
Note: When we implement Hyperparameters optimization techniques, we have to have the Cross-Validation techniques as well in the flow because we may not miss out on the best combinations that work on tests and training.
Manual Search: The name itself is self-explanatory and the data scientist can do the experiment with different combinations of hyperparameters and their values for the selected model perform the training and pick up the best model with the best performance and go for testing and move on to production deployment. Of Course, what you think is absolutely right is that this method will consume immense effort.
Let’s try this with a simple dataset
Dataframe ready after loading CSV and required libraries for further operations
Train and Test are done with target and dependent variables identification.
# Train Test Split
#df = df.drop(['name','origin','model_year'], axis=1)
y = df['class']
X = df.drop(['class'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=30)
Since we’re planning for a manual search, I am creating 3 sets for DecisionTreeClassifier and fitting the model
# sets of hyperparameters
params_1 = {'criterion': 'gini', 'splitter': 'best', 'max_depth': 50}
params_2 = {'criterion': 'entropy', 'splitter': 'random', 'max_depth': 70}
params_3 = {'criterion': 'gini', 'splitter': 'random', 'max_depth': 60}
params_4 = {'criterion': 'entropy', 'splitter': 'best', 'max_depth': 80}
params_5 = {'criterion': 'gini', 'splitter': 'best', 'max_depth': 40}
# Separate models
model_1 = DecisionTreeClassifier(**params_1)
model_2 = DecisionTreeClassifier(**params_2)
model_3 = DecisionTreeClassifier(**params_3)
model_4 = DecisionTreeClassifier(**params_4)
model_5 = DecisionTreeClassifier(**params_5)
model_1.fit(X_train, y_train)
model_2.fit(X_train, y_train)
model_3.fit(X_train, y_train)
model_4.fit(X_train, y_train)
model_5.fit(X_train, y_train)
# Prediction sets
preds_1 = model_1.predict(X_test)
preds_2 = model_3.predict(X_test)
preds_3 = model_3.predict(X_test)
preds_4 = model_4.predict(X_test)
preds_5 = model_5.predict(X_test)
print(f'Accuracy on Model 1: {round(accuracy_score(y_test, preds_1), 3)}')
print(f'Accuracy on Model 2: {round(accuracy_score(y_test, preds_2), 3)}')
print(f'Accuracy on Model 3: {round(accuracy_score(y_test, preds_3), 3)}')
print(f'Accuracy on Model 4: {round(accuracy_score(y_test, preds_4), 3)}')
print(f'Accuracy on Model 5: {round(accuracy_score(y_test, preds_5), 3)}')
Output
Accuracy on Model 1: 0.693 Accuracy on Model 2: 0.693 Accuracy on Model 3: 0.693 Accuracy on Model 4: 0.736 Accuracy on Model 5: 0.688
Look at the accuracy and its differences with different parameters that we have passed over the list. But this is a tedious job and running behind a number of permutations and combinations and finding the best one, hope you can understand the pain and code management.
Grid-Search: To implement the Grid-Search, we have a Scikit-Learn library called GridSearchCV. The computational time would be long, but it would reduce the manual efforts by avoiding the ‘n’ number of lines of code. The library itself performs the search operations and returns the performing model and its score. In which each model is built for each permutation of a given hyperparameter, internally it would be evaluated and ranked across the given cross-validation folds.
Let’s implement this with the given dataset.
Getting KNeighborsClassifier object for my operation.
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
Assigning my Train and Test spilt to my KNN object
knn_clf.fit(X_train, y_train)
Output
KNeighborsClassifier()
Importing other required libraries
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
Defining a number of folders for GridSearchCV and assigning TT.
gs = GridSearchCV(knn_clf,param_grid,cv=10)
gs.fit(X_train, y_train)
Preparing a list of hyperparameters for my further actions with 4 different algorithm
param_grid = {‘n_neighbors’: list(range(1,9)),’algorithm’: (‘auto’, ‘ball_tree’, ‘kd_tree’ , ‘brute’) }
Output
GridSearchCV(cv=10, estimator=KNeighborsClassifier(),param_grid={'algorithm': ('auto', 'ball_tree', 'kd_tree', 'brute'),'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8]})
We will print all 4 algorithms for 8 sub-sets.
gs.cv_results_['params']
Output 32 combinations
[{'algorithm': 'auto', 'n_neighbors': 1}, {'algorithm': 'auto', 'n_neighbors': 2}, {'algorithm': 'auto', 'n_neighbors': 3}, {'algorithm': 'auto', 'n_neighbors': 4}, {'algorithm': 'auto', 'n_neighbors': 5}, {'algorithm': 'auto', 'n_neighbors': 6}, {'algorithm': 'auto', 'n_neighbors': 7}, {'algorithm': 'auto', 'n_neighbors': 8}, {'algorithm': 'ball_tree', 'n_neighbors': 1}, {'algorithm': 'ball_tree', 'n_neighbors': 2}, {'algorithm': 'ball_tree', 'n_neighbors': 3}, {'algorithm': 'ball_tree', 'n_neighbors': 4}, {'algorithm': 'ball_tree', 'n_neighbors': 5}, {'algorithm': 'ball_tree', 'n_neighbors': 6}, {'algorithm': 'ball_tree', 'n_neighbors': 7}, {'algorithm': 'ball_tree', 'n_neighbors': 8}, {'algorithm': 'kd_tree', 'n_neighbors': 1}, {'algorithm': 'kd_tree', 'n_neighbors': 2}, {'algorithm': 'kd_tree', 'n_neighbors': 3}, {'algorithm': 'kd_tree', 'n_neighbors': 4}, {'algorithm': 'kd_tree', 'n_neighbors': 5}, {'algorithm': 'kd_tree', 'n_neighbors': 6}, {'algorithm': 'kd_tree', 'n_neighbors': 7}, {'algorithm': 'kd_tree', 'n_neighbors': 8}, {'algorithm': 'brute', 'n_neighbors': 1}, {'algorithm': 'brute', 'n_neighbors': 2}, {'algorithm': 'brute', 'n_neighbors': 3}, {'algorithm': 'brute', 'n_neighbors': 4}, {'algorithm': 'brute', 'n_neighbors': 5}, {'algorithm': 'brute', 'n_neighbors': 6}, {'algorithm': 'brute', 'n_neighbors': 7}, {'algorithm': 'brute', 'n_neighbors': 8}]
Let’s get the best parameter from the list
gs.best_params_
Output
{'algorithm': 'auto', 'n_neighbors': 6}
As per the Cross-Validation process, will figure out the mean and get the results
gs.cv_results_['mean_test_score']
Output
array([0.68134172, 0.71701607, 0.71331237, 0.71509434, 0.72075472, 0.73944794, 0.72085954, 0.73392732, 0.68134172, 0.71701607, 0.71331237, 0.71509434, 0.72075472, 0.73944794, 0.72085954, 0.73392732, 0.68134172, 0.71701607, 0.71331237, 0.71509434, 0.72075472, 0.73944794, 0.72085954, 0.73392732, 0.68134172, 0.71701607, 0.71331237, 0.71509434, 0.72075472, 0.73944794, 0.72085954, 0.73392732])
That’s fine. which one is the best accuracy from the above list, this is simple, already found the best parameter from the list is {‘algorithm’: ‘auto’, ‘n_neighbors’: 6}, So compare the 32 combinations of different parameters and accuracy list. this answer is 0.73944794. is the highest value among the list and this is the BEST accuracy of the training model.
Best accuracy from training
print(gs.score(X_test,y_test))
Output
0.70129870
Random Search: The Grid Search that we have discussed above usually increases the complexity in terms of the computation flow, So sometimes GS is considered inefficient since it attempts all the combinations of given hyperparameters. But the Randomized Search is used to train the models based on random hyperparameters and combinations. obviously, the number of training models is small column than the grid search.
In simple terms, In Random Search, in a given grid, the list of hyperparameters is trained and test our model on a random combination of given hyperparameters.
Getting RandomForestClassifier object for my operation.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint
Assigning my Train and Test spilt to my RandomForestClassifier object
# build a RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50)
Specifying the list of parameters and distributions
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
Defining the sample, distributions and cross-validation
samples = 8
# number of random samples
randomCV = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=samples,cv=3)
All parameters are set and, let’s do the fit model
randomCV.fit(X, y)
print(randomCV.best_params_)
Output
{'bootstrap': False, 'criterion': 'gini', 'max_depth': 3, 'max_features': 3, 'min_samples_leaf': 7, 'min_samples_split': 8}
As per the Cross-Validation process, will figure out the mean and get the results
randomCV.cv_results_['mean_test_score']
Output
array([0.73828125, 0.69010417, 0.7578125 , 0.75911458, 0.73828125, nan, nan, 0.7421875 ])
Best accuracy from training
print(randomCV.score(X_test,y_test))
Output
0.8744588744588745
You may have a question, now which technique is best to go with? The straight answer is RandomSearshCV, let’s see why?
Comparison Study of GridSearchCV and RandomSearshCV
GridSearchCV | RandomSearshCV |
Grid is well-defined | Grid is not well defined |
Discrete values for HP-params | Continuous values and Statistical distribution |
Defined size for Hyperparameter space | No such a restriction |
Picks of the best combination from HP-Space | Picks up the samples from HP-Space |
Samples are not created | Samples are created and specified by the range and n_iter |
Low performance than RSCV | Better performance and result |
Guided flow to search for the best combination | The name itself says that, no guidance. |
The blow pictorial representation would give you the best understanding of GridSearchCV and RandomSearshCV.
Image designed by the author – Shanthababu
Conclusion
Guys! So far we have discussed a detailed study of Hyperparameter visions with respect to the Machine Learning point of view, please remember a few things before we go
- Each model has a set of hyperparameters, so we have carefully chosen them and tweaked them during hyperparameter tuning. I mean building the HP space.
- All hyperparameters are NOT equally important and no defined rules for this. try to use continuous values instead of discrete values.
- Make sure to use K-Fold while using Hyperparameter tuning to improvise your hyperparameter tuning and coverage of hyperparameter space.
- Go with a better combination for hyperparameters and build strong results.
I trust, this article helps you to understand the concepts and ways to implement the same.
Thanks for the time and will connect on different topics. Until then Bye! Cheers! – Shanthababu