Hyperparameter optimization by nested cross-validation

In this example we will see how to evaluate a model by cross-validation by coupling a hyperparameter optimization step using a nested cross-validation. The main idea here is that, for each fold, using the training data of that fold, we try to find the hyperparameters that maximize/minimize a certain objective function. Usually this objective function corresponds to a classification/regression metric (depending on the problem) obtained by cross-validation (hence the name nested cross-validation). Once we obtain a model derived from the hyperparameter optimization, we apply that model to the test data of the cross-validation fold and thus estimate its performance on that fold.

Below is a schema of the nested cross validation (will only work in jupyterlab versions > 4.1 and jupyter > 7.1):

graph TD
    A[Dataset] -->|Outer Split| B[Outer Fold 1]
    A[Dataset] -->|Outer Split| C[Outer Fold 2]
    A[Dataset] -->|Outer Split| D[Outer Fold 3]
    A[Dataset] -->|Outer Split| E[Outer Fold 4]
    A[Dataset] -->|Outer Split| F[Outer Fold 5]

    B[Outer Fold 1] --> |Inner Split - train| BA[Training data]
    B[Outer Fold 1] --> |Inner Split - test| BB[Test data]

    BA[Training data] --> |Inner CV| BAA[Cross-validation with HPO]
    BAA[Cross-validation with HPO] --> |Hyperparameter combination| BAAA[Best model]

    BAAA[Best model] --> |Predictions| BB[Test data]


    C[Outer Fold 2] --> |Inner CV| CA[Cross-validation with HPO] --> |Predictions| CAA[Test data]
    D[Outer Fold 3] --> |Inner CV| CB[Cross-validation with HPO] --> |Predictions| CBA[Test data]
    E[Outer Fold 4] --> |Inner CV| CC[Cross-validation with HPO] --> |Predictions| CCA[Test data]
    F[Outer Fold 5] --> |Inner CV| CD[Cross-validation with HPO] --> |Predictions| CDA[Test data]

BB[Test data] --> DA[Aggregate predictions]
CAA[Test data] --> DA[Aggregate predictions]
CBA[Test data] --> DA[Aggregate predictions]
CCA[Test data] --> DA[Aggregate predictions]
CDA[Test data] --> DA[Aggregate predictions]

For this example we will use the Breast Cancer Wisconsin (Diagnostic) dataset.The following is a description of the dataset:

Overview

The Breast Cancer Wisconsin (Diagnostic) dataset is a widely used dataset for binary classification tasks in the field of machine learning. It consists of features that describe the characteristics of cell nuclei present in breast cancer biopsies. The goal is to predict whether a given sample is benign or malignant based on these characteristics.

Dataset Characteristics

Number of Instances: 569
Number of Features: 30 numeric, predictive attributes
Class Distribution:
- Malignant: 212 samples
- Benign: 357 samples

Attribute Information

The dataset includes 10 real-valued features for each cell nucleus, which are computed for each of the 569 instances. These features are:

Radius (mean of distances from center to points on the perimeter)
Texture (standard deviation of gray-scale values)
Perimeter
Area
Smoothness (local variation in radius lengths)
Compactness (perimeter^2 / area - 1.0)
Concavity (severity of concave portions of the contour)
Concave points (number of concave portions of the contour)
Symmetry
Fractal dimension (“coastline approximation” - 1)

For each of these features, the dataset provides the mean, standard error, and worst (largest) value. This results in a total of 30 features per instance (10 features * 3 measures = 30).

Target Variable

The target variable is binary and indicates the diagnosis of the breast tissue:

0: Malignant
1: Benign

The dataset is publicly available from the UCI Machine Learning Repository and was originally curated by Dr. William H. Wolberg from the University of Wisconsin.

import optuna
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier

# for a simpler use, we load the different submodules of the library
#     - the gojo.core module contains all the subroutines used to evaluate the models
#     - the gojo.interfaces module provides a standardized way to interact with the different elements of gojo.core
#     - the gojo.util module implements some utilities
#     - the gojo.plotting module implements different visualization tools
from gojo import core
from gojo import interfaces
from gojo import util
from gojo import plotting

C:Usersfgarciaanaconda3envsmlv0libsite-packagestqdmauto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

# load the dataset
breast_cancer = datasets.load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target
X.shape, y.shape

((569, 30), (569,))

Initial model evaluation

First let’s start by defining a simple decision tree-based model and evaluating its performance by cross-validation to get a first idea of the performance we can expect. To do this we will use the default parameters of the model and evaluate it by a five-fold cross-validation.

# define the model and the cross validation schema
model = interfaces.SklearnModelWrapper(model_class=DecisionTreeClassifier)
cv_obj = util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True)
model

SklearnModelWrapper(
    base_model='sklearn.tree._classes.DecisionTreeClassifier',
    model_params={},
    predict_proba=False,
    supress_warnings=False
)

# evaluate the model using the cross-validation schema defined in the previous cell keeping the
# training set predictions to evaluate the possible overfitting
cv_report = core.evalCrossVal(
    X=X, y=y,
    model=model,
    cv=cv_obj,
    save_train_preds=True,
)
cv_report

Performing cross-validation...: 5it [00:00, 172.43it/s]

<gojo.core.report.CVReport at 0x15fffee8d60>

# calculate performance metrics
performance = cv_report.getScores(core.getDefaultMetrics('binary_classification'), supress_warnings=True)
pd.concat([
    pd.DataFrame(performance['test'].mean(), columns=['Performance (test)']).drop(index=['n_fold']).round(decimals=3),
    pd.DataFrame(performance['train'].mean(), columns=['Performance (train)']).drop(index=['n_fold']).round(decimals=3)

], axis=1)

	Performance (test)	Performance (train)
accuracy	0.921	1.0
balanced_accuracy	0.920	1.0
precision	0.948	1.0
recall	0.924	1.0
sensitivity	0.924	1.0
specificity	0.915	1.0
negative_predictive_value	0.882	1.0
f1_score	0.936	1.0
auc	0.920	1.0

Now, lets use the gojo.plotting module to display the confusion matrices.

cv_report.getTrainPredictions().head(5)

		pred_labels	true_labels
n_fold	indices
0	0	0	0
	2	0	0
	3	0	0
	4	0	0
	6	0	0

plotting.confusionMatrix(
    df=cv_report.getTestPredictions(),
    y_pred='pred_labels',
    y_true='true_labels',
    average='n_fold',
    labels=['Malignant', 'Benign'],
    title='Predictions on the test data'
)

../_images/Example_3_Hyperparameter_optimization_by_nested_cross_validation_10_0.png

plotting.confusionMatrix(
    df=cv_report.getTrainPredictions(),
    y_pred='pred_labels',
    y_true='true_labels',
    average='n_fold',
    labels=['Malignant', 'Benign'],
    title='Predictions on the train data'
)

Hyperparameter optimization

Next we will optimize the same model as before by means of a nested cross validation as described above (outer cross-validation of 5-fold, and inner cross-validation of 5-fold). This can be easily accomplished using the gojo.core.evalCrossValNestedHPO function. This function depends on the samplers defined in the optuna library as this is the library that the function uses internally to optimize the hyperparameters.

First of all, lets define the search space (aka parameter grid) to be explored. In this case we will explore the following hyperparameters:

Maximum tree depth
Minimum number of samples (defined as proportion) to make a split
Minimum number of samples (defined as proportion) that can be in a terminal node.
Number of features considered for a split.
Pruning parameter (ccp_alpha)

Note that here some parameters are sampled considering a categorical distribution (suggest_categorical), others a continuous distribution (suggest_float), and others a continuous distribution with integers (suggest_int).

search_space = {
    'max_depth'        : ('suggest_int', (4, 15)),
    'min_samples_split': ('suggest_float', (0.01, 0.2)),
    'min_samples_leaf' : ('suggest_float', (0.01, 0.2)),
    'max_features'     : ('suggest_categorical', [('sqrt', 'log2', None)]),
    'ccp_alpha'        : ('suggest_float', (0.00, 0.5)),
}

Subsequently, we perform hyperparameter optimization using both internal and external five-fold cross-validation, and using the optuna tree-Parzen estimator (TPE) to explore the search space. The TPE model will sample the first 100 hyperparameter combinations randomly, and the remaining 300 using the distribution of values learned and refined from the first random iterations and refined from subsequent iterations. As the objective metric of hyperparameter optimization we will maximize the balanced accuracy obtained on the internal cross-validation test set.

# perform the HPO to optimice model-hyperparameters
hpo_cv_report = core.evalCrossValNestedHPO(
    X=X,
    y=y,
    model=model,
    search_space=search_space,
    outer_cv=util.getCrossValObj(
        cv=5, repeats=1, stratified=True, loocv=False, random_state=1997),
    inner_cv=util.getCrossValObj(
        cv=5, repeats=1, stratified=True, loocv=False, random_state=1997),
    hpo_sampler=optuna.samplers.TPESampler(n_startup_trials=100),
    hpo_n_trials=400,
    minimization=False,
    metrics=core.getDefaultMetrics('binary_classification', bin_threshold=0.5),
    objective_metric='balanced_accuracy',
    verbose=1,
    save_train_preds=True,
    save_models=False,
    n_jobs=25
)

Performing cross-validation...: 5it [02:20, 28.06s/it]

The report obtained after calling the evalCrossValNestedHPO function is exactly the same as the previous one (although as we will see below it contains more information). Let’s see the confusion matrix obtained with this model and compare it with the one obtained previously.

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

plt.subplots_adjust(wspace=0.5, hspace=None)
plotting.confusionMatrix(
    df=cv_report.getTestPredictions(),
    y_pred='pred_labels',
    y_true='true_labels',
    average='n_fold',
    labels=['Malignant', 'Benign'],
    title='Cross-validation',
    ax=axes[0],
    show=False
)

plotting.confusionMatrix(
    df=hpo_cv_report.getTestPredictions(),
    y_pred='pred_labels',
    y_true='true_labels',
    average='n_fold',
    labels=['Malignant', 'Benign'],
    title='Cross-validation (HPO)',
    ax=axes[1],
    show=False
)

plt.show()

../_images/Example_3_Hyperparameter_optimization_by_nested_cross_validation_18_0.png

We see that the performance obtained by HPO is slightly worse than that obtained without optimizing the hyperparameters… however, this is because the problem addressed is quite simple in nature. However, let’s see if we have a level of overfitting (recall that without HPO the model learned the training data perfectly).

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

plt.subplots_adjust(wspace=0.5, hspace=None)
plotting.confusionMatrix(
    df=hpo_cv_report.getTestPredictions(),
    y_pred='pred_labels',
    y_true='true_labels',
    average='n_fold',
    labels=['Malignant', 'Benign'],
    title='Cross-validation (HPO) - Test',
    ax=axes[0],
    show=False
)

plotting.confusionMatrix(
    df=hpo_cv_report.getTrainPredictions(),
    y_pred='pred_labels',
    y_true='true_labels',
    average='n_fold',
    labels=['Malignant', 'Benign'],
    title='Cross-validation (HPO) - Train',
    ax=axes[1],
    show=False
)

plt.show()

../_images/Example_3_Hyperparameter_optimization_by_nested_cross_validation_20_0.png

Now we see that the model does not show the extreme overfitting that we observed before which can make the model work better in a production environment.

On the other hand, the results of the HPO (the history) are stored in the metadata attribute of the obtained report:

hpo_cv_report.metadata.keys()

dict_keys(['hpo_history', 'hpo_best_params', 'op_instance_args'])

Using the key hpo_best_params we can access the best combinations of hyperparameters obtained within each cross validation. This allows us to learn about how the different hyperparameters behave.

pd.DataFrame(hpo_cv_report.metadata['hpo_best_params'])

	0	1	2	3	4
max_depth	12	12.000000	5	14	14
min_samples_split	0.135302	0.059464	0.168597	0.139786	0.021478
min_samples_leaf	0.128331	0.026129	0.035262	0.035723	0.016587
max_features	log2	NaN	sqrt	sqrt	sqrt
ccp_alpha	0.061524	0.011771	0.026528	0.029911	0.001725

Let’s also look at the convergence of the HPO

hpo_cv_report.metadata['hpo_history'][0].head(5)

	number	value	datetime_start	datetime_complete	duration	params_ccp_alpha	params_max_depth	params_max_features	params_min_samples_leaf	params_min_samples_split	state
0	0	-0.913699	2024-06-12 15:28:52.894498	2024-06-12 15:28:54.360391	0 days 00:00:01.465893	0.258801	6	log2	0.025152	0.179437	COMPLETE
1	1	-0.500000	2024-06-12 15:28:52.895498	2024-06-12 15:28:54.454688	0 days 00:00:01.559190	0.490937	8	None	0.084618	0.185654	COMPLETE
2	2	-0.912698	2024-06-12 15:28:52.897541	2024-06-12 15:28:54.302652	0 days 00:00:01.405111	0.198169	13	log2	0.114808	0.015214	COMPLETE
3	3	-0.897909	2024-06-12 15:28:52.898561	2024-06-12 15:28:54.429326	0 days 00:00:01.530765	0.024611	13	sqrt	0.084600	0.101999	COMPLETE
4	4	-0.881339	2024-06-12 15:28:52.899591	2024-06-12 15:28:54.472548	0 days 00:00:01.572957	0.196477	5	log2	0.107105	0.024681	COMPLETE

# plot the HPO convergence for each plot
fig, ax = plt.subplots(figsize=(8, 3))
for n_fold in hpo_cv_report.metadata['hpo_history'].keys():
    ax.plot(
        hpo_cv_report.metadata['hpo_history'][n_fold]['value'].values,
        label='Fold %d' % n_fold
    )
for pos in ['right', 'left', 'bottom', 'top']:
    ax.spines[pos].set_visible(False)
ax.grid(alpha=0.3, color='grey')
ax.legend(loc='upper right')
ax.set_xlabel('Iteration')
ax.set_ylabel('Objective')
ax.set_title('HPO convergence')
plt.show()

../_images/Example_3_Hyperparameter_optimization_by_nested_cross_validation_27_0.png

Alternatively, we can use some auxiliary functions of the library to represent the above graph.

plotting.linePlot(
    *list(hpo_cv_report.metadata['hpo_history'].values()),
    x='number',
    y='value',
    labels=['Fold %d' % (i+1) for i in range(5)],
    figsize=(8, 4),
    title='HPO convergence',
)