Advanced use

This notebook is intended for those users who want to make advanced use of the module by defining their own advanced functionalities and benefit from the subroutines already implemented in the gojo library.

import numpy as np
from sklearn import datasets

# For the tests we will use the test dataset used in Example 1....
# load test dataset (Wine)
wine_dt = datasets.load_wine()

# create the target variable. Classification problem 0 vs rest
# to see the target names you can use wine_dt['target_names']
y = (wine_dt['target'] == 1).astype(int)
X = wine_dt['data']

X.shape, y.shape

((178, 13), (178,))

Definition of your own transformations (gojo.interfaces.Transform)

To define your own transformations you can make use of the gojo.interfaces.Transform class. Let’s see how to define our own transformations using an example.

In the example we will implement a very naive strategy of feature selection based on trying different combinations of variables (number of variables in each combination defined by n_vars, and number of interations specified by n_iters) and selecting the combination that works best. To evaluate the quality of the selected variables we will use the GaussianNB__ model of sklearn.

For more information use: help(interfaces.Transform)

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

from gojo import interfaces
from gojo import core
from gojo import util

C:Usersfgarciaanaconda3envsmlv0libsite-packagestqdmauto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

class RandomPermutationSelection(interfaces.Transform):
    def __init__(self, n_vars: int, n_iters: int, random_state: int = None):
        super().__init__()    # IMPORTANT. Don't forget to call the superclass constructor

        self.n_vars = n_vars
        self.n_iters = n_iters
        self.random_state = random_state
        self.selected_features = None

    def fit(self, X: np.ndarray, y: np.ndarray, **_):

        # fix the random seed
        np.random.seed(self.random_state)

        # create a selection array
        findex = np.arange(X.shape[1])

        # iterate over random feature sets
        best_fset = None
        best_score = -np.inf
        for _ in range(self.n_iters):
            binary_mask = np.zeros(shape=X.shape[1])

            # random shuffle of findex
            np.random.shuffle(findex)

            # get selected features
            binary_mask[findex[:self.n_vars]] = 1
            sel_features = np.where(binary_mask == 1)[0]

            # test model performance
            cv_score = cross_val_score(
                GaussianNB(),
                X=X[:, sel_features],
                y=y,
                scoring='f1')
            avg_cv_score = np.mean(cv_score)

            # save features
            if avg_cv_score > best_score:
                best_score = avg_cv_score
                best_fset = sel_features

        self.selected_features = best_fset

    def transform(self, X: np.ndarray, **_):
        assert self.selected_features is not None, 'Unfitted transform'
        return X[:, self.selected_features]

    def reset(self):
        self.selected_features = None

fselector = RandomPermutationSelection(
    n_vars=5, n_iters=500)

fselector.fit(X, y)

fselector.selected_features

array([ 0,  4,  9, 10, 12], dtype=int64)

fselector_copy = fselector.copy()   # test the copy method
fselector.reset()                   # reset the transform
fselector.selected_features, fselector_copy.selected_features

(None, array([ 0,  4,  9, 10, 12], dtype=int64))

Now that we have implemented our custom transformation, we are going to introduce it into a cross validation loop by saving the transformations so that we can explore the selected characteristics of each fold. Here we are going to use the same model and approach used in the notebook Example 1. Model evaluation by cross validation.ipynb

# model definition
model = interfaces.SklearnModelWrapper(
    model_class=SVC,
    kernel='poly', degree=1, coef0=0.0,
    cache_size=1000, class_weight=None
)

# cross-validation definition
cv_obj = util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True)


# z-score scaling
zscores_scaler = interfaces.SKLearnTransformWrapper(transform_class=StandardScaler)

# put all transformation in a list (they will be applied sequentially)
transformations = [zscores_scaler, fselector]

cv_report = core.evalCrossVal(
    X=X,
    y=y,
    model=model,
    cv=cv_obj,
    save_train_preds=True,
    save_models=True,
    save_transforms=True,
    transforms=transformations,
    n_jobs=5
)

Performing cross-validation...: 5it [00:00, 363.73it/s]

performance = cv_report.getScores(
    core.getDefaultMetrics('binary_classification')
)
performance['test']

	accuracy	balanced_accuracy	precision	recall	sensitivity	specificity	negative_predictive_value	f1_score	auc	n_fold
0	0.916667	0.905844	0.923077	0.857143	0.857143	0.954545	0.913043	0.888889	0.905844	0
1	0.916667	0.892857	1.000000	0.785714	0.785714	1.000000	0.880000	0.880000	0.892857	1
2	0.916667	0.909524	0.928571	0.866667	0.866667	0.952381	0.909091	0.896552	0.909524	2
3	0.914286	0.904762	0.923077	0.857143	0.857143	0.952381	0.909091	0.888889	0.904762	3
4	0.942857	0.952381	0.875000	1.000000	1.000000	0.904762	1.000000	0.933333	0.952381	4

performance['test'].mean().loc['f1_score']

0.8975325670498083

fitted_transforms = cv_report.getFittedTransforms()
fitted_transforms

{0: [SKLearnTransformWrapper(
      base_transform='sklearn.preprocessing._data.StandardScaler',
      transform_params={}
  ),
  <__main__.RandomPermutationSelection at 0x27156732f50>],
 1: [SKLearnTransformWrapper(
      base_transform='sklearn.preprocessing._data.StandardScaler',
      transform_params={}
  ),
  <__main__.RandomPermutationSelection at 0x27156731780>],
 2: [SKLearnTransformWrapper(
      base_transform='sklearn.preprocessing._data.StandardScaler',
      transform_params={}
  ),
  <__main__.RandomPermutationSelection at 0x27156732110>],
 3: [SKLearnTransformWrapper(
      base_transform='sklearn.preprocessing._data.StandardScaler',
      transform_params={}
  ),
  <__main__.RandomPermutationSelection at 0x27156732260>],
 4: [SKLearnTransformWrapper(
      base_transform='sklearn.preprocessing._data.StandardScaler',
      transform_params={}
  ),
  <__main__.RandomPermutationSelection at 0x27156732020>]}

Lets explore the selected features in each fold

for n_fold, transform in fitted_transforms.items():
    print('Selected features in fold %d: %r' % (n_fold, list(transform[1].selected_features)))

Selected features in fold 0: [0, 2, 4, 9, 10]
Selected features in fold 1: [0, 2, 7, 9, 10]
Selected features in fold 2: [4, 6, 9, 10, 12]
Selected features in fold 3: [0, 2, 4, 8, 9]
Selected features in fold 4: [0, 2, 5, 9, 10]

We have seen that this feature selection, although naive, tends to select always the same features.