Advanced use
This notebook is intended for those users who want to make advanced use of the module by defining their own advanced functionalities and benefit from the subroutines already implemented in the gojo library.
import numpy as np
from sklearn import datasets
# For the tests we will use the test dataset used in Example 1....
# load test dataset (Wine)
wine_dt = datasets.load_wine()
# create the target variable. Classification problem 0 vs rest
# to see the target names you can use wine_dt['target_names']
y = (wine_dt['target'] == 1).astype(int)
X = wine_dt['data']
X.shape, y.shape
((178, 13), (178,))
Definition of your own transformations (gojo.interfaces.Transform)
To define your own transformations you can make use of the gojo.interfaces.Transform class. Let’s see how to define our own transformations using an example.
In the example we will implement a very naive strategy of feature selection based on trying different combinations of variables (number of variables in each combination defined by n_vars, and number of interations specified by n_iters) and selecting the combination that works best. To evaluate the quality of the selected variables we will use the GaussianNB__ model of sklearn.
For more information use: help(interfaces.Transform)
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from gojo import interfaces
from gojo import core
from gojo import util
C:Usersfgarciaanaconda3envsmlv0libsite-packagestqdmauto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
class RandomPermutationSelection(interfaces.Transform):
def __init__(self, n_vars: int, n_iters: int, random_state: int = None):
super().__init__() # IMPORTANT. Don't forget to call the superclass constructor
self.n_vars = n_vars
self.n_iters = n_iters
self.random_state = random_state
self.selected_features = None
def fit(self, X: np.ndarray, y: np.ndarray, **_):
# fix the random seed
np.random.seed(self.random_state)
# create a selection array
findex = np.arange(X.shape[1])
# iterate over random feature sets
best_fset = None
best_score = -np.inf
for _ in range(self.n_iters):
binary_mask = np.zeros(shape=X.shape[1])
# random shuffle of findex
np.random.shuffle(findex)
# get selected features
binary_mask[findex[:self.n_vars]] = 1
sel_features = np.where(binary_mask == 1)[0]
# test model performance
cv_score = cross_val_score(
GaussianNB(),
X=X[:, sel_features],
y=y,
scoring='f1')
avg_cv_score = np.mean(cv_score)
# save features
if avg_cv_score > best_score:
best_score = avg_cv_score
best_fset = sel_features
self.selected_features = best_fset
def transform(self, X: np.ndarray, **_):
assert self.selected_features is not None, 'Unfitted transform'
return X[:, self.selected_features]
def reset(self):
self.selected_features = None
fselector = RandomPermutationSelection(
n_vars=5, n_iters=500)
fselector.fit(X, y)
fselector.selected_features
array([ 0, 4, 9, 10, 12], dtype=int64)
fselector_copy = fselector.copy() # test the copy method
fselector.reset() # reset the transform
fselector.selected_features, fselector_copy.selected_features
(None, array([ 0, 4, 9, 10, 12], dtype=int64))
Now that we have implemented our custom transformation, we are going to introduce it into a cross validation loop by saving the transformations so that we can explore the selected characteristics of each fold. Here we are going to use the same model and approach used in the notebook Example 1. Model evaluation by cross validation.ipynb
# model definition
model = interfaces.SklearnModelWrapper(
model_class=SVC,
kernel='poly', degree=1, coef0=0.0,
cache_size=1000, class_weight=None
)
# cross-validation definition
cv_obj = util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True)
# z-score scaling
zscores_scaler = interfaces.SKLearnTransformWrapper(transform_class=StandardScaler)
# put all transformation in a list (they will be applied sequentially)
transformations = [zscores_scaler, fselector]
cv_report = core.evalCrossVal(
X=X,
y=y,
model=model,
cv=cv_obj,
save_train_preds=True,
save_models=True,
save_transforms=True,
transforms=transformations,
n_jobs=5
)
Performing cross-validation...: 5it [00:00, 363.73it/s]
performance = cv_report.getScores(
core.getDefaultMetrics('binary_classification')
)
performance['test']
accuracy | balanced_accuracy | precision | recall | sensitivity | specificity | negative_predictive_value | f1_score | auc | n_fold | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.916667 | 0.905844 | 0.923077 | 0.857143 | 0.857143 | 0.954545 | 0.913043 | 0.888889 | 0.905844 | 0 |
1 | 0.916667 | 0.892857 | 1.000000 | 0.785714 | 0.785714 | 1.000000 | 0.880000 | 0.880000 | 0.892857 | 1 |
2 | 0.916667 | 0.909524 | 0.928571 | 0.866667 | 0.866667 | 0.952381 | 0.909091 | 0.896552 | 0.909524 | 2 |
3 | 0.914286 | 0.904762 | 0.923077 | 0.857143 | 0.857143 | 0.952381 | 0.909091 | 0.888889 | 0.904762 | 3 |
4 | 0.942857 | 0.952381 | 0.875000 | 1.000000 | 1.000000 | 0.904762 | 1.000000 | 0.933333 | 0.952381 | 4 |
performance['test'].mean().loc['f1_score']
0.8975325670498083
fitted_transforms = cv_report.getFittedTransforms()
fitted_transforms
{0: [SKLearnTransformWrapper(
base_transform='sklearn.preprocessing._data.StandardScaler',
transform_params={}
),
<__main__.RandomPermutationSelection at 0x27156732f50>],
1: [SKLearnTransformWrapper(
base_transform='sklearn.preprocessing._data.StandardScaler',
transform_params={}
),
<__main__.RandomPermutationSelection at 0x27156731780>],
2: [SKLearnTransformWrapper(
base_transform='sklearn.preprocessing._data.StandardScaler',
transform_params={}
),
<__main__.RandomPermutationSelection at 0x27156732110>],
3: [SKLearnTransformWrapper(
base_transform='sklearn.preprocessing._data.StandardScaler',
transform_params={}
),
<__main__.RandomPermutationSelection at 0x27156732260>],
4: [SKLearnTransformWrapper(
base_transform='sklearn.preprocessing._data.StandardScaler',
transform_params={}
),
<__main__.RandomPermutationSelection at 0x27156732020>]}
Lets explore the selected features in each fold
for n_fold, transform in fitted_transforms.items():
print('Selected features in fold %d: %r' % (n_fold, list(transform[1].selected_features)))
Selected features in fold 0: [0, 2, 4, 9, 10]
Selected features in fold 1: [0, 2, 7, 9, 10]
Selected features in fold 2: [4, 6, 9, 10, 12]
Selected features in fold 3: [0, 2, 4, 8, 9]
Selected features in fold 4: [0, 2, 5, 9, 10]
We have seen that this feature selection, although naive, tends to select always the same features.