Hướng dẫn cross validation python code

Yes, we’ll code 5 different techniques here!

Cross-Validation is one of the most efficient ways of interpreting the model performance. It ensures that the model accurately fits the data and also checks for any Overfitting. It is the process of assessing how the results of a statistical analysis will generalize to an independent dataset.

Overfitting occurs when the model fits perfectly on the training set and performs poorly when it comes to the test set. This means that the model will learn the training data well but will not generalize on unseen samples.

There are different Cross-Validation techniques, such as,

  1. K-Fold Cross-Validation
  2. Stratified K-Fold Cross-Validation
  3. Hold-Out based Validation
  4. Leave-One-Out Cross-Validation
  5. Group K-Fold Cross-Validation

The general idea behind Cross-validation is that we divide the Training Data into a few parts. We choose a few of these parts to train and the rest to testing the model. The different Cross-Validation techniques are based on how we partition the data.

  1. K-Fold Cross-Validation

Hướng dẫn cross validation python code

K-Fold CV (Source - Internet)

We split the data into k equal parts, and at each Split, use a part as a test set and others as a train set, we keep doing it for multiple splits. This approach minimizes wastage of data, thus proving beneficial in cases with a lower number of samples.

A code snippet to initiate K-Fold Cross-Validation,

# import model_selection module of scikit-learnfrom sklearn import model_selection# initiate the k-fold class from model_selection modulekf = model_selection.KFold(n_splits=5)# fill the new kfold columnfor fold, (trn_, val_) in enumerate(kf.split(X=df)):df.loc[val_, 'kfold'] = fold

2. Stratified K-Fold Cross-Validation

Hướng dẫn cross validation python code

Stratified K-Fold (Source - Internet)

Stratified K-Fold is used to create K folds from the data made such that they preserve the percentage of samples for each class.

For example, in a Binary Classification problem where the classes are skewed in a ratio of 90:10, a Stratified K-Fold would create folds maintaining this ratio, unlike K-Fold Validation.

This Cross-Validation technique is for Classification problems, for Regression, however, we have to bin the data together in a certain range and then apply Stratified K-Fold on it.

For finding the appropriate number of bins, we follow the Sturge’s rule,

Number of Bins = 1 + log2(N)

Where N is the number of samples you have in your dataset.

The code for Stratified K-Fold is similar, just that we provide the target variable on which we want to preserve the percentage of samples,

# import model_selection module of scikit-learnfrom sklearn import model_selection# fetch targetsy = df.target.values# initiate the kfold class from model_selection modulekf = model_selection.StratifiedKFold(n_splits=5)# fill the new kfold columnfor f, (t_, v_) in enumerate(kf.split(X=df, y=y)):df.loc[v_, 'kfold'] = f

3. Hold-Out Based Validation

Hướng dẫn cross validation python code

Hold-Out Based CV (Source - Internet)

This is the most common type of Cross-Validation. Here, we split the dataset into Training and Test Set, generally in a 70:30 or 80:20 ratio. The model is trained on the Training data and performance evaluation happens on Test data. Since we know the more data a model sees, the better it gets, this approach lacks there, as it isolates a chunk of data from the model while training.

Hold-Out Cross Validation is also used in Time-Series models, where the most recent data is used as a Validation Set. For an instance, we have data from the year 2005–2010, so we will Hold-Out the data from 2010 for Validation and use the 2005–2009 data for Training.

We can use a scikit-learn function to split the data or do a manual split for problems like the one discussed above.

# import model_selection module of scikit-learnfrom sklearn import model_selection#holding out 40% of the data for testing (evaluating)X_train, X_test, y_train, y_test = model_selection.train_test_split (X, y, test_size=0.4, random_state=0)

4. Leave-One-Out Cross-Validation

Hướng dẫn cross validation python code

LOOCV (Source - Internet)

Leave-One-Out Cross Validation is an extreme case of K-Fold Cross Validation where k is the number of samples in the data. This technique is computationally very expensive and should only be used with small-sized datasets. Since this technique fits a lot of models, it is robust in terms of its evaluation estimates.

We can use the K-Fold validation code to create n-folds, where n is the number of samples in the data. Alternatively, we can leverage scikit-learn’s LeaveOneOut() method too,

# import model_selection module of scikit-learnfrom sklearn.model_selection import LeaveOneOut# Instantiate the LeaveOneOut() objectloo = LeaveOneOut()# Run the function through the dataset (here, X)loo.get_n_splits(X)# Get the list of Train-Test setsfor train_index, test_index in loo.split(X):X_train, X_test = X[train_index], X[test_index]y_train, y_test = y[train_index], y[test_index]

5. Group K-Fold Cross-Validation

Hướng dẫn cross validation python code

Group K-Fold (Source - Internet)

GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing/validation and training sets.

For example, if your data includes multiple rows for each customer (but it still makes sense to train on individual transactions/rows), and your production use-case involves making predictions for new customers, then testing on rows from customers that also have rows in your training set may be optimistically biased.

GroupKFold makes it possible to detect this kind of overfitting situation.

A snippet for application of the same,

# import model_selection module of scikit-learnfrom sklearn.model_selection import LeaveOneOut# Creating GroupKFold Objectn_splits = 5gkf = GroupKFold(n_splits = 5)# Creating Group Foldsresult = []for train_idx, val_idx in gkf.split(train_df, y_labels, groups =groups_by_patient_id_list):train_fold = train_df.iloc[train_idx]val_fold = train_df.iloc[val_idx]result.append((train_fold, val_fold))

Cross-Validation is an important technique when it comes to the statistical evaluation of our models. Ensuring the right technology can improve both the accuracy and robustness of the model. Inferencing the metrics for performance using these techniques can help create models that perform better on unseen data.

There are more such Cross-Validation techniques too, but we leave that to you to explore and find out!

If you have come this far hit a clap and follow me for more!

Also, cheers to Abhishek Thakur for his amazing book Approaching (Almost) Any Machine Learning Problemfor covering cross-validation in detail, and the scikit-learn website for coming up with amazing Cross-Validation visualizations.