class: left, middle, inverse, title-slide # Machine Learning in R:
Workshop Series ## Imbalanced Learning ### Simon Schölzel ###
Research Team Berens
### 2020-11-24 (updated: 2020-11-24) --- name: agenda ## Agenda **1 Learning Objectives** **2 Introduction to Imbalanced Learning** **3 Techniques for Addressing Class Imbalance** >3.1 Sampling Strategies >> 3.1.1 Random Oversampling 3.1.2 Synthetic Minority Over-Sampling Technique (SMOTE) 3.1.3 Borderline SMOTE 3.1.4 Random Undersampling 3.1.5 Informed NearMiss Undersampling >3.2 Case Weighting 3.3 Excursus: Evaluation of Classification Models 3.4 Tinkering with Classification Cutoffs **4 Cost-Sensitive Learning** --- ## 1 Learning Objectives 💡 This workshop serves as a gentle introduction to the field of imbalanced learning. You will not only learn about the peculiarities and implications of working with imbalanced data sets, but also how to address class imbalance within your machine learning pipeline. More specifically, after this workshop you will - be able to identify an imbalanced data set and know about its implications for modeling,<br><br> - carry a toolbox of techniques for addressing class imbalance at various stages in your machine learning pipeline (e.g., data collection, resampling, model estimation or model evaluation),<br><br> - have internalized basic (random under- and oversampling) and more advanced techniques of resampling (SMOTE, Borderline SMOTE, NearMiss),<br><br> - know how to distinguish alternative routes to handling class imbalance, such as imbalanced learning or cost-sensitive learning. --- ## 2 Introduction to Imbalanced Learning > An imbalance occurs when one or more classes have very low proportions in the training data as compared to the other classes. ~ [Kuhn, M./Johnson, K. (2019), p. 419](#references) -- As can be inferred from this quote, the presence of class imbalance depends on the properties of your data set. It is a definition based on relations between two types of classes: - **Majority class:** The class (or set of classes) which accounts for the vast majority of the samples. - **Minority class:** The underrepresented class which is characterized by relatively few samples. -- The relation between the number of samples in the majority vs. minority class is called the **class distribution** and it is often expressed in terms of ratios (e.g., 100:1) or percentages (e.g., 4% vs. 96%). <br><br><br> <img src="img/class-imb.png" width="50%" height="50%" style="display: block; margin: auto auto auto 0;" /> .pull-right[.footnote[ *Note: In this workshop we focus on class imbalance in the context of classification. Most of the ideas, however, analogously transfer to the realm of regression problems where outliers reflect the minority class.* *Source: [Allison Horst](https://github.com/allisonhorst/stats-illustrations)* ]] ??? - class distribution for imbalanced data sets: skewed - In most practical applications, such as fraud detection, churn prediction or bankruptcy prediction, you are particularly interested in correctly predicting the *minority class*. --- background-image: url(https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/lter_penguins.png) background-position: 95% 5% background-size: 15% layout: false ## 2 Introduction to Imbalanced Learning .pull-left[ **Implication of imbalanced learning regimes:** There is only very few information in the minority class for a model to learn from. This may lead to - an overemphasis of the majority class, - poorly calibrated prediction models, - a focus on detrimental model candidates. <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ Let's consider the following confusion matrix which belongs to a classification model that was trained on the penguin data: ``` > Truth > Prediction 0 1 > 0 137 9 > 1 0 0 ``` A *naive classifier* which always predicts the negative class will be correct 93.4% of the time. 😲 ] ??? Example: - We look only at Penguins of the "Adelie" Species - Plotted by Bill Length (Schnabellänge) und Bill Depth (Schnabeltiefe im Profil) - Artificial Example: Classify penguins by Flipper Length of 200+ (positive class) - here: overlapping distribution 93.43 = predictive accuracy - no indication about the type of error - can be devastating in some errors of application (e.g., cancer diagnosis) - and if we next look at the confusion matrix we can clearly see a bias towards the majority class --- ## Recap: The Generic Machine Learning Pipeline <img src="https://ml-ops.org/img/ml-engineering.jpg" width="55%" height="55%" style="float:right; padding:10px" style="display: block; margin: auto;" /> **.yellow[Data Pipeline]** >1 Data ingestion 2 Data exploration (EDA) 3 Data wrangling 4 Data partitioning / resampling **.red[ML Pipeline]** >5 Model training 6 Model evaluation / testing 7 Model packaging **.blue[Software & Code Pipeline]** >8 Model deployment 9 Model performance monitoring 10 Model performance logging .footnote[*Source: [ml-ops.org](https://ml-ops.org/content/end-to-end-ml-workflow)*] ??? different steps within our pipeline where we can address class imbalance: - data collection - resampling - model training / estimation - model evaluation --- ## 3 Techniques for Addressing Class Imbalance ### 3.1 Sampling Strategies (*Data Collection* / *Resampling*) The upcoming techniques address the issue of class imbalance by implementing sampling strategies that prevent imbalances, either a priori (prior to *data collection*) or post hoc (after *data collection*). -- **A priori:** Collect data in a way that directly ensures a uniform class distribution. -- **Post hoc:** Set up your resampling approach (e.g., cross-validation, bootstrapping) in a way that ensures a uniform class distribution. - Oversampling (also: upsampling) - Undersampling (also: downsampling) - Hybrid approaches (mixture of over- and undersampling) .footnote[ *Note: The resampling strategies are model-agnostic as class imbalance is addressed on the data level. Eventually, model performance must be assessed on an imbalanced test set to receive a realistic estimate of the model's performance.* ] --- ## 3.1 Sampling Strategies (*Resampling*) ### 3.1.1 Random Oversampling .pull-left[ <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ **Random oversampling:** Take all cases from the majority class in the training set and complement it by drawing random samples with replacement from the minority class. - The training set size is determined by the size of the majority class. - The classifier may extract information from one sample several times. - Duplication of potentially uninformative samples and risk of overfitting. ] .footnote[ *Note: This plot uses `geom_jitter()` from the `ggplot2` package. Thus, the generated samples are slightly shifted. In reality, they have the exact same values as the three original positive samples.* ] --- ## 3.1 Sampling Strategies (*Resampling*) ### 3.1.2 Synthetic Minority Over-Sampling Technique (SMOTE) .pull-left[ <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ **SMOTE:** Take the minority class and generate additional synthetic samples by using *k*-NN. [[2](#references)] - The classifier can leverage new samples which are similar but not equal to the original ones. - SMOTE enriches the minority class feature space by filling it (instead of extending it). **Algorithm:** 1. Randomly select a minority class sample as well as its *k* nearest neighbors. 2. Randomly create a synthetic sample along the vector between the minority class sample and each nearest neighbor. 3. Repeat till desired class distribution is reached. ] ??? - k-NN: k neareast neighbor algorithm where k is the number of nearest neighbors considered. - step 1: nearest neighbors are also from the minority class - Step 2: i have 4 samples, i.e. 3 nearest neighbours. create 3 synthetic samples and then again select new sample randomly. --- ## 3.1 Sampling Strategies (*Resampling*) ### 3.1.3 Borderline SMOTE .pull-left[ <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ **Three types of minority class samples:** - *Safe* (all nearest neighbors of minority class) - *Lost* (all nearest neighbors of majority class) - *Borderline* (> 50% neighbors of majority class) **Borderline SMOTE:** Take the *borderline samples* in the minority class and generate additional synthetic samples by using *k*-NN to improve model accuracy in uncertain regions. [[3](#references)] - **Borderline SMOTE1:** Create synthetic samples only from minority class nearest neighbors (i.e. filling the minority class feature space). - **Borderline SMOTE2:** Create synthetic samples from all nearest neighbors (i.e. extending the minority class feature space). ] ??? Borderline SMOTE focuses on uncertain samples (safe and lost samples are certain) --- ## 3.1 Sampling Strategies (*Resampling*) ### 3.1.4 Random Undersampling .pull-left[ <img src="index_files/figure-html/unnamed-chunk-9-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ **Random undersampling:** Take all cases from the minority class in the training set and complement it by drawing random samples with replacement from the majority class. - The training set size is determined by the size of the minority class. - The classifier extracts information only from a subset of all available samples. - Deletion of potentially informative samples. ] --- ## 3.1 Sampling Strategies (*Resampling*) ### 3.1.5 Informed NearMiss Undersampling .pull-left[ <img src="index_files/figure-html/unnamed-chunk-10-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ **Informative NearMiss undersampling:** Take only informative samples from the majority class, i.e. samples which are expected to be close to the empirical decision boundary, to improve accuracy in uncertain regions. [[4](#references)] - **NearMiss-1:** Select majority class samples with minimum average distances to *k* nearest neighbors from the minority class. - **NearMiss-2:** Select majority class samples with minimum average distances to *k* farthest neighbors from the minority class. - **NearMiss-3:** Select *k* majority class samples with minimum distances to the minority class sample. ] ??? - again builds on k-NN - near-miss-3: ensures surrounding and focuses on samples close to the decision boundary --- ## 3 Techniques for Addressing Class Imbalance ### 3.1 Sampling Strategies (*Resampling*) <img src="https://camo.githubusercontent.com/4da51d5ade52abe55b211efa50de0bc26cf025e5/68747470733a2f2f74686973686f6c6c6f7765617274682e66696c65732e776f726470726573732e636f6d2f323031322f30372f7468656d69732e6a7067" width="25%" height="25%" style="float:right; padding:10px" /> **Additional remarks:** - The effectiveness of the different sampling techniques depends heavily on the underlying data as well as the applied model.<br><br> - In practice, it is often sufficient to eliminate severe imbalance and keep a slight imbalance (e.g., 60:40), i.e. a 1:1 ratio as advocated here is not always required.<br><br> - You may consider the ratio (as well as the number of nearest neighbors) as potential hyperparameters in your modeling workflow.<br><br> - Even though the examples center around a 2-dimensional feature space, the techniques easily generalize to higher dimensional feature spaces.<br><br> - The `themis` package is a good place to start when facing class imbalance in your modeling pipeline. --- layout: false class: center, middle # 5-Minute Break<br><br>☕ 🍩 --- ## 3 Techniques for Addressing Class Imbalance ### 3.2 Case Weighting (*Model Estimation*) .pull-left[ <img src="index_files/figure-html/unnamed-chunk-12-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ Assign higher *case weights* to samples of the minority class when training the classifier. For some model families, this is equivalent to generating duplicate samples with identical feature values. ] .footnote[ *Note: In some sense this approach is very similar to random oversampling as discussed previously.* ] --- ## 3.3 Excursus: Evaluation of Classification Models ### 3.3.1 Confusion Matrix .pull-left[ After having successfully trained a classification model, we are generally interested in its out-of-sample performance. Let's again look at the **confusion matrix** mentioned above: ] .pull-right[ ``` > Truth > Prediction 0 1 > 0 137 9 > 1 0 0 ``` ] -- **True Positives:** `\(TP=\)` 0 Correctly predicted positive cases (positive cases predicted as positive) **False Positives:** `\(FP=\)` 0 Incorrectly predicted negative cases (negative cases predicted as positive) **True Negatives:** `\(TN=\)` 137 Correctly predicted negative cases (negative cases predicted as negative) **False Negatives:** `\(FN=\)` 9 Incorrectly predicted positive cases (positive cases predicted as negative) --- ## 3.3 Excursus: Evaluation of Classification Models ### 3.3.1 Confusion Matrix .pull-left[ After having successfully trained a classification model, we are generally interested in its out-of-sample performance. Let's again look at the **confusion matrix** mentioned above: ] .pull-right[ ``` > Truth > Prediction 0 1 > 0 137 9 > 1 0 0 ``` ] **Sensitivity:** `\(sens=TP/(TP+FN)=\)` 0 / (0 + 9) = 0 The proportion of positives that are correctly classified (also called *True Positive Rate (TPR)* or *Recall*) **Specificity:** `\(spec=TN/(FP+TN)=\)` 137 / (0 + 137) = 1 The proportion of negatives that are correctly classified -- This discrepancy between *sensitivity* and *specificity* highlights the trade-off inherent to the two metrics: Optimizing for the correct prediction of the positive (negative) class results in a higher chance of FP (FN). Hence, it is merely a trade-off between the two types of classification error. -- 🤔 **What if we simply shift the cutoff (e.g., 0.5) for classifying a sample into the positive class?** ??? - shift: in this case lower the threshold to make it more likely to predict a sample as positive. - here the underlying model is a logistic regression which gives us predicted probabilities. - other classifiers (e.g., SVM) may give us other outputs that help us to discriminate between classes --- ## 3.3 Excursus: Evaluation of Classification Models ### 3.3.2 Receiver Operating Characteristic (ROC)-Curve .pull-left[ <img src="index_files/figure-html/unnamed-chunk-15-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ **ROC-curve:** Step function created by gradually lowering (from `+inf` to `-inf`) the cutoff ( `\(c\)` ) for classifying a sample as positive. It aggregates all possible confusion matrices and creates a performance profile of the classifier. [[5](#references)] **A:** Our classifier ( `\(sens=\)` 0, `\(spec=\)` 1) which never issues positive predictions ( `\(c=\)` 0.5 ).<br><br> **B:** The opposite classifier which only issues positive predictions.<br><br> **C:** The perfect classifier which always predicts the class correctly. ] ??? - ROC-curve can be created for any classifier that outputs some kind of ranking - each point on the ROC-curve corresponds to a single confusion matrix. - more nuanced view than plain vanilla accuracy --- ## 3.3 Excursus: Evaluation of Classification Models ### 3.3.2 Receiver Operating Characteristic (ROC)-Curve .pull-left[ <img src="index_files/figure-html/unnamed-chunk-16-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ **D:** A conservative classifier which only issues positive predictions with sufficient evidence (hence, producing few TP and FP).<br><br> **E:** A liberal classifier which issues positive predictions more liberally (hence, producing many TP but also FP).<br><br> **F:** A random classifier which issues positive predictions according to a coin-toss.<br><br> **G:** Our classifier ( `\(sens=\)` 0.8889, `\(spec=\)` 0.7007) with an optimized cutoff ( `\(c=\)` 0.0699 ).<br><br> **Lower right triangle:** A classifier that performs worse than random. ] ??? - random classifier: to move form F towards the top-right corner, the classifier must exploit some of the information hidden in the data - worse than random classifier: Technically, such a classifier does not exist as you could simply invert the predictions (e.g., FP become TN). --- ## 3.3 Excursus: Evaluation of Classification Models ### 3.3.2 Receiver Operating Characteristic (ROC)-Curve .pull-left[ <img src="index_files/figure-html/unnamed-chunk-17-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ **Properties of the cutoff `\(c\)` :** In general, the variation of the cutoff does not need to adhere to predicted probabilities! Any uncalibrated score that allows to rank predictions is sufficient. **Optimal strategy:** Choose the cutoff so that the classifier is closest to the top-left corner of the ROC-plot. - This approach does not alter the classifier! - It does simply move samples up and down in the confusion matrix. 🤔 **What if we compare different classifiers with intersecting ROC-curves?** ] ??? - changing cutoffs is also referred to as post-processing --- ## 3.3 Excursus: Evaluation of Classification Models ### 3.3.3 Area under the ROC-Curve (ROC-AUC) .pull-left[ <img src="index_files/figure-html/unnamed-chunk-18-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ **ROC-AUC:** Summary of the area under the ROC-curve (i.e. the integral) that allows to identify the overall best classifier independent of the cutoff `\(c\)` . - Lies on the `\([0;1]\)`-interval (technically, `\([0.5;1]\)`). - Likelihood that the classifier assigns a higher score to a randomly chosen positive relative to a negative sample (*discriminatory power*). **Here:** `\(AUC(C2)=\)` 0.7382 `\(>\)` `\(AUC(C1)=\)` 0.7247 **Multi-class ROC-AUC:** With `\(n>\)` 2 classes the problem gets more complex. Hence, alternative approaches are required: - One vs. rest ROC-curves - Weighted one vs. rest ROC-AUC [[5](#references)] - Generalized ROC-AUC [[6](#references)] ] ??? - ROC-AUC is a scalar value - multi-class ROC: confusion matrix of dimension n x n with multiple error types (not only TP, TN, FP and FN) - weighted one vs rest ROC-AUC: (weighted by class prevalence) - generalized roc-auc builds on pairwise ROC-AUCs --- ## 3 Techniques for Addressing Class Imbalance ### 3.4 Tinkering with Classification Cutoffs (*Model Evaluation*) As previously discussed, one approach for tackling class imbalance is to simply trade-off *sensitivity* and *specificity* to strike an optimal balance between the two. Usually, by varying the cutoff `\(c\)` the increase in one of the two metrics comes at the disadvantage of the other. **Possible approaches:** - Optimize the sensitivity-specificity-relation according to some pre-defined target. - Optimize the classification rule by selecting a cutoff that produces the shortest distance to the optimum in the ROC-plot. --- ## 4 Cost-Sensitive Learning A related (partly overlapping) learning paradigm to imbalanced learning is the notion of **cost-sensitive learning**. In contrast to the previous techniques, which mainly focus on rebalancing class distributions, cost-sensitive learning explicitly incorporates cost information when building machine learning systems. [[7](#references)] **Relative misclassification cost:** Different misclassification errors incur different cost, i.e. misclassifying minority class samples (FN) is usually relatively more costly than misclassifying majority class samples (FP). .pull-left[ **Approaches to cost-sensitive learning:** - Incorporate cost information into your loss function by specifying a *cost matrix*.<br><br> - Use *robust algorithms* that can account for different misclassification cost, e.g., by assigning higher cost to FN. ] .pull-right[.center[ <img src="./img/cost-matrix.PNG" width="100%" height="100%" /> *Source: Multi-class cost matrix adapted from [[7](#references)].* ]] ??? cost: money, time, health (depending on the domain) --- name: references ## References [1]: **Kuhn, M./Johnson, K. (2013):** Applied Predictive Modeling. Chapter 3 (Data Pre-processing). Springer: New York 2013. [2]: **Chawla, N. V./Bowyer, K. W./Hall, L. O./Kegelmeyer, W. P. (2002):** SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, Vol. 16, 2002, pp. 321‑357. [3]: **Han, H./Wang, W.-Y./Mao, B.-H. (2005):** Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Hutchison, D. et al. (eds.): Advances in Intelligent Computing. Berlin/Heidelberg: Springer 2005, pp. 878‑887. [4]: **Mani, I./Zhang, I. (2003):** kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In: Proceedings of Workshop on Learning from Imbalanced Datasets. [5]: **Fawcett, T. (2006):** An introduction to ROC Analysis. Pattern Recognition Letters, Vol. 27, 2006, No. 8, pp. 861‑874. [6]: **Hand, D. J./Till, Robert, J. (2001):** A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, Vol. 45, 2001, No. 2, pp. 171‑186. [7]: **He, H./Garcia, E. A. (2009):** Learning from Imbalanced Data. In: IEEE Transactions on Knowledge and Data Engineering, Vol. 21, 2009, No. 9, pp. 1263‑1284.