# 几种常见采样方法及原理

## 1、朴素随机采样

• 随机过采样：从少数类中随机选择示例，并进行替换，然后将它们添加到训练数据集中；
• 随机欠采样：从多数类中随机选择示例，并将它们从训练数据集中删除；

## 2、随机过采样

``# define oversampling strategyoversample = RandomOverSampler(sampling_strategy='minority')``

``# define oversampling strategyoversample = RandomOverSampler(sampling_strategy=0.5)# fit and apply the transformX_over, y_over = oversample.fit_resample(X, y)``

``# example of evaluating a decision tree with random oversamplingfrom numpy import meanfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import RepeatedStratifiedKFoldfrom sklearn.tree import DecisionTreeClassifierfrom imblearn.pipeline import Pipelinefrom imblearn.over_sampling import RandomOverSampler# define datasetX, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)# define pipelinesteps = [('over', RandomOverSampler()), ('model', DecisionTreeClassifier())]pipeline = Pipeline(steps=steps)# evaluate pipelinecv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1)score = mean(scores)print('F1 Score: %.3f' % score)``

## 3、随机欠采样

``# define undersample strategyundersample = RandomUnderSampler(sampling_strategy='majority')..# define undersample strategyundersample = RandomUnderSampler(sampling_strategy=0.5)# fit and apply the transformX_over, y_over = undersample.fit_resample(X, y)``

## 4、随机过采样与欠采样的结合

``# define pipelineover = RandomOverSampler(sampling_strategy=0.1)under = RandomUnderSampler(sampling_strategy=0.5)steps = [('o', over), ('u', under), ('m', DecisionTreeClassifier())]pipeline = Pipeline(steps=steps)# example of evaluating a model with random oversampling and undersamplingfrom numpy import meanfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import RepeatedStratifiedKFoldfrom sklearn.tree import DecisionTreeClassifierfrom imblearn.pipeline import Pipelinefrom imblearn.over_sampling import RandomOverSamplerfrom imblearn.under_sampling import RandomUnderSampler# define datasetX, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)# define pipelineover = RandomOverSampler(sampling_strategy=0.1)under = RandomUnderSampler(sampling_strategy=0.5)steps = [('o', over), ('u', under), ('m', DecisionTreeClassifier())]pipeline = Pipeline(steps=steps)# evaluate pipelinecv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1)score = mean(scores)print('F1 Score: %.3f' % score)``

## 5、其他几类过采样

### 5.1、SMOTE

RandomOverSampler通过复制少数类的一些原始样本进行过采样，而SMOTE 通过插值生成新样本。

`从现有示例中合成新示例`，这是少数类`数据增强`的一种类型，被称为合成少数类过采样技术，简称SMOTE

SMOTE 的工作原理是选择特征空间中接近的示例，在特征空间中的示例之间绘制一条线，并在该线的某个点处绘制一个新样本。

1. SMOTE 首先随机选择一个少数类实例 a 并找到它的 k 个最近的少数类邻居
2. 然后通过随机选择 k 个最近邻 b 中的一个并连接 a 和 b ，以在特征空间中形成线段来创建合成实例，合成实例是作为两个选定实例 a 和 b 的凸组合生成的。

`建议首先使用随机欠采样来修剪多数类中的示例数量，然后使用 SMOTE 对少数类进行过采样以平衡类分布。SMOTE 和欠采样的组合比普通欠采样表现更好`。(关于 SMOTE 的原始论文建议将 SMOTE 与多数类的随机欠采样结合起来)

``# Oversample and plot imbalanced dataset with SMOTEfrom collections import Counterfrom sklearn.datasets import make_classificationfrom imblearn.over_sampling import SMOTEfrom matplotlib import pyplotfrom numpy import where# define datasetX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)# summarize class distributioncounter = Counter(y)print(counter)# transform the datasetoversample = SMOTE()X, y = oversample.fit_resample(X, y)# summarize the new class distributioncounter = Counter(y)print(counter)# scatter plot of examples by class labelfor label, _ in counter.items():row_ix = where(y == label)[0]pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))pyplot.legend()pyplot.show()``

``# Oversample with SMOTE and random undersample for imbalanced datasetfrom collections import Counterfrom sklearn.datasets import make_classificationfrom imblearn.over_sampling import SMOTEfrom imblearn.under_sampling import RandomUnderSamplerfrom imblearn.pipeline import Pipelinefrom matplotlib import pyplotfrom numpy import where# define datasetX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)# summarize class distributioncounter = Counter(y)print(counter)# define pipelineover = SMOTE(sampling_strategy=0.1)under = RandomUnderSampler(sampling_strategy=0.5)steps = [('o', over), ('u', under)]pipeline = Pipeline(steps=steps)# transform the datasetX, y = pipeline.fit_resample(X, y)# summarize the new class distributioncounter = Counter(y)print(counter)# scatter plot of examples by class labelfor label, _ in counter.items():row_ix = where(y == label)[0]pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))pyplot.legend()pyplot.show()``

``# decision tree  on imbalanced dataset with SMOTE oversampling and random undersamplingfrom numpy import meanfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import RepeatedStratifiedKFoldfrom sklearn.tree import DecisionTreeClassifierfrom imblearn.pipeline import Pipelinefrom imblearn.over_sampling import SMOTEfrom imblearn.under_sampling import RandomUnderSampler# define datasetX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)# define pipelinemodel = DecisionTreeClassifier()over = SMOTE(sampling_strategy=0.1)under = RandomUnderSampler(sampling_strategy=0.5)steps = [('over', over), ('under', under), ('model', model)]pipeline = Pipeline(steps=steps)# evaluate pipelinecv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)print('Mean ROC AUC: %.3f' % mean(scores))``

``# grid search k value for SMOTE oversampling for imbalanced classificationfrom numpy import meanfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import RepeatedStratifiedKFoldfrom sklearn.tree import DecisionTreeClassifierfrom imblearn.pipeline import Pipelinefrom imblearn.over_sampling import SMOTEfrom imblearn.under_sampling import RandomUnderSampler# define datasetX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)# values to evaluatek_values = [1, 2, 3, 4, 5, 6, 7]for k in k_values:# define pipelinemodel = DecisionTreeClassifier()over = SMOTE(sampling_strategy=0.1, k_neighbors=k)under = RandomUnderSampler(sampling_strategy=0.5)steps = [('over', over), ('under', under), ('model', model)]pipeline = Pipeline(steps=steps)# evaluate pipelinecv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)score = mean(scores)print('> k=%d, Mean ROC AUC: %.3f' % (k, score))``

### 5.2、Borderline-SMOTE

SMOTE过于随机了，从少数类中随机选择一个样本a，找到K近邻后，再从近邻中随机选择一个样本b，连接样本a,b，选择ab直线上一点de作为过采样点。这样很容易生成错误类样本，生成的样本进入到多数类中去了。

Borderline-SMOTE 方法仅在两个类之间的决策边界上创建合成示例，而不是盲目地为少数类生成新的合成示例。

``# borderline-SMOTE for imbalanced datasetfrom collections import Counterfrom sklearn.datasets import make_classificationfrom imblearn.over_sampling import BorderlineSMOTEfrom matplotlib import pyplotfrom numpy import where# define datasetX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)# summarize class distributioncounter = Counter(y)print(counter)# transform the datasetoversample = BorderlineSMOTE()X, y = oversample.fit_resample(X, y)# summarize the new class distributioncounter = Counter(y)print(counter)# scatter plot of examples by class labelfor label, _ in counter.items():row_ix = where(y == label)[0]pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))pyplot.legend()pyplot.show()``

### 5.3 、Borderline-SMOTE SVM

``# borderline-SMOTE with SVM for imbalanced datasetfrom collections import Counterfrom sklearn.datasets import make_classificationfrom imblearn.over_sampling import SVMSMOTEfrom matplotlib import pyplotfrom numpy import where# define datasetX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)# summarize class distributioncounter = Counter(y)print(counter)# transform the datasetoversample = SVMSMOTE()X, y = oversample.fit_resample(X, y)# summarize the new class distributioncounter = Counter(y)print(counter)# scatter plot of examples by class labelfor label, _ in counter.items():row_ix = where(y == label)[0]pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))pyplot.legend()pyplot.show()``

ADASYN 基于根据分布自适应生成少数数据样本的思想：与那些更容易学习的少数样本相比，`为更难学习的少数类样本生成更多的合成数据。`

``# Oversample and plot imbalanced dataset with ADASYNfrom collections import Counterfrom sklearn.datasets import make_classificationfrom imblearn.over_sampling import ADASYNfrom matplotlib import pyplotfrom numpy import where# define datasetX, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)# summarize class distributioncounter = Counter(y)print(counter)# transform the datasetoversample = ADASYN()X, y = oversample.fit_resample(X, y)# summarize the new class distributioncounter = Counter(y)print(counter)# scatter plot of examples by class labelfor label, _ in counter.items():row_ix = where(y == label)[0]pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))pyplot.legend()pyplot.show()``

1. 该模型不会在类分布类似于真实用例的数据集上进行测试。通过对整个数据集进行重新采样，训练集和测试集都可能是平衡的，但是模型应该在不平衡的数据集上进行测试，以评估模型的潜在偏差；

2. 重采样过程可能会使用有关数据集中样本的信息来生成或选择一些样本。因此，我们可能会使用样本信息，这些信息将在以后用作测试样本。

References：

1. https://imbalanced-learn.org/dev/user_guide.html
2. https://zhuanlan.zhihu.com/p/360045341

原文作者：xihuishaw
原文地址: https://www.cnblogs.com/xihuishaw1995/p/16356280.html
本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。