# 2 随机森林（RF）简介

1. 用有抽样放回的方法（bootstrap）从样本集中选取n个样本作为一个训练集
2. 用抽样得到的样本集生成一棵决策树。在生成的每一个结点：
• 随机不重复地选择d个特征
• 利用这d个特征分别对样本集进行划分，找到最佳的划分特征（可用基尼系数、增益率或者信息增益判别）
1. 重复步骤1到步骤2共k次，k即为随机森林中决策树的个数。
2. 用训练得到的随机森林对测试样本进行预测，并用票选法决定预测的结果。

# 3 特征重要性评估

i i i棵树节点 q q q的Gini指数的计算公式为

G I q ( i ) = ∑ c = 1 ∣ C ∣ ∑ c ′ ≠ c p q c ( i ) p q c ′ ( i ) = 1 − ∑ c = 1 ∣ C ∣ ( p q c ( i ) ) 2 (3-1) GI_q^{(i)}=\sum_{c=1}^{|C|}\sum_{c’ \neq c } p_{qc}^{(i)} p_{qc’}^{(i)}=1-\sum_{c=1}^{|C|}(p_{qc}^{(i)})^2 \tag{3-1} GIq(i)=c=1Cc=cpqc(i)pqc(i)=1c=1C(pqc(i))2(31)

V I M j q ( G i n i ) ( i ) = G I q ( i ) − G I l ( i ) − G I r ( i ) (3-2) VIM_{jq}^{(Gini)(i)}=GI_q^{(i)}-GI_l^{(i)}-GI_r^{(i)} \tag{3-2} VIMjq(Gini)(i)=GIq(i)GIl(i)GIr(i)(32)

V I M j ( G i n i ) ( i ) = ∑ q ∈ Q V I M j q ( G i n i ) ( i ) (3-3) VIM_{j}^{(Gini)(i)}=\sum_{q \in Q}VIM_{jq}^{(Gini)(i)} \tag{3-3} VIMj(Gini)(i)=qQVIMjq(Gini)(i)(33)

V I M j ( G i n i ) = ∑ i = 1 I V I M j ( G i n i ) ( i ) (3-4) VIM_j^{(Gini)}=\sum_{i=1}^{I}VIM_{j}^{(Gini)(i)} \tag{3-4} VIMj(Gini)=i=1IVIMj(Gini)(i)(34)

V I M j ( G i n i ) = V I M j ( G i n i ) ∑ j ′ = 1 J V I M j ′ ( G i n i ) (3-5) VIM_j^{(Gini)}=\dfrac{VIM_j^{(Gini)}}{\sum_{j’=1}^J VIM_{j’}^{(Gini)}} \tag{3-5} VIMj(Gini)=j=1JVIMj(Gini)VIMj(Gini)(35)

# 4 举个例子

import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']


import numpy as np
np.unique(df['Class label'])


array([1, 2, 3], dtype=int64)


df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
Class label                     178 non-null int64
Alcohol                         178 non-null float64
Malic acid                      178 non-null float64
Ash                             178 non-null float64
Alcalinity of ash               178 non-null float64
Magnesium                       178 non-null int64
Total phenols                   178 non-null float64
Flavanoids                      178 non-null float64
Nonflavanoid phenols            178 non-null float64
Proanthocyanins                 178 non-null float64
Color intensity                 178 non-null float64
Hue                             178 non-null float64
OD280/OD315 of diluted wines    178 non-null float64
Proline                         178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB


from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
x, y = df.iloc[:, 1:].values, df.iloc[:, 0].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feat_labels = df.columns[1:]
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
forest.fit(x_train, y_train)


importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(x_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))


 1) Color intensity                0.182483
2) Proline                        0.158610
3) Flavanoids                     0.150948
4) OD280/OD315 of diluted wines   0.131987
5) Alcohol                        0.106589
6) Hue                            0.078243
7) Total phenols                  0.060718
8) Alcalinity of ash              0.032033
9) Malic acid                     0.025400
10) Proanthocyanins                0.022351
11) Magnesium                      0.022078
12) Nonflavanoid phenols           0.014645
13) Ash                            0.013916


threshold = 0.15
x_selected = x_train[:, importances > threshold]
x_selected.shape


(124, 3)


# 5 参考文献

[1] Raschka S. Python Machine Learning[M]. Packt Publishing, 2015.
[2] 杨凯, 侯艳, 李康. 随机森林变量重要性评分及其研究进展[J]. 2015.

原文作者：zjuPeco
原文地址: https://blog.csdn.net/zjuPeco/article/details/77371645
本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。