Sklearn是Python中最为流行的机器学习工具包。实现了数据预处理、分类、回归、降维、特征选择、特征抽取和模型评价等机器学习功能。在Sklearn中进行机器学习模型训练是一件很容易的事情。大部分模型实现了fitpredict两个函数,分别进行模型训练和模型预测。本案例的目的在于通过通过一些简单例子演示Sklearn的使用。本案例翻译自: http://amueller.github.io/sklearn_tutorial/ .

1 数据表示

在Sklearn中,数据使用numpy数组或者scipy的稀疏矩阵进行表示。 下面我们获取一些示例数据:

In [1]:
from sklearn.datasets import load_digits
digits = load_digits()
In [2]:
print("images shape: %s" % str(digits.images.shape))
print("targets shape: %s" % str(digits.target.shape))
images shape: (1797, 8, 8)
targets shape: (1797,)
In [7]:
%matplotlib inline
import matplotlib.pyplot  as plt
plt.matshow(digits.images[0], cmap=plt.cm.Greys)
Out[7]:
<matplotlib.image.AxesImage at 0x7fedf6b0b7d0>
In [8]:
digits.target
Out[8]:
array([0, 1, 2, ..., 8, 9, 8])

2 数据准备

In [9]:
X = digits.data.reshape(-1, 64)
print(X.shape)
(1797, 64)
In [10]:
y = digits.target
print(y.shape)
(1797,)

我们的数据集共包含1797个样本,每一个样本是8x8的图片,我们将其转换成了64维的向量。 X.shape属性返回样本数量和特征数量,返回格式为(n_samples, n_feature)。

In [11]:
print(X)
[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]

3 数据降维和可视化

通常我们需要观察一下数据,最简单的方式是将数据降维到二维,然后进行绘图显示。 我们使用主成分分析(PCA)这种简单的可视化方法。 在Sklearn中,PCA已经在sklearn.decomposition模块实现。

In [12]:
from sklearn.decomposition import PCA

初始化模型,并设置降维参数,将维数设置成2。

In [13]:
pca = PCA(n_components=2)

训练降维模型,将数据传给模型。

In [15]:
pca.fit(X)
Out[15]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

利用模型来对数据进行降维转换。

In [17]:
X_pca = pca.transform(X)
X_pca.shape
Out[17]:
(1797, 2)

将降维后的二维数据进行可视化,利用每一个原始的类别进行颜色区分。

In [19]:
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y);
In [20]:
print(pca.mean_.shape)
print(pca.components_.shape)
(64,)
(2, 64)
In [21]:
fix, ax = plt.subplots(1, 3)
ax[0].matshow(pca.mean_.reshape(8, 8), cmap=plt.cm.Greys)
ax[1].matshow(pca.components_[0, :].reshape(8, 8), cmap=plt.cm.Greys)
ax[2].matshow(pca.components_[1, :].reshape(8, 8), cmap=plt.cm.Greys)
Out[21]:
<matplotlib.image.AxesImage at 0x7fedf2df8450>

现在,我们来尝试一下另外一种降维方法Isomap的降维效果。

In [23]:
from sklearn.manifold import Isomap
In [24]:
isomap = Isomap(n_components=2, n_neighbors=20)
In [25]:
isomap.fit(X)
Out[25]:
Isomap(eigen_solver='auto', max_iter=None, n_components=2, n_jobs=1,
    n_neighbors=20, neighbors_algorithm='auto', path_method='auto', tol=0)
In [26]:
X_isomap = isomap.transform(X)
X_isomap.shape
Out[26]:
(1797, 2)
In [27]:
plt.scatter(X_isomap[:, 0], X_isomap[:, 1], c=y)
Out[27]:
<matplotlib.collections.PathCollection at 0x7fedf01c3a10>

4 分类

为了测试模型效果,我们首先将数据集划分成训练集和测试集两部分。在Sklearn中,这是一件简单的事。

In [30]:
from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
In [32]:
print("X_train shape: %s" % repr(X_train.shape))
print("y_train shape: %s" % repr(y_train.shape))
print("X_test shape: %s" % repr(X_test.shape))
print("y_test shape: %s" % repr(y_test.shape))
X_train shape: (1347, 64)
y_train shape: (1347,)
X_test shape: (450, 64)
y_test shape: (450,)

线性支持向量机

In [33]:
from sklearn.svm import LinearSVC

模型初始化

In [34]:
svm = LinearSVC()

利用样本标签,训练线性模型。

In [35]:
svm.fit(X_train, y_train)
Out[35]:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

模型应用,对于分类模型,通过调用predict方法来进行模型预测。

In [36]:
svm.predict(X_train)
Out[36]:
array([2, 8, 9, ..., 7, 7, 8])

模型效果评估。

In [37]:
svm.score(X_train, y_train)
Out[37]:
0.99331848552338531
In [38]:
svm.score(X_test, y_test)
Out[38]:
0.93111111111111111

随机森林

In [39]:
from sklearn.ensemble import RandomForestClassifier

随机森林通过构造多棵随机决策树,然后平均决策树的结果进行预测。

In [40]:
rf = RandomForestClassifier()
In [41]:
rf.fit(X_train, y_train)
Out[41]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [42]:
rf.score(X_train, y_train)
Out[42]:
1.0
In [43]:
rf.score(X_test, y_test)
Out[43]:
0.94666666666666666

在我们的数据中,随机森林的分类效果比简单的线性支持向量机的效果要更好。

5 模型选择与评估

通过交叉验证来评估模型效果。

In [46]:
import numpy as np
from sklearn.model_selection import cross_val_score
scores =  cross_val_score(rf, X_train, y_train, cv=5)
print("scores: %s  mean: %f  std: %f" % (str(scores), np.mean(scores), np.std(scores)))
scores: [ 0.93065693  0.93357934  0.94464945  0.94007491  0.96212121]  mean: 0.942216  std: 0.011090

通过增加决策树的棵树来提升模型效果。

In [47]:
rf2 = RandomForestClassifier(n_estimators=50)
scores =  cross_val_score(rf2, X_train, y_train, cv=5)
print("scores: %s  mean: %f  std: %f" % (str(scores), np.mean(scores), np.std(scores)))
scores: [ 0.96715328  0.97416974  0.9704797   0.97378277  0.96590909]  mean: 0.970299  std: 0.003356

对于重要的模型参数,可以通过网格搜索进行调参。

In [49]:
from sklearn.model_selection import GridSearchCV

对于支持向量机,只有C一个重要参数。

In [51]:
param_grid = {'C': 10. ** np.arange(-3, 4)}
grid_search = GridSearchCV(svm, param_grid=param_grid, cv=3, verbose=3, return_train_score=True)
In [52]:
grid_search.fit(X_train, y_train)
Fitting 3 folds for each of 7 candidates, totalling 21 fits
[CV] C=0.001 .........................................................
[CV] .................... C=0.001, score=0.951327433628, total=   0.0s
[CV] C=0.001 .........................................................
[CV] .................... C=0.001, score=0.966592427617, total=   0.0s
[CV] C=0.001 .........................................................
[CV] .................... C=0.001, score=0.970852017937, total=   0.0s
[CV] C=0.01 ..........................................................
[CV] ..................... C=0.01, score=0.951327433628, total=   0.1s
[CV] C=0.01 ..........................................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
[CV] ..................... C=0.01, score=0.957683741648, total=   0.1s
[CV] C=0.01 ..........................................................
[CV] ..................... C=0.01, score=0.959641255605, total=   0.1s
[CV] C=0.1 ...........................................................
[CV] ...................... C=0.1, score=0.953539823009, total=   0.1s
[CV] C=0.1 ...........................................................
[CV] ...................... C=0.1, score=0.948775055679, total=   0.1s
[CV] C=0.1 ...........................................................
[CV] ...................... C=0.1, score=0.955156950673, total=   0.1s
[CV] C=1.0 ...........................................................
[CV] ...................... C=1.0, score=0.944690265487, total=   0.1s
[CV] C=1.0 ...........................................................
[CV] ...................... C=1.0, score=0.942093541203, total=   0.1s
[CV] C=1.0 ...........................................................
[CV] ...................... C=1.0, score=0.934977578475, total=   0.1s
[CV] C=10.0 ..........................................................
[CV] ..................... C=10.0, score=0.944690265487, total=   0.1s
[CV] C=10.0 ..........................................................
[CV] ..................... C=10.0, score=0.944320712695, total=   0.1s
[CV] C=10.0 ..........................................................
[CV] ..................... C=10.0, score=0.941704035874, total=   0.1s
[CV] C=100.0 .........................................................
[CV] .................... C=100.0, score=0.944690265487, total=   0.1s
[CV] C=100.0 .........................................................
[CV] .................... C=100.0, score=0.937639198218, total=   0.1s
[CV] C=100.0 .........................................................
[CV] .................... C=100.0, score=0.923766816143, total=   0.1s
[CV] C=1000.0 ........................................................
[CV] ................... C=1000.0, score=0.944690265487, total=   0.1s
[CV] C=1000.0 ........................................................
[CV] ................... C=1000.0, score=0.944320712695, total=   0.1s
[CV] C=1000.0 ........................................................
[CV] ................... C=1000.0, score=0.934977578475, total=   0.1s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    1.5s finished
Out[52]:
GridSearchCV(cv=3, error_score='raise',
       estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': array([  1.00000e-03,   1.00000e-02,   1.00000e-01,   1.00000e+00,
         1.00000e+01,   1.00000e+02,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=3)
In [53]:
print(grid_search.best_params_)
print(grid_search.best_score_)
{'C': 0.001}
0.96288047513
In [63]:
plt.plot(grid_search.cv_results_['mean_test_score'], label="test error")
plt.plot(grid_search.cv_results_['mean_train_score'] , label="train error")
plt.xticks(np.arange(6), param_grid['C'])
plt.xlabel("C")
plt.ylabel("Accuracy")
plt.legend(loc='best')
Out[63]:
<matplotlib.legend.Legend at 0x7fede9a28dd0>
In [61]:
grid_search.cv_results_['mean_test_score']
Out[61]:
array([ 0.96288048,  0.95619896,  0.95248701,  0.94060876,  0.94357832,
        0.93541203,  0.94135115])