北京爱家家居主成分析与支持向量机的数字识别模型-自科部

    主成分析与支持向量机的数字识别模型-自科部
    这篇文章使用Jupyter Notebook 写的,基本上是手工转换成微信文章的,难免有疏漏,欢迎在公众号后台留言.源文件在Github上:https://github.com/czfshine/DMML/blob/master/learn/pca%2Bsvm.ipynb
    基于主成分析与支持向量机的数字识别模型
    题目要求: 输入图片,识别数字
    简单的分类问题
    载入数据
    载入常用库
    import os
    import pandas as pd
    import os, random
    import numpy as np
    import matplotlib.pyplot as plt
    from matplotlib import ticker
    from sklearn import linear_model, svm, preprocessing, decomposition, model_selection
    import seaborn as sns
    %matplotlib inline
    先来看下我们的训练数据
    DATAPATH="D:/dataset/Digit/"
    print(os.listdir(DATAPATH))
    train_file=DATAPATH+"train.csv"
    train = pd.read_csv(train_file)
    print(train.shape)
    train
    ['sample_submission.csv', 'test.csv', 'train.csv'](42000, 785)labelpixel0pixel1pixel2pixel3pixel4pixel5pixel6pixel7pixel8...pixel774pixel775pixel776pixel777pixel778pixel779pixel780pixel781pixel782pixel78301000000000...000000000010000000000...000000000021000000000...000000000034000000000...000000000040000000000...000000000050000000000...000000000067000000000...000000000073000000000...000000000085000000000...000000000093000000000...0000000000108000000000...0000000000119000000000...0000000000121000000000...0000000000133000000000...0000000000143000000000...0000000000151000000000...0000000000162000000000...0000000000170000000000...0000000000187000000000...0000000000195000000000...0000000000208000000000...0000000000216000000000...0000000000222000000000...0000000000230000000000...0000000000242000000000...0000000000253000000000...0000000000266000000000...0000000000279000000000...0000000000289000000000...0000000000297000000000...0000000000..................................................................419702000000000...0000000000419713000000000...0000000000419724000000000...0000000000419734000000000...0000000000419743000000000...0000000000419759000000000...0000000000419762000000000...0000000000419774000000000...0000000000419784000000000...0000000000419794000000000...0000000000419807000000000...272531100000000419812000000000...0000000000419828000000000...0000000000419837000000000...0000000000419843000000000...0000000000419853000000000...0000000000419860000000000...0000000000419875000000000...0000000000419880000000000...0000000000419895000000000...0000000000419903000000000...0000000000419911000000000...0000000000419929000000000...0000000000419936000000000...0000000000419944000000000...0000000000419950000000000...0000000000419961000000000...0000000000419977000000000...0000000000419986000000000...0000000000419999000000000...0000000000
    42000 rows × 785 columns
    可以看到我们的数据是一个42000 * 785 大小的矩阵, 每一行代表一组训练数据,第一列是所对应的数字,接下来784(28*28)列是对应各个像素点的灰度值(0-255).
    随便看一下几张图片
    j=0
    for i in [0,1,2,124,1234,42,233,8888,9999]:
    j+=1
    plt.subplot(3, 3, j)
    plt.imshow( np.array(train.iloc()[i].values[1:]).reshape((28,28))/255.0)
    plt.show()

    数据预处理
    提取标签
    Y = train.loc[:,'label':'label'].values.flatten()
    Y[0:20]
    array([1, 0, 1, 4现代鲁宾逊 , 0, 0, 7, 3, 5, 3, 8北京爱家家居, 9, 1, 3, 3, 1, 2, 0, 7, 5],dtype=int64)
    归一正则化
    X = preprocessing.normalize(train.loc[:,'pixel0':].values)
    按8:2的比例分割数据集
    (X, X_test, y, y_test) = model_selection.train_test_split(X, Y, test_size=0.2, random_state=0)
    搭建模型

    est = svm.SVC(C=1.61, gamma=1.66,cache_size=4096)
    训练模型
    TRAINCOUNT=10000 ##### 训练用数据,现在测试的话小一点,真正训练就等于训练集大小
    #TRAINCOUNT=train.shape[0]
    est.fit(X[:TRAINCOUNT]姚俊良, y[:TRAINCOUNT])
    验证模型
    est.score(X_test, y_test)
    0.9698809523809524
    测试集
    test_file=DATAPATH+"test.csv"
    test = pd.read_csv(test_file)
    T=preprocessing.normalize(test.loc[:,'pixel0':].values)
    预测测试集
    ans = est.predict(T)
    ans
    array([2, 0, 9, ..., 3, 9, 2], dtype=int64)
    处理与写入文件
    ids = np.reshape(np.arange(1, ans.shape[0] + 1), (-1, 1))
    answer_column = np.reshape(ans, (-1, 1))
    answer_matrix = np.append(ids, answer_column, axis=1)
    df = pd.DataFrame(answer_matrix, columns=["ImageId", "Label"])
    df.to_csv("simplesvm.csv", sep=",", index=False)
    在排行榜上取得 0.96714 的成绩
    优化在优化之前先找出上面代码的问题,
    首先,我们只用到了前1w个数据来进行训练,这是为了加快训练速度,你可以用全部的训练数据进行训练(包括训练集与验证集),模型的最终效果可能会好一点.不过注意一点,官方的说法是svm算法的时间复杂度

    其次,在定义svm模型时使用了两个魔数1.61和1.68,或许改变这两个数字会改善模型的性能?主成分分析要加快训练速度,从特征数下手,我们使用PCA的前50个数据当作新的特征数据(即从784降到50).
    进行主成分分析
    pca = decomposition.PCA(n_components=50, whiten=True)
    X=preprocessing.normalize(train.loc[:,'pixel0':].values)
    X=pca.fit(X).transform(X)
    pd.DataFrame(X[:3])#显示主成分析后结果0123456789...4041424344454647484901.535293-1.110612-0.7313631.333155-1.754703-1.0924801.8122780.2892791.612573-0.288227...0.124678-0.4288280.5893111.479232-0.111943-1.2970640.085061-0.7732940.4590280.5052231-1.483209-1.7344520.3134830.104703-0.8641012.068636-1.027981-0.4466210.151356-0.484600...0.626253-0.2725230.4665070.736921-0.383676-0.2772590.269298-1.1496860.696214-0.02780122.0210910.3181310.452610-1.2085391.7502761.823014-0.857398-0.9995670.8069761.349876...0.726378-0.4612640.931743-1.2646871.3985590.577552-0.054154-0.248361-0.167295-0.761548
    3 rows × 50 columns
    重新分割数据集 8:2
    (X, X_test, y, y_test) = model_selection.train_test_split(X, Y, test_size=0.2, random_state=0)
    再训练与验证一次
    est = svm.SVC(C=1.61, gamma=1.66,cache_size=4096)
    est.fit(X[:TRAINCOUNT], y[:TRAINCOUNT])
    est.score(X_test, y_test)
    0.1144047619047619
    可以看到效果不是很好,但是我们把
    和C 改成0.00774和4.17531的话
    est = svm.SVC(C=4.17531, gamma=0.00774,cache_size=4096)
    est.fit(X[:TRAINCOUNT], y[:TRAINCOUNT])
    est.score(X_test, y_test)
    0.9657142857142857
    可以看到参数的选择对性能影响很大,那么怎么找到比较好的参数呢?
    答案是:一个一个找
    首先定义查找的范围
    C_range = np.logspace(-6, 6, 30)
    gamma_range = np.logspace(-6, 1, 10)
    C_range,gamma_range
    (array([1.00000000e-06, 2.59294380e-06, 6.72335754e-06, 1.74332882e-05,.52035366e-05, 1.17210230e-04, 3.03919538e-04, 7.88046282e-04,.04335972e-03, 5.29831691e-03, 1.37382380e-02, 3.56224789e-02,.23670857e-02, 2.39502662e-01, 6.21016942e-01, 1.61026203e+00,.17531894e+00, 1.08263673e+01, 2.80721620e+01, 7.27895384e+01,.1.88739182e+02轮回乐队 , 4.89390092e+02, 1.26896100e+03, 3.29034456e+03,.8.53167852e+03, 2.21221629e+04, 5.73615251e+04黄飞鸿笑传, 1.48735211e+05喻子达,.85662042e+05, 1.00000000e+06]),rray([1.00000000e-06, 5.99484250e-06, 3.59381366e-05, 2.15443469e-04,.29154967e-03, 7.74263683e-03, 4.64158883e-02, 2.78255940e-01,.66810054e+00, 1.00000000e+01]))
    定义待查找参数的模型
    svc = svm.SVC(kernel="rbf",cache_size=4096)
    查找器,
    它将会把C_range和gamma_range进行组合,当作模型的参数,对每个模型用训练数据训练,验证数据验证,找出验证结果最好的参数组合
    clf = model_selection.GridSearchCV(estimator=svc, cv=model_selection.ShuffleSplit(n_splits=1红娘子演员表, test_size=0.2, random_state=0)
    , param_grid=dict(C=C_range, gamma=gamma_range), n_jobs=-1,verbose=1)
    下面的代码一共训练了30*10 个模型,所以数据量不要太大
    SEARCHMAX=1000
    clf.fit(X[:SEARCHMAX], y[:SEARCHMAX])
    我们看下结果
    import matplotlib.cm as cm
    from matplotlib.colors import Normalize
    plt.imshow(clf.cv_results_["mean_test_score"].reshape(30,10),
    cmap=cm.hot,norm=Normalize())

    横坐标表示gamma参数,纵坐标表示C参数,左上角是最小的两个参数的组合
    我们来看下最好的参数组合长怎么样
    clf.best_params_
    {'C': 4.175318936560401, 'gamma': 0.007742636826811269}
    也就是上面用的数字:)我们可以进行二次查找,不过要确定新的数据范围
    先看下在gamma=0.007742636826811269时,验证的准确率随C的表现,和C=4.175318936560401时gamma的表现
    plt.plot(clf.cv_results_["mean_test_score"].reshape(30,10)[:,5])
    plt.show()
    plt.plot(clf.cv_results_["mean_test_score"].reshape(30,10)[16,:])


    可以看到在C[12]之后准确率开始增长,且图像程tanh形状的,我们下次就从C[12]开始,使用线性范围.对于gamma,可以看到波峰在第4到第8中间,我们继续选用log.
    新的参数列表如下
    C_range2 = np.linspace(C_range[12], C_range[17], 20, endpoint=True)
    gamma_range2 = np.logspace(np.log(gamma_range[5]), np.log(gamma_range[8]), 20)
    (C_range2 芈丫头,gamma_range2)
    (array([ 0.09236709, 0.65731447, 1.22226185, 1.78720923, 2.35215661,.91710399, 3.48205138, 4.04699876, 4.61194614, 5.17689352,.7418409 , 6.30678828, 6.87173567, 7.43668305, 8.00163043,.56657781, 9.13152519, 9.69647258, 10.26141996, 10.82636734]),rray([1.37716833e-05, 2.64095277e-05, 5.06447279e-05, 9.71198157e-05,.86243643e-04, 3.57153627e-04, 6.84902377e-04, 1.31341594e-03,天藤湘子.51869679e-03, 4.83002632e-03, 9.26239089e-03, 1.77621982e-02,.40620138e-02, 6.53196620e-02, 1.25261479e-01, 2.40210034e-01,.60643293e-01, 8.83361283e-01, 1.69399439e+00, 3.24852023e+00]))
    再次训练
    svc = svm.SVC(kernel="rbf",cache_size=4096)
    clf = model_selection.GridSearchCV(estimator=svc, cv=model_selection.ShuffleSplit(n_splits=1, test_size=0.2潘允姬, random_state=0)
    , param_grid=dict(C=C_range2, gamma=gamma_range2), n_jobs=-1,verbose=1)
    SEARCHMAX=1000
    clf.fit(X[:SEARCHMAX], y[:SEARCHMAX])
    plt.imshow(clf.cv_results_["mean_test_score"].reshape(20,20),
    cmap=cm.hot,norm=Normalize())

    可以看到热力图基本居中,我们的参数范围还行,看下这次最好的参数列表
    clf.best_params_
    {'C': 1.222261849194718, 'gamma': 0.01776219824452757}
    两个数字比起前一次有大的改变,不过C对性能影响不是很大,gamma是指数分布的,换成对数的话差别小,
    用新的参数训练模型,再验证
    est = svm.SVC(C=1.7872092309327074, gamma=0.01776219824452757,cache_size=4096)
    est.fit(X[:TRAINCOUNT], y[:TRAINCOUNT])
    est.score(X_test, y_test)
    0.9696428571428571
    提高了0.4%,emmmmmmm
    上面用1000个数据匹配参数的,花了50多秒双网之音,现在我们用不同数量进行测试
    svc = svm.SVC(kernel="rbf",cache_size=4096)
    clf = model_selection.GridSearchCV(estimator=svc, cv=model_selection.ShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
    , param_grid=dict(C=C_range2, gamma=gamma_range2), n_jobs=-1,verbose=1)
    SEARCHMAX=2000
    clf.fit(X[:SEARCHMAX], y[:SEARCHMAX])
    plt.imshow(clf.cv_results_["mean_test_score"].reshape(20,20),
    cmap=cm.hot,norm=Normalize())
    clf.best_params_
    {'C': 4.611946139622656, 'gamma': 0.009262390888788635}


    发现最优参数变了
    再训练一次,然后提交看看
    est = svm.SVC(C=4.0469987578846665, gamma=0.009262390888788635,cache_size=4096,verbose=True)
    est.fit(X, y)
    est.score(X_test, y_test)
    0.9788095238095238
    我们这次载入测试数据时要注意与训练数据进行同种预处理,不然预测的结果就不对了
    test_file=DATAPATH+"test.csv"
    test = pd.read_csv(test_file)
    T=preprocessing.normalize(test.values)
    T=pca.transform(T)
    ans=est.predict(T)
    ans
    array([2, 0, 9, ..., 3, 9, 2], dtype=int64)
    ids = np.reshape(np.arange(1, ans.shape[0] + 1), (-1, 1))
    answer_column = np.reshape(ans, (-1, 1))
    answer_matrix = np.append(ids第三滴血, answer_column, axis=1)
    df = pd.DataFrame(answer_matrix, columns=["ImageId", "Label"])
    df.to_csv("pcasvm.csv", sep=",", index=False)
    LB分数0.97914