炒股一路虧，人工智能技術可以幫我們預測股價嗎？-魔扣目錄

十分鐘實現人工智能股價預測，是一個深度學習的練習項目。其通過機器學習算法，根據過去幾年與某只股票相關的K線走勢、公司相關報道的情感分析作為數據集，通過訓練來得到可以預測股價的機器學習模型，并用該模型對股價進行預測。本項目使用幾種不同的算法（線性回歸、神經網絡和隨機森林）對股票進行預測，并對各自的效果進行比較。運行本項目需要Python編程的基礎，理解本項目的代碼則需要對機器學習的相關知識。

自然人是如何投資股市的

在編寫人工智能的程序之前，我們需要分析人類是怎樣決定如何投資的。有過炒股經歷的人會更快地理解。投資股市的目的是盈利，因此在決定購買哪只股票之前我們會查閱與該公司相關的信息，搜索最近甚至之前與該公司有關的新聞，逛逛炒股方面的貼吧，看看微博上面與該公司有關的消息。如果這個公司的前景明朗（正面報道很多），那么投資該股票的回報率也許會高一些。

股票的K線

此外，投資股市，還需要會看各種數據，如K線等。有時我們看到某只股票持續走低，并且有上漲的勢頭了，也許此時是最佳的購入時機，因為該股票有很大可能會觸底反彈了。通過上述分析，我們明確了訓練這樣的一個機器學習模型需要哪些數據： 1、股價數據 2、對該股票（公司）的情感數據

獲取歷史數據并簡單處理

數據對于機器學習十分重要。沒有合適的數據，我們就無法訓練機器學習模型，從而使其可以進行相應地預測。在該項目中，我們需要獲取2部分的數據。1：股價數據，2：情感數據。對于處理股價數據，我們需要對于股價數據，需要使用Pandas進行分析。對于情感數據則使用NLTK（Natural Language Toolkit）來進行處理。

關于Pandas的使用入門，我曾寫過一篇教程：從零開始機器學習-8 五分鐘學會Pandas

首先，我們導入相應地Python包。

import numpy as np
import pandas as pd
import unicodedata
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from treeinterpreter import treeinterpreter as ti
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

然后再讀取往年的股價的數據，對其處理并生成Pandas的DataFrame格式。

df_stocks = pd.read_pickle('data/pickled_ten_year_filtered_data.pkl')
df_stocks['prices'] = df_stocks['adj close'].Apply(np.int64)
df_stocks = df_stocks[['prices', 'articles']]
df_stocks['articles'] = df_stocks['articles'].map(lambda x: x.lstrip('.-'))

注：此處的數據是已經序列化成為文件的Python對象。通過

print(df_stocks)

來查看我們的df_stocks DataFrame對象。其輸出如下：

            prices                                           articles
2007-01-01   12469   What Sticks from '06. Somalia Orders Islamist...
2007-01-02   12472   Heart Health: Vitamin Does Not Prevent Death ...
2007-01-03   12474   google Answer to Filling Jobs Is an Algorithm...
2007-01-04   12480   Helping Make the Shift From Combat to Commerc...
2007-01-05   12398   Rise in Ethanol Raises Concerns About Corn as...
2007-01-06   12406   A Status Quo Secretary General. Best Buy and ...
2007-01-07   12414   THE COMMON APPLICATION; Typo.com. Jumbo Bonus...
...            ...              ...
2016-12-31   19762  Terrorist Attack at Nightclub in Istanbul Kill...

[3653 rows x 2 columns]

Process finished with exit code 0

可以看到，我們已經成功獲取到了股票的股價以及相關的文章的內容，下一步我們開始對股票情感數據與股價數據聯合起來進行分析處理。先將df_stocks中的price Series獨立出來，成為一個單獨的DataFrame對象。因為我們對股票數據進行分析，并且不想破壞原DataFrame。在獨立出來Price之后，我們再添加幾個新的Series，接下來就是使用NLTK對文章進行情感分析了。

df = df_stocks[['prices']].copy()

df["compound"] = ''#合成
df["neg"] = ''#負面
df["neu"] = ''#中立
df["pos"] = ''#積極

我們使用NLTK的情感強度分析器對文章情感進行分析。并將情感的強度寫入新獨立出來的DataFrame df中。其中neg Series用來存放該新聞的負面指數，neu Series用來存放該新聞的中立指數，pos Series用來存放該新聞的正面（積極）指數，Compound用來存放該新聞的合成（將neg neu pos結合）指數。

nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()
for date, row in df_stocks.T.iteritems():
    try:
        sentence = unicodedata.normalize('NFKD', df_stocks.loc[date, 'articles'])
        ss = sid.polarity_scores(sentence)
        df.at[date, 'compound'] = ss['compound']
        df.at[date, 'neg'] = ss['neg']
        df.at[date, 'neu'] = ss['neu']
        df.at[date, 'pos'] = ss['pos']
    except TypeError:
        print(df_stocks.loc[date, 'articles'])
        print(date)

其輸出如下：

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:...nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
            prices compound    neg    neu    pos
2007-01-01   12469  -0.9814  0.159  0.749  0.093
2007-01-02   12472  -0.8179  0.114  0.787  0.099
2007-01-03   12474  -0.9993  0.198  0.737  0.065
...          ...          ...     ...     ...      ...
2016-12-28   19833   0.2869  0.128  0.763  0.108
2016-12-29   19819  -0.9789  0.138  0.764  0.097
2016-12-30   19762   -0.995  0.168  0.734  0.098
2016-12-31   19762  -0.2869  0.173  0.665  0.161

[3653 rows x 5 columns]

Process finished with exit code 0

得到上述輸出之后，我們便成功地獲得了歷史文章的情感分析數據。

劃分數據集

從上面的輸出，我們可以看到開始時間是2007年1月1日，而結束時間是2016年12月31日。我們以8：2的比例來劃分訓練集與測試集。

train_start_date = '2007-01-01'
train_end_date = '2014-12-31'
test_start_date = '2015-01-01'
test_end_date = '2016-12-31'
train = df.ix[train_start_date : train_end_date]
test = df.ix[test_start_date:test_end_date]

對df進行劃分完成之后，再新建一個對每個時間點情感評分的List，并將訓練集與測試集的數據加入其中。

sentiment_score_list = []

for date, row in train.T.iteritems():
    sentiment_score = np.asarray([df.loc[date, 'neg'], df.loc[date, 'pos']])
    sentiment_score_list.append(sentiment_score)
numpy_df_train = np.asarray(sentiment_score_list)

sentiment_score_list = []
for date, row in train.T.iteritems():
    sentiment_score = np.asarray([df.loc[date, 'neg'], df.loc[date, 'pos']])
    sentiment_score_list.append(sentiment_score)
numpy_df_train = np.asarray(sentiment_score_list)

由于我們程序預測的目標是股價，因此y標簽也就是股價。

y_train = pd.DataFrame(train['prices'])
y_test = pd.DataFrame(test['prices'])

使用隨機森林算法對股價進行預測

使用Scikit Learn封裝好了的的隨機森林算法對股票進行預測。

rf = RandomForestRegressor()
rf.fit(numpy_df_train, y_train)

#print(rf.feature_importances_)
prediction, bias, contributions = ti.predict(rf, numpy_df_test)
print(preditcion)

在看到控制臺有輸出之后，如果輸出正確則證明使用隨機森林算法對股票預測成功了。為了更加直觀地觀察我們的預測與實際情況有多少偏差，則需要使用Matplotlib來進行繪圖。

#Matplot
idx = pd.date_range(test_start_date, test_end_date)
predictions_df = pd.DataFrame(data=prediction[0:731], index=idx, columns=['prices'])
print(predictions_df)
predictions_plot = predictions_df.plot()

fig = y_test.plot(ax=predictions_plot).get_figure()

ax = predictions_df.rename(columns={"Price": "Predicted Price"}).plot(title='Random Forest Predict Stock Price')
ax.set_xlabel("Date")
ax.set_ylabel("Price")
fig = y_test.rename(columns={"Price": "Actual Price"}).plot(ax=ax).get_figure()
fig.savefig("RF_noSmoothing.png")

通過上述代碼，我們繪制了沒有平滑的隨機森林算法預測的股價走勢，并保存為"RF_noSmoothing.png"。

預測結果可視化

上圖中藍色的折線是預測的股價，而橙色的折現是真實的股票走勢。很明顯我們的預測與實際產生了巨大的偏差，因此我們需要對數據進行進一步處理，將股價加上一個常數來表示測試時的閉市股價。

temp_date = test_start_date
average_last_5_days_test = 0
total_days = 10
for i in range(total_days):
    average_last_5_days_test += test.loc[temp_date, 'prices']
    temp_date = datetime.strptime(temp_date, "%Y-%m-%d").date()
    difference = temp_date + timedelta(days=1)
    temp_date = difference.strftime('%Y-%m-%d')
average_last_5_days_test = average_last_5_days_test / total_days
print(average_last_5_days_test)

temp_date = test_start_date
average_upcoming_5_days_predicted = 0
for i in range(total_days):
    average_upcoming_5_days_predicted += predictions_df.loc[temp_date, 'prices']
    temp_date = datetime.strptime(temp_date, "%Y-%m-%d").date()
    difference = temp_date + timedelta(days=1)
    temp_date = difference.strftime('%Y-%m-%d')
    print(temp_date)
average_upcoming_5_days_predicted = average_upcoming_5_days_predicted / total_days
print(average_upcoming_5_days_predicted)
difference_test_predicted_prices = average_last_5_days_test - average_upcoming_5_days_predicted
print(difference_test_predicted_prices)

predictions_df['prices'] = predictions_df['prices'] + difference_test_predicted_prices

再次使用Matplotlib對修正過后的預測進行繪圖。

# RF plot aligned
ax = predictions_df.rename(columns={"prices": "predicted_price"}).plot(title='Random Forest Predict Stock Price Aligned')
ax.set_xlabel("Dates")
ax.set_ylabel("Stock Prices")
fig = y_test.rename(columns={"prices": "actual_price"}).plot(ax = ax).get_figure()
fig.savefig("RF_aligned.png")

修正后的預測折線與實際折線

通過對預測數據進行修正，我們發現預測折線開始向實際折線靠攏了，但預測折線上下抖動太過明顯，因此需要對其進行平滑處理。在平滑處理方面，我們使用Pandas的EWMA（Exponentially Weighted Moving-Average，指數加權移動平均值的控制圖）方法來進行。

# Pandas EWMA
# predictions_df['ewma'] = pd.ewma(predictions_df["prices"], span=60, freq="D").mean()
predictions_df['ewm'] = 
    predictions_df["prices"].ewm(span=60, min_periods=0, freq='D', adjust=True, ignore_na=False).mean()

predictions_df['actual_value'] = test['prices']
# predictions_df['actual_value_ewma'] = pd.ewma(predictions_df["actual_value"], span=60, freq="D").mean()
predictions_df['actual_value_ewm'] = 
    predictions_df["actual_value"].ewm(span=60, min_periods=0, freq='D', adjust=True, ignore_na=False).mean()
predictions_df.columns = ['predicted_price', 'average_predicted_price', 'actual_price', 'average_actual_price']

再次對我們隨機森林算法預測的結果進行繪圖。

# RF smoothed
predictions_plot = predictions_df.plot(title='Random Forest Predict Stock Price Aligned and Smoothed')
predictions_plot.set_xlabel("Dates")
predictions_plot.set_ylabel("Stock Prices")
fig = predictions_plot.get_figure()
fig.savefig("RF_smoothed.png")

使用隨機森林算法預測的股票走勢

我們可以看到，隨機森林算法并沒有很好地擬合股票走勢的曲線。上圖中，綠色和紅色的是實際股票的走勢。而橙色的平滑后的預測走勢與最后部分真實股票的走向甚至相反。讓我們只繪制平滑后的實際股市走勢與預測走勢的折現。

# 只繪制平滑后的實際股市走勢與預測走勢的折現
predictions_df_average = predictions_df[['Average_predicted_price', 'Average_actual_price']]
predictions_plot = predictions_df_average.plot(title='Random Forest Predict Stock Price Aligned and Smoothed')
predictions_plot.set_xlabel("Dates")
predictions_plot.set_ylabel("Prices")
fig = predictions_plot.get_figure()
fig.savefig("RF_smoothed_and_actual_price.png")

預測走勢與實際走勢

很明顯，隨機森林算法的預測效果并沒有理想中的那么好。那么下一步，我們將嘗試使用最普遍的線性回歸模型來進行預測。

使用線性回歸算法對股價進行預測

線性回歸模型具有效率高的特點，我的“從零開始機器學習”系列文章中從零開始機器學習-10 TensorFlow的基本使用方法便是以線性回歸為例子講的TensorFlow使用方法。這里我們使用線性回歸模型進行預測的過程不再贅述。

def LR_prediction():
    years = [2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
    prediction_list = []
    for year in years:
        # 劃分訓練集測試集
        train_start_date = str(year) + '-01-01'
        train_end_date = str(year) + '-10-31'
        test_start_date = str(year) + '-11-01'
        test_end_date = str(year) + '-12-31'
        train = df.ix[train_start_date: train_end_date]
        test = df.ix[test_start_date:test_end_date]

        # 計算情感分數
        sentiment_score_list = []
        for date, row in train.T.iteritems():
            sentiment_score = np.asarray(
                [df.loc[date, 'compound'], df.loc[date, 'neg'], df.loc[date, 'neu'], df.loc[date, 'pos']])
            sentiment_score_list.append(sentiment_score)
        numpy_df_train = np.asarray(sentiment_score_list)

        sentiment_score_list = []
        for date, row in test.T.iteritems():
            sentiment_score = np.asarray(
                [df.loc[date, 'compound'], df.loc[date, 'neg'], df.loc[date, 'neu'], df.loc[date, 'pos']])
            sentiment_score_list.append(sentiment_score)
        numpy_df_test = np.asarray(sentiment_score_list)

        # 線性回歸模型
        lr = LogisticRegression()
        lr.fit(numpy_df_train, train['prices'])

        prediction = lr.predict(numpy_df_test)
        prediction_list.append(prediction)
        idx = pd.date_range(test_start_date, test_end_date)
        predictions_df_list = pd.DataFrame(data=prediction[0:], index=idx, columns=['prices'])

        difference_test_predicted_prices = offset_value(test_start_date, test, predictions_df_list)
        # 對齊
        predictions_df_list['prices'] = predictions_df_list['prices'] + difference_test_predicted_prices
        predictions_df_list

        # 平滑
        predictions_df_list['ewm'] = predictions_df_list["prices"].ewm(span=10,freq='D').mean()
        predictions_df_list['actual_value'] = test['prices']
        predictions_df_list['actual_value_ewma'] = predictions_df_list["actual_value"].ewm(span=10, freq='D').mean()
        # 更改Series名稱
        predictions_df_list.columns = ['predicted_price', 'average_predicted_price', 'actual_price',
                                       'average_actual_price']
        predictions_df_list.plot()
        predictions_df_list_average = predictions_df_list[['average_predicted_price', 'average_actual_price']]
        predictions_df_list_average.plot()

        # 只繪制平滑后的實際股市走勢與預測走勢的折現
        predictions_plot = predictions_df_list_average.plot(title='Linear Regression Predict Stock Price Aligned and Smoothed')
        predictions_plot.set_xlabel("Dates")
        predictions_plot.set_ylabel("Prices")
        fig = predictions_plot.get_figure()
        fig.savefig("LR_smoothed_and_actual_price.png")

        plt.show()

線性回歸模型預測結果

通過對所有輸出的圖（針對很長的時間，分段預測并繪圖）的觀察，我們可以看到線性回歸的預測甚至要比隨機森林要好一些，但是并不能給我們太多的參考價值。

使用神經網絡算法對股價進行預測

關于神經網絡相關的知識，我的“從零開始機器學習”系列文章中講到。下面是使用Scikit Learn的MLP（多層感知機）對股價進行預測的代碼：

def MLP_prediction():
    years = [2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
    prediction_list = []
    for year in years:
        # 分割數據集與測試集
        train_start_date = str(year) + '-01-01'
        train_end_date = str(year) + '-10-31'
        test_start_date = str(year) + '-11-01'
        test_end_date = str(year) + '-12-31'
        train = df.ix[train_start_date: train_end_date]
        test = df.ix[test_start_date:test_end_date]

        # 計算情感分數
        sentiment_score_list = []
        for date, row in train.T.iteritems():
            sentiment_score = np.asarray(
                [df.loc[date, 'compound'], df.loc[date, 'neg'], df.loc[date, 'neu'], df.loc[date, 'pos']])
            sentiment_score_list.append(sentiment_score)
        numpy_df_train = np.asarray(sentiment_score_list)

        sentiment_score_list = []
        for date, row in test.T.iteritems():
            sentiment_score = np.asarray(
                [df.loc[date, 'compound'], df.loc[date, 'neg'], df.loc[date, 'neu'], df.loc[date, 'pos']])
            sentiment_score_list.append(sentiment_score)
        numpy_df_test = np.asarray(sentiment_score_list)

        # 創建MLP模型
        mlpc = MLPClassifier(hidden_layer_sizes=(100, 200, 100), activation='relu',
                             solver='lbfgs', alpha=0.005, learning_rate_init=0.001, shuffle=False)  # span = 20 # best 1
        mlpc.fit(numpy_df_train, train['prices'])
        prediction = mlpc.predict(numpy_df_test)

        prediction_list.append(prediction)
        idx = pd.date_range(test_start_date, test_end_date)
        predictions_df_list = pd.DataFrame(data=prediction[0:], index=idx, columns=['prices'])

        difference_test_predicted_prices = offset_value(test_start_date, test, predictions_df_list)
        predictions_df_list['prices'] = predictions_df_list['prices'] + difference_test_predicted_prices
        predictions_df_list

        # 平滑
        predictions_df_list['ewma'] = predictions_df_list["prices"].ewm(span=20, freq='D').mean()
        predictions_df_list['actual_value'] = test['prices']
        predictions_df_list['actual_value_ewma'] = predictions_df_list["actual_value"].ewm(span=20, freq='D').mean()

        predictions_df_list.columns = ['predicted_price', 'average_predicted_price', 'actual_price',
                                       'average_actual_price']
        predictions_df_list.plot()
        predictions_df_list_average = predictions_df_list[['average_predicted_price', 'average_actual_price']]
        predictions_df_list_average.plot()

        plt.show()