聊天機器人(ChatRobot)的概念我們并不陌生,也許你曾經在百無聊賴之下和Siri打情罵俏過,亦或是閑暇之余與小愛同學談笑風生,無論如何,我們都得承認,人工智能已經深入了我們的生活。目前市面上提供三方api的機器人不勝枚舉:微軟小冰、圖靈機器人、騰訊閑聊、青云客機器人等等,只要我們想,就隨時可以在App端或者web應用上進行接入。但是,這些應用的底層到底如何實現的?在沒有網絡接入的情況下,我們能不能像美劇《西部世界》(Westworld)里面描繪的那樣,機器人只需要存儲在本地的“心智球”就可以和人類溝通交流,如果你不僅僅滿足于當一個“調包俠”,請跟隨我們的旅程,本次我們將首度使用深度學習庫Keras/TensorFlow打造屬于自己的本地聊天機器人,不依賴任何三方接口與網絡。
首先安裝相關依賴:
pip3 install Tensorflow
pip3 install Keras
pip3 install nltk
pip3 install pandas
然后撰寫腳本test_bot.py導入需要的庫:
import nltk
import ssl
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
import pandas as pd
import pickle
import random
這里有一個坑,就是自然語言分析庫NLTK會報一個錯誤:
Resource punkt not found
正常情況下,只要加上一行下載器代碼即可
import nltk
nltk.download('punkt')
但是由于學術上網的原因,很難通過Python/ target=_blank class=infotextkey>Python下載器正常下載,所以我們玩一次曲線救國,手動自己下載壓縮包:
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
解壓之后,放在你的用戶目錄下即可:
C:Usersliuyuetokenizersnltk_datapunkt
ok,言歸正傳,開發聊天機器人所面對的最主要挑戰是對用戶輸入信息進行分類,以及能夠識別人類的正確意圖(這個可以用機器學習解決,但是太復雜,我偷懶了,所以用的深度學習Keras)。第二就是怎樣保持語境,也就是分析和跟蹤上下文,通常情況下,我們不太需要對用戶意圖進行分類,只需要把用戶輸入的信息當作聊天機器人問題的答案即可,所這里我們使用Keras深度學習庫用于構建分類模型。
聊天機器人的意向和需要學習的模式都定義在一個簡單的變量中。不需要動輒上T的語料庫。我們知道如果玩機器人的,手里沒有語料庫,就會被人嘲笑,但是我們的目標只是為某一個特定的語境建立一個特定聊天機器人。所以分類模型作為小詞匯量創建,它僅僅將能夠識別為訓練提供的一小組模式。
說白了就是,所謂的機器學習,就是你重復的教機器做某一件或幾件正確的事情,在訓練中,你不停的演示怎么做是正確的,然后期望機器在學習中能夠舉一反三,只不過這次我們不教它很多事情,只一件,用來測試它的反應而已,是不是有點像你在家里訓練你的寵物狗?只不過狗子可沒法和你聊天。
這里的意向數據變量我就簡單舉個例子,如果愿意,你可以用語料庫對變量進行無限擴充:
intents = {"intents": [
{"tag": "打招呼",
"patterns": ["你好", "您好", "請問", "有人嗎", "師傅","不好意思","美女","帥哥","靚妹","hi"],
"responses": ["您好", "又是您啊", "吃了么您內","您有事嗎"],
"context": [""]
},
{"tag": "告別",
"patterns": ["再見", "拜拜", "88", "回見", "回頭見"],
"responses": ["再見", "一路順風", "下次見", "拜拜了您內"],
"context": [""]
},
]
}
可以看到,我插入了兩個語境標簽,打招呼和告別,包括用戶輸入信息以及機器回應數據。
在開始分類模型訓練之前,我們需要先建立詞匯。模式經過處理后建立詞匯庫。每一個詞都會有詞干產生通用詞根,這將有助于能夠匹配更多用戶輸入的組合。
for intent in intents['intents']:
for pattern in intent['patterns']:
# tokenize each word in the sentence
w = nltk.word_tokenize(pattern)
# add to our words list
words.extend(w)
# add to documents in our corpus
documents.append((w, intent['tag']))
# add to our classes list
if intent['tag'] not in classes:
classes.append(intent['tag'])
words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))
classes = sorted(list(set(classes)))
print (len(classes), "語境", classes)
print (len(words), "詞數", words)
輸出:
2 語境 ['告別', '打招呼']
14 詞數 ['88', '不好意思', '你好', '再見', '回頭見', '回見', '帥哥', '師傅', '您好', '拜拜', '有人嗎', '美女', '請問', '靚妹']
訓練不會根據詞匯來分析,因為詞匯對于機器來說是沒有任何意義的,這也是很多中文分詞庫所陷入的誤區,其實機器并不理解你輸入的到底是英文還是中文,我們只需要將單詞或者中文轉化為包含0/1的數組的詞袋。數組長度將等于詞匯量大小,當當前模式中的一個單詞或詞匯位于給定位置時,將設置為1。
# create our training data
training = []
# create an empty array for our output
output_empty = [0] * len(classes)
# training set, bag of words for each sentence
for doc in documents:
# initialize our bag of words
bag = []
pattern_words = doc[0]
pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
for w in words:
bag.append(1) if w in pattern_words else bag.append(0)
output_row = list(output_empty)
output_row[classes.index(doc[1])] = 1
training.append([bag, output_row])
random.shuffle(training)
training = np.array(training)
train_x = list(training[:,0])
train_y = list(training[:,1])
我們開始進行數據訓練,模型是用Keras建立的,基于三層。由于數據基數小,分類輸出將是多類數組,這將有助于識別編碼意圖。使用softmax激活來產生多類分類輸出(結果返回一個0/1的數組:[1,0,0,...,0]--這個數組可以識別編碼意圖)。
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
這塊是以200次迭代的方式執行訓練,批處理量為5個,因為我的測試數據樣本小,所以100次也可以,這不是重點。
開始訓練:
14/14 [==============================] - 0s 32ms/step - loss: 0.7305 - acc: 0.5000
Epoch 2/200
14/14 [==============================] - 0s 391us/step - loss: 0.7458 - acc: 0.4286
Epoch 3/200
14/14 [==============================] - 0s 390us/step - loss: 0.7086 - acc: 0.3571
Epoch 4/200
14/14 [==============================] - 0s 395us/step - loss: 0.6941 - acc: 0.6429
Epoch 5/200
14/14 [==============================] - 0s 426us/step - loss: 0.6358 - acc: 0.7143
Epoch 6/200
14/14 [==============================] - 0s 356us/step - loss: 0.6287 - acc: 0.5714
Epoch 7/200
14/14 [==============================] - 0s 366us/step - loss: 0.6457 - acc: 0.6429
Epoch 8/200
14/14 [==============================] - 0s 899us/step - loss: 0.6336 - acc: 0.6429
Epoch 9/200
14/14 [==============================] - 0s 464us/step - loss: 0.5815 - acc: 0.6429
Epoch 10/200
14/14 [==============================] - 0s 408us/step - loss: 0.5895 - acc: 0.6429
Epoch 11/200
14/14 [==============================] - 0s 548us/step - loss: 0.6050 - acc: 0.6429
Epoch 12/200
14/14 [==============================] - 0s 468us/step - loss: 0.6254 - acc: 0.6429
Epoch 13/200
14/14 [==============================] - 0s 388us/step - loss: 0.4990 - acc: 0.7857
Epoch 14/200
14/14 [==============================] - 0s 392us/step - loss: 0.5880 - acc: 0.7143
Epoch 15/200
14/14 [==============================] - 0s 370us/step - loss: 0.5118 - acc: 0.8571
Epoch 16/200
14/14 [==============================] - 0s 457us/step - loss: 0.5579 - acc: 0.7143
Epoch 17/200
14/14 [==============================] - 0s 432us/step - loss: 0.4535 - acc: 0.7857
Epoch 18/200
14/14 [==============================] - 0s 357us/step - loss: 0.4367 - acc: 0.7857
Epoch 19/200
14/14 [==============================] - 0s 384us/step - loss: 0.4751 - acc: 0.7857
Epoch 20/200
14/14 [==============================] - 0s 346us/step - loss: 0.4404 - acc: 0.9286
Epoch 21/200
14/14 [==============================] - 0s 500us/step - loss: 0.4325 - acc: 0.8571
Epoch 22/200
14/14 [==============================] - 0s 400us/step - loss: 0.4104 - acc: 0.9286
Epoch 23/200
14/14 [==============================] - 0s 738us/step - loss: 0.4296 - acc: 0.7857
Epoch 24/200
14/14 [==============================] - 0s 387us/step - loss: 0.3706 - acc: 0.9286
Epoch 25/200
14/14 [==============================] - 0s 430us/step - loss: 0.4213 - acc: 0.8571
Epoch 26/200
14/14 [==============================] - 0s 351us/step - loss: 0.2867 - acc: 1.0000
Epoch 27/200
14/14 [==============================] - 0s 3ms/step - loss: 0.2903 - acc: 1.0000
Epoch 28/200
14/14 [==============================] - 0s 366us/step - loss: 0.3010 - acc: 0.9286
Epoch 29/200
14/14 [==============================] - 0s 404us/step - loss: 0.2466 - acc: 0.9286
Epoch 30/200
14/14 [==============================] - 0s 428us/step - loss: 0.3035 - acc: 0.7857
Epoch 31/200
14/14 [==============================] - 0s 407us/step - loss: 0.2075 - acc: 1.0000
Epoch 32/200
14/14 [==============================] - 0s 457us/step - loss: 0.2167 - acc: 0.9286
Epoch 33/200
14/14 [==============================] - 0s 613us/step - loss: 0.1266 - acc: 1.0000
Epoch 34/200
14/14 [==============================] - 0s 534us/step - loss: 0.2906 - acc: 0.9286
Epoch 35/200
14/14 [==============================] - 0s 463us/step - loss: 0.2560 - acc: 0.9286
Epoch 36/200
14/14 [==============================] - 0s 500us/step - loss: 0.1686 - acc: 1.0000
Epoch 37/200
14/14 [==============================] - 0s 387us/step - loss: 0.0922 - acc: 1.0000
Epoch 38/200
14/14 [==============================] - 0s 430us/step - loss: 0.1620 - acc: 1.0000
Epoch 39/200
14/14 [==============================] - 0s 371us/step - loss: 0.1104 - acc: 1.0000
Epoch 40/200
14/14 [==============================] - 0s 488us/step - loss: 0.1330 - acc: 1.0000
Epoch 41/200
14/14 [==============================] - 0s 381us/step - loss: 0.1322 - acc: 1.0000
Epoch 42/200
14/14 [==============================] - 0s 462us/step - loss: 0.0575 - acc: 1.0000
Epoch 43/200
14/14 [==============================] - 0s 1ms/step - loss: 0.1137 - acc: 1.0000
Epoch 44/200
14/14 [==============================] - 0s 450us/step - loss: 0.0245 - acc: 1.0000
Epoch 45/200
14/14 [==============================] - 0s 470us/step - loss: 0.1824 - acc: 1.0000
Epoch 46/200
14/14 [==============================] - 0s 444us/step - loss: 0.0822 - acc: 1.0000
Epoch 47/200
14/14 [==============================] - 0s 436us/step - loss: 0.0939 - acc: 1.0000
Epoch 48/200
14/14 [==============================] - 0s 396us/step - loss: 0.0288 - acc: 1.0000
Epoch 49/200
14/14 [==============================] - 0s 580us/step - loss: 0.1367 - acc: 0.9286
Epoch 50/200
14/14 [==============================] - 0s 351us/step - loss: 0.0363 - acc: 1.0000
Epoch 51/200
14/14 [==============================] - 0s 379us/step - loss: 0.0272 - acc: 1.0000
Epoch 52/200
14/14 [==============================] - 0s 358us/step - loss: 0.0712 - acc: 1.0000
Epoch 53/200
14/14 [==============================] - 0s 4ms/step - loss: 0.0426 - acc: 1.0000
Epoch 54/200
14/14 [==============================] - 0s 370us/step - loss: 0.0430 - acc: 1.0000
Epoch 55/200
14/14 [==============================] - 0s 368us/step - loss: 0.0292 - acc: 1.0000
Epoch 56/200
14/14 [==============================] - 0s 494us/step - loss: 0.0777 - acc: 1.0000
Epoch 57/200
14/14 [==============================] - 0s 356us/step - loss: 0.0496 - acc: 1.0000
Epoch 58/200
14/14 [==============================] - 0s 427us/step - loss: 0.1485 - acc: 1.0000
Epoch 59/200
14/14 [==============================] - 0s 381us/step - loss: 0.1006 - acc: 1.0000
Epoch 60/200
14/14 [==============================] - 0s 421us/step - loss: 0.0183 - acc: 1.0000
Epoch 61/200
14/14 [==============================] - 0s 344us/step - loss: 0.0788 - acc: 0.9286
Epoch 62/200
14/14 [==============================] - 0s 529us/step - loss: 0.0176 - acc: 1.0000
ok,200次之后,現在模型已經訓練好了,現在聲明一個方法用來進行詞袋轉換:
def clean_up_sentence(sentence):
# tokenize the pattern - split words into array
sentence_words = nltk.word_tokenize(sentence)
# stem each word - create short form for word
sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
return sentence_words
def bow(sentence, words, show_details=True):
# tokenize the pattern
sentence_words = clean_up_sentence(sentence)
# bag of words - matrix of N words, vocabulary matrix
bag = [0]*len(words)
for s in sentence_words:
for i,w in enumerate(words):
if w == s:
# assign 1 if current word is in the vocabulary position
bag[i] = 1
if show_details:
print ("found in bag: %s" % w)
return(np.array(bag))
測試一下,看看是否可以命中詞袋:
p = bow("你好", words)
print (p)
返回值:
found in bag: 你好
[0 0 1 0 0 0 0 0 0 0 0 0 0 0]
很明顯匹配成功,詞已入袋。
在我們打包模型之前,可以使用model.predict函數對用戶輸入進行分類測試,并根據計算出的概率返回用戶意圖(可以返回多個意圖,根據概率倒序輸出):
def classify_local(sentence):
ERROR_THRESHOLD = 0.25
# generate probabilities from the model
input_data = pd.DataFrame([bow(sentence, words)], dtype=float, index=['input'])
results = model.predict([input_data])[0]
# filter out predictions below a threshold, and provide intent index
results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD]
# sort by strength of probability
results.sort(key=lambda x: x[1], reverse=True)
return_list = []
for r in results:
return_list.append((classes[r[0]], str(r[1])))
# return tuple of intent and probability
return return_list
測試一下:
print(classify_local('您好'))
返回值:
found in bag: 您好
[('打招呼', '0.999913')]
liuyue:mytornado liuyue$
再測:
print(classify_local('88'))
返回值:
found in bag: 88
[('告別', '0.9995449')]
完美,匹配出打招呼的語境標簽,如果愿意,可以多測試幾個,完善模型。
測試完成之后,我們可以將訓練好的模型打包,這樣每次調用之前就不用訓練了:
model.save("./v3u.h5")
這里分類模型會在根目錄產出,文件名為v3u.h5,將它保存好,一會兒會用到。
接下來,我們來搭建一個聊天機器人的API,這里我們使用目前非常火的框架Fastapi,將模型文件放入到項目的目錄之后,編寫main.py:
import random
import uvicorn
from fastapi import FastAPI
app = FastAPI()
def classify_local(sentence):
ERROR_THRESHOLD = 0.25
# generate probabilities from the model
input_data = pd.DataFrame([bow(sentence, words)], dtype=float, index=['input'])
results = model.predict([input_data])[0]
# filter out predictions below a threshold, and provide intent index
results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD]
# sort by strength of probability
results.sort(key=lambda x: x[1], reverse=True)
return_list = []
for r in results:
return_list.append((classes[r[0]], str(r[1])))
# return tuple of intent and probability
return return_list
@app.get('/')
async def root(word: str = None):
from keras.models import model_from_json,load_model
model = load_model("./v3u.h5")
wordlist = classify_local(word)
a = ""
for intent in intents['intents']:
if intent['tag'] == wordlist[0][0]:
a = random.choice(intent['responses'])
return {'message':a}
if __name__ == "__main__":
uvicorn.run(app, host="127.0.0.1", port=8000)
這里的:
from keras.models import model_from_json,load_model
model = load_model("./v3u.h5")
用來導入剛才訓練好的模型庫,隨后啟動服務:
uvicorn main:app --reload
效果是這樣的:
結語:毫無疑問,科技改變生活,聊天機器人可以讓我們沒有佳人相伴的情況下,也可以聽聞鶯啼燕語,相信不久的將來,笑語盈盈、衣香鬢影的“機械姬”亦能伴吾等于清風明月之下。