我們都知道，神經(jīng)網(wǎng)絡(luò)可以在執(zhí)行某些任務(wù)時(shí)復(fù)制人腦的功能。神經(jīng)網(wǎng)絡(luò)在計(jì)算機(jī)視覺(jué)和自然語(yǔ)言生成方面的應(yīng)用已經(jīng)非常引人注目。

本文將介紹神經(jīng)網(wǎng)絡(luò)的一個(gè)這樣的應(yīng)用，并讓讀者了解如何使用CNNs和RNNs (LSTM)的混合網(wǎng)絡(luò)實(shí)際為圖像生成標(biāo)題(描述)。我們?cè)谶@個(gè)任務(wù)中使用的數(shù)據(jù)集是流行的flickr 8k圖像數(shù)據(jù)集，它是這個(gè)任務(wù)的基準(zhǔn)數(shù)據(jù)

注意:我們將把數(shù)據(jù)集分割為7k用于訓(xùn)練，1k用于測(cè)試。

我們將首先討論在我們的混合神經(jīng)網(wǎng)絡(luò)中不同的組件(層)和它們的功能。與此同時(shí)，我們還將研究使用Tensorflow、Keras和Python開(kāi)發(fā)混合神經(jīng)網(wǎng)絡(luò)的實(shí)際實(shí)現(xiàn)。

神經(jīng)網(wǎng)絡(luò)的總體結(jié)構(gòu)

讓我們來(lái)看看我們將用于生成字幕的神經(jīng)網(wǎng)絡(luò)的總體架構(gòu)。

簡(jiǎn)單地說(shuō)，上述神經(jīng)網(wǎng)絡(luò)有3個(gè)主要組成部分(子網(wǎng)絡(luò))，每個(gè)子網(wǎng)絡(luò)都有一個(gè)特定的任務(wù)，即卷積網(wǎng)絡(luò)(用于從圖像中提取特征)、rstm(用于生成文本)和解碼器(用于合并兩種網(wǎng)絡(luò))。

現(xiàn)在讓我們?cè)敿?xì)討論每個(gè)組件并了解它們的工作原理。

圖像特征提取器

為了從圖像中生成特征，我們將使用卷積神經(jīng)網(wǎng)絡(luò)，只需稍加修改。讓我們來(lái)看看一個(gè)用于圖像識(shí)別的卷積神經(jīng)網(wǎng)絡(luò)。

一般的CNN分類模型有兩個(gè)子網(wǎng)絡(luò)

Feature Learning Network—負(fù)責(zé)從圖像中生成Feature map的網(wǎng)絡(luò)(多卷積和池化層的網(wǎng)絡(luò))。

分類網(wǎng)絡(luò)——負(fù)責(zé)圖像分類的全連通深度神經(jīng)網(wǎng)絡(luò)(多稠密層、單輸出層網(wǎng)絡(luò))。

由于我們只對(duì)從圖像中提取特征感興趣，而對(duì)其分類不感興趣，所以我們只對(duì)CNN的Feature Learning部分進(jìn)行處理，這就是我們從圖像中提取特征的方法。

下面的代碼可以用來(lái)從任何一組圖像提取特征：

import tensorflow as tf
from keras.preprocessing import image

import numpy as np

# function to extract features from image
def extract_image_features():
    
    model = tf.keras.models.Sequential()
    
    # adding first layers of convolution and pooling layers to network
    model.add(tf.keras.layers.Conv2D(filters=64, kernel_size=(3,3), input_shape=(90,90,3), padding="VALID", activation="relu"))
    model.add(tf.keras.layers.Conv2D(filters=64, kernel_size=(3,3), activation="relu"))
    model.add(tf.keras.layers.MaxPool2D(pool_size=2, strides=2))
    
    # adding second layers of convolution and pooling layers to network
    model.add(tf.keras.layers.Conv2D(filters=32, kernel_size=(3,3), padding="VALID", activation="relu"))
    model.add(tf.keras.layers.Conv2D(filters=32, kernel_size=(3,3), activation="relu"))
    model.add(tf.keras.layers.AveragePooling2D(pool_size=2, strides=1))
    
    # flattening the output using flatten layer, since the input to neural net has to be flat
    model.add(tf.keras.layers.Flatten())
    
    # model summary
    model.summary()
    
    return model

for file in os.listdir(image_path):
    path = image_path + "//" + file
    img = image.load_img(path, target_size=(90, 90))
    img_data = image.img_to_array(img)
    img_data = np.expand_dims(img_data, axis=0)
    img_data = preprocess_input(img_data)
    
    feature = extract_image_features.predict(img_data)
    feature = np.reshape(feature, feature.shape[1])

任何人都可以使用上述代碼構(gòu)建自己的圖像特征提取器，但有一個(gè)問(wèn)題…

上面的模型太過(guò)簡(jiǎn)單，無(wú)法從我們的一組圖像中提取出每一個(gè)重要的細(xì)節(jié)，因此會(huì)影響整個(gè)模型的性能。此外，由于高性能gpu和系統(tǒng)的不可用性，使得模型過(guò)于復(fù)雜(具有大量神經(jīng)元的多層密集層)也具有挑戰(zhàn)性。

為了解決這個(gè)問(wèn)題，我們?cè)赥ensorflow中有非常流行的預(yù)訓(xùn)練CNN模型(VGG-16, ResNet50等，由不同大學(xué)和組織的科學(xué)家開(kāi)發(fā))，可以用于從圖像中提取特征。記住，在使用輸出層進(jìn)行特征提取之前，要將它從模型中移除。

下面的代碼將讓您了解如何使用Tensorflow中這些預(yù)先訓(xùn)練好的模型從圖像中提取特征。

import tensorflow as tf
from keras.preprocessing import image
from keras.Applications.resnet50 import ResNet50
from keras.applications.resnet50 import preprocess_input
from keras.models import Model

# load the ResNet50 Model
feature_extractor = ResNet50(weights='imagenet', include_top=False)
feature_extractor_new = Model(feature_extractor.input, feature_extractor.layers[-2].output)
feature_extractor_new.summary()

for file in os.listdir(image_path):
    path = image_path + "//" + file
    img = image.load_img(path, target_size=(90, 90))
    img_data = image.img_to_array(img)
    img_data = np.expand_dims(img_data, axis=0)
    img_data = preprocess_input(img_data)
    
    feature = feature_extractor_new.predict(img_data)
    feature_reshaped = np.array(feature).flatten()

正如您在下面看到的，如果執(zhí)行上面的代碼，您將看到我們的圖像特性只是一個(gè)形狀-(18432，)的numpy數(shù)組。

image_feature_dictionary[list(image_feature_dictionary. Keys())[0]].shape
 (18432,)

接下來(lái)，我們將開(kāi)發(fā)用于為圖像生成標(biāo)題的LSTM網(wǎng)絡(luò)(RNN)。

用于生成標(biāo)題的LSTM

文本生成是LSTM網(wǎng)絡(luò)中最流行的應(yīng)用之一。LSTM單元格(LSTM網(wǎng)絡(luò)的基本構(gòu)建塊)能夠根據(jù)前一層的輸出生成輸出，即它保留前一層(內(nèi)存)的輸出，并使用該內(nèi)存生成(預(yù)測(cè))序列中的下一個(gè)輸出。

對(duì)于我們的數(shù)據(jù)集，我們?yōu)槊繌垐D片設(shè)置了5個(gè)標(biāo)題，即總共40k個(gè)標(biāo)題。

讓我們看看我們的數(shù)據(jù)集-

1. A child in a pink dress is climbing up a set of stairs in an entry way.

1. A girl going into a wooden building.

1. A little girl climbing into a wooden playhouse.

1. A little girl climbing the stairs to her playhouse.

1. A little girl in a pink dress going into a wooden cabin.

正如所見(jiàn)，所有的字幕都很好地描述了圖片。我們現(xiàn)在的任務(wù)是設(shè)計(jì)一個(gè)RNN，它可以為任何相似的圖像集復(fù)制這個(gè)任務(wù)。

回到最初的任務(wù)，我們首先必須看看LSTM網(wǎng)絡(luò)是如何生成文本的。對(duì)于LSTM來(lái)說(shuō)，網(wǎng)絡(luò)標(biāo)題只不過(guò)是一長(zhǎng)串單獨(dú)的單詞(編碼為數(shù)字)放在一起。利用這些信息，它試圖根據(jù)前面的單詞預(yù)測(cè)序列中的下一個(gè)單詞(記憶)。

在我們的例子中，由于標(biāo)題可以是可變長(zhǎng)度的，所以我們首先需要指定每個(gè)標(biāo)題的開(kāi)始和結(jié)束。我們看看-是什么意思

首先，我們將把和添加到數(shù)據(jù)集中的每個(gè)標(biāo)題中。在創(chuàng)建最終詞匯表之前，我們將對(duì)訓(xùn)練數(shù)據(jù)集中的每個(gè)標(biāo)題進(jìn)行標(biāo)記。為了訓(xùn)練我們的模型，我們將從詞匯表中刪除頻率小于或等于10的單詞。增加這一步是為了提高我們的模型的一般性能，并防止它過(guò)擬合訓(xùn)練數(shù)據(jù)集。

代碼如下：

# loading captions from captions file
import pandas as pd

# loading captions.txt
captions = pd.read_csv('/kaggle/input/flickr8k/captions.txt', sep=",")
captions = captions.rename(columns=lambda x: x.strip().lower())
captions['image'] = captions['image'].apply(lambda x: x.split(".")[0])
captions = captions[['image', 'caption']]
# adding <start> and <end> to every caption
captions['caption'] = "<start> " + captions['caption'] + " <end>"

# in case we have any missing caption/blank caption drop it
print(captions.shape)
captions = captions.dropna()
print(captions.shape)

# training and testing image captions split
train_image_captions = {}
test_image_captions = {}

# list for storing every caption
all_captions = []

# storing training data
for image in train_data_images:
    tempDf = captions[captions['image'] == image]
    list_of_captions = tempDf['caption'].tolist()
    train_image_captions[image] = list_of_captions
    all_captions.append(list_of_captions)

# store testing data
for image in test_data_images:
    tempDf = captions[captions['image'] == image]
    list_of_captions = tempDf['caption'].tolist()
    test_image_captions[image] = list_of_captions
    all_captions.append(list_of_captions)

print("Data Statistics")
print(f"Training Images Captions {len(train_image_captions.keys())}")
print(f"Testing Images Captions {len(test_image_captions.keys())}")

上面的代碼將生成下面的輸出

train_image_captions[list(train_image_captions. Keys())[150]]
['<start> A brown dog chases a tattered ball around the yard . <end>',
 '<start> A brown dog is chasing a tattered soccer ball across a low cut field . <end>',
 '<start> Large brown dog playing with a white soccer ball in the grass . <end>',
 '<start> Tan dog chasing a ball . <end>',
 '<start> The tan dog is chasing a ball . <end>']

一旦我們加載了標(biāo)題，我們將首先使用spacy和Tokenizer(來(lái)自tensorflow.preprocessing.)對(duì)所有內(nèi)容進(jìn)行標(biāo)記。文本類)。

令牌化就是將一個(gè)句子分解成不同的單詞，同時(shí)刪除特殊字符，所有內(nèi)容都小寫(xiě)。結(jié)果是我們?cè)诰渥又杏辛艘粋€(gè)有意義的單詞(記號(hào))的語(yǔ)料庫(kù)，我們可以在將其用作模型的輸入之前對(duì)其進(jìn)行進(jìn)一步編碼。

import spacy
nlp = spacy.load('en', disable=['tagger', 'parser', 'ner'])

# tokenize evry captions, remove punctuations, lowercase everything
for key, value in train_image_captions.items():
    ls = []
    for v in value:
        doc = nlp(v)
        new_v = " "
        for token in doc:
            if not token.is_punct:
                if token.text not in [" ", "n", "nn"]:
                    new_v = new_v + " " + token.text.lower()
        
        new_v = new_v.strip()
        ls.append(new_v)
    train_image_captions[key] = ls
    

# create a vocabulary of all the unique words present in captions
# flatten the list
all_captions = [caption for list_of_captions in all_captions for caption in list_of_captions]

# use spacy to convert to lowercase and reject any special characters 
tokens = []
for captions in all_captions:
    doc = nlp(captions)
    for token in doc:
        if not token.is_punct:
            if token.text not in [" ", "n", "nn"]:
                tokens.append(token.text.lower())

# get tokens with frequency less than 10
import collections
word_count_dict = collections.Counter(tokens)
reject_words = []
for key, value in word_count_dict.items():
    if value < 10:
        reject_words.append(key)
        
reject_words.append("<")
reject_words.append(">")

 # remove tokens that are in reject words
tokens = [x for x in tokens if x not in reject_words]

# convert the token to equivalent index using Tokenizer class of Keras
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(tokens)

上面的代碼將生成一個(gè)字典，其中每個(gè)令牌都被編碼為整數(shù)，反之亦然。示例輸出如下所示

tokenizer.word_index {'a': 1,
 'end': 2,
 'start': 3,
 'in': 4,
 'the': 5,
 'on': 6,
 'is': 7,
 'and': 8,
 'dog': 9,
 'with': 10,
 'man': 11,
 'of': 12,
 'two': 13,
 'black': 14,
 'white': 15,
 'boy': 16,
 'woman': 17,
 'girl': 18,
 'wearing': 19,
 'are': 20,
 'brown': 21.....}

在此之后，我們需要找到詞匯表的長(zhǎng)度和最長(zhǎng)標(biāo)題的長(zhǎng)度。讓我們看看這兩種方法在創(chuàng)建模型時(shí)的重要性。

詞匯長(zhǎng)度:詞匯長(zhǎng)度基本上是我們語(yǔ)料庫(kù)中唯一單詞的數(shù)量。此外，輸出層中的神經(jīng)元將等于詞匯表長(zhǎng)度+ 1(+ 1表示由于填充序列而產(chǎn)生的額外空白)，因?yàn)樵诿看蔚鷷r(shí)，我們需要模型從語(yǔ)料庫(kù)中生成一個(gè)新單詞。

最大標(biāo)題長(zhǎng)度:因?yàn)樵谖覀兊臄?shù)據(jù)集中，即使對(duì)于相同的圖像，標(biāo)題也是可變長(zhǎng)度的。讓我們?cè)囍敿?xì)地理解這個(gè)

正如您所看到的，每個(gè)標(biāo)題都有不同的長(zhǎng)度，因此我們不能將它們用作我們的LSTM模型的輸入。為了解決這個(gè)問(wèn)題，我們填充填充每個(gè)標(biāo)題到最大標(biāo)題的長(zhǎng)度。

注意，每個(gè)序列都有一組額外的0來(lái)增加它的長(zhǎng)度到最大序列。

# compute length of vocabulary and maximum length of a caption (for padding)
vocab_len = len(tokenizer.word_counts) + 1
print(f"Vocabulary length - {vocab_len}")

max_caption_len = max([len(x.split(" ")) for x in all_captions])
print(f"Maximum length of caption - {max_caption_len}")

和輸出的模型創(chuàng)建訓(xùn)練數(shù)據(jù)集。對(duì)于我們的問(wèn)題，我們有兩個(gè)輸入和一個(gè)輸出。為了便于理解，讓我們更詳細(xì)地看看這個(gè)

對(duì)于每個(gè)圖像我們都有

圖像特征(X1)：利用ResNet50模型提取的形狀的Numpy數(shù)組(18432，)

輸入序列(X2)：這需要更多的解釋。每個(gè)標(biāo)題只是一個(gè)序列列表，我們的模型試圖預(yù)測(cè)序列中下一個(gè)最好的元素。因此，對(duì)于每個(gè)標(biāo)題，我們將首先從序列中的第一個(gè)元素開(kāi)始，對(duì)該元素的相應(yīng)輸出將是下一個(gè)元素。在下一次迭代中，前一次迭代的輸出將和前一次迭代的輸入(內(nèi)存)一起成為新的輸入，這樣一直進(jìn)行，直到我們到達(dá)序列的末尾。

輸出(y)：序列中的下一個(gè)單詞。

下面的代碼可以用來(lái)實(shí)現(xiàn)上面創(chuàng)建訓(xùn)練數(shù)據(jù)集的邏輯-

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# generator function to generate inputs for model
def create_trianing_data(captions, images, tokenizer, max_caption_length, vocab_len, photos_per_batch):
    
    X1, X2, y = list(), list(), list()
    n=0
   
    # loop through every image
    while 1:
        for key, cap in captions.items():
            n+=1
            # retrieve the photo feature
            image = images[key]
            
            for c in cap:
                # encode the sequence
                sequnece = [tokenizer.word_index[word] for word in c.split(' ') if word in list(tokenizer.word_index.keys())]
                
                # split one sequence into multiple X, y pairs
                
                for i in range(1, len(sequence)):
                    # creating input, output
                    inp, out = sequence[:i], sequence[i]
                    # padding input                     
                    input_seq = pad_sequences([inp], maxlen=max_caption_length)[0]
                    # encode output sequence
                    output_seq = to_categorical([out], num_classes=vocab_len)[0]
                    # store
                    X1.append(image)
                    X2.append(input_seq)
                    y.append(output_seq)
                    
            # yield the batch data
            if n==photos_per_batch:
                yield ([np.array(X1), np.array(X2)], np.array(y))
                X1, X2, y = list(), list(), list()
                n=0

合并兩個(gè)子網(wǎng)絡(luò)

現(xiàn)在我們已經(jīng)開(kāi)發(fā)了兩個(gè)子網(wǎng)絡(luò)(用于生成字幕的圖像特征提取器和LSTM)，讓我們結(jié)合這兩個(gè)網(wǎng)絡(luò)來(lái)創(chuàng)建我們的最終模型。

對(duì)于任何一幅新圖像(必須與訓(xùn)練中使用的圖像相似)，我們的模型將根據(jù)它在訓(xùn)練相似的圖像和字幕集時(shí)獲得的知識(shí)生成標(biāo)題。

下面的代碼創(chuàng)建了最終的模型

import keras

def create_model(max_caption_length, vocab_length):
    
    # sub network for handling the image feature part
    input_layer1 = keras.Input(shape=(18432))
    feature1 = keras.layers.Dropout(0.2)(input_layer1)
    feature2 = keras.layers.Dense(max_caption_length*4, activation='relu')(feature1)
    feature3 = keras.layers.Dense(max_caption_length*4, activation='relu')(feature2)
    feature4 = keras.layers.Dense(max_caption_length*4, activation='relu')(feature3)
    feature5 = keras.layers.Dense(max_caption_length*4, activation='relu')(feature4)
    
    # sub network for handling the text generation part
    input_layer2 = keras.Input(shape=(max_caption_length,))
    cap_layer1 = keras.layers.Embedding(vocab_length, 300, input_length=max_caption_length)(input_layer2)
    cap_layer2 = keras.layers.Dropout(0.2)(cap_layer1)
    cap_layer3 = keras.layers.LSTM(max_caption_length*4, activation='relu', return_sequences=True)(cap_layer2)
    cap_layer4 = keras.layers.LSTM(max_caption_length*4, activation='relu', return_sequences=True)(cap_layer3)
    cap_layer5 = keras.layers.LSTM(max_caption_length*4, activation='relu', return_sequences=True)(cap_layer4)
    cap_layer6 = keras.layers.LSTM(max_caption_length*4, activation='relu')(cap_layer5)
    
    # merging the two sub network
    decoder1 = keras.layers.merge.add([feature5, cap_layer6])
    decoder2 = keras.layers.Dense(256, activation='relu')(decoder1)
    decoder3 = keras.layers.Dense(256, activation='relu')(decoder2)
    
    # output is the next word in sequence
    output_layer = keras.layers.Dense(vocab_length, activation='softmax')(decoder3)
    model = keras.models.Model(inputs=[input_layer1, input_layer2], outputs=output_layer)
    
    model.summary()

    return model

在編譯模型之前，我們需要給嵌入層添加權(quán)重。這是通過(guò)為語(yǔ)料庫(kù)(詞匯表)中出現(xiàn)的每個(gè)標(biāo)記創(chuàng)建單詞嵌入(在高維向量空間中表示標(biāo)記)來(lái)實(shí)現(xiàn)的。有一些非常流行的字嵌入模型可以用于這個(gè)目的(GloVe, Gensim嵌入模型等)。

我們將使用Spacy內(nèi)建的"encoreweb_lg"模型來(lái)創(chuàng)建令牌的向量表示(即每個(gè)令牌將被表示為(300，)numpy數(shù)組)。

下面的代碼可以用于創(chuàng)建單詞嵌入，并將其添加到我們的模型嵌入層。

# create word embeddings
import spacy
nlp = spacy.load('en_core_web_lg')

# create word embeddings
embedding_dimension = 300
embedding_matrix = np.zeros((vocab_len, embedding_dimension))

# travel through every word in vocabulary and get its corresponding vector
for word, index in tokenizer.word_index.items():

    doc = nlp(word)
    embedding_vector = np.array(doc.vector)
    embedding_matrix[index] = embedding_vector
    
# adding embeddings to model
predictive_model.layers[2]
predictive_model.layers[2].set_weights([embedding_matrix])
predictive_model.layers[2].trainable = False

現(xiàn)在我們已經(jīng)創(chuàng)建了所有的東西，我們只需要編譯和訓(xùn)練我們的模型。

注意:由于我們?nèi)蝿?wù)的復(fù)雜性，這個(gè)網(wǎng)絡(luò)的訓(xùn)練時(shí)間會(huì)非常長(zhǎng)(具有大量的epoch)

# get training data
train_data = create_trianing_data(train_image_captions, train_image_features, tokenizer, max_caption_len, vocab_length, 32)

# initialize model
model = create_model(max_caption_len, vocab_len)

steps_per_epochs = len(train_image_captions)//32

# compile model
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit_generator(train_data, epochs=100, steps_per_epoch=steps_per_epochs)

為了生成新的標(biāo)題，我們首先需要將一幅圖像轉(zhuǎn)換為與訓(xùn)練數(shù)據(jù)集(18432)圖像相同維數(shù)的numpy數(shù)組，并使用作為模型的輸入。

在序列生成過(guò)程中，一旦在輸出中遇到，我們就會(huì)終止這個(gè)過(guò)程。

import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
%matplotlib inline

# method for generating captions
def generate_captions(model, image, tokenizer.word_index, max_caption_length, tokenizer.index_word):
    
    # input is <start>
    input_text = '<start>'
    
    # keep generating words till we have encountered <end>
    for i in range(max_caption_length):
        seq = [tokenizer.word_index[w] for w in in_text.split() if w in list(tokenizer.word_index.keys())]
        seq = pad_sequences([sequence], maxlen=max_caption_length)
        prediction = model.predict([photo,sequence], verbose=0)
        prediction = np.argmax(prediction)
        word = tokenizer.index_word[prediction]
        input_text += ' ' + word
        if word == '<end>':
            break
    
    # remove <start> and <end> from output and return string
    output = in_text.split()
    output = output[1:-1]
    output = ' '.join(output)
    return output

# traverse through testing images to generate captions
count = 0
for key, value in test_image_features.items():
    test_image = test_image_features[key]
    test_image = np.expand_dims(test_image, axis=0)
    final_caption = generate_captions(predictive_model, test_image, tokenizer.word_index, max_caption_len, tokenizer.index_word)
    
    plt.figure(figsize=(7,7))
    image = Image.open(image_path + "//" + key + ".jpg")
    plt.imshow(image)
    plt.title(final_caption)
    
    count = count + 1
    if count == 3:
        break

現(xiàn)在讓我們檢查模型的輸出