Tensorflow已經成長為事實上的機器學習（ML）平臺，在業界和研究領域都很流行。對Tensorflow的需求和支持促成了大量圍繞訓練和服務機器學習（ML）模型的OSS庫、工具和框架。Tensorflow服務是一個構建在分布式生產環境中用于服務機器學習（ML）模型的推理方面的項目。

今天，我們將重點討論通過優化預測服務器和客戶機來提高延遲的技術。模型預測通常是“在線”操作(在關鍵的應用程序請求路徑上)，因此我們的主要優化目標是以盡可能低的延遲處理大量請求。

首先讓我們快速概述一下Tensorflow服務。

什么是Tensorflow服務？

Tensorflow Serving提供靈活的服務器架構，旨在部署和服務機器學習（ML）模型。一旦模型被訓練過并準備用于預測，Tensorflow服務就需要將模型導出為Servable兼容格式。

Servable是封裝Tensorflow對象的中心抽象。例如，模型可以表示為一個或多個可服務對象。因此，Servables是客戶機用來執行計算(如推理)的底層對象。可服務的大小很重要，因為較小的模型使用更少的內存、更少的存儲空間，并且將具有更快的加載時間。Servables希望模型采用SavedModel格式，以便使用Predict API加載和服務。

Tensorflow Serving將核心服務組件放在一起，構建一個gRPC/HTTP服務器，該服務器可以服務多個ML模型(或多個版本)、提供監視組件和可配置的體系結構。

Tensorflow服務與Docker

讓我們使用標準Tensorflow服務（無CPU優化）獲得基線預測性能延遲指標。

首先，從Tensorflow Docker hub中提取最新的服務鏡像：

docker pull tensorflow/serving:latest

出于本文的目的，所有容器都在4核15GB Ubuntu 16.04主機上運行。

將Tensorflow模型導出為SavedModel格式

使用Tensorflow訓練模型時，輸出可以保存為變量檢查點（磁盤上的文件）。可以通過恢復模型檢查點或其轉換的凍結圖（二進制）直接運行推理。

為了使用Tensorflow服務來提供這些模型，必須將凍結圖導出為SavedModel格式。Tensorflow文檔提供了以SavedModel格式導出訓練模型的示例。

我們將使用深度殘差網絡（ResNet）模型，該模型可用于對ImageNet的1000個類的數據集進行分類。下載預訓練的ResNet-50 v2模型（https://github.com/tensorflow/models/tree/master/official/resnet#pre-trained-model），特別是channels_last（NHWC） convolution SavedModel，它通常更適合CPU。

復制下列結構中的RestNet模型目錄:

Tensorflow Serving期望模型采用數字排序的目錄結構來管理模型版本控制。在這種情況下，目錄1/對應于模型版本1，其中包含模型體系結構saved_model.pb以及模型權重（變量）的快照。

加載并提供SavedModel

以下命令在docker容器中啟動Tensorflow服務模型服務器。為了加載SavedModel，需要將模型的主機目錄掛載到預期的容器目錄中。

docker run -d -p 9000:8500  
 -v $(pwd)/models:/models/resnet -e MODEL_NAME=resnet 
 -t tensorflow/serving:latest

檢查容器日志顯示，ModelServer正在運行，準備在gRPC和HTTP端點上為resnet模型提供推理請求:

I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: resnet version: 1}
I tensorflow_serving/model_servers/server.cc:286] Running gRPC ModelServer at 0.0.0.0:8500 ... 
I tensorflow_serving/model_servers/server.cc:302] Exporting HTTP/REST API at:localhost:8501 ...

預測客戶端

Tensorflow Serving將API服務模式定義為協議緩沖區(protobufs)。預測API的gRPC客戶端實現打包為tensorflow_serving.apisPython包。我們還需要tensorflowpython包來實現實用功能。

讓我們安裝依賴項來創建一個簡單的客戶端：

virtualenv .env && source .env/bin/activate &&  
 pip install numpy grpcio opencv-python tensorflow tensorflow-serving-api

該ResNet-50 v2模型期望在channels_last（NHWC）格式的數據結構中使用浮點Tensor輸入。因此，使用opencv-python讀取輸入圖像，opencv-python以float32數據類型加載到numpy數組（height x width x channels）中。下面的腳本創建預測客戶端存根，將JPEG圖像數據加載到numpy數組中，轉換為張量原型，提出gRPC預測請求:

#!/usr/bin/env python
from __future__ import print_function
import argparse
import numpy as np
import time
tt = time.time()
import cv2
import tensorflow as tf
from grpc.beta import implementations
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2
parser = argparse.ArgumentParser(description='incetion grpc client flags.')
parser.add_argument('--host', default='0.0.0.0', help='inception serving host')
parser.add_argument('--port', default='9000', help='inception serving port')
parser.add_argument('--image', default='', help='path to JPEG image file')
FLAGS = parser.parse_args()
def main(): 
 # create prediction service client stub
 channel = implementations.insecure_channel(FLAGS.host, int(FLAGS.port))
 stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)
 
 # create request
 request = predict_pb2.PredictRequest()
 request.model_spec.name = 'resnet'
 request.model_spec.signature_name = 'serving_default'
 
 # read image into numpy array
 img = cv2.imread(FLAGS.image).astype(np.float32)
 
 # convert to tensor proto and make request
 # shape is in NHWC (num_samples x height x width x channels) format
 tensor = tf.contrib.util.make_tensor_proto(img, shape=[1]+list(img.shape))
 request.inputs['input'].CopyFrom(tensor)
 resp = stub.Predict(request, 30.0)
 
 print('total time: {}s'.format(time.time() - tt))
 
if __name__ == '__main__':
 main()

使用輸入JPEG圖像運行客戶機的輸出如下所示:

python tf_serving_client.py --image=images/pupper.jpg

total time: 2.56152906418s

輸出張量的預測結果為整數值和特征概率

對于單個請求，這種預測延遲是不可接受的。然而，這并非完全出乎意料;服務于二進制文件的默認Tensorflow目標是針對最廣泛的硬件范圍，以涵蓋大多數用例。您可能已經從標準的Tensorflow服務容器日志中注意到:

I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

這表示Tensorflow服務二進制文件在不兼容的CPU平臺上運行，并未進行優化。

構建CPU優化服務二進制

根據Tensorflow文檔，建議從源代碼編譯Tensorflow，并在運行二進制文件的主機平臺的CPU上使用所有可用的優化。Tensorflow構建選項公開了一些標志，以支持構建特定于平臺的CPU指令集:

在本例中，我們將使用1.13:

USER=$1 
TAG=$2 
TF_SERVING_VERSION_GIT_BRANCH="r1.13" 
git clone --branch="$TF_SERVING_VERSION_GIT_BRANCH" https://github.com/tensorflow/serving

Tensorflow服務開發鏡像使用Bazel作為構建工具。處理器特定CPU指令集的構建目標可以指定如下:

TF_SERVING_BUILD_OPTIONS="--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-msse4.1 --copt=-msse4.2"

如果memory是約束，則可以使用--local_resources=2048,.5,1.0 flag 限制內存密集型構建過程的消耗。

以開發鏡像為基礎構建服務鏡像：

#!/bin/bash
USER=$1
TAG=$2
TF_SERVING_VERSION_GIT_BRANCH="r1.13"
git clone --branch="${TF_SERVING_VERSION_GIT_BRANCH}" https://github.com/tensorflow/serving
TF_SERVING_BUILD_OPTIONS="--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-msse4.1 --copt=-msse4.2"
cd serving && 
 docker build --pull -t $USER/tensorflow-serving-devel:$TAG 
 --build-arg TF_SERVING_VERSION_GIT_BRANCH="${TF_SERVING_VERSION_GIT_BRANCH}" 
 --build-arg TF_SERVING_BUILD_OPTIONS="${TF_SERVING_BUILD_OPTIONS}" 
 -f tensorflow_serving/tools/docker/Dockerfile.devel .
cd serving && 
 docker build -t $USER/tensorflow-serving:$TAG 
 --build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel:$TAG 
 -f tensorflow_serving/tools/docker/Dockerfile .

ModelServer可以配置tensorflow特定的標志來啟用會話并行性。以下選項配置兩個線程池來并行執行:

intra_op_parallelism_threads

控制用于并行執行單個操作的最大線程數。
用于并行化具有子操作的操作，這些子操作本質上是獨立的。

inter_op_parallelism_threads

控制用于并行執行獨立不同操作的最大線程數。
Tensorflow Graph上的操作彼此獨立，因此可以在不同的線程上運行。

兩個選項的默認值都設置為0。這意味著，系統會選擇一個合適的數字，這通常需要每個CPU核心有一個線程可用。

接下來，與之前類似地啟動服務容器，這次使用從源碼構建的docker映像，并使用Tensorflow特定的CPU優化標志:

docker run -d -p 9000:8500  
 -v $(pwd)/models:/models/resnet -e MODEL_NAME=resnet 
 -t $USER/tensorflow-serving:$TAG 
 --tensorflow_intra_op_parallelism=4 
 --tensorflow_inter_op_parallelism=4

容器日志不應再顯示CPU警告警告。在不更改任何代碼的情況下，運行相同的預測請求會使預測延遲降低約35.8％：

python tf_serving_client.py --image=images/pupper.jpg

total time: 1.64234706879s

提高預測客戶端的速度

服務器端已針對其CPU平臺進行了優化，但超過1秒的預測延遲似乎仍然過高。

加載tensorflow_serving和tensorflow庫的延遲成本很高。每次調用tf.contrib.util.make_tensor_proto也會增加不必要的延遲開銷。

我們實際上并不需要的tensorflow或tensorflow_serving包進行預測的請求。

如前所述，Tensorflow預測API被定義為protobufs。因此，可以通過生成必要的tensorflow和tensorflow_servingprotobuf python存根來替換這兩個外部依賴項。這避免了在客戶端本身上Pull整個Tensorflow庫。

首先，擺脫tensorflow和tensorflow_serving依賴關系，并添加grpcio-tools包。

pip uninstall tensorflow tensorflow-serving-api &&  
 pip install grpcio-tools==1.0.0

克隆tensorflow/tensorflow和tensorflow/serving存儲庫并將以下protobuf文件復制到客戶端項目中：

將上述protobuf文件復制到protos/目錄中并保留原始路徑：

為簡單起見，predict_service.proto可以簡化為僅實現Predict RPC。這樣可以避免引入服務中定義的其他RPC的嵌套依賴項。這是簡化的一個例子prediction_service.proto（https://gist.github.com/masroorhasan/8e728917ca23328895499179f4575bb8）。

使用grpcio.tools.protoc以下命令生成gRPC python實現：

PROTOC_OUT=protos/ 
PROTOS=$(find . | grep ".proto$") 
for p in $PROTOS; do 
 python -m grpc.tools.protoc -I . --python_out=$PROTOC_OUT --grpc_python_out=$PROTOC_OUT $p
done

現在tensorflow_serving可以刪除整個模塊：

from tensorflow_serving.apis import predict_pb2 
from tensorflow_serving.apis import prediction_service_pb2

并替換為生成的protobufs protos/tensorflow_serving/apis：

from protos.tensorflow_serving.apis import predict_pb2 
from protos.tensorflow_serving.apis import prediction_service_pb2

導入Tensorflow庫是為了使用輔助函數make_tensor_proto，該函數用于將 python / numpy對象封裝為TensorProto對象。

因此，我們可以替換以下依賴項和代碼段：

import tensorflow as tf 
...
tensor = tf.contrib.util.make_tensor_proto(features) 
request.inputs['inputs'].CopyFrom(tensor)

使用protobuf導入并構建TensorProto對象：

from protos.tensorflow.core.framework import tensor_pb2 
from protos.tensorflow.core.framework import tensor_shape_pb2 
from protos.tensorflow.core.framework import types_pb2 
...
# ensure NHWC shape and build tensor proto
tensor_shape = [1]+list(img.shape) 
dims = [tensor_shape_pb2.TensorShapeProto.Dim(size=dim) for dim in tensor_shape] 
tensor_shape = tensor_shape_pb2.TensorShapeProto(dim=dims) 
tensor = tensor_pb2.TensorProto( 
 dtype=types_pb2.DT_FLOAT,
 tensor_shape=tensor_shape,
 float_val=list(img.reshape(-1)))
request.inputs['inputs'].CopyFrom(tensor)

完整的python腳本在這里可用（https://gist.github.com/masroorhasan/0e73a7fc7bb2558c65933338d8194130）。運行更新的初始客戶端，該客戶端將預測請求發送到優化的Tensorflow服務：

python tf_inception_grpc_client.py --image=images/pupper.jpg

total time: 0.58314920859s

下圖顯示了針對標準，優化的Tensorflow服務和客戶端超過10次運行的預測請求的延遲：

從標準Tensorflow服務到優化版本的平均延遲降低了約70.4%。

優化預測吞吐量

Tensorflow服務也可以配置為高吞吐量處理。優化吞吐量通常是為“脫機”批處理完成的，在“脫機”批處理中并不嚴格要求延遲界限。

服務器端批處理

延遲和吞吐量之間的權衡取決于支持的batching 參數。

通過設置--enable_batching和--batching_parameters_file標記來啟用batching。可以按SessionBundleConfig的定義設置批處理參數（https://github.com/tensorflow/serving/blob/d77c9768e33e1207ac8757cff56b9ed9a53f8765/tensorflow_serving/servables/tensorflow/session_bundle_config.proto）。對于僅CPU系統，請考慮設置num_batch_threads可用的核心數。

在服務器端達到全部批處理后，推理請求在內部合并為單個大請求(張量)，并在合并的請求上運行一個Tensorflow會話。在單個會話上運行一批請求是CPU/GPU并行性真正能夠發揮作用的地方。

使用Tensorflow服務進行批量處理時需要考慮的一些用例：

使用異步客戶機請求填充服務器端上的batches
通過將模型圖組件放在CPU / GPU上來加速批處理
在從同一服務器提供多個模型時交錯預測請求
強烈建議對“離線”高容量推理處理進行批處理

客戶端批處理

在客戶端進行批處理將多個輸入組合在一起以生成單個請求。

由于ResNet模型需要NHWC格式的輸入（第一維是輸入數），我們可以將多個輸入圖像聚合成一個RPC請求：

...
batch = [] 
for jpeg in os.listdir(FLAGS.images_path): 
 path = os.path.join(FLAGS.images_path, jpeg)
 img = cv2.imread(path).astype(np.float32)
 batch.Append(img)
...
batch_np = np.array(batch).astype(np.float32) 
dims = [tensor_shape_pb2.TensorShapeProto.Dim(size=dim) for dim in batch_np.shape] 
t_shape = tensor_shape_pb2.TensorShapeProto(dim=dims) 
tensor = tensor_pb2.TensorProto( 
 dtype=types_pb2.DT_FLOAT,
 tensor_shape=t_shape,
 float_val=list(batched_np.reshape(-1)))
request.inputs['inputs'].CopyFrom(tensor)