當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

TensorRT(4)-Profiling and 16-bit Inference

發布時間：2024/9/27 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 TensorRT(4)-Profiling and 16-bit Inference 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前面幾節以 LeNet 為例主要介紹了 tensorRT 的簡單使用流程。包括，使用 tensorRT 的 NvCaffeParser 工具以及底層 C++ API 來模型 caffe 解析，構建 tensorRT 模型并部署等。

本節以 GooLeNet 為例，來展示 tensorRT 的優化方法。

例程位于?/usr/src/tensorrt/samples/sampleGoogleNet

這個例程展示的是 TensorRT的layer-based profiling和 half2mode 和 FP16 使用方法。

1 Key Concepts

首先了解幾個概念：

Profiling a network ：就是測量網絡每一層的運行時間，可以很方便的看出：使用了TensorRT和沒使用TensorRT在時間上的差別。
FP16 ：FP32 是指 Full Precise Float 32 ，FP 16 就是 float 16。更省內存空間，更節約推理時間。
Half2Mode ：tensorRT 的一種執行模式（execution mode ），這種模式下圖片上相鄰區域的 tensor 是以16位交叉存儲的方式存在的。而且在 batchsize 大于 1的情況下，這種模式的運行速度是最快的。（Half2Mode is an execution mode where internal tensors interleave 16-bits from
adjacent pairs of images, and is the fastest mode of operation for batch sizes greater
than one.?）

這是計算機組成原理中涉及到存儲方式的選擇，不是很懂。大概是下圖這樣的：

以下分別是 2D和3D情況：

?

參考這個?順序存儲和交叉存儲?，這樣做可以提升存儲器帶寬。更多詳細內容參考文末參考資料。

2 具體做法

2.1 配置 builder

TensorRT3.0的官方文檔上說，如果只是使用 float 16 的數據精度代替 float-32 ，實際上并不會有多大的性能提升。真正提升性能的是 half2mode ，也就是上述使用了交叉存存儲方式的模式。

如何使用half2mode ？

首先使用float 16 精度的數據來初始化 network 對象，主要做法就是在調用NvCaffeParser 工具解析 caffe模型時，使用 DataType::kHALF 參數，如下：

1

2

3

4

5

const IBlobNameToTensor *blobNameToTensor =

parser->parse(locateFile(deployFile).c_str(),

locateFile(modelFile).c_str(),

*network,

DataType::kHALF);
配置builder 使用 half2mode ，這個很簡單，就一個語句就完成了：

1

builder->setFp16Mode(true);

2.2 Profiling

profiling 一個網絡 ,要創建一個 IProfiler 接口并且添加 profiler 到 execution context 中:

1	context.profiler = &gProfiler;

然后執行時，Profiling不支持異步方式，只支持同步方式，因此要使用 tensorRT的同步執行函數 execute() ：

1 2	for (int i = 0; i < TIMING_ITERATIONS;i++) engine->execute(context, buffers);

執行過程中，每一層都會調用 profiler 回調函數，存儲執行時間。

因為TensorRT進行了層間融合和張量融合的優化方式，一些層在 TensorRT 中會被合并，如上圖。

比如原來網絡中的?inception_5a/3x3?和?inception_5a/ relu_3x3?等這樣的層會被合并成?inception_5a/3x3 + inception_5a/relu_3x3?，因此輸出每一層的時間時，也是按照合并之后的輸出。因此TensorRT優化之后的網絡結構是跟原來的網絡結構不是一一對應的。

3 官方例程

例程位于?/usr/src/tensorrt/samples/sampleGoogleNet

這個例程展示的是 TensorRT的layer-based profiling和 half2mode 和 FP16 使用方法。相比于前面說過的mnist的例程只添加了一些借口和修改了一部分參數，還是貼個完整代碼吧，雖然比較占篇幅。

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

#include <assert.h>

#include <fstream>

#include <sstream>

#include <iostream>

#include <cmath>

#include <algorithm>

#include <sys/stat.h>

#include <cmath>

#include <time.h>

#include <cuda_runtime_api.h>

#include "NvInfer.h"

#include "NvCaffeParser.h"

#include "common.h"

static Logger gLogger;

using namespace nvinfer1;

using namespace nvcaffeparser1;

// stuff we know about the network and the caffe input/output blobs

static const int BATCH_SIZE = 4;

static const int TIMING_ITERATIONS = 1000;

const char* INPUT_BLOB_NAME = "data";

const char* OUTPUT_BLOB_NAME = "prob";

std::string locateFile(const std::string& input)

{

std::vector<std::string> dirs{"data/samples/googlenet/", "data/googlenet/"};

return locateFile(input, dirs);

}

// profile類，繼承自 IProfiler

struct Profiler : public IProfiler

{

typedef std::pair<std::string, float> Record;

std::vector<Record> mProfile;

// 將每一層的運行時間存放到 vector中

virtual void reportLayerTime(const char* layerName, float ms)

{

// find_if找到第一個 r.first 與 layerName 相同的層，返回一個迭代器

auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r){ return r.first == layerName; });

// 如果是新的層就push_back進vector

if (record == mProfile.end())

mProfile.push_back(std::make_pair(layerName, ms));

// 如果是vector中已有的層就直接累加時間，因為他是迭代1000次的，肯定會重復，所以要累加時間

else

record->second += ms;

}

// 打印各層的運行時間，打印時要除掉總的迭代次數

void printLayerTimes()

{

float totalTime = 0;

for (size_t i = 0; i < mProfile.size(); i++)

{

printf("%-40.40s %4.3fms\n", mProfile[i].first.c_str(), mProfile[i].second / TIMING_ITERATIONS);

totalTime += mProfile[i].second;

}

printf("Time over all layers: %4.3f\n", totalTime / TIMING_ITERATIONS);

}

} gProfiler;

void caffeToTRTModel(const std::string& deployFile, // name for caffe prototxt

const std::string& modelFile, // name for model

const std::vector<std::string>& outputs, // network outputs

unsigned int maxBatchSize, // batch size - NB must be at least as large as the batch we want to run with)

IHostMemory *&trtModelStream)

{

// create API root class - must span the lifetime of the engine usage

IBuilder* builder = createInferBuilder(gLogger);

INetworkDefinition* network = builder->createNetwork();

// parse the caffe model to populate the network, then set the outputs

ICaffeParser* parser = createCaffeParser();

bool useFp16 = builder->platformHasFastFp16();

// 判斷當前的GPU設備是否支持 FP16的精度

DataType modelDataType = useFp16 ? DataType::kHALF : DataType::kFLOAT; // create a 16-bit model if it's natively supported

const IBlobNameToTensor *blobNameToTensor =

parser->parse(locateFile(deployFile).c_str(), // caffe deploy file

locateFile(modelFile).c_str(), // caffe model file

*network, // network definition that the parser will populate

modelDataType);

assert(blobNameToTensor != nullptr);

// the caffe file has no notion of outputs, so we need to manually say which tensors the engine should generate

for (auto& s : outputs)

network->markOutput(*blobNameToTensor->find(s.c_str()));

// Build the engine

builder->setMaxBatchSize(maxBatchSize);

builder->setMaxWorkspaceSize(16 << 20);

// 設置half2mode

// set up the network for paired-fp16 format if available

if(useFp16)

builder->setFp16Mode(true);

ICudaEngine* engine = builder->buildCudaEngine(*network);

assert(engine);

// we don't need the network any more, and we can destroy the parser

network->destroy();

parser->destroy();

// serialize the engine, then close everything down

trtModelStream = engine->serialize();

engine->destroy();

builder->destroy();

shutdownProtobufLibrary();

}

void timeInference(ICudaEngine* engine, int batchSize)

{

// input and output buffer pointers that we pass to the engine - the engine requires exactly ICudaEngine::getNbBindings(),

// of these, but in this case we know that there is exactly one input and one output.

assert(engine->getNbBindings() == 2);

void* buffers[2];

// In order to bind the buffers, we need to know the names of the input and output tensors.

// note that indices are guaranteed to be less than ICudaEngine::getNbBindings()

int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME), outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);

// allocate GPU buffers

// 自動獲取輸入輸出的維度

Dims3 inputDims = static_cast<Dims3&&>(engine->getBindingDimensions(inputIndex)), outputDims = static_cast<Dims3&&>(engine->getBindingDimensions(outputIndex));

size_t inputSize = batchSize * inputDims.d[0] * inputDims.d[1] * inputDims.d[2] * sizeof(float);

size_t outputSize = batchSize * outputDims.d[0] * outputDims.d[1] * outputDims.d[2] * sizeof(float);

CHECK(cudaMalloc(&buffers[inputIndex], inputSize));

CHECK(cudaMalloc(&buffers[outputIndex], outputSize));

IExecutionContext* context = engine->createExecutionContext();

// 設置profiler

context->setProfiler(&gProfiler);

// zero the input buffer

CHECK(cudaMemset(buffers[inputIndex], 0, inputSize));

for (int i = 0; i < TIMING_ITERATIONS;i++)

context->execute(batchSize, buffers);

// release the context and buffers

context->destroy();

CHECK(cudaFree(buffers[inputIndex]));

CHECK(cudaFree(buffers[outputIndex]));

}

int main(int argc, char** argv)

{

std::cout << "Building and running a GPU inference engine for GoogleNet, N=4..." << std::endl;

// parse the caffe model and the mean file

IHostMemory *trtModelStream{nullptr};

caffeToTRTModel("googlenet.prototxt", "googlenet.caffemodel", std::vector < std::string > { OUTPUT_BLOB_NAME }, BATCH_SIZE, trtModelStream);

assert(trtModelStream != nullptr);

// create an engine

IRuntime* infer = createInferRuntime(gLogger);

assert(infer != nullptr);

ICudaEngine* engine = infer->deserializeCudaEngine(trtModelStream->data(), trtModelStream->size(), nullptr);

assert(engine != nullptr);

printf("Bindings after deserializing:\n");

for (int bi = 0; bi < engine->getNbBindings(); bi++) {

if (engine->bindingIsInput(bi) == true) {

printf("Binding %d (%s): Input.\n", bi, engine->getBindingName(bi));

} else {

printf("Binding %d (%s): Output.\n", bi, engine->getBindingName(bi));

}

// run inference with null data to time network performance

timeInference(engine, BATCH_SIZE);

engine->destroy();

infer->destroy();

trtModelStream->destroy();

// 打印profing 結果

gProfiler.printLayerTimes();

std::cout << "Done." << std::endl;

return 0;

}

4 結果分析

TensorRT 的profiling執行結果：

batch=4, iterations=1000， GPU=1080 ti

myself@admin:~/workspace/study/tensorrt/bin$ ./sample_googlenet

Building and running a GPU inference engine for GoogleNet, N=4...

Bindings after deserializing:

Binding 0 (data): Input.

Binding 1 (prob): Output.

conv1/7x7_s2 + conv1/relu_7x7 0.128ms

pool1/3x3_s2 0.054ms

pool1/norm1 0.031ms

conv2/3x3_reduce + conv2/relu_3x3_reduce 0.029ms

conv2/3x3 + conv2/relu_3x3 0.193ms

conv2/norm2 0.084ms

pool2/3x3_s2 0.045ms

inception_3a/1x1 + inception_3a/relu_1x1 0.040ms

inception_3a/3x3 + inception_3a/relu_3x3 0.062ms

inception_3a/5x5 + inception_3a/relu_5x5 0.044ms

inception_3a/pool 0.020ms

inception_3a/pool_proj + inception_3a/re 0.031ms

inception_3a/1x1 copy 0.008ms

inception_3b/1x1 + inception_3b/relu_1x1 0.075ms

inception_3b/3x3 + inception_3b/relu_3x3 0.109ms

inception_3b/5x5 + inception_3b/relu_5x5 0.086ms

inception_3b/pool 0.026ms

inception_3b/pool_proj + inception_3b/re 0.040ms

inception_3b/1x1 copy 0.012ms

pool3/3x3_s2 0.032ms

inception_4a/1x1 + inception_4a/relu_1x1 0.056ms

inception_4a/3x3 + inception_4a/relu_3x3 0.034ms

inception_4a/5x5 + inception_4a/relu_5x5 0.044ms

inception_4a/pool 0.014ms

inception_4a/pool_proj + inception_4a/re 0.048ms

inception_4a/1x1 copy 0.007ms

inception_4b/1x1 + inception_4b/relu_1x1 0.059ms

inception_4b/3x3 + inception_4b/relu_3x3 0.037ms

inception_4b/5x5 + inception_4b/relu_5x5 0.059ms

inception_4b/pool 0.014ms

inception_4b/pool_proj + inception_4b/re 0.051ms

inception_4b/1x1 copy 0.006ms

inception_4c/1x1 + inception_4c/relu_1x1 0.059ms

inception_4c/3x3 + inception_4c/relu_3x3 0.052ms

inception_4c/5x5 + inception_4c/relu_5x5 0.061ms

inception_4c/pool 0.014ms

inception_4c/pool_proj + inception_4c/re 0.051ms

inception_4c/1x1 copy 0.006ms

inception_4d/1x1 + inception_4d/relu_1x1 0.059ms

inception_4d/3x3 + inception_4d/relu_3x3 0.057ms

inception_4d/5x5 + inception_4d/relu_5x5 0.072ms

inception_4d/pool 0.014ms

inception_4d/pool_proj + inception_4d/re 0.051ms

inception_4d/1x1 copy 0.005ms

inception_4e/1x1 + inception_4e/relu_1x1 0.063ms

inception_4e/3x3 + inception_4e/relu_3x3 0.063ms

inception_4e/5x5 + inception_4e/relu_5x5 0.071ms

inception_4e/pool 0.014ms

inception_4e/pool_proj + inception_4e/re 0.052ms

inception_4e/1x1 copy 0.008ms

pool4/3x3_s2 0.014ms

inception_5a/1x1 + inception_5a/relu_1x1 0.079ms

inception_5a/3x3 + inception_5a/relu_3x3 0.040ms

inception_5a/5x5 + inception_5a/relu_5x5 0.071ms

inception_5a/pool 0.009ms

inception_5a/pool_proj + inception_5a/re 0.072ms

inception_5a/1x1 copy 0.004ms

inception_5b/1x1 + inception_5b/relu_1x1 0.075ms

inception_5b/3x3 + inception_5b/relu_3x3 0.046ms

inception_5b/5x5 + inception_5b/relu_5x5 0.097ms

inception_5b/pool 0.009ms

inception_5b/pool_proj + inception_5b/re 0.072ms

inception_5b/1x1 copy 0.005ms

pool5/7x7_s1 0.012ms

loss3/classifier 0.019ms

prob 0.007ms

Time over all layers: 2.978

Done.

這個速度很快的整個網絡一次前向過程只有3ms左右。

我們再來看看不用TensorRT的googlenet的profiling結果，這個googlenet使用的是caffe代碼中自帶的模型文件，profiling用的是caffe 自己的time命令。

將deploy.prototxt 中的batch改為4，迭代次數的話因為這個沒有使用TensorRT優化，所以比較費時間，就跑50個iterations，不過也能說明問題了。同樣因為沒有使用TensorRT優化，原來的網絡結構中是沒有進行層間融合的，而且caffe的time命令是把forward和backward都測了時間的，因此輸出比較多，所以下面刪除了一部分，只保留了inception_5*。

myself@admin:~/caffe$ ./build/tools/caffe time --model=models/bvlc_googlenet/deploy.prototxt --iterations=50

Average time per layer:

data forward: 0.00454 ms.

data backward: 0.00204 ms.

………………太長了省略一部分

inception_5a/1x1 forward: 4.43762 ms.

inception_5a/1x1 backward: 1.5149 ms.

inception_5a/relu_1x1 forward: 0.10942 ms.

inception_5a/relu_1x1 backward: 0.00126 ms.

inception_5a/3x3_reduce forward: 2.88932 ms.

inception_5a/3x3_reduce backward: 1.17394 ms.

inception_5a/relu_3x3_reduce forward: 0.0859 ms.

inception_5a/relu_3x3_reduce backward: 0.0012 ms.

inception_5a/3x3 forward: 9.88662 ms.

inception_5a/3x3 backward: 3.98626 ms.

inception_5a/relu_3x3 forward: 0.22092 ms.

inception_5a/relu_3x3 backward: 0.00116 ms.

inception_5a/5x5_reduce forward: 0.90482 ms.

inception_5a/5x5_reduce backward: 0.66332 ms.

inception_5a/relu_5x5_reduce forward: 0.01554 ms.

inception_5a/relu_5x5_reduce backward: 0.00128 ms.

inception_5a/5x5 forward: 2.50424 ms.

inception_5a/5x5 backward: 1.49614 ms.

inception_5a/relu_5x5 forward: 0.05624 ms.

inception_5a/relu_5x5 backward: 0.00108 ms.

inception_5a/pool forward: 10.9052 ms.

inception_5a/pool backward: 0.00168 ms.

inception_5a/pool_proj forward: 2.41494 ms.

inception_5a/pool_proj backward: 1.23424 ms.

inception_5a/relu_pool_proj forward: 0.05614 ms.

inception_5a/relu_pool_proj backward: 0.00124 ms.

inception_5a/output forward: 0.20292 ms.

inception_5a/output backward: 0.01312 ms.

inception_5a/output_inception_5a/output_0_split forward: 0.00384 ms.

inception_5a/output_inception_5a/output_0_split backward: 0.00156 ms.

inception_5b/1x1 forward: 6.4108 ms.

inception_5b/1x1 backward: 2.19984 ms.

inception_5b/relu_1x1 forward: 0.16204 ms.

inception_5b/relu_1x1 backward: 0.00146 ms.

inception_5b/3x3_reduce forward: 3.16198 ms.

inception_5b/3x3_reduce backward: 1.70668 ms.

inception_5b/relu_3x3_reduce forward: 0.08388 ms.

inception_5b/relu_3x3_reduce backward: 0.00146 ms.

inception_5b/3x3 forward: 13.2323 ms.

inception_5b/3x3 backward: 5.93336 ms.

inception_5b/relu_3x3 forward: 0.16636 ms.

inception_5b/relu_3x3 backward: 0.00118 ms.

inception_5b/5x5_reduce forward: 1.01018 ms.

inception_5b/5x5_reduce backward: 0.82398 ms.

inception_5b/relu_5x5_reduce forward: 0.02294 ms.

inception_5b/relu_5x5_reduce backward: 0.00118 ms.

inception_5b/5x5 forward: 4.08472 ms.

inception_5b/5x5 backward: 2.8564 ms.

inception_5b/relu_5x5 forward: 0.05658 ms.

inception_5b/relu_5x5 backward: 0.0011 ms.

inception_5b/pool forward: 10.9437 ms.

inception_5b/pool backward: 0.00116 ms.

inception_5b/pool_proj forward: 2.21102 ms.

inception_5b/pool_proj backward: 2.23458 ms.

inception_5b/relu_pool_proj forward: 0.05634 ms.

inception_5b/relu_pool_proj backward: 0.00124 ms.

inception_5b/output forward: 0.26758 ms.

inception_5b/output backward: 0.01492 ms.

pool5/7x7_s1 forward: 2.37076 ms.

pool5/7x7_s1 backward: 0.00188 ms.

pool5/drop_7x7_s1 forward: 0.06108 ms.

pool5/drop_7x7_s1 backward: 0.00134 ms.

loss3/classifier forward: 2.74434 ms.

loss3/classifier backward: 2.75442 ms.

prob forward: 0.28054 ms.

prob backward: 0.06392 ms.

Average Forward pass: 1046.79 ms.

Average Backward pass: 676.121 ms.

Average Forward-Backward: 1723.54 ms.

Total Time: 86177 ms.

*** Benchmark ends ***

首先是一次前向的總耗時：

沒有使用TensorRT優化的googlenet 是 1046.79ms，使用TensorRT優化的是2.98ms

其次可以看其中的某一層的對比：

inception_5b/1x1 + inception_5b/relu_1x1

優化前：

1

2

inception_5b/1x1 forward: 6.4108 ms.

inception_5b/relu_1x1 forward: 0.16204 ms.

總耗時：6.57ms

優化后：

1

inception_5b/1x1 + inception_5b/relu_1x1 0.075ms

總耗時：0.075ms
inception_5b/3x3 + inception_5b/relu_3x3：

優化前：

1

2

inception_5b/3x3 forward: 13.2323 ms.

inception_5b/relu_3x3 forward: 0.16636 ms.

總耗時：13.40ms

優化后：

1

inception_5b/3x3 + inception_5b/relu_3x3 0.046ms

總耗時：0.046ms
inception_5b/5x5 + inception_5b/relu_5x5

優化前：

1

2

inception_5b/5x5 forward: 4.08472 ms.

inception_5b/relu_5x5 forward: 0.05658 ms.

總耗時：4.14ms

優化后：

1

inception_5b/5x5 + inception_5b/relu_5x5 0.097ms

總耗時：0.079ms

此外還有這些層：

優化前：

inception_5b/pool forward: 10.9437 ms.

inception_5b/pool_proj forward: 2.21102 ms.

inception_5b/relu_pool_proj forward: 0.05634 ms.

inception_5b/output forward: 0.26758 ms.

pool5/7x7_s1 forward: 2.37076 ms.

優化后：

inception_5b/pool 0.009ms

inception_5b/pool_proj + inception_5b/re 0.072ms

inception_5b/1x1 copy 0.005ms

pool5/7x7_s1 0.012ms

前面 3×3 卷積比 5×5 卷積還耗時間是因為 3×3 卷積的channel比 5×5 卷積的channel多很多，但是經過TensorRT優化之后二者差別就不是很大了，甚至 5×5 卷積比 3×3 卷積耗時間。

TensorRT確實極大的降低了前向傳播時間，一次前向傳播時間只有優化之前的 0.2%，不過這只是分類問題，并且網絡也都是傳統卷積堆起來的。對于那些復雜結構的網絡，比如用于檢測的網絡或者使用了非經典卷積的比如 dilated conv 或者 deformable conv 的，應該就不會有這么大幅度的提升效果了。不過從英偉達公布的測試數據來看，提升幅度還是蠻大的。

參考

Morton Coding Overview：Ash C++ Template Library (AshTL) Documentation

Interleaving Explained：::. Kitz - Interleaving .::

interleave bits the obvious way：c - Bit Twiddling Hacks: interleave bits the obvious way - Stack Overflow

Bitwise operation：https://en.wikipedia.org/wiki/Bitwise_operation

計算機組成原理

順序存儲和交叉存儲：順序存儲和交叉存儲 - 百度文庫

執行時間計算：?https://devtalk.nvidia.com/default/topic/1027443/print-time-unit-is-not-ms-in-samplegooglenet-cpp-of-tensorrt-3-0-sdk-/

tensorRT API :?TensorRT: nvinfer1::IProfiler Class Reference

tensorRT 用戶手冊：Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

總結

以上是生活随笔為你收集整理的TensorRT(4)-Profiling and 16-bit Inference的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：上学基金是什么保险
下一篇： TensorRT(5)-INT8校准原理