DL之YoloV3:YoloV3论文《YOLOv3: An Incremental Improvement》的翻译与解读
DL之YoloV3:YoloV3論文《YOLOv3: An Incremental Improvement》的翻譯與解讀
?
?
目錄
YoloV3論文翻譯與解讀
Abstract
1. Introduction
2. The Deal
?
?
論文地址:https://arxiv.org/pdf/1804.02767.pdf
YoloV3論文翻譯與解讀
Abstract
? ? ? ?We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50 in 51 ms on a Titan X, compared to 57.5 AP50 in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at https://pjreddie.com/yolo/.
? ? ? ?我們對YOLO系列算法進行一些最新情況介紹!我們做了一些小的設計更改以使它更好。我們還培訓了這個非常棒的新網絡。比上次大一點,但更準確。不過還是很快,別擔心。在320×320處,Yolov3在22毫秒內以28.2 mAP的速度運行,與SSD一樣精確,但速度快了三倍。當我們看到舊的.5 IOU地圖檢測標準yolov3是相當不錯的。在Titan X上,51 ms內可達到57.9 AP50,而在198 ms內,Retinanet可達到57.5 AP50,性能相似,但速度快3.8倍。與往常一樣,所有代碼都在?https://pjreddie.com/yolo/.
?
1. Introduction
? ? ? ?Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little. Actually, that’s what brings us here today. We have a camera-ready deadline [4] and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT! The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means.
? ? ? ?有時候你只需要打一年電話就行了,你知道嗎?今年我沒有做很多研究。在Twitter上花了很多時間。和GANs 玩了一會兒。去年我有一點動力,我設法對YOLO做了一些改進。但是,老實說,沒有什么比這更有趣的了,只是一些小的改變讓它變得更好。我也在其他人的研究上做了一點幫助。事實上,這就是我們今天來到這里的原因。我們有一個攝像頭準備就緒的最后期限[4],我們需要引用我對Yolo所做的一些隨機更新,但我們沒有來源。所以準備一份技術報告吧!關于技術報告,最重要的是他們不需要介紹,你們都知道為什么我們會在這里。因此,本導言的結尾將為論文的其余部分做上標記。首先,我們會告訴你YOLOV3上處理了什么。然后我們會告訴你我們是怎么做的。我們也會告訴你一些我們嘗試過但不起作用的事情。最后,我們將思考這一切意味著什么。
2. The Deal
? ? So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all.
? ? ?所以YOLOv3是這樣的:我們主要從別人那里獲得好主意。我們還訓練了一個新的分類器網絡,它比其他分類器更好。我們將從頭開始介紹整個系統,這樣您就能理解所有內容。
? ? ? Figure 1. We adapt this figure from the Focal Loss paper [9]. YOLOv3 runs significantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU.
? ? ? ? 圖1.我們根據Focal Loss報告[9]調整了這個數字。Yolov3的運行速度明顯快于其他具有類似性能的檢測方法。從M40或Titan X獲得的時間,都是基于相同的GPU。
?
2.1. Bounding Box Prediction
? ? ? ?Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [15]. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:
? ? ? 按照YOLO9000,我們的系統預測使用維度集群作為錨定框[15]的邊界框。網絡為每個邊界框預測4個坐標,分別為tx、ty、tw、th。如果單元格距圖像左上角偏移(cx, cy),且邊界框先驗有寬和高pw, ph,則預測對應:
? ? ?During training we use sum of squared error loss. If the ground truth for some coordinate prediction is t? * our gradient is the ground truth value (computed from the ground truth box) minus our prediction: t? * ? t* . This ground truth value can be easily computed by inverting the equations above.
? ? ?在訓練中,我們使用誤差損失的平方和。如果地面真理協調預測t?*我們的梯度是地面真值(從地面實況框計算)-我們的預測:t?*?t *。這一地面真值可以很容易地計算通過反演上述方程。
? ? ? ?Figure 2. Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function. This figure blatantly self-plagiarized from [15].
? ? ? 圖2.帶有尺寸優先和位置預測的邊界框。我們預測了盒子的寬度和高度作為與簇形心的偏移。我們使用一個sigmoid函數來預測盒子相對于過濾器應用程序位置的中心坐標。這個數字公然自抄自[15]。
?
? ? ?YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior?is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of .5. Unlike [17] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.
? ? ?YOLOv3使用邏輯回歸預測每個邊界框的客觀得分。如果邊界框先驗與地面真值對象的重疊超過任何其他邊界框先驗,則該值應為1。如果邊界框先驗不是最好的,但是重疊了超過某個閾值的地面真值對象,我們忽略預測,跟隨[17]。我們使用的閾值是。5。與[17]不同的是,我們的系統只為每個地面真值對象分配一個邊界框。如果一個邊界框先驗沒有分配給一個地面真值對象,它不會導致坐標或類預測的損失,只會導致對象性的損失。
?
2.2. Class Prediction
? ? ?Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions. This formulation helps when we move to more complex domains like the Open Images Dataset [7]. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.
? ? ? 每個框使用多標簽分類預測邊界框可能包含的類。我們沒有使用softmax,因為我們發現它對于良好的性能是不必要的,相反,我們只是使用獨立的邏輯分類器。在訓練過程中,我們使用二元交叉熵損失進行類預測。當我們移動到更復雜的領域,比如開放圖像數據集[7]時,這個公式會有所幫助。在這個數據集中有許多重疊的標簽(即女人和人)。使用softmax會假定每個框只有一個類,而通常情況并非如此。多標簽方法可以更好地對數據建模。
?
2.3. Predictions Across Scales
? ? ?YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [8]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [10] we predict 3 boxes at each scale so the tensor is N × N × [3 ? (4 + 1 + 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.
? ? ? YOLOv3預測了三種不同尺度的盒子。我們的系統從這些尺度中提取特征,使用類似于特征金字塔網絡[8]的概念。從我們的基本特征提取器,我們添加了幾個卷積層。最后一個預測了一個三維張量編碼的邊界框、對象和類預測。在COCO[10]的實驗中,我們在每個尺度上預測3個盒子,因此對于4個邊界盒偏移量、1個對象預測和80個類預測,張量是N×N×[3(4 + 1 + 80)]。
? ? ? ?Next we take the feature map from 2 layers previous and upsample it by 2×. We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size. We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network. We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (10×13),(16×30),(33×23),(30×61),(62×45),(59× 119),(116 × 90),(156 × 198),(373 × 326).
? ? ? ?接下來,我們從之前的兩層中提取特征圖,并將其向上采樣2×。我們還從網絡的早期獲取一個feature map,并使用連接將其與我們的上采樣特性合并。該方法允許我們從上采樣的特征中獲取更有意義的語義信息,并從早期的特征圖中獲取更細粒度的信息。然后,我們再添加幾個卷積層來處理這個組合的特征圖,并最終預測出一個類似的張量,盡管現在張量是原來的兩倍。我們再次執行相同的設計來預測最終規模的盒子。因此,我們對第三尺度的預測得益于所有的先驗計算以及網絡早期的細粒度特性。我們仍然使用k-means聚類來確定我們的邊界框先驗。我們只是隨意選擇了9個簇和3個尺度然后在尺度上均勻地劃分簇。在COCO數據集中,9個簇分別為(10×13)、(16×30)、(33×23)、(30×61)、(62×45)、(59×119)、(116×90)、(156×198)、(373×326)。
?
2.4. Feature Extractor
? ? ? We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × 3 and 1 × 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it.... wait for it..... Darknet-53!
? ? ? 我們使用一個新的網絡來進行特征提取。我們的新網絡是YOLOv2、Darknet-19中使用的網絡和新穎的剩余網絡之間的混合方法。我們的網絡使用連續的3×3和1×1卷積層,但現在也有一些快捷連接,而且明顯更大。它有53個卷積層。等待.....Darknet-53 !
? ? ?This new network is much more powerful than Darknet- 19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results:
Table 2. Comparison of backbones. Accuracy, billions of operations, billion floating point operations per second, and FPS for various networks.
表2,backbones的比較,精確度,數十億次運算,每秒數十億次浮點運算,以及各種網絡的FPS。
? ? ? Each network is trained with identical settings and tested at 256×256, single crop accuracy. Run times are measured on a Titan X at 256 × 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster. Darknet-53 also achieves the highest measured floating point operations per second. This means the network structure better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient.
? ? ? ?每個網絡都以相同的設置進行訓練,并以256×256的單次裁剪精度進行測試。運行時間是在泰坦X上以256×256的速度測量的。因此,Darknet-53的性能與最先進的分類器相當,但浮點運算更少,速度更快。Darknet-53比ResNet-101好,并且1.5×更快。Darknet-53的性能與ResNet-152相似,并且速度是后者的2倍。Darknet-53還實現了每秒最高的浮點運算。這意味著網絡結構更好地利用GPU,使其更有效地評估,從而更快。這主要是因為ResNets層太多,效率不高。
?
2.5. Training
? ? ?We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing [14].
? ? ?我們仍然訓練完整的圖像沒有硬負面挖掘或任何東西。我們使用多尺度訓練,大量的數據擴充,批量標準化,所有標準的東西。我們使用Darknet神經網絡框架來訓練和測試[14]。
?
?
?
總結
以上是生活随笔為你收集整理的DL之YoloV3:YoloV3论文《YOLOv3: An Incremental Improvement》的翻译与解读的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Py之pyttsx:pyttsx/pyt
- 下一篇: BlockChain:《Blockcha