当前位置：网站首页>Rgb+ depth image semantic segmentation paper reading notes (icra2021)

Rgb+ depth image semantic segmentation paper reading notes (icra2021)

2022-07-21 05:05:00 【Blue feather birds】

paper：Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis

The main contributions of this paper are as follows ：
Combined with depth image , Lifting only RGB Image segmentation mIOU
Design a mechanism , You can use tensorRT Realization , In turn, the NX Improve the efficiency of segmentation on the board , For example, robot scenes with limited computing power and battery capacity
The improved ResNet-based encoder and decoder. Reduce the amount of computation , Improve efficiency

The semantic segmentation of this paper combines RGB Images and depth images as input, Mainly for indoor scenes ,
RGB Images and depth maps are in encoder Enter two different branch, And then merge , The structure is as follows: ：
Insert picture description here

It looks very complicated , In fact, it's not so complicated to decompose ,

Because if feature In more than one stage In the middle of fuse, The segmentation result will be improved , So you can see the depth map and RGB The features of the graph are in the middle stage There are RGB-D Fusion.
encode With ResNet As backbone, To improve efficiency , Did not like deepLabv3 Then convolute with holes , It's about using strided convolution.
stay encoder At the end of ,feature map Of size than input image narrow 32 times , It's using ResNet34,

But every one of them 3x3 The convolution is replaced by block, Namely the 3x3 It broke down into 3x1 and 1x3, Add ReLU In the middle ,
This module is called Non-Bottleneck-1D-Block, It is said that this can shorten the inference time , Improve performance.

RGB Fusion It's using Squeeze and Excitation modular . The details are shown in the green module in the lower left corner of the figure .

Context module Is the solution ResNet Limited receptive field problem , Make a difference scales Of feature combination , similar Pyramid Pooling Module in PSPNet,
And it was modified Avg Pooling, because TensorRT Only support fixed size Of pooling, So use and input resolution dependent pooling size replace adaptive pooling.
therefore , Depending on the data set ,pooling size It will be different .

Decoder
Yes 3 individual decoder modular
Transpose convolution is not used for up sampling , Because of the large amount of calculation , And will produce gridding artifacts, Here's the picture
Insert picture description here
This article uses learned upsampling Method , In the figure 1 In the dark green square ,
use NN upsampling To increase resolution, With another 3x3 depthwise Convolution layer to connect adjacent feature.

But some details will be lost after sampling , Because the details are encoder Of downsampling Will be lost ,
So the author designed skip connections, hold encoder The features of are connected to the same resolution Of decoder in ,
To make sure channel identical , It was used 1x1 Convolution ,
Doing so can restore some details .

After recovering to the ratio input image Small 4x After image of , With one 3x3 Convolution layer , Reuse 2 The upper sampling layer is restored to input image Of resolution.

General calculation loss Both use the result and ground truth Compare , In order to avoid using only the final result , Will be in every decoder The module outputs a result ,
And corresponding ground truth Calculate the scaled image loss, In this way, we can calculate on multiple scales loss.

Parameters

Training used 500 individual epoch, batch size by 8,
SGD Optimizer ,momentum=0.9, The learning rate is {0.00125, 0.0025, 0.005, 0.01, 0.02, 0.04}
Adam The learning rate is {0.0001, 0.0004}
weight decay by 0.0001,
The learning rate is pytorch Of one-cycle learning rate scheduler Adjust the

stay AGX Frame rate on
Insert picture description here