当前位置：网站首页>Learning to Incorporate Structure Knowledge for Image Inpainting

Learning to Incorporate Structure Knowledge for Image Inpainting

2022-07-22 09:21:00 【yijun009】

Learning to Incorporate Structure Knowledge for Image Inpainting

Motivation
Methods
Experiment
reference

原文链接: link.

Motivation

图像修复旨在用合理且充满细节的内容填充损坏的图像区域或者不想要的图像的区域。现存的方法可以粗略分为传统的和基于深度学习的方法。传统方法使用低级的特征（比如颜色和纹理描述符），并利用先验(如平滑度和图像统计)或辅助数据(如外部图像数据库)，来手工修复。
如果没有对图像内容和结构的高层次理解，传统的方法通常难以生成语义上有意义的内容，特别是当图像的很大一部分丢失或损坏时。
基于深度学习的方法可以通过自动捕获图像内在的层次表示来理解图像内容，并生成高级语义特征来合成缺失的内容，在图像修复任务中优于传统方法。

Context Encoder（ Pathak et al.(2016)）是第一个利用Encoder-Decoder结构，并使用对抗策略训练网络的图像修复深度学习方法。其结果语义上内容合理，但缺乏细节以及有视觉上不能忽视的伪影。17、18年有多人基于其做了很多改进。

最近， Nazeri et al.(2019)提出利用显示的图像结构知识来做图像修复。他们提出了一个两阶段模型，将一个边generator和一个图像generator串联。Xiong et al.(2019)提出一个相似的模型，不在使用边generator而是使用轮廓generator（在更显著对目标场景下更适用）。效果都还不错。

以上两个模型显示了结构知识（比如边和轮廓）在生成合理又细节丰富的图像任务中很重要。然而，两阶段策略可能会有几个限制:1)由于使用两个generator，需要更多的参数;2)很容易受到负面影响的不合理的结构先决条件推理期间由于使用series-coupled架构（这句话的英文意思是如果第一阶段的结构不好，那么第二阶段更容易不好）:3)没有一个明确的结构指导作为损失函数，,它可能不能够合成足够的结构信息,因为他们可能会因为网络太深或比较稀疏而被削弱或被遗忘。

基于这些见解，我们建议使用多任务框架，以更好地结合结构知识。我们利用一个共享的generator同步生成完整的图像和相应的结构知识。此外，Nazeri等人（2019）和Xiong等人（2019）有证明了结构先验有利于图像补全; 另一方面，相对于损坏的图像，它也更有可能从一个相对完整的图像中找出完整的结构。

We summarize the main contributions as follows:

我们提出了一个多任务学习框架来整合图像结构知识来辅助图像修复。
提出了一种结构知识嵌入方案，可以明确地为图像修复提供结构preconditions，并利用注意机制，利用图像中的相似块来细化生成的结构和内容。
针对结构学习和嵌入问题，提出了一种新的金字塔结构损失函数。已经进行了大量的实验来评估我们的方法的性能。

Methods

结构知识指的是：用Sobel算子算出来的三个通道两个方向，共6张梯度图；以及canny算子计算的轮廓图。
多任务框架指的是：以前利用结构知识的方案都是两阶段任务，本文提出的方案将结构知识与图像同时生成，并相互辅助。
注意机制指的是：注意机制不是我们一般的注意机制，是结合了non-local的注意力机制，利用图像中的相似块来细化生成的结构和内容。
金字塔结构损失函数：在本文中有两个scale的梯度图。
以上为本文主要内容，先来看看消融实验，来看看哪一块用处最大。
在这里插入图片描述

框架：

在这里插入图片描述
输入输出：
$\left(\mathbf{I}_{\text {pred}}, \mathbf{C}_{\text {pred}}^{(s)}\right)=G(\hat{\mathbf{I}}, \hat{\mathbf{C}}, \hat{\mathbf{E}}, \mathbf{M})$
以上5个字母从左到右分别表示：输出图像，输出不同尺度s上的梯度图，网络，带有mask的输入图像，带有mask的梯度图像，带有mask的轮廓图像，以及mask。（既有规则的mask，又有不规则的mask）

总体采用GAN，生成器是Encoder-Decoder结构，鉴别器是PatchGAN鉴别器。

Encoder是 Nazeri et al.(2019)的下采样两倍之后，接8个reidual denseblock。梯度图与输出图像此时共享一个编码器，反正是提取特征。

Decoder中有新设计的注意力机制与结构嵌入模块，它们互相辅助。

右下虚线框内是灰色梯度图是生成的两个尺度，分别求loss。右上将经过注意力机制网络与不同尺度的梯度图concatenate到一起生成输出图像。下面具体讲讲：attention模块和structure embedding模块。

Attention Layer

对输入的feature map分成patch然后进行衡量相似性：
Given an input feature map, we first extract the feature patches and calculate the cosine similarity $s_{i, j}$ of each pair of the patches:
$s_{i, j}=\left\langle\frac{p_{i}}{\left\|p_{i}\right\|_{2}}, \frac{p_{j}}{\left\|p_{j}\right\|_{2}}\right\rangle$
where $p_{i}$ and $p_{j}$ are the $i$ -th and $j$ -th patch of the input feature map x respectively.
然后用softmax获得相似性得分：
$\hat{s}_{i, j}=\frac{e^{s_{i, j}}}{\sum_{j=1}^{m} e^{s_{i, j}}}$
然后用得分乘以feature map原来的值，获得响应：
Supposing a total of $m$ patches are extracted, the response of a position $o_{i}$ in the output feature map is calculated as the weighted sum of the patch features:
$o_{i}=\sum_{j=1}^{m} \hat{s}_{i, j} p_{j}$
最后，用残差的形式加起来：

In particular, as shown in Figure $4,$ we formulate all the operations into convolution forms, and make it a residual block which thus can be seamlessly embedded into our architecture:
$\mathbf{y}=\mathbf{x}+\gamma \mathbf{o}$
where $\mathbf{y}$ is the residual output, $\gamma$ is a learnable scale parameter.
$\gamma$ 没说是咋学习的，待会儿看代码就知道了。
哎呀，又仔细看了一下文章，弄明白了non-local注意力是如何完成局部迁移的了。注意看公式中字母下标。

Structure Embedding Layer

是一个标准的residual denseblock。

Pyramid Structure Loss

除了用L1loss，这里加了个正则项。
$\mathcal{L}_{\text {structure}}=\sum_{s}^{n_{s}}\left[\left\|\mathbf{C}_{\text {pred}}^{(s)}-\mathbf{C}^{(s)}\right\|_{1}+\beta \mathcal{L}_{\text {edge}}^{(s)}\right]$
where $\mathcal{L}_{\text {edge}}^{(s)}$ denotes the regularization term, $\beta$ corresponding coefficient and $n_{s}$ the number of total scales. To implement the regularization on the edge structure, we first use a Gaussian filter $g$ to convolve the binary ground truth edge $\operatorname{map} \mathbf{E}^{(s)}$ to create a weighted edge mask as:
$\mathbf{M}_{E}^{(s)}=g * \mathbf{E}^{(\mathbf{s})}$
Then, we computes the edge regularization loss as:
$\mathcal{L}_{\text {edge}}^{(s)}=\left\|\mathbf{C}_{\text {pred}}^{(s)}-\mathbf{C}^{(s)}\right\|_{1} \odot \mathbf{M}_{E}^{(s)}$
where the weighted edge mask is used to extract the edge information from the gradient map. Using such an edge mask not only considers the positions of the binary edges but also exert constraints on their nearby locations, thus to highlight and intensify the edge structure. In our implementation, a Gaussian filter with size $10 \times 10$ and standard deviation 1 is used.