SieveNet: A Unified Framework for Robust Image-Based Virtual Try-On

shilongshen included in 深度学习论文阅读笔记

2020-11-15 277 words 2 minutes

Contents

two-stage spatial-transformer - to capture fine details in the geometric iwarping stage
conditional segmentation mask generation module - to prevent garment textures from bleeding onto skin and other areas.
perceptual geometric matching loss - to impove warping output
duelling triplet loss strategy - to improve output from the translation network

inputs

$I_P$ : try-on cloth image
$I_{priors}$ : 19 channel pose and body-shape map
$I_m$ : target model image

Coarse2Fine Warping

目标：将$I_p$进行warp操作，与$I_m$的姿态和形状进行对齐。

基于STN进行warp操作

Tackling Occlusion and Pose-variation

作者认为若要进行准确的warping 操作需要考虑一下两点：

Large variations in shape or pose between the try- on cloth image and the corresponding regions in the model image.
Occlusions in the model image. For example, the long hair of a person may occlude part of the garment near the top.

基于STN进行warp操作:

第一阶段（Coarse）以$I_p$和$I_{priors}$作为输入，产生的参数$\theta$,对$I_p$进行warp操作，产生$I_{stn}^0$

第二阶段（fine）以$I_{stn}^0$和$I_{priors}$作为输入，产生参数$\Delta \theta$，以$\theta+\Delta\theta$对$I_p$进行warp操作，产生$I_{stn}^1$

Perceptual Geometric Matching Loss

$$ L_{warp}=\lambda_1L_s^0+\lambda_2L_s^1+\lambda_3L_{pgm} $$

其中： $$ L_S^0=| I_{gt}-I_{stn}^0 | \
L_S^1=| I_{gt}-I_{stn}^1 | $$

$$ L_{pgm}=\lambda_4L_{push}+\lambda_5L_{align} $$

又 $$ L_{push}=k*L_s^1-| I_{stn}^1-I_{stn}^0 | $$ 这是为了确保$I_{stn}^1$比$I_{stn}^0$更加接近$I_{gt}$,$I_{gt}$表示目标服装。 $$ V^0=VGG(I_{stn}^0)-VGG(I_{gt})\
V^1=VGG(I_{stn}^1)-VGG(I_{gt})\
\
L_{align}=(CosineSimilarity(V^0,V^1)-1)^2 $$ 实际上是采用了余弦距离进行度量，目的是保证向量$V^0$和$V^1$更加的接近。同时最小化$L_{align}$能够促进最小化$L_{push}$。

Texture Transfer

Conditional Segmentation Mask Prediction

现有方法的关键问题是不能够准确的遵守clothing product 和skin的界限。会出现clothing product pixels渗色到skin pixel，或者是skin pixel渗色到clothing product pixels。在自遮挡的情况下，skin pixels可能会被完全的替代掉。当try-on image和clothing in the model image上的形状不一致时，这种情况尤为严重。同时当target model有复杂的姿势时也会这种情况。

输入为$I_{priors}$、$I_p$

输出为$M_{exp}$ –> “expected ” segmentation mask.target model is wearing the try-on cloth.

注意损失函数的使用。

损失函数为加权交叉熵损失函数。在skin和background处增加了权重。在skin处增加权重能够更好的解决自遮挡问题。在background处增加权重能够阻止skin pixels渗色到background。

Segmentation Assisted Texture Translation

输入为：

The warped product image $I_{stn}^1$
The expected seg. mask $M_{exp}$
Pixels of $I_m$ for the unaffected regions, (Texture Trans- lation Priors in Figure 3). E.g. face and bottom cloth, if a top garment is being tried-on.

输出：

an RGB rendered person image $I_{rp}$
a composition mask $M_{cm}$ ->

最终的try-on image: $$ I_{try-on}=M_{cm}*I_{stn}^1+(1-M_{cm})*I_{rp} $$

损失函数： $$ L_{tt}=L_{l1}+L_{percep}+L_{mask}\
L_{l1}=| I_{try-on}-I_m |\
L_{percep}=| VGG(I_{try-on})-VGG(I_m) |\
L_{mask}=| M_{cm}-M_{gt}^{cloth} | $$

注意此处的训练策略：

前K步使用$L_{tt}$进行训练，得到一个较为合理的结果。

之后再用$L_{tt}$加上triplet loss进行细粒度挖掘。

Duelling Triplet Loss Strategy

其中$I_{try-on}^{i_{prev}}$表示上一步的输出，其作为negative