论文笔记+源码 DETR:End-to-End Object Detection with Transformers初级打字员博客-

19 六月

星期五, 19 六月 2020 14:50 Last Updated on 星期五, 19 六月 2020 14:50 0 Comments

〇、本论文需要有的基础知识

目标检测：了解传统目标检测的基本技术路线（如anchor-based、非最极大值抑制、one-stage、two-stage），大致了解近两年的SOTA方法（如Faster-RCNN）
Transformer：了解Transformer的机制，知道self-attention机制
二分图匹配：了解图论中的二分图匹配，知道匈牙利算法

一、摘要核心点

1. 相比传统路线：去掉了很多手工设计模块（hand-designed）：如非极大值抑制、anchor的设计
这些手工设计的模块里均为人为对task先验知识的一定程度上的“先验的编码（encode）”

2. DETR核心内容：
a set-based global loss → forces unique predictions via:
(a) bipartite matching, and
(b) a transformer encoder-decoder architecture.（本文用的Transformer网络是non-autoregressive非自回归的）

关于非自回归的介绍可以参考https://zhuanlan.zhihu.com/p/82892975

3. DETR能做到的事：
· 输入： a fixed small set of learned objects queries
· DETR输出：

(a) the relations of the objects
(b) the global image context to directly output the final set of prediction in parallel

4. 流程架构示意图：

论文笔记+源码 DETR:End-to-End Object Detection with Transformers初级打字员博客-

更细节一些的流程架构示意图↓：

论文笔记+源码 DETR:End-to-End Object Detection with Transformers初级打字员博客-

二、正文

1. 首先定性object detection问题为set of prediction

2. 整个网络设计是端到端（end-to-end）的，然后用一个“集合”损失函数（set loss function）来训练，这个损失函数描述预测框和ground-truth框之间的二分图匹配（ performs bipartite matching between predicted and ground-truth objects）来训练

3. DETR仅仅是架构上的创新，并没有创新独有的层（就好像resnet创新了跳连，DETR没有在layer这个层面进行创新）

4. DETR用的“匹配”损失函数（matching loss function）将预测框“一一分配”给ground-truth框（uniquely assigns a prediction to a ground truth object，这里的“一一分配”正是bipartite matching的本身含义）；而且能保证对预测对象的排列顺序保持不变（这也是用二分图匹配建模的原因，这里特指无向二分图）（uniquely assigns a prediction to a ground truth object）→这是能够并行化预测的一个原因

“matching”这里是图论里的概念，可以参考https://www.renfei.org/blog/bipartite-matching.html

5. 对于建模为“Set Prediction”（“集合”预测）的考虑：

通常“集合”预测任务是一种多标签分类问题。多标签分类问题的解决方法通常是“one-vs-rest”（“一对多”,one-vs-rest,又称one-vs-all, 这里指的是将label的类别作为“一”，将其余类别当做一个整体作为“多”，进行训练），这种方法不适用于“元素”间有底层关系结构的情况（“元素”e.g.几乎一模一样的预测框）（does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes)。这个方法会导致大量几乎一样的结果的情况（near-duplicates），传统的目标检测方法会用后处理（如非极大值抑制）来解决这个问题（成堆的近乎一样的预测结果），但是如果是建模为set prediction就不用这些后处理。set prediction需要在全局上有个策略来对这些“元素”之间的关系建模，来避免预测过多的无用、复制的结果造成冗余。

6. 对于采用“Bipartite Matching”（二分图匹配）作为“预测值→ground-truth值”的损失函数的考虑：

在Set Prediction问题中，损失函数必须满足“预测顺序不变性”（invariant by a permutation of the predictions，即预测值/框的顺序不能影响损失值），而二分图匹配——这里特指的是“无向”二分图匹配将“预测值→ground-truth值”的关系建模为了一个无向二分图，这种图的“匹配”不存在顺序问题。特别地，用“匈牙利算法”来求解二分图匹配问题。

· “Bipartite Matching”（二分图匹配）（1）能保证预测顺序不变性”；（2）能保证两者间的“一一匹配”

7. 对于大物体的预测更准确：

文章中说“a result likely enabled by the non-local computations of the transformer”，这里的“non-local computations”指的是Non-local Neural Networks（https://arxiv.org/pdf/1711.07971.pdf）这篇文章中的Non-local概念。

non-local computations指的是计算“非局部”感受野上的信息，可以参考https://zhuanlan.zhihu.com/p/33345791

三、结果

四、源码讨论

为了防止后面代码项目有改动，我摘出来写本文时候（2020.06.18）的最新的一次提交（1fcfc65）来做部分源码说明

DETR网络结构一览：

 class DETR(nn.Module):     """ This is the DETR module that performs object detection """     def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):         """ Initializes the model.         Parameters:             backbone: torch module of the backbone to be used. See backbone.py             transformer: torch module of the transformer architecture. See transformer.py             num_classes: number of object classes             num_queries: number of object queries, ie detection slot. This is the maximal number of objects                          DETR can detect in a single image. For COCO, we recommend 100 queries.             aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.         """         super().__init__()         self.num_queries = num_queries         self.transformer = transformer         hidden_dim = transformer.d_model         self.class_embed = nn.Linear(hidden_dim, num_classes + 1)         self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)         self.query_embed = nn.Embedding(num_queries, hidden_dim)         self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1)         self.backbone = backbone         self.aux_loss = aux_loss      def forward(self, samples: NestedTensor):         """ The forward expects a NestedTensor, which consists of:                - samples.tensor: batched images, of shape [batch_size x 3 x H x W]                - samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels             It returns a dict with the following elements:                - "pred_logits": the classification logits (including no-object) for all queries.                                 Shape= [batch_size x num_queries x (num_classes + 1)]                - "pred_boxes": The normalized boxes coordinates for all queries, represented as                                (center_x, center_y, height, width). These values are normalized in [0, 1],                                relative to the size of each individual image (disregarding possible padding).                                See PostProcess for information on how to retrieve the unnormalized bounding box.                - "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of                                 dictionnaries containing the two above keys for each decoder layer.         """         if not isinstance(samples, NestedTensor):             samples = nested_tensor_from_tensor_list(samples)         features, pos = self.backbone(samples) # backbone是一个CNN用于特征提取          src, mask = features[-1].decompose() #??         assert mask is not None         hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]  # 这里是吧features的其中一部分信息作为src传进Transformer，input_proj是一个卷积层，用来收缩输入的维度，把维度控制到d_model的尺寸（model dimension）          outputs_class = self.class_embed(hs)  # 为了把Transformer应用于目标检测问题上，作者引入了“类别嵌入网络”和“框嵌入网络”         outputs_coord = self.bbox_embed(hs).sigmoid()  # 在框嵌入后加入一层sigmoid输出框坐标（原论文中提到是四点坐标，但是要考虑到原图片的尺寸）         out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}         if self.aux_loss:             out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)         return out      @torch.jit.unused     def _set_aux_loss(self, outputs_class, outputs_coord):         # this is a workaround to make torchscript happy, as torchscript         # doesn't support dictionary with non-homogeneous values, such         # as a dict having both a Tensor and a list.         return [{'pred_logits': a, 'pred_boxes': b}                 for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]

TBC.（没写完的部分最近会补上，毕竟我也是边看边学然后记下来的……）

展开阅读全文

2
评论
x
海报

扫一扫，海报
手机看

到微信朋友圈

x

扫一扫，手机阅读
- 打赏
打赏

郭汪汪

“你的鼓励将是我创作的最大动力”

5C币 10C币 20C币 50C币 100C币 200C币

确定
关注

Allen__Iverson的博客

06-14 论文笔记+源码 DETR:End-to-End Object Detection with Transformers初级打字员博客- 79

DETR: End-to-End Object Detection with Transformers [暴力美学]

DETR: End-to-End Object Detection with Transformers 网络解析说明：个人理解，如有错误请及时提出。由于自己电脑驱动较低不满足440及以上，所以目前网络中张量的具体维度不太清楚，后续如有条件再更新博客。不得不感叹论文作者的数学功底扎实、知识涉猎广博。太暴力了，依靠一些精巧的结构和强大的硬件支持替代了大家精心设计的anchor、nms等结构。对于小目标、多目标场景仍然较差。论文涉及的东西很多，本博客逐步添加，欢迎提出修改意见。资源：论文地…

本页所有内容来自官方网站 https://www.imapbox.com 新闻来源：互联网搜索引擎和新闻站

本网页所有图片由 ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片，下载并得到。

ImageBox 图片批量下载器工具地址: 网页图片批量下载工具-最新版本下载

非凡下载站地址：https://www.crsky.com/soft/35838.html

本网页所有视频内容由 imoviebox边看边下-网页视频下载, iurlBox网页地址收藏管理器下载并得到。

ImovieBox网页视频下载器下载地址: ImovieBox网页视频下载器-最新版本下载

本文章由: imapbox邮箱云存储,邮箱网盘,ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片,imoviebox网页视频批量下载器,下载视频内容,为您提供.

阅读和此文章类似的: 全球云计算

论文笔记+源码 DETR:End-to-End Object Detection with Transformers初级打字员博客-

〇、本论文需要有的基础知识

一、摘要核心点

二、正文

四、源码讨论

DETR: End-to-End Object Detection with Transformers [暴力美学]

文章目录

近期文章

官方链接

关于我们

软件产品

事业方向

联系我们

ImapBox Technology Research Group

论文笔记+源码 DETR:End-to-End Object Detection with Transformers初级打字员博客-

〇、本论文需要有的基础知识

一、 摘要核心点

二、 正文

四、源码讨论

DETR: End-to-End Object Detection with Transformers [暴力美学]

文章目录

近期文章

官方链接

关于我们

软件产品

事业方向

联系我们

ImapBox Technology Research Group

登录

一、摘要核心点

二、正文