# Detectron2 Model Zoo and Baselines ## Introduction This file documents a large collection of baselines trained with detectron2 in Sep-Oct, 2019. All numbers were obtained on [Big Basin](https://engineering.fb.com/data-center-engineering/introducing-big-basin-our-next-generation-ai-hardware/) servers with 8 NVIDIA V100 GPUs & NVLink. The software in use were PyTorch 1.3, CUDA 9.2, cuDNN 7.4.2 or 7.6.3. You can access these models from code using [detectron2.model_zoo](https://detectron2.readthedocs.io/modules/model_zoo.html) APIs. In addition to these official baseline models, you can find more models in [projects/](projects/). #### How to Read the Tables * The "Name" column contains a link to the config file. Running `tools/train_net.py` with this config file and 8 GPUs will reproduce the model. * Training speed is averaged across the entire training. We keep updating the speed with latest version of detectron2/pytorch/etc., so they might be different from the `metrics` file. Training speed for multi-machine jobs is not provided. * Inference speed is measured by `tools/train_net.py --eval-only`, or [inference_on_dataset()](https://detectron2.readthedocs.io/modules/evaluation.html#detectron2.evaluation.inference_on_dataset), with batch size 1 in detectron2 directly. Measuring it with custom code may introduce other overhead. Actual deployment in production should in general be faster than the given inference speed due to more optimizations. * The *model id* column is provided for ease of reference. To check downloaded file integrity, any model on this page contains its md5 prefix in its file name. * Training curves and other statistics can be found in `metrics` for each model. #### Common Settings for COCO Models * All COCO models were trained on `train2017` and evaluated on `val2017`. * The default settings are __not directly comparable__ with Detectron's standard settings. For example, our default training data augmentation uses scale jittering in addition to horizontal flipping. To make fair comparisons with Detectron's settings, see [Detectron1-Comparisons](configs/Detectron1-Comparisons/) for accuracy comparison, and [benchmarks](https://detectron2.readthedocs.io/notes/benchmarks.html) for speed comparison. * For Faster/Mask R-CNN, we provide baselines based on __3 different backbone combinations__: * __FPN__: Use a ResNet+FPN backbone with standard conv and FC heads for mask and box prediction, respectively. It obtains the best speed/accuracy tradeoff, but the other two are still useful for research. * __C4__: Use a ResNet conv4 backbone with conv5 head. The original baseline in the Faster R-CNN paper. * __DC5__ (Dilated-C5): Use a ResNet conv5 backbone with dilations in conv5, and standard conv and FC heads for mask and box prediction, respectively. This is used by the Deformable ConvNet paper. * Most models are trained with the 3x schedule (~37 COCO epochs). Although 1x models are heavily under-trained, we provide some ResNet-50 models with the 1x (~12 COCO epochs) training schedule for comparison when doing quick research iteration. #### ImageNet Pretrained Models We provide backbone models pretrained on ImageNet-1k dataset. These models have __different__ format from those provided in Detectron: we do not fuse BatchNorm into an affine layer. * [R-50.pkl](https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/MSRA/R-50.pkl): converted copy of [MSRA's original ResNet-50](https://github.com/KaimingHe/deep-residual-networks) model. * [R-101.pkl](https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/MSRA/R-101.pkl): converted copy of [MSRA's original ResNet-101](https://github.com/KaimingHe/deep-residual-networks) model. * [X-101-32x8d.pkl](https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/FAIR/X-101-32x8d.pkl): ResNeXt-101-32x8d model trained with Caffe2 at FB. Pretrained models in Detectron's format can still be used. For example: * [X-152-32x8d-IN5k.pkl](https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/25093814/X-152-32x8d-IN5k.pkl): ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see ResNeXt paper for details on ImageNet-5k). * [R-50-GN.pkl](https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/47261647/R-50-GN.pkl): ResNet-50 with Group Normalization. * [R-101-GN.pkl](https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/47592356/R-101-GN.pkl): ResNet-101 with Group Normalization. Torchvision's ResNet models can be used after converted by [this script](tools/convert-torchvision-to-d2.py). #### License All models available for download through this document are licensed under the [Creative Commons Attribution-ShareAlike 3.0 license](https://creativecommons.org/licenses/by-sa/3.0/). ### COCO Object Detection Baselines #### Faster R-CNN:
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
model id | download |
R50-C4 | 1x | 0.551 | 0.102 | 4.8 | 35.7 | 137257644 | model | metrics |
R50-DC5 | 1x | 0.380 | 0.068 | 5.0 | 37.3 | 137847829 | model | metrics |
R50-FPN | 1x | 0.210 | 0.038 | 3.0 | 37.9 | 137257794 | model | metrics |
R50-C4 | 3x | 0.543 | 0.104 | 4.8 | 38.4 | 137849393 | model | metrics |
R50-DC5 | 3x | 0.378 | 0.070 | 5.0 | 39.0 | 137849425 | model | metrics |
R50-FPN | 3x | 0.209 | 0.038 | 3.0 | 40.2 | 137849458 | model | metrics |
R101-C4 | 3x | 0.619 | 0.139 | 5.9 | 41.1 | 138204752 | model | metrics |
R101-DC5 | 3x | 0.452 | 0.086 | 6.1 | 40.6 | 138204841 | model | metrics |
R101-FPN | 3x | 0.286 | 0.051 | 4.1 | 42.0 | 137851257 | model | metrics |
X101-FPN | 3x | 0.638 | 0.098 | 6.7 | 43.0 | 139173657 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
model id | download |
R50 | 1x | 0.205 | 0.056 | 4.1 | 37.4 | 190397773 | model | metrics |
R50 | 3x | 0.205 | 0.056 | 4.1 | 38.7 | 190397829 | model | metrics |
R101 | 3x | 0.291 | 0.069 | 5.2 | 40.4 | 190397697 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
prop. AR |
model id | download |
RPN R50-C4 | 1x | 0.130 | 0.034 | 1.5 | 51.6 | 137258005 | model | metrics | |
RPN R50-FPN | 1x | 0.186 | 0.032 | 2.7 | 58.0 | 137258492 | model | metrics | |
Fast R-CNN R50-FPN | 1x | 0.140 | 0.029 | 2.6 | 37.8 | 137635226 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
R50-C4 | 1x | 0.584 | 0.110 | 5.2 | 36.8 | 32.2 | 137259246 | model | metrics |
R50-DC5 | 1x | 0.471 | 0.076 | 6.5 | 38.3 | 34.2 | 137260150 | model | metrics |
R50-FPN | 1x | 0.261 | 0.043 | 3.4 | 38.6 | 35.2 | 137260431 | model | metrics |
R50-C4 | 3x | 0.575 | 0.111 | 5.2 | 39.8 | 34.4 | 137849525 | model | metrics |
R50-DC5 | 3x | 0.470 | 0.076 | 6.5 | 40.0 | 35.9 | 137849551 | model | metrics |
R50-FPN | 3x | 0.261 | 0.043 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
R101-C4 | 3x | 0.652 | 0.145 | 6.3 | 42.6 | 36.7 | 138363239 | model | metrics |
R101-DC5 | 3x | 0.545 | 0.092 | 7.6 | 41.9 | 37.3 | 138363294 | model | metrics |
R101-FPN | 3x | 0.340 | 0.056 | 4.6 | 42.9 | 38.6 | 138205316 | model | metrics |
X101-FPN | 3x | 0.690 | 0.103 | 7.2 | 44.3 | 39.5 | 139653917 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
kp. AP |
model id | download |
R50-FPN | 1x | 0.315 | 0.072 | 5.0 | 53.6 | 64.0 | 137261548 | model | metrics |
R50-FPN | 3x | 0.316 | 0.066 | 5.0 | 55.4 | 65.5 | 137849621 | model | metrics |
R101-FPN | 3x | 0.390 | 0.076 | 6.1 | 56.4 | 66.1 | 138363331 | model | metrics |
X101-FPN | 3x | 0.738 | 0.121 | 8.7 | 57.3 | 66.0 | 139686956 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
PQ | model id | download |
R50-FPN | 1x | 0.304 | 0.053 | 4.8 | 37.6 | 34.7 | 39.4 | 139514544 | model | metrics |
R50-FPN | 3x | 0.302 | 0.053 | 4.8 | 40.0 | 36.5 | 41.5 | 139514569 | model | metrics |
R101-FPN | 3x | 0.392 | 0.066 | 6.0 | 42.4 | 38.5 | 43.0 | 139514519 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
R50-FPN | 1x | 0.292 | 0.107 | 7.1 | 23.6 | 24.4 | 144219072 | model | metrics |
R101-FPN | 1x | 0.371 | 0.114 | 7.8 | 25.6 | 25.9 | 144219035 | model | metrics |
X101-FPN | 1x | 0.712 | 0.151 | 10.2 | 26.7 | 27.1 | 144219108 | model | metrics |
Name | train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
box AP50 |
mask AP |
model id | download |
R50-FPN, Cityscapes | 0.240 | 0.078 | 4.4 | 36.5 | 142423278 | model | metrics | ||
R50-C4, VOC | 0.537 | 0.081 | 4.8 | 51.9 | 80.3 | 142202221 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
Baseline R50-FPN | 1x | 0.261 | 0.043 | 3.4 | 38.6 | 35.2 | 137260431 | model | metrics |
Deformable Conv | 1x | 0.342 | 0.048 | 3.5 | 41.5 | 37.5 | 138602867 | model | metrics |
Cascade R-CNN | 1x | 0.317 | 0.052 | 4.0 | 42.1 | 36.4 | 138602847 | model | metrics |
Baseline R50-FPN | 3x | 0.261 | 0.043 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
Deformable Conv | 3x | 0.349 | 0.047 | 3.5 | 42.7 | 38.5 | 144998336 | model | metrics |
Cascade R-CNN | 3x | 0.328 | 0.053 | 4.0 | 44.3 | 38.5 | 144998488 | model | metrics |
Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
Baseline R50-FPN | 3x | 0.261 | 0.043 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
GN | 3x | 0.309 | 0.060 | 5.6 | 42.6 | 38.6 | 138602888 | model | metrics |
SyncBN | 3x | 0.345 | 0.053 | 5.5 | 41.9 | 37.8 | 169527823 | model | metrics |
GN (from scratch) | 3x | 0.338 | 0.061 | 7.2 | 39.9 | 36.6 | 138602908 | model | metrics |
GN (from scratch) | 9x | N/A | 0.061 | 7.2 | 43.7 | 39.6 | 183808979 | model | metrics |
SyncBN (from scratch) | 9x | N/A | 0.055 | 7.2 | 43.6 | 39.3 | 184226666 | model | metrics |
Name | inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
PQ | model id | download |
Panoptic FPN R101 | 0.098 | 11.4 | 47.4 | 41.3 | 46.1 | 139797668 | model | metrics |
Mask R-CNN X152 | 0.234 | 15.1 | 50.2 | 44.0 | 18131413 | model | metrics | |
above + test-time aug. | 51.9 | 45.9 |