Unable to reproduce benchmark results mentioned in the paper


As part of our submission pipeline for the challenge, I’m trying to reproduce the results mentioned in the paper. With AlexNet trained on the jigsaw pretext task using YFCC100M to extract features from VOC07 and then train/test the linear SVM on VOC07 for image classification, I obtain the following results for features from different layers (compared with results mentioned in Table 19 in the paper) -

  • conv3 = 47.64 vs 54.7
  • conv4 = 48.6 vs 55.4
  • conv5 = 43.67 vs 49.7

I see a consistent drop of ~6% compared to the results of the paper. I’m using the default config files (alexnet_jigsaw_extract_features.yaml) provided in the repo without any changes to the parameters. I’m also using the provided PARAMS_FILE of the corresponding pretext task.

Can you please let us know if this reduction is expected? if not, any thoughts/suggestions on anything I’m missing to reproduce the results from the paper.


Hi @nitheeshkl, thanks for reaching out. Can you please try reproducing results on other datasets like ImageNet-1K once? For example this config https://github.com/facebookresearch/fair_self_supervision_benchmark/blob/master/configs/legacy_tasks/imagenet_linear_tune/resnet50_jigsaw_finetune_linear.yaml

My guess is that your data for VOC07 might not be correct. Did you train on “trainval” partition of VOC07 and then test on “test”?

Let me know if there are more questions. :slight_smile: All the best. Looking forward to your participation.


@prigoyal - Thanks for the suggestion. I did try with image-net and restnet50 as well, and I see the same 6-7% reduction. Here are then numbers I’m obtaining -

Alexnet Jigsaw YFCC100M on VOC07 svm image classification:

  • conv3 = 47.64 vs 54.7
  • conv4 = 48.6 vs 55.4
  • conv5 = 43.67 vs 49.7

ResNet50 Jigsaw YFCC100M on VOC07 svm image classification:

  • layer3 = 51.11 vs 58.4
  • layer4 = 65.05 vs 71.0
  • layer5 = 56.63 vs 63.5

ResNet50 Jigsaw IN22K on VOC07 svm image classification:

  • res3 = 50.99 vs 57.7
  • res4 = 65.57 vs 71.9
  • res5 = 58.19 vs 64.8

I’ve used the provided scripts to prepare the VOC07 dataset as per the instructions in the readme file. My VOC07 dataset consists for 2501 images for trainval and 2510 images for test sets. Is the correct?

The only other chagnes I’ve done is, set the GPU=1 and reduced the batchsize=64 (since I currently have limited gpu resources :slight_smile: )

so, any other suggestions?


The VOC07 dataset size isn’t correct. The trainval should have 5011 images and test should have 4952 images. NOTE that this is trainval not just train or val partition.

If you reduce the number of gpus and change the batch size, you need to change the learning rate as well. Otherwise the results won’t reproduce. For reference on how to adjust the batch size, learning rate etc. please see the paper https://arxiv.org/abs/1905.01235.

The above two things explain why your results don’t match. Feel free to reach out for any more questions.


That was it. My voc07 dataset had only half the images (2.5k), although I’m not sure how that happened!
I’m now able to reproduce the numbers successfully.
Thanks @prigoyal for pointing me in the right direction! :slight_smile: