Are there any limitations on the backbone architecture? Is it limited to AlexNet and ResNet-50, or we could use any architecture with any capacity?

No limitations on the backbone architecture. We provide references for AlexNet and ResNet-50 https://github.com/facebookresearch/fair_self_supervision_benchmark but you can use any backbone.