Clarification on test score calculation

On the equation provided at


it is shown how one can calculate the “accuracy” of a single generated image, assuming that I_X and I_Y are image width or height.

How are these individual scores accumulated? Are the 1471 test set scores simply summed up?

Thanks a lot in advance for the clarification.

I was wondering about the same thing. When I sum up the scores calculated for all images, I usually to get a validation error that is higher than my test scores on the leaderboard.
I would also appreciate a clarification on how the test score is calculated. Thank you.

Test error is the RMSE over all test images.

So just to be very clear…

RMSE would be the square root of the mean of squared errors…
What does this mean for the final equation which computes the test score?

Do we calculate a per-image score as described above, then assuming N images and corresponding error values e_i, evaluate the following?

Or do we evaluate a RMSE of all pixels in the full test set in the following manner?

where p is a ground-truth pixel value (p_hat being predicted pixel value), and W and H image dimensions.

Thanks for asking the question that might be confusing to many folks. RMSE is per-image score i.e.