Problems and Suggestions

Hello everyone,

As the final dataset evaluation is now out and phase 2 has started for the top 10 participants, I have several questions and concerns about the final evaluation if the current evaluation methodology remains the same.

Below I have summarised some of my thoughts and problems. My aim is to start a discussion with other (top 10) participants and the organisers of this challenge.

  1. A first major problem I see is in the “opportunity to refine [your] model while working on [your] technical write-up”.

    • The first major problem is that there was no mention of this in the challenge description beforehand. I think it is a bit dishonest to change the rules of the challenge in the middle of it without even giving a reason.
    • While I can see my ranking on the full dataset, I do not know which of my submissions scored the highest on the full dataset. Therefore, I do not know which model I should continue to develop and refine.
      • In addition, the main objective of Phase 2 is to write the technical report. However, if participants are allowed to change their models again, how will they know which model will score the highest in the now “final.final” evaluation on the full test set? While the evaluation is on the full test set, participants are incentivised to quickly submit three new versions, see which one performs best, and then write a technical report for that particular model. The point is that we should write a technical report, but at the same time we are allowed to change our model again.
    • Three new models leaves very little room for fine-tuning or experimentation. The first submission is more or less spent on retraining our model, uploading it and seeing how it performs. The other two tries are then invested in trying to find a good hyperparameter for the full test set. I doubt anyone is going to make any changes to their architecture, so it is just a bit of gambling and hoping that the test results improve.
    • This leads to the next point, that new model submissions are likely to randomise the f2 score again. The evaluation of the full data set already showed that there are huge differences between the training and the full test set. Good training is therefore not directly possible. I have my doubts that the new 50 samples will solve this problem. This means, combined with my previous point, that some people will be lucky with their three/two new submissions and some will not. I think this (additional) randomness in the final score is unnecessary and even more unfair to the participants.
    • I can say for myself that I have already invested a lot of blood, sweat and tears into this challenge, just to create a good model. Retraining models, finding hyperparameters, etc. again takes a lot of time that participants should rather invest in the technical report. Combined with the personal and most likely professional lives that everyone has, this leads to a lot more work and less time in the two weeks that we have left for the model and the technical report.
  2. I was generally amazed at how big a difference there was between the small test set and the final full test evaluation. Overall, there was a huge drop in f2 for all participants. Some unlucky models, such as “colt”, were hit really hard, dropping from 0.949 and second place to 0.758 and eighth place! “Millenial-IUP”, on the other hand, jumped from fifth place to the top. I was aware that there would be changes between the previous test set and the full data set, but I would have expected them to be less significant. This reinforces my point that the training and test distributions are too far apart and further retraining of the model will randomise the scores again.

  3. I also have some questions about the technical report. It says that “The technical report will be evaluated based on clarity, novelty, technical depth, reproducibility, and insights by a panel of judges.”. Giving only one sentence as a requirement is a bit sparse in my opinion.

    • Clarity: Makes sense to me and just asks me to write an easy to understand report.
    • Novelty: This has already been decided with the use of our model. The importance of this feature is perhaps up for debate, because if the model performs best, why is it important that it is a simple and common approach. However, I can see the intention to encourage innovation.
    • Technical depth: how much is too much and how little is too little? While some technical background should be important to describe the methodologies and approaches, the question is how much background is needed. I would assume that a thorough background would be expected in a final paper rather than just a technical report, but as this point is mentioned as an evaluation criterion, it makes it difficult for me to judge how detailed I should be in my report. In addition, technical depth with regard to what? My model, the methods, the initial problem we are looking at?
    • Reproducibility: What kind of reproducibility are we talking about? Is reproducibility meant as actually training models for yourself or more as an idea of how to program the submitted model? What exactly is desired here?
    • As an additional question related to the previous one, is it desirable to use programming code in the technical report or to link to Github pages? As far as I know, technical reports are usually exempt from this, but I am confused by the “reproducibility” point.
    • Insights by a panel of judges: What exactly does this mean? Unfortunately, I can’t imagine anything under this point.

**Suggestion.
As a participant, I do not have the full background as to why decisions were made the way they were. However, I would suggest three things:

  • Either get rid of the new submission option or limit it to just one submission and tell us which model performed best. This would give people the opportunity to re-train the model for better results. I would prefer the first option as the second just randomises the final scores and heavilty depends on the train set participants choose. Note: My ranking remained more or less the same. So I have no real personal benefit from the first option over the second.
  • Please elaborate on the evaluation of the report.
  • Extend the technical report period by a few days or a week. The evaluation of the full test dataset already took two days longer than expected, and I assume that this “opportunity to refine the model” and the discussion I hope to stimulate will reduce the time people have for their technical report.

I would be happy to hear from both other participants as well as the organisators.

1 Like

As another participant, I generally agree with the previous points and would like to add a little bit to the points mentioned.

  1. On the “opportunity to refine [your] model while working on [your] technical write-up”.
  • I agree that it comes at quite a surprise that we are granted the opportunity to further refine our models. While I think I understand the motivaton behind this, especially given the large difference between the partial and full-test-set performance, I agree that this should have been mentioned before. With just 3 submissions, it will be very tricky to find reliable improvements, and it feels like the results will be based a lot on chance. On the other hand, I strictly planned my personal schedule based on the fact that there will be no coding necessary beyond the Phase 1 deadline, and will not have the ability to invest the time that may be necessary to make meaningful improvements. It feels like these news mainly benefit those who entered the challenge late, effectively giving them another month to catch up to the other participants.

  • I also have several questions regarding the new and previous datasets on this note:

      1. Will the 50 test samples released resemble real or synthetic data?
      1. Will they be removed from the full test set on order to prevent overfitting?
      1. Were any real objects contained within the previous training & partial test datasets?
      1. What are the sizes (number of objects) of the partial and full test set?
      1. Will the new dataset also contain fixes for the broken objects of v2/v3?
  1. On the differences between the partial and full test set:
  • I understand that differences between both sets are to be expected. However, I did not expect them to be this “severe”. The fact that every submission dropped to ~.8F2 down from .95 indicates to me that the partial test set was not sufficiently representative of the full test set. It seems likely that this is caused by the non-synthetic objects, but that would simply mean that the synthetic objects are inssuficient approximations of the real data.
  • For some participants, it seems like very old models are now best-performing. This means that the last weeks and months were spent improving on a metric which is not representative of the full challenge. As you stated the intention not to include full results for our other submissions, I wonder how these participants are supposed to find out which changes could be beneficial or detrimental for the “real” performance of their submissions
  1. On the technical report
  • would it be possible for you to provide us with a (preferrably Latex) template? I think this would help both participants and organizers
  • Could you elaborate slightly on the “novelty” aspect of the report evaluation, and its relative importance? If a result works well even though it’s not particularly innovative, will it be punished for being too simple?

Overall, I would mainly hope for some transparancy on your reasoning behind the extension of the coding deadline. Was it always planned to give us additional submissions for fine-tuning, or was this decision based on the results on the full test set?

Best regards and have a great weekend!

David

1 Like

Hi,

We have reverted the decision on the new fine tuning phase. It was not something that we had planned from before, and it was driven by our intention to make it possible to reduce the difference between the two leaderboard scores for some of the participants. Needless to say, the feedback hasn’t been positive, and there were some aspects that we did not take into account, so finally we have rolled back on this. We apologize for the inconvenience caused.

Regarding the evaluation of the technical report, we will use different metrics:

  • Clarity: We expect the technical report to be written in a clear and concise manner, making it easy for the judges to understand your methodology, results, and insights. Avoiding jargon and providing clear explanations will help in achieving this criterion.
  • Novelty: While your model’s performance has already been assessed, the novelty criterion focuses on the originality and creativity of your approach. Even if your model is effective, a novel approach that introduces new ideas or techniques can be a distinguishing factor in the evaluation.
  • Technical depth: The level of technical depth should strike a balance between providing enough detail to explain your methodologies and approaches without overwhelming the reader. You should aim to provide sufficient information for the judges to understand the technical aspects of your work, including your model and methodologies.
  • Reproducibility: Reproducibility refers to the ability of others to replicate your results. In the context of the report, this include providing enough information for someone to understand and potentially recreate your model. You are not required to include your code, but feel free to include pseudo-code or snippets if they can help enhance the understanding of your methodology.
  • Insights: This criterion evaluates the insights that you provide in your report, such as the implications of your findings and lessons learned from your approach. We are interested in the challenges that you have faced during the challenge and how you tackle them.

There will be a 1-10 score for each metric and the participant’s score will be an average of them.

Thank you so much for three time and effort you have put in this challenge.

Best regards,
Víctor

1 Like

Hi again,

With regard to the format of the report, you can use the latex template for arXiv-like preprints:

Best!
Víctor

1 Like

Hello all,

Per the confusion surrounding the challenge dataset, there have always been real objects in both the training and test datasets. The partial test set included 161 objects, both real and synthetic. The full test set includes 500 objects, also with real and synthetic objects. After a full review of the training dataset, the cause and extent of the errors observed in V2 and V3 have been identified. Only 12 out of 1900 objects were affected, and all future versions of the dataset will reflect the correct node labels for these objects. The test dataset was also reviewed for similar errors, and only 2 objects out of 500 were attributed missing node labels. These labels have since been corrected.

The difference between the partial and complete test datasets is that the partial test set only included data that was interpolated from moderate to high frequency orbit updates, and the full test set includes data interpolated from less frequent orbit estimates. The inclusion of multiple data sources and update frequencies is reflective of realistic conditions in which the persistence of observation varies between objects and for individual objects over time. The synthetic data in both the training and test datasets was derived from a high frequency of orbit updates relative to many of the real objects represented in the challenge dataset which is why we recommended that participants use data augmentation techniques while developing their solutions.

The decision not to downsample the synthetic data was made for two reasons. First, the frequency of orbit updates represented by the synthetic data is comparable to that of real data sources that are otherwise not represented by the challenge dataset. Second, the original resolution of the synthetic data cannot be recovered by the participants if given a downsampled version, whereas providing the high resolution data gives participants the flexibility to downsample as they see fit without the inclusion of redundant objects in the challenge dataset.

Let us know if you have any further questions about the data, and we will do our best to be transparent.

Best,
Liz

1 Like

Hello,

How do you compute the F2,norm from the F2 of the leaderboard?

Thanks

Hi,

The F2 is computed as in the rest of the phases, the way is described in the challenge page and paper.

Best!
Víctor

Thank you a lot for the transparency and reversion to the original challenge outline! This came as a big relief.

Thank you also @hlizsolera for the clarification on the datasets, that makes a lot of sense. I agree that dataset augumentation could be a strong method for adapting the existing models to such low-frequency data, and understand how the fine-tuning phase might have helped better reveal each model’s robustness to such data. It will be interesting to further explore this once the full dataset has been released after the challenge.

@Haik_Isaac : From my understanding the F2,norm is a normalization of the F2 values, where 1 represents the highest F2 score achieved, and 0 remains 0 (i.e. a percentile of how close everyone is to the current best model).

Best regards and have a great week

David

1 Like

Bummer on the elimination of the 3 additional submittals. I was looking forward to that aspect of Phase 2. Any chance submittals against the full test set could be turned on during Phase 2 anyway to help inform our technical reports?

Thanks,
Jeff

1 Like

We are disappointed on the elimination as well. We are confused at the downside of allowing the additional submissions. It seems like if teams don’t want to use their additional three submissions they are not forced to in any way.

However, allowing evaluations on the full test set sounds like a great idea. Hopefully the organizers do something like this

1 Like

@FuturifAI_OFFICIAL There are a variety of downsides, all of which are listed above.

@beckja While this only indirectly affects the scoring, some of the points in my original post and @DavidB’s comment still apply.

One suggestion would be to let the challenge continue as originally planned. At the end of the challenge, the full dataset could be released to allow further experimentation with our current models.

@Backwelle see below, not sure what the downsides for the competition are

There is a clear reason, the big difference in scores between the datasets.

Don’t change your model in a big way. (or do and write about both)

You are not supposed to train your model on the 50 samples, rather use them to validate your model performance. 3 submissions is more than enough for a team which has already developed a well performing model. The partial->full test set already randomised the scores, the intent of the second phase is to reduce the randomness.

So don’t make more submissions and work on your technical report only. No one is forcing you to train your model further.

@beckja @FuturifAI_OFFICIAL We will be releasing the full test dataset as well as the details of the models’ performance on the full test set, so you will have all of this information available to you as you write your reports.

2 Likes

@FuturifAI_OFFICIAL With my dearest respect I want to say, you didn’t get any of my points I made.
To avoid prolonging the discussion unnecessarily, I believe the compromise made by the organisers is reasonable. The complete test dataset will be released, allowing all participants to include additional details in their report. However, the scoring system remains unchanged as any new evaluation would not be fully comparable anymore.

Any updates on the release of the full test dataset?

Thanks,
Jeff