Hello! I’m here trying to figure out the reason for an issue that im troubling with for some time already.
I’m writing this to you in hope that you might be able to assist me.
After creating a challenge (The EIMTC challenge) on the Eval.Ai platform I encounter the following issue:
The submissions are stuck in “Running state”
- When the Worker has no tasks/submissions to run, a small-sized submission of around 1kb can be successfully evaluated after some time (can be a few minutes though).
- Contrary to the previous point, the actual submission files are quite large (~6mb filesize). They can stay in “Running” status for days, maybe even weeks.
- Any other submission afterward will get stuck in “Running” as well, even for smaller-sized submissions as in (1).
- I did some digging, and it seems that for some reason, as can be seen in the Worker logs, that the worker is getting “restarted” everytime and trying to re-run/re-evaluate the “Running” submissions. and goes on and on in a loop → trying to evaluate → submissions stuck or stay in “Running” → trying to evaluate…
What I mean by “Restarting” is that I see the following messages:
[2022-04-11 02:57:31] INFO WORKER_LOG Using /tmp/tmpv__38txs as temp directory to store data
[2022-04-11 02:57:31] INFO No custom requirements for challenge 638
And then the submission that the worker is trying to evaluate:
[2022-04-11 03:05:15] INFO WORKER_LOG Processing message body: {“challenge_pk”: “638”, “phase_pk”: “1860”, “is_static_dataset_code_upload_submission”: false, “submission_pk”: 15319}
[2022-04-11 03:05:15] INFO SUBMISSION_LOG [x] Received submission message {“challenge_pk”: “638”, “phase_pk”: “1860”, “is_static_dataset_code_upload_submission”: false, “submission_pk”: 15319}
And then all over again (notice the new temp folder):
[2022-04-11 03:07:12] INFO WORKER_LOG Using /tmp/tmpkv0u1_w5 as temp directory to store data
[2022-04-11 03:07:12] INFO No custom requirements for challenge 638
…. Trying to evaluate submission 15319 again…
Observation:
This occurs in any of the phases in the challenge.
Observation:
The same phenomenon in both staging and prod servers. (staging.eva.ai and eval.ai)
Links (EIMTC challenge):
Staging:
Prod:
(currently, they are having different source codes, where the staging one is more updated).
Thank you in advance.