About the test samples

Hello everyone!

Our team would like to understand how the test input data will be delivered to the model. To our reasoning, there are two approaches:

(I) The 10 min segments work as independent inputs, where no time correlation between them is preserved and are tested in random order.

(II) The 10 min segments are correlated in time and tested in the correct order, allowing us to keep the information from the previous segments to be used in the next one.

Thank you in advance!

In ‘Submit’ section of the challenge description there is the following statement:

Implement a Python model that accepts 1 hour of combined input data and returns a probability of a seizure occurring in the following 15 minute

Which seems to refer that models will be tested using every segment in a pre-ictal event at once and a overall given probability must be returned. This seems to be a similar approach to the one applied in Nasseri paper ( Ambulatory seizure forecasting with a wrist-worn device using long-short term memory deep learning | Scientific Reports (nature.com)) referred as the origin of the challenge.

Mean classifier probability values were calculated for groups of five consecutive 1-min segment, and the maximum probability across each 60-min interval was calculated.

However, we find quite confusing some examples. In outputs.md ( msg-2022/outputs.md at master · seermedical/msg-2022 (github.com) ) the submission example is a table with probability predictions for each .parquet individually.

filepath,prediction
1110/000/UTC-2020_12_06-21_00_00.parquet,0.417022004702574
1110/000/UTC-2020_12_06-21_10_00.parquet,0.7203244934421581
1110/000/UTC-2020_12_06-21_20_00.parquet,0.00011437481734488664
1110/001/UTC-2020_12_07-03_00_00.parquet,0.30233257263183977
1110/001/UTC-2020_12_07-03_10_00.parquet,0.14675589081711304
1869/002/UTC-2020_12_08-03_00_00.parquet,0.0923385947687978
1869/002/UTC-2020_12_08-03_10_00.parquet,0.1862602113776709
1876/000/UTC-2020_12_08-03_50_00.parquet,0.34556072704304774
1876/000/UTC-2020_12_08-04_00_00.parquet,0.39676747423066994
1876/003/UTC-2020_12_09-03_30_00.parquet,0.538816734003357
1876/003/UTC-2020_12_09-03_40_00.parquet,0.4191945144032948
1876/003/UTC-2020_12_09-03_50_00.parquet,0.6852195003967595

However, in the example with the InceptionTime model, the submission.csv ( msg2022 → examples-> inceptionTime → submission → submission.csv) does not make it explicit whether the probabilities are calculated for each .parquet or for a group of samples in every subfolder. The output seems like overall probabilities for the whole pre-ictal event:

filepath prediction
test/1110/000 0.0002441753
test/1110/001 0.00077731156
test/1904/001 0.010788057
test/1904/002 0.20220436
test/1876/000 0.054220196
test/1876/001 0.711805
test/1965/000 0.010349908
test/1965/001 0.006649486
test/1869/000 0.02781004
test/1869/001 0.0014016484
test/2002/000 0.00019426351
test/2002/001 0.00070443813

How are the tests designed? It is expected to compute probabilities for each .parquet segment independently or it is expected to return just an overall probability for a whole pre-ictal event?

Hi BlakeJC,

sorry to bother you on a similar issue, but I have a question regarding the test-data that I hope you can hel me with.
In the description it states that:

> Seizures that occur within 4 hours of the previous seizure are not labelled (These are known as lead seizures).
In the training data, we might just drop those sequences occurring in that time frame (for example, or handle it in other ways).
But if we approach the problem like an “image classification task” with only one sequence by one, without the possibility to include information from previous sequences (please, correct me if I’m wrong), are we allowed to (for example) hardcode on the models those timeframes for which it is impossible for a seizure to be recorded ?

Another aspect that is not too clear to me is related to the timeframe. From your comment:

it seems that the sequences used for the private leaderboard were randomly picked from the dataset, but I would have expected them all to refer to a timespan subsequent to the last values of the training data, could you please help me clarify this aspect ?

Thank you.
Best regard,
Leonardo

Based on the discussion about the test samples, it is still not entirely clear to me whether samples from the past may be used. Therefore, once again my explicit question:

Is it explicitly allowed to use sequences from the past? I mean, if I have the sample with the appropriate timestamp and know that there are some consecutive 10-minute sequences in the past, can I also use this string of 10-minute sequences for the prediction? Kind of to include the extended history.

Thank you in advance!

Thanks for the question!

As the contest rules make no mention of past data, you are able to use samples from the past to generate your predictions.