Question 1: Sachin already covered the file format. There is not backwards mapping from valid.json to valid.parquet because we remove some information to make it harder to cheat based on using different sequences.
Question 2- about the median filter. Here’s the code. Since we only run it once across the datasets, it is not optimized or pretty. The reason for the sample count in the valid.json (and test.json) files is that we randomly sample a subset of sequences. We do this so that participants can’t infer predictions from future sequences.
def _median_filter_vector(vector, num_samples=5, min_support=3):
"""Median filters a 1-dimensional input vector
vector: input vector to be filtered
num_samples: support for the median filter
min_support: if fewer than min_support values valid, sets filtered value to nan
NOTE First and last num_samples // 2 values are not filtered
filtered = list(vector[:num_samples // 2])
for index in range(2, len(vector) - 2):
valid_values = [value for value in vector[index - num_samples // 2 : index + num_samples // 2 + 1] if not math.isnan(value)]
support = len(valid_values)
if support < min_support:
filtered.extend(vector[-(num_samples // 2):])
assert len(vector) == len(filtered)
We linearly interpolation between the median filtered samples for the prediction targets. The median filter acts as a low pass filter to reduce detection noise and the linear interpolation leads to the correct values for the requested timestamp.
I hope this helps