@tonio your solution could work, but if you look at the video the head of the referee looks quite similar to the ball on the field. So the NN might make a wrong prediction with high confidence.

I think using Kalman filters could also be a solution. The intuition is to model the movement of the ball taking into account the “probability” of where the ball is in a video frame. So using simple equations of motion, we keep track of where we expect the ball to be in the next frame, and if the observation is too different (the ball jumped to the referee’s head), we discard this prediction and use the next best prediction from the neural network.

This is just speaking on a high level, but it’d be interesting to see the approach implemented. Maybe it can be improved by accounting for relative ball size in the frame as well (so that close ups of bald people in the crowd are not picked up as a ball).