That is a very interesting phenomenon that the predictions are all different, but they generate similar scores. If they are using a validation dataset, it would be interesting to compare their scores on the validation dataset. Assuming they are not doing
that (and k-fold validation is introducing too much complexity), it would be interesting to see how much their R^2 or RMSE differ on the training dataset.
Another interesting experiment would be to have some of the student teams merge and "ensemble" their submissions. Since the scores are so similar, you could reasonably ensemble the best submissions from multiple teams using the simple of their respective
submissions. If their underlying predictions are actually disimilar, I think they'd be pleasantly surprised to learn how well this works.