First, thanks to all the folks at Kaggle for running this contest. I'm sure today ("launch day") has been a busy one.
I've got a few questions about the rules & datasets --- can any of the organizers provide some answers? Thanks.
- Leaderboard: The "Evaluation" page, it says entrants can submit beginning April 18th... but on the "Rules" page, section 8 (& elsewhere) it says submissions can begin on May 4th. Is the April 18th date a typo? Or can we submit then, but the leaderboard is just not active until May 4th? (I'm assuming a typo, but want to verify).
- Outside Data Use: The rules say outside data is permitted until April 4, 2012, as long as the source is publically declared in the forums. Can you clarify what happens after that date? Is the intent that after 4/4/2012, we can only use the data sources that others have already declared in the forums? Or is the use of outside data after that date forbidden? (I'm assuming the former, but want to verify).
- Milestone prize winner: In section 13 of the rules, it says that Milestone prize candidates will provide their algorithm code and documentation and that the sponsors (Kaggle?) will "post the information on the Website for review and testing by other Entrants." What does "information" mean? Does it mean that all the CODE of the winners will be made available to all competitors, or just the high level descriptions of their algorithms?
- Y5: Dataset "Y5" is mentioned in the evaluation page, FAQ, and in the Rules (section 12). However, it's not described fully. In the Evaluation tab, for example, it mentions "Y4 (or if applicable, Y5)." Can you elaborate on Y5? When would competitors use it instead of Y4? Why use the phrase, "if applicable"?
- Sampling: Is the set of patients a random sample of all the patients, or was some selection critereon applied? (i.e. were non-emergency hospitalizations, like childbirths, excluded?)
- Releases of Data: Any particular reason why the data tables are not all being released at the same time? (e.g. is it being made available as soon as it becomes available from it's source?) Just to state the obvious, competing is harder without the full dataset in hand! ;)