Initial Questions about the Rules & Dataset

First, thanks to all the folks at Kaggle for running this contest.  I'm sure today ("launch day") has been a busy one.   

I've got a few questions about the rules & datasets --- can any of the organizers provide some answers?  Thanks.


  • Leaderboard:   The "Evaluation" page, it says entrants can submit beginning April 18th... but on the "Rules" page, section 8 (& elsewhere) it says submissions can begin on May 4th.  Is the April 18th date a typo?  Or can we submit then, but the leaderboard is just not active until May 4th? (I'm assuming a typo, but want to verify).
  • Outside Data Use:  The rules say outside data is permitted until April 4, 2012, as long as the source is publically declared in the forums. Can you clarify what happens after that date? Is the intent that after 4/4/2012, we can only use the data sources that others have already declared in the forums?  Or is the use of outside data after that date forbidden? (I'm assuming the former, but want to verify).
  • Milestone prize winner: In section 13 of the rules, it says that Milestone prize candidates will provide their algorithm code and documentation and that the sponsors (Kaggle?) will "post the information on the Website for review and testing by other Entrants."   What does "information" mean? Does it mean that all the CODE of the winners will be made available to all competitors, or just the high level descriptions of their algorithms?  


  • Y5:   Dataset "Y5" is mentioned in the evaluation page, FAQ, and in the Rules (section 12). However, it's not described fully.  In the Evaluation tab, for example, it mentions "Y4 (or if applicable, Y5)."   Can you elaborate on Y5?  When would competitors use it instead of Y4?  Why use the phrase, "if applicable"?  
  • Sampling:  Is the set of patients a random sample of all the patients, or was some selection critereon applied?  (i.e. were non-emergency hospitalizations, like childbirths, excluded?)
  • Releases of Data:  Any particular reason why the data tables are not all being released at the same time? (e.g.  is it being made available as soon as it becomes available from it's source?) Just to state the obvious, competing is harder without the full dataset in hand! ;)
Thanks for your thorough questions. The first is a typo, fixed now. You are correct about your understanding of external data - after April 4 2012 you may use any external data sets that have been made public on the forum (unless for some reason they are explicitly disallowed - e.g. they have some IP protection that means that can't be used). Only the paper describing the algorithm will be posted publicly. The paper must fully describe the algorithm. If other competitors find that it's missing key information, or doesn't behave as advertised, then they can appeal. The idea of course is that progress prize winners will fully share the results they've used to that point, so that all competitors can benefit for the remainder of the comp, and so that the overall outcome for health care is improved. The references to Y5 are a hold-over from drafting - we didn't know whether we'd be able to get an extra year's data or not in time for the start of the comp. It turned out that we couldn't get a whole year's worth of data, so we decided not to use it. Those pages should have been of course updated after that was decided! I've updated them now - thanks for catching that. Yes, some sampling did occur. It wasn't random - it was based on issues such as removing records that could be more easily de-identified, records with obvious data entry errors, and so forth. We won't be providing any further details about that process - it's up to competitors to get the most out of the data that they can. :) I've answered the last question in the other forum post that is about that issue. Good luck in the comp!
Oh BTW Chris - did you see you were mentioned on Forbes today?

No, I didn't see that one! Looks like more good press for Kaggle, too. Thanks for pointing it out & for your clarifications. 


