"Once an Entry is selected as eligible for a prize, the conditional winner must deliver the Prediction Algorithm’s code and documentation to Sponsor for verification within 21 days. Documentation must be written in English and must be written so that individuals trained in computer science can replicate the winning results. Source code must contain a description of resources required to build and run the method. Conditional winners must be available to provide assistance to the judges verifying their Entries"
Consider and entry that uses a search heuristic (e.g. generic algorithm, hill climbing, etc.) to do pre-processing. For example, one may use a search heuristic to generate 50 pairs of feature subsets and classifier choices based on some fitness measure. Then say those 50 constituent classifiers are trained on their assigned subset of features, and used to generate predictions. Finally, the 50 sets of predictions are combined using an ensemble method such as stacked regression.
The ambiguity is there doesn't seem to be a restriction on computing resources required to recreate a winning entry.
For example, suppose:
1) each heuristic search required 24 hours processing time (on a high-end workstation).
2) training each constituent classifiers required 2 hours processing time.
3) the final ensemble required 4 hours processing time.
If the Sponsor required full recreation of the winning entry from contest data, then 1304 computing hours are required or about 54 days. Is it reasonable to assume that Sponsor will wait almost 2 months for verification? The rules are unclear here!
If the Sponsor required only recreation from the selected feature subsets and classifier choices (using the output of the preprocessing steps as a starting point), then verification is reduced to 104 hours or about 4 1/3 days. Or perhaps it is sufficient to also validate only a small sample of preprocessing steps which couple be completed a few days. But again the rules are unclear here.
This is ignoring the impact of parallel computing, for example, perhaps the Sponsor has ample computing resources and can fully recreate the winning entry in 48 hours using a farm of 100 workstations. But is it the contestant's or the Sponsor's responsibility to identify and implement the parallel computations?
Basically, I don't want to run a program on my workstation for 2 months to generate an entry and be disqualified on a technicality.