Jeremy Howard (Kaggle) wrote:
Remember, a random number generator is simply a deterministic function that is applied to a seed. The randomisation that is used based on external factors (e.g. MAC address, ticks, process id, etc) is used to pick a seed if a specific one is not given, and
can be easily overridden by seeding the RNG.
Well, I use a lot of SAS, which has a large number of functions that generate pseudo-random numbers. For some of them, the built-in default choice is to use the system clock to generate the initial seed. So from that I am going to presume that using the
system clock to set the initial seed value is a well-accepted practice for pseudo-randomization. To the best of my knowledge, if one is using built-in random numbers functions and the system clock in SAS as the initial seed, there is no mechanism possible
to determine what the actual initial seed value was. And I am willing to take on faith that if one cannot determine the precise nano-second value used, the odds of a verification re-run of the code to ever re-enter the deterministic stream of numbers at exactly
the same point as the submission run are infinitely small.
Now in SAS-code I write, I have the ability to control the seed, and can make sure I never call a seed based on the system clock. But if I use some complied code constructed by someone else (and yes, people do sometimes distribute pre-compiled SAS macros,
as a way of protecting their intellectual property), I could be using a random number function where I have not been given the ability to set the initial seed.
Likewise, if anyone is using applications or code build by others, it is very conceivable that the logic to partition the data in test vs validation datasets, pick variables in a rain forest ensemble, etc, may be controlled by unseen and untouchable random
number generators that are tied to the system clock.
I have the impression that a number of people are using the caret package for R ... shout out Thanks to Max Kuhn right here .... SALUTE !!! .... it is possible to explicitly set the seed for
every modeling package for which caret is a wrapper ? Are there any where there's an inaccessible, clock-based seed under the covers? Fact is, I don't know, and I absolutely do not want to HAVE to know about such
details, and use that information to decide which modeling packages are safe to use for this contest, and which aren't.
The fact that I can't cite an example of a specific piece of s/w with hard-coded and inaccessible clock-based seed doesn't mean such s/w doesn't exist; I've never had to worry about that level of reproducibility before, so I haven't been accumulating a list
of examples. But in an earlier stage of my career, I was a data analyst for a statistical consulting company, where most of the work we did was litigation support, intended to be used as evidence in trials. Our working assumption was always the other side
could pay other people equally as proficient more money to go through our data and analysis line by line in an effort to discredit it ... which some did attempt .... and even in that adversarial environment where convictions meant jail or multi-million dollar
fines, the issue of "exact reproducibility to 0.0000000001" never was an issue.)
But here and now it is becoming an issue, because of the interpretation you all are putting on "exactly reproducible."
Has anyone actually in practice during this comp re-run their algorithm using a given random seed, and got different results?
Until this came up, most people probably weren't thinking they'd be wise to test their code on a different hardware platform. And most people probably eyeball their self-generated "Kaggle score" at 4 decimals of significance, not full double-digit precision.
So here again, the absence of citable evidence is not compelling.
The practical difficulties of setting a specific difference threshold is not at all easy,
Really? How about "reproducible to within +/- 0.001% of the contestant's leaderboard score?" How hard is that? It's an arbitrary mgmt decision about how close is close enough to draw a reasonable conclusion that the code and algorithm you were given did
in fact generate the predicted scores that were submitted. What you are trying to rule out is the possibility that the submitted scores were generated from illicit access to the uncensored, non-public HPN data, rather than a legitimate, predictive algorithm,
constructed using the datasets made available to all contestants.
and I haven't yet heard any practical reason as to why reproducible results should cause problems. (... snip snip ...) We're all very keen to make sure everyone is comfortable with this process. :)
(a) YOU haven't given any plausible reasons why a tolerance limit of +/- .0000x% is NOT acceptable,
and
(b) You have contestants, such as Sali Mali, who at one point was #1 on the leaderboard, telling you ("if it is the latter, then I guess this will be impossible for most people - an I for one am out"), and B Yang, currently ranked #6 ("I really
hope you reconsider the exact reproducibility requirement. In theory it's possible, in practice it'll probably mean the winner has to send you the computer(s) he used. If he used cloud computing, forget about it.") that this is a significant issue to
them. Feedback from your free labor pool is telling you that some are NOT comfortable with exact reproducibility to the last 0.0000000000001.
What you do with that information is entirely your mgmt decision, as it rightfully should be.
And how current and potential contestants react in turn to whatever decision you make is entirely their decision.
(I am not trying to be a hard- er "nose" on this ... but I AM trying to forcefully and clearly state why a reasonable tolerance limit is fair, practical, easy to administer, and accomplishes the essential requirements of the "reproducibility requirement"
from the viewpoint of the client, your company, and the contestants.)
Post-edit comment: this was written at the same time Christopher Hefele was writing his post on the inconsistent (non-reproducible) effects when multiple threads are hitting the same stream from a number generator. An issue I had read about, but forgotten.
Had I seen his post first, this one would have been much shorter !
with —