The "Optimized Constant Value" Benchmark

« Prev
» Next

(NOTE: The description below is not new insight, but rather is an expansion upon an earlier post by Allan Engelhardt. This post is effectively notes I took while reading his post. All of the credit goes to him for this technique. If I've made a mistake, all the errors within belong to me)

I've just added an "Optimized Constant Value Benchmark" to the leaderboard which represents approximately the best score you can get with a constant value of approximately 0.209179. This approach has been used by several competitors and many have exceeded it already.

To understand this benchmark, we need to look at the evaluation metric for this competition which is the Root Mean Squared Logarithmic Error (RMSLE):

\[ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }\]


  • \\(\epsilon\\) is the RMSLE value (score)
  • \\(n\\) is the total number of members
  • \\(p_i\\) is your predicted DaysInHospital value for member \\(i\\)
  • \\(a_i\\) is the actual DaysInHospital value for member \\(i\\)
  • \\(\log(x)\\) is the natural logarithim of \\(x\\)

We'd like to know an optimal \\(p\\) constant value such that if we make a submission where all \\(p_i = p\\), we'll get a decent score on the leaderboard for the RMSLE evaluation. This would roughly be the best possible score you could get without looking at any individual member data because this approach tries to identify an "average" member.

One good place to start in the process of finding an optimal \\(p\\) is to try \\(p = p_i = 0\\). This is exactly what the "All Zeros Benchmark" does. I'll call this score \\(\epsilon_0\\). If we look at how it's calculated, we see:

\[ \epsilon_0 = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(0 + 1) - \log(a_i+1))^2 }\]

\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(1) - \log(a_i+1))^2 }\]

\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (0 - \log(a_i+1))^2 }\]

\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (- \log(a_i+1))^2 }\]

\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(a_i+1))^2 }\]

To keep things compact, I'll use the "bar" notation to denote the mean (average). That is, \\(\overline{P} = \frac{1}{n} \sum_{i=1}^n p_i\\). This gives us:

\[ \epsilon_0 = \sqrt{\overline{\log(a_i+1)^2} }\]

\[ \epsilon_0^2 = \overline{\log(a_i+1)^2} \]

By looking at the public leaderboard, we see that \\(\epsilon_0 \approx 0.522226\\):

\[ \epsilon_0^2 = 0.522226^2 \approx 0.272720 = \overline{\log(a_i+1)^2} \]

Thus, we can figure out average squared logarithm value (plus 1) for the public leaderboard's \\(a_i\\)'s

Things get interesting if we compare another constant submission. The leaderboard also has an "All 15's Benchmark" which is a score obtained by submitting a constant value of \\(p = p_i = 15\\) for each member. Let's call this score \\(\epsilon_{15}\\):

\[ \epsilon_{15} = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }\]

\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(15 + 1) - \log(a_i+1))^2 }\]

\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(16) - \log(a_i+1))^2 }\]

\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(16) - \log(a_i+1))(\log(16) - \log(a_i+1)) }\]

\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(16)^2 - \log(16)\log(a_i+1) - \log(a_i+1)\log(16) + \log(a_i+1)^2) }\]

This simplifies to:

\[ \epsilon_{15} = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(16)^2 - 2\log(16)\log(a_i+1) + \log(a_i+1)^2) }\]

We can square both sides:

\[ \epsilon_{15}^2 = \frac{1}{n} \sum_{i=1}^n (\log(16)^2 - 2\log(16)\log(a_i+1) + \log(a_i+1)^2) \]

We can use our "bar" notation for averages to simplify this:

\[ \epsilon_{15}^2 = \overline{\log(16)^2} - \overline{2\log(16)\log(a_i+1)} + \overline{\log(a_i+1)^2)} \]

We know that the average of a constant value is the constant value itself, so this simplifies to:

\[ \epsilon_{15}^2 = \log(16)^2 - 2\log(16)\overline{\log(a_i+1)} + \overline{\log(a_i+1)^2)} \]

In addition, we learned earlier that \\(\epsilon_0^2 = \overline{\log(a_i+1)^2}\\), so we can substitute that in to obtain:

\[ \epsilon_{15}^2 = \log(16)^2 - 2\log(16)\overline{\log(a_i+1)} + \epsilon_0^2 \]

We can rearrange terms:

\[ 2\log(16)\overline{\log(a_i+1)} = \log(16)^2 + \epsilon_0^2 - \epsilon_{15}^2 \]

and then divide each side by \\(2\log(16)\\):

\[ \overline{\log(a_i+1)} = \frac{\log(16)^2 + \epsilon_0^2 - \epsilon_{15}^2}{2\log(16)} \]

We'd really like to solve for \\(\overline{a_i}\\). This is where we step into some dangerous territory from a mathematical perspective. We can calculate \\(exp(x) = e^x\\) on each side, but this approach is misleading because it won't really tell us the true \\(\overline{a_i}\\) but rather a somewhat optimal value for an average based on our error metric.

We'll blissfully ignore this for now and press on:

\[ \exp{\left(\overline{\log(a_i+1)}\right)} = \exp{\left(\frac{\log(16)^2 + \epsilon_0^2 - \epsilon_{15}^2}{2\log(16)}\right)} \]

\[ \overline{a_i+1} = \exp{\left(\frac{\log(16)^2 + \epsilon_0^2 - \epsilon_{15}^2}{2\log(16)}\right)} \]

\[ \overline{a_i} = \exp{\left(\frac{\log(16)^2 + \epsilon_0^2 - \epsilon_{15}^2}{2\log(16)}\right)} - 1 \]

Again, by looking at the leaderboard, we know that

  • \\(\epsilon_{0} \approx 0.522226\\)
  • \\(\epsilon_{15} \approx 2.628062\\)

Thus, we can calculate \\(\overline{a_i}\\):

\[ \overline{a_i} = \exp{\left(\frac{\log(16)^2 + 0.522226^2 - 2.628062^2}{2\log(16)}\right)} - 1 \]

\[ \approx 0.209178645003481 \]

What's really interesting about this approach is that you can use any other non-zero constant value (\\(p\\)) submission along with its public leaderboard error (\\(\epsilon_{p}\\)) and you should get the same "optimized" constant value by calculating:

\[ \overline{a_i} = \exp{\left(\frac{\log(p + 1)^2 + \epsilon_0^2 - \epsilon_{p}^2}{2\log(p + 1)}\right)} - 1 \]

Now, a few words of caution:

  1. It's critical to realize that the calculated \\(\overline{a_i} \approx 0.209179 \\) value is not the real mean value for DaysInHospital. Instead, it is a constant value that is useful only in the context of this contest's evaluation metric. A better approximation for the real mean value can be obtained by finding the mean of Y2 and Y3.
  2. Note that both \\(\epsilon_{0}\\) and \\(\epsilon_{15}\\) came from the public leaderboard values which represent approximately 30% of the solution set. The other 70% is used in the private leaderboard and thus will be used to calculate your real final scores. Thus, there is a chance that this "optimized" constant will be different than had we calculated it based off the complete solution set.
  3. Just by looking at the leaderboard, you can tell that a constant value submission will not put you in the running for any prizes, so you'll actually have to look at the data :)

Regardless, I thought this technique was a clever approach to the data. Thanks again to Allan for sharing it.

Glad you found the original approach useful.  I think my post was shorter: there is an awful lot of maths here :-)  Good to be comprehensive.

But there is nothing misleading about the approach.  Everything in this competition is about \\(\log(1+a_i)\\) except for a slight complication in the submission that you have to submit the \\(a_i\\) values but that is really just a distraction.  You shoudn't care about the mean of \\(a_i\\) and we don't really want to solve for it: only \\(\log(1+a_i)\\) has any real relevance.  Taking the exponent on both sides is just a way of circumventing the curious submission format.  You can calculate the mean of \\(\log(1+a_i)\\) from your formula using the known \\(\epsilon_1\\) and \\(\epsilon_{15}\\); from that you just need to put it in the format for the submission.

Hope this makes sense to someone.


(Will be back in the competition later this week after having done Real Work™ for the last few weeks.)

Allan Engelhardt wrote:

You shoudn't care about the mean of \\(a_i\\) and we don't really want to solve for it: only \\(\log(1+a_i)\\) has any real relevance. 

I care about \\(a_i\\) in the sense of relating it back to reality :)

Jeff Moser wrote:

I care about \\(a_i\\) in the sense of relating it back to reality :)

There is no reality. There is only data from a sample.

(And anonymized, simplified, censored, truncated data at that.)

Conduct yourselves accordingly.


I'm afraid the maths is dodgy on the exponentiation of the averaged log values. You have to do some calculus to end up at the final expression which gives the correct optimized constant value, although that is probably not the mean of the actual values. Just a thought, isn't it cheating to have such a sneak peak into the future? Clever though, thanks for sharing.. :)

Jeff Moser wrote:

I care about \\(a_i\\) in the sense of relating it back to reality :)

Jeff, if you care about \\(\overline{a_i}\\), I suggest the following approach. I'll drop the subscript i for compactness. Clearly,
\[\log(\overline{a}+1) \neq \overline{\log(a+1)}. \]
By Jensen's Inequality, it is true that
\[\log(\overline{a}+1) > \overline{\log(a+1)}, \]
given the strict concavity of the log function.

From the dataset, we can compute, for Y2:
For Y3:

The difference is quite large and according to theoretical
result. The ratios are 0.4861132 and 0.490699, for Y2 and Y3.

As for Y4, from your derivation, we know that
\[\overline{\log(a+1)} = \frac{\log(16)^2 + \epsilon_0^2 -
or, making the computations,
\[\overline{\log(a+1)} =0.1899413.\]
Considering a 0.485 ratio (a little less than that observed for
Y2) we get the estimate:
\[\log(\overline{a}+1)\approx \overline{\log(a+1)}/0.485=
that is more than twice your estimate (around 0.20) and closer to the ground truth
values for Y2 and Y3 in the dataset.


Flag alert Flagging notifies Kaggle that this message is spam, inappropriate, abusive, or violates rules. Do not use flagging to indicate you disagree with an opinion or to hide a post.