(NOTE: The description below is not new insight, but rather is an expansion upon an earlier post by Allan Engelhardt. This post is effectively notes I took while reading his post. All of the credit goes to him for this technique. If I've made a mistake, all the errors within belong to me)
I've just added an "Optimized Constant Value Benchmark" to the leaderboard which represents approximately the best score you can get with a constant value of approximately 0.209179. This approach has been used by several competitors and many have exceeded it already.
To understand this benchmark, we need to look at the evaluation metric for this competition which is the Root Mean Squared Logarithmic Error (RMSLE):
\[ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }\]
Where:
- \\(\epsilon\\) is the RMSLE value (score)
- \\(n\\) is the total number of members
- \\(p_i\\) is your predicted DaysInHospital value for member \\(i\\)
- \\(a_i\\) is the actual DaysInHospital value for member \\(i\\)
- \\(\log(x)\\) is the natural logarithim of \\(x\\)
We'd like to know an optimal \\(p\\) constant value such that if we make a submission where all \\(p_i = p\\), we'll get a decent score on the leaderboard for the RMSLE evaluation. This would roughly be the best possible score you could get without looking at any individual member data because this approach tries to identify an "average" member.
One good place to start in the process of finding an optimal \\(p\\) is to try \\(p = p_i = 0\\). This is exactly what the "All Zeros Benchmark" does. I'll call this score \\(\epsilon_0\\). If we look at how it's calculated, we see:
\[ \epsilon_0 = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(0 + 1) - \log(a_i+1))^2 }\]
\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(1) - \log(a_i+1))^2 }\]
\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (0 - \log(a_i+1))^2 }\]
\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (- \log(a_i+1))^2 }\]
\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(a_i+1))^2 }\]
To keep things compact, I'll use the "bar" notation to denote the mean (average). That is, \\(\overline{P} = \frac{1}{n} \sum_{i=1}^n p_i\\). This gives us:
\[ \epsilon_0 = \sqrt{\overline{\log(a_i+1)^2} }\]
\[ \epsilon_0^2 = \overline{\log(a_i+1)^2} \]
By looking at the public leaderboard, we see that \\(\epsilon_0 \approx 0.522226\\):
\[ \epsilon_0^2 = 0.522226^2 \approx 0.272720 = \overline{\log(a_i+1)^2} \]
Thus, we can figure out average squared logarithm value (plus 1) for the public leaderboard's \\(a_i\\)'s
Things get interesting if we compare another constant submission. The leaderboard also has an "All 15's Benchmark" which is a score obtained by submitting a constant value of \\(p = p_i = 15\\) for each member. Let's call this score \\(\epsilon_{15}\\):
\[ \epsilon_{15} = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }\]
\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(15 + 1) - \log(a_i+1))^2 }\]
\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(16) - \log(a_i+1))^2 }\]
\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(16) - \log(a_i+1))(\log(16) - \log(a_i+1)) }\]
\[ = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(16)^2 - \log(16)\log(a_i+1) - \log(a_i+1)\log(16) + \log(a_i+1)^2) }\]
This simplifies to:
\[ \epsilon_{15} = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(16)^2 - 2\log(16)\log(a_i+1) + \log(a_i+1)^2) }\]
We can square both sides:
\[ \epsilon_{15}^2 = \frac{1}{n} \sum_{i=1}^n (\log(16)^2 - 2\log(16)\log(a_i+1) + \log(a_i+1)^2) \]
We can use our "bar" notation for averages to simplify this:
\[ \epsilon_{15}^2 = \overline{\log(16)^2} - \overline{2\log(16)\log(a_i+1)} + \overline{\log(a_i+1)^2)} \]
We know that the average of a constant value is the constant value itself, so this simplifies to:
\[ \epsilon_{15}^2 = \log(16)^2 - 2\log(16)\overline{\log(a_i+1)} + \overline{\log(a_i+1)^2)} \]
In addition, we learned earlier that \\(\epsilon_0^2 = \overline{\log(a_i+1)^2}\\), so we can substitute that in to obtain:
\[ \epsilon_{15}^2 = \log(16)^2 - 2\log(16)\overline{\log(a_i+1)} + \epsilon_0^2 \]
We can rearrange terms:
\[ 2\log(16)\overline{\log(a_i+1)} = \log(16)^2 + \epsilon_0^2 - \epsilon_{15}^2 \]
and then divide each side by \\(2\log(16)\\):
\[ \overline{\log(a_i+1)} = \frac{\log(16)^2 + \epsilon_0^2 - \epsilon_{15}^2}{2\log(16)} \]
We'd really like to solve for \\(\overline{a_i}\\). This is where we step into some dangerous territory from a mathematical perspective. We can calculate \\(exp(x) = e^x\\) on each side, but this approach is misleading because it won't really tell us the true \\(\overline{a_i}\\) but rather a somewhat optimal value for an average based on our error metric.
We'll blissfully ignore this for now and press on:
\[ \exp{\left(\overline{\log(a_i+1)}\right)} = \exp{\left(\frac{\log(16)^2 + \epsilon_0^2 - \epsilon_{15}^2}{2\log(16)}\right)} \]
\[ \overline{a_i+1} = \exp{\left(\frac{\log(16)^2 + \epsilon_0^2 - \epsilon_{15}^2}{2\log(16)}\right)} \]
\[ \overline{a_i} = \exp{\left(\frac{\log(16)^2 + \epsilon_0^2 - \epsilon_{15}^2}{2\log(16)}\right)} - 1 \]
Again, by looking at the leaderboard, we know that
- \\(\epsilon_{0} \approx 0.522226\\)
- \\(\epsilon_{15} \approx 2.628062\\)
Thus, we can calculate \\(\overline{a_i}\\):
\[ \overline{a_i} = \exp{\left(\frac{\log(16)^2 + 0.522226^2 - 2.628062^2}{2\log(16)}\right)} - 1 \]
\[ \approx 0.209178645003481 \]
What's really interesting about this approach is that you can use any other non-zero constant value (\\(p\\)) submission along with its public leaderboard error (\\(\epsilon_{p}\\)) and you should get the same "optimized" constant value by calculating:
\[ \overline{a_i} = \exp{\left(\frac{\log(p + 1)^2 + \epsilon_0^2 - \epsilon_{p}^2}{2\log(p + 1)}\right)} - 1 \]
Now, a few words of caution:
- It's critical to realize that the calculated \\(\overline{a_i} \approx 0.209179 \\) value is not the real mean value for DaysInHospital. Instead, it is a constant value that is useful only in the context of this contest's evaluation metric. A better approximation for the real mean value can be obtained by finding the mean of Y2 and Y3.
- Note that both \\(\epsilon_{0}\\) and \\(\epsilon_{15}\\) came from the public leaderboard values which represent approximately 30% of the solution set. The other 70% is used in the private leaderboard and thus will be used to calculate your real final scores. Thus, there is a chance that this "optimized" constant will be different than had we calculated it based off the complete solution set.
- Just by looking at the leaderboard, you can tell that a constant value submission will not put you in the running for any prizes, so you'll actually have to look at the data :)
Regardless, I thought this technique was a clever approach to the data. Thanks again to Allan for sharing it.
Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?

with —