We have read through the commentary and thank all the readers for their feedback. We have aggregated the following responses to hopefully answer the questions and fill
in any gaps that were pointed out in the methodology. We would like to say that even though the rules seem to lean toward a source-code oriented deliverable, we were asked by the organizers to format the methodology more like a paper, and we were given the
Netflix Prize winner's papers as the example. This format may be less conducive to replication by a computer scientist, but perhaps more conducive to replication by an experienced predictive modeler. Nevertheless, we did provide a fair amount of source code
to assist with the replication process. If others disagree with this format, they should probably take it up directly with the organizers. However we do happen to agree with this format. We believe that most of the top teams used ensemble models which have
some level of randomness, which means that there is some level of adaptation required to adjust to the variation in running the same technique twice. Providing exact coefficients could therefore be mis-leading, but providing the method for combining (e.g.
linear regression) allows for this adaptation to be made and therefore provides for a better replication.
Phil, Dave & Randy
@Edward
-
Thanks for the example software, but shouldn't we get models and parameters inorder to reproduce the results?Is this one of the models used in the ensembling, or just
an example?
The example code provided was just an example. A more accurate model can be achieved by including the remaining variables and fine tuning the algorithm parameters.
-
This paper suggests that they improved their results by assigning a score toeach PCG based on medical knowledge. I think the actual values used should bepublished as
this could be considered additional data and is absolutely requiredto reproduce their results.
|
PCG
|
Risk
|
Risk (Age > 70)
|
|
AMI
|
5
|
5
|
|
APPCHOL
|
1
|
1
|
|
ARTHSPIN
|
1
|
2
|
|
CANCRA
|
3
|
4
|
|
CANCRB
|
3
|
4
|
|
CANCRM
|
4
|
3
|
|
CATAST
|
3
|
4
|
|
CHF
|
5
|
5
|
|
COPD
|
5
|
5
|
|
FLaELEC
|
1
|
1
|
|
FXDISLC
|
1
|
1
|
|
GIBLEED
|
4
|
3
|
|
GIOBSENT
|
3
|
2
|
|
GYNEC1
|
1
|
1
|
|
GYNECA
|
4
|
3
|
|
HEART2
|
4
|
3
|
|
HEART4
|
1
|
2
|
|
HEMTOL
|
3
|
2
|
|
HIPFX
|
1
|
4
|
|
INFEC4
|
1
|
1
|
|
LIVERDZ
|
3
|
3
|
|
METAB1
|
3
|
4
|
|
METAB3
|
1
|
1
|
|
MISCHRT
|
2
|
3
|
|
MISCL1
|
1
|
1
|
|
MISCL5
|
1
|
1
|
|
MSC2a3
|
1
|
1
|
|
NEUMENT
|
2
|
1
|
|
ODaBNCA
|
1
|
1
|
|
PERINTL
|
1
|
1
|
|
PERVALV
|
4
|
5
|
|
PNCRDZ
|
2
|
3
|
|
PNEUM
|
1
|
2
|
|
PRGNCY
|
1
|
1
|
|
RENAL1
|
2
|
3
|
|
RENAL2
|
4
|
3
|
|
RENAL3
|
2
|
2
|
|
RESPR4
|
1
|
1
|
|
ROAMI
|
3
|
4
|
|
SEIZURE
|
2
|
2
|
|
SEPSIS
|
2
|
5
|
|
SKNAUT
|
2
|
1
|
|
STROKE
|
5
|
5
|
|
TRAUMA
|
1
|
2
|
|
UTI
|
1
|
2
|
Specifically the two predictors derived from this chart were the mean and max of the risk score across all of a member's claims. These predictors were the most influential
predictors in our model.
-
For the Neural Networks it looks like some proprietary software has been used,(Tiberius) is this correct?What are the models used, the number of hidden layers, functions,
number ofneurons, parameters, etc.?Nothing of this has been claryfied.
We did use Tibeirus for building the Neural Networks, primarily because of familiarity as it was developed by a team member. The same ideas could easily be implemented
in R.
We used a single hidden layer with 3-4 neurons, the backpropagation of error weight update rule, with a learning rate of 0.007 and 500 epochs. 50 models were built
with random weight initialisation and then the predictions averaged (this is essentially a single giant neural network).
For those wanting to develop their own neural network code please see
http://www.philbrierley.com/code.html
-
For the other algorithms (GBM, Bagged trees and Linear regression), there arealso many models derived (20 is mentioned but with some models consisting of evenmore models),
but the models itself are unclear to me.Note that many models were generated (up to 60?, the exact number is unknown), sothere are many ways to incorparate different models with different subsets,different parameters, and possibly different initialisations
of weights.I cannot guess what these subsets or parameters are, so I am not able toreproduce the results. A good descrioption of every model with the usedsubsets are necessary for me to reproduce the results.
The algorithms we used have very few learning parameters that need to be initialised. For each algorithm we did try to use the most appropriate parameters as determined
by a combination of experience and techniques such as cross-fold validation.
We tried numerous subsets of variables and records, ultimately relying on the results to tell us what was working. The best models were tuned GBMs using all records
and all variables.
-
This paper suggests that they improved their results by assigning a score toeach PCG based on medical knowledge. I think the actual values used should bepublished as
this could be considered additional data and is absolutely requiredto reproduce their results.
See 2)
-
Combinations of each PCG, Specialty and PG resulted in many fields which werereduced. Where is de model used for this? This is unclear."Building classification models"
seems to be a very vague description for this.
The task here is to remove 'useless' variables in order to reduce the data set size – for example if there is a specialty * PCG combination that
only occurs for only one patient then including this combination as a variable if futile, will add little to the overall model accuracy but will contribute to overfitting.
There are numerous techniques commonly used in classification problems for variable elimination, such as stepwise logistic regression. To make the
problem binary (0/1), we considered it as a 'did you go to hospital' prediction task (if DIH >= 1 then DIH = 1).
We considered the counts of each paring individually (PCG * Specialty, PCG * PG, Specialty * PG). For each we built a logistic regression model using
all combinations (ie variables), and then calculated the variable importance to the model of each combination. If the least important did not affect the model accuracy it was removed, and the process started again. This was repeated until all combinations
in the model suggested they were important.
In order to calculate the model variable importance, each variable is in turn randomly permuted and the model accuracy (AUC/Gini ) recalculated.
This permutation is repeated several times and and average resulting accuracy taken. If the resulting model accuracy with the permuted variable is not significantly diminished, then this variable can be safely removed in the knowledge that it has little effect
in the overall model.
Due to the random permutation, repeating this process will not always result in the same subset of selected variables, but they should result in
producing the same model accuracy. Hence we do not expect you to be able to exactly replicate the variables we ended up with, and if we repeated the process we ourselves would end up with a different subset in our final modelling data set, but this would be
of little concern, as the important variables would be there, but maybe just not the same variables of lesser importance.
-
What are the weights used for the linear blending of the models?
We have described the method we used to determine the weights. Specific weights are meaningless unless we give the source code to reproduce all our
models exactly.
@Mark Waddle
Market Makers wrote:
Using n-fold cross validation to generate evaluation sets, further capping within these limits could be evaluated to determine if this decreased the error.
Some algorithms (such as linear regression) will result in negative predictions, even on the training set. We know this is impossible, and can improve the RMSE by
'manually' capping the negative predictions to zero. This demonstrates the flaws in the algorithm in trying to minimise the RMSE.
The 'further capping' we mention is to test what happens if rather than choose zero as the cap, other values such as 0.1, 0.2 etc. are investigated. We can use the
cross validation sets to test what the best value to cap at is, in order to result in the lowest RMSE (in-fact cross validation sets are not required to determine if an algorithm is performing inefficiently at the extremes, the actual training results from
the training data can be used to see if capping improves the training RMSE).
Market Makers wrote:
The technique of blending algorithm dependent predictions withalgorithm
independent predictionswas successful in a previous Kaggle competiton.
When we say algorithm independent predictions, we are referring to the median models. For each record, the actual prediction could be from any of the algorithms in the median mix (whichever
one gives the median score for that record).
@AndyWocky
-
What is the admission risk score, defined on page 2 as “... An admission risk score was developed around the Primary Condition Group that was based on medical experience only. This score was a 1-5 ranking for each PCG, and split by age band.” This needs to
be defined mathematically, or programmatically, such that it can be reconstructed either from their SQL data, or the raw contest data.
Please see response 2) to @Edward
Please see response 3) to @Edward. There is a link to the source code at the heart of the Tiberius algorithm.
-
There is no detail to support “… Multiple models were built on the two data sets using various parameter settings and variable subsets.” The competitors need to provide
the precise models, model parameters, variable subsets and supporting data set for each model that was used to generate their winning submission.
If you try the holistic approach we described, then you will be able to generate a leaderboard score the same as ours. It might not be exactly the same, but it will
be the same accuracy. The heart of the approach is the stochastic nature, and if we started again, we would not look to replicate everything exactly.
-
There is no detail to support “… Truncation of the predictions was found to be useful in certain models.” Did truncations vary by model, or data set? What is the truncation
function? The competitors need to provide precise details of the truncation function, and the models (with corresponding data sets) to which it was applied.
Please see first response to @Mark Waddle
-
The term “multiple n-fold cross validation” is unclear in “…During the model generation process, multiple n-fold cross validation was used to esentially generate out-of-sample
prediction sets for the training data.” What is multiple n-fold CV? For example: is it simply n-fold CV applied for each model? Are the folds the same for each model, if not, how do they differ? What value of n is used?
Multiple n-fold cross validation is where you do n-fold cross validation not just once, but multiple times. The cross validation set is then just an average. Predominantly
we used two values of n, 2 and 10. When 2 was use we repeated multiple times until the cv error converged. The reason for using multiple 2-fold cv was mainly to overcome computer memory issues (the training data set is half the size of the complete data set)
and to decrease processing time for each pass of the algorithm, rather than any specific mathematical benefits.
-
The linear regression ensembling needs to be described in reproducible detail. For example: was this a standard linear regression with an intercept? Were the folds for
each model the same? Was any sampling performed during this fitting procedure? How were out-of-bound predictions altered (e.g., was a negative value simply rounded to 0)? What models were included in the regression and what were the final weights?
Yes, it was standard linear regression with intercept. The key here was mainly experience in knowing not to use too many models in the mix and not to use highly correlated
models, otherwise overfitting and model instability could occur.
Please see reponse to @Mark Waddle and reponse 7) to @Edward.
Yes, median models were included in the final ensemble.
-
“The weightings were applied to the log scale version of the predicted values” — how were out-of-bound values handled? Were they rounded into range, or was a shrinking
strategy used? Were out-of-bound values modified uniformly for all patients, or differently by subset?
Out of bound values were dealt with at the individual model level in the first instance. See response to @Mark Waddle
-
Complete details of the "...final solution ... ensemble of approximately 20 models” is needed: which models were used, what are their associated parameters, which data
sets were they trained on and what is their corresponding weight in the final solution?
Please see response to you 3rd question
-
In the last paragraph of the report, “We found that calibrating the final predictions so that the overall average predicted days in hospital matched the optimized constant
value benchmark gave a small improvement” — full details are required, for example: did they simply mean-shift their entire submission targets? Was any scaling or patient-subset specific shifting performed? What was the final value used for translating predictions?
What was the impact of the improvement?
The mean of the predictions should be around 0.209179, the optimised constant value benchmark for the leaderboard set. This can be achieved by performing a y=mx or
y=mx+c transformation, to name but two. The former is easier to apply and should result in an improvement. When we applied this transformation there was an improvement but it was very minimal.
with —