Can you provide days in Hospital in Y1?

« Prev
» Next

Is it possible to add to the data the number of days in hospital in Y1?


We'd rather not - because it seems confusing and unhelpful to provide an answers file for a year for which there's no training data! The competition is complex enough already... If you find a genuine and compelling modelling reason why the results of the competition would be more effective if this data was provided, please let us know, and we'll consider it.

Thanks for the response.

There is a correlation between days in hospital from Y1 to Y2 which makes it worth to explore. Furthermore since the data is really limited releasing any additional field is appreciated.

agree mgomari!

i don't know whether the field of "LengthOfStay" in Claims_Y1 is the the number of days in hospital in Y1?

In my view the most compelling reason to release year1 days in hospital is consistency. One can not accurately calculate year1 days in hospital from length of stay due to the generalisation of length of stay (all length of stay durations above 6 days have been transformed to weeks). Once year2 claims are released it becomes possible to model days in hospital for that year, same argument holds for year3.
@mgomari, one issue we have to keep in mind are the tradeoffs in releasing data. For data privacy reasons, HPN have a granularity threshold which they're not willing to breach. The data anonymization team (represented by keleman in the forums) are trying to release CPTCodes (probably at an aggregated level). Apparently it's pretty line-ball and releasng DaysInHospital_Y1 might put this in jeopardy. (I describe the data privacy considerations like a waterbed, you push down on one part of the bed and it creates a bulge somewhere else.) After May 4, you'll be able to use DaysInHospital_Y2 and DaysInHospital_Y3 to predict DaysInHospital_Y4. @ogenex, even if we release DaysInHospital_Y1, you won't be able to do a consistency check. Not all "length of stays" count as hospitalizations (as calculated for this competition) and you don't have enough detail in this dataset to work out which count and which don't.

Thanks for the response Anthony.

If I have to choose I do prefer CPT and ICD9 codes over days in Y1.

anthony.goldbloom wrote:
... even if we release DaysInHospital_Y1, you won't be able to do a consistency check. Not all "length of stays" count as hospitalizations (as calculated for this competition) and you don't have enough detail in this dataset to work out which count and which don't.

Dear Anthony,

I believe, the precision of predictive model will be higher if you release DaysInHospital_Y1 (especally, because for each patient we will have 3 longitudinal data points: DaysInHospital_Y1,  DaysInHospital_Y2, and  DaysInHospital_Y3.

W/o  DaysInHospital_Y1 we will have only 2 data point: DaysInHospital_Y2 and DaysInHospital_Y3 which does not bring much longitudinality.

Yours truly,


That was my meaning behind the consistency post, longitudinal consistency. If it comes down to a choice between CPT codes and Y1 days in hospital then I would definitely opt for CPT codes.
Glad I found this post. Yes I was looking hard to find Y1 data, but realized that I could not create this from the Claims data as there is a mismatch due to binning. I am not sure how this would push you past any granularity threshold as the more detailed info in claims is already out, this would at least allow us to use your consistent binning method for our training.
There are members with internal hospital/urgent care (and with LOS 1+) in year 2, but with no days in hospital in Year 2 or 3. This may be because the LOS has been counted in days in hospital for Year 1, if the claims year is based on a financial year and DIH on a calender year perhaps. I have a promising pattern from which a lot of noise would be removed if I could factor in DIH_Year1. Or at least tell us why there is no record of DIH_Year 2 in such cases, please. I do understand that LOS and DIH would not match completely, but there are members with no claims at all in Year 2, and yet with DIH_Year 2. How?? Thanks!

SSRC wrote:

there are members with no claims at all in Year 2, and yet with DIH_Year 2. How?? Thanks!

the DiH have to be predicted from the claims made in the year before. there are claims from 71435 different members in year 2 and 71435 members in DaysInHospital_Y3 (and of course each members with claims in year 2 has a matching entry in DaysInHospital_Y3). i don't see the problem...

@SSRC, mapping LOS to DIH is impossible. Not every LOS entry corresponds with a DIH (e.g.hospice stay) One reason somebody may have DIH in y2 but no claims is if they weren't eligible to claim in Y3 (in which case their Y2 claims would've been removed).

Hey Anthony,

I just join the compitition and have one maybe rudimentary question here. According to, we are asked to develop a predictive model over Y1 data first and evaluate it against Y2 data. Since you mentioned that you are not going to relase DaysInHospital of Y1, then how could we train over Y1 data?



We are always trying to use one year's encounter data for a member to predict the DaysInHospital (as defined) for the following year.

So Yr1 encounters are trained against and predict Yr2 DIH, etc.

Dear, Anthony Goldbloom,

Firstly; thanks for a great initiative, Kaggle is awesome !

We belive that Y1DaysInHospital is critically important, as it then becomes possible to


train on forecasting : Y2 with: claimsY1 and Y1DaysInHospitalY1 and
train on forecasting : Y3 with: claimsY2 and Y1DaysInHospitalY2
To forecast               : Y4 with: claimsY3 and Y1DaysInHospitalY3


to train on forecasting :Y3 with :

claimsY1 and Y1DaysInHospitalY1 and
claimsY2 and Y1DaysInHospitalY2

To forecast             :Y4 with:

claimsY2 and Y1DaysInHospitalY2 and
claimsY3 and Y1DaysInHospitalY3

So without Y1DaysInHospital the whole Option B. falls away, and option A has only half the complete data to train on.

We feel that without this information, it will be hard if not impossible to construct at truly good model that can have real use for the sponsors,
and might make reaching the .4 mark unachievable.

Glad you're enjoying the contest. But unfortunately, there's no way to change anything about the data at this point.

Even more critical is the fact that this effectively shuts the competition of from time series modeling, cutting this option of by not making the data time series friendly. Sadly the days in hospital Y1 is the only data that would be needed to use the data as a full time series. It's hard to understand that the sponsors would wish that this competition should be flawed by not employing the vast tool sets from time series modeling, and by deliberately closing of this possibility, how can one then expect state of the art results ? Which one would think was the primary incentive.

Furhtermore the heedless reply from DavidChudzicki is contradictory to the reply of Anthony Goldbloom, who explained that it is indeed possible, albeit a hard decision as the limits of anonymization is already stretched, hence the waterbed example, but cutting of time series modeling tools is an extremely critical decision, and should perhaps be re-evaluated. Especially since full data for days in hospital Y2 and Y3 already exist, and are secured by the other ample anonymization strategies, this leads one to think that adding Y1 can't possibly do more harm than Y2 and Y3 already constituates.

@NeoStrata Not having Y1 DIH data is detrimental but you can make some intelligent assessments and weigh them. Sure its not perfect but all contestants are working with this handicap. I entirely side with David for dismissing any changes to the data out of hand. The last time Anthony mentioned some flexibility was 16 months prior to your post. The contest in nearing its end hence any changes now simply lead to too many problems.


Flag alert Flagging notifies Kaggle that this message is spam, inappropriate, abusive, or violates rules. Do not use flagging to indicate you disagree with an opinion or to hide a post.