AlKhwarizmi's image Posts 33
Thanks 4
Joined 11 Nov '11 Email user

I downloaded the data months ago but I am now finally finding time to look at this. I have a few questions about the data. Sorry if they have already been answered somewhere else.

1. The second column in the target file is ClaimsTruncated. The exact same numbers are in the sample entry. What is this? Are we supposed to estimate it or just copy it?

2. There is a field called DSFS (days since first service) in three different files - Claims, DrugCount, LabCount. Are these different days since first service?

3. DaysInHospital_Y2 is the outcome for members with claims in Year 1. Is this the length of time for the claim in Year 1? Something else? The naming is a little confusing. Same question for DaysInHospital_Y3.

 

 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

1) It's data that you are given, not something you estimate. It indicates whether the claims information for the previous year is truncated (presumably to protect patient confidentiality).

2) These are used to help create a chronology of events for each patient. As you noticed, each drugcount, labcount and claim has it's own dsfs. If you have two labcount records, the labcount record with the higher dsfs value is the later lab count. Drug counts and lab counts are aggregated... my recollection is that each record is the aggregate number of drug or lab counts for a month, but I may be remembering incorrectly (and it may be aggregated for a 2 month period).

3) No. DaysInHospital_Y2 is the number of days that the patient spent in the hospital in the 2nd year. The idea is to build a model where you use claims from year K to predict hospitalization in year K+1.

A common workflow is to train/estimate a model using hospitalization in the 2nd year as an outcome to be predicted from Y1 claims, and year 3 hospitalization as an outcome to be predicted from Y2 claims. Once you've trained those models, you can use claims data from the third year to predict hospitalization in the 4th year. We aren't given DaysInHospital4, and that is what the competition is judged on.

Thanked by AlKhwarizmi
 
AlKhwarizmi's image Posts 33
Thanks 4
Joined 11 Nov '11 Email user

Is it possible that the truncated claim indicator tells when a claim was started in the previous calendar year and is still open in the current calendar year? This would be important to know and a very common occurrence.

 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user

I don't think so - for three reasons:

1) Every Single instance where there is more than 43/44 claims - there is an indication for ClaimsTruncated - there are a few at the max without CT, but that is what you would expect if someone had exactly 43/44 claims.
2) other discussions have made it pretty clear - as well as the definitions - that it simply means the person had more than x percentile of claims (I think it was 95). Which is why it is different for year 1&2 vs 3 [yes - don't get us started - doesn't make sense to most of us either]
3) There are no obvious correlations between CT and DSFS other than what you would expect to see (the more months someone is seen, the more claims they have (or rather the reverse), and the more likely you will hit the 43/44 limit.

All it means is that there was more than X# of claims. Someone decided to dedupe the claims differently in year 1 (again none of us thought that was a good idea either :) ). But other than that - it seems obvious CT simply means what they say it does - they have removed all the claims over a certain percentile.

Keep in mind I say obvious in that if you have spent as much time as some of us have looking at the data - it is obvious :)

 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user

Also - what may be confusing you is why it is in the file it is - that is because other than age and sex - it is the only NON claim related variable. It is the only variable on a member basis that can change from year to year.

Thanked by AlKhwarizmi
 
AlKhwarizmi's image Posts 33
Thanks 4
Joined 11 Nov '11 Email user

Thanks. I guess I missed that in the definitions. I will use this as just another predictive variable like you suggested.

 
AMULET Analytics's image Posts 4
Joined 14 Jul '12 Email user

Re: DanB's comment:

3) No. DaysInHospital_Y2 is the number of days that the patient spent in the hospital in the 2nd year. The idea is to build a model where you use claims from year K to predict hospitalization in year K+1.

But to predict DaysInHospital for year K+1 you still need feature examples for year K+1 (Y4 in this case) but we don't have Y4 data, right?

 

 

 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

Amulet: To predict Y4, you should only use features from Y3 (and prior years if you choose).

Imagine being an insurer on Dec 31 of Y3 trying to set insurance premiums for Y4.  You don't have any data from Y4 data yet, so you will try to forecast Y4 outcomes (like days in hospital) from previous years' data.

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?