Ford Prefect's image
Posts 23
Thanks 10
Joined 2 Dec '10
Email User

Here's a list of issues I've discovered so far with the new dataset:

1) In the Claims.csv file, some of the ID numbers have extra zeros in front compared with release 2, but there are no changes to any of the values.

2) The DaysInHospital_Y2.csv has one extra row for MemberID  24027423

3) The new DrugCount.csv file contains some entries which don't correspond with Claims.csv

210,Y3,7- 8 months,1
210,Y3,8- 9 months,1
210,Y1,4- 5 months,1
210,Y3,5- 6 months,2

A I understand it, there should be claims for MemberID 210 appearing in Claims.csv for those particular year and DSFC combinations, but they are missing. The other rows of DrugCount.csv have a corresponding claim in Claims.csv. 

(Addendum) That last sentence isn't actually true. I've just looked at the data again, and there are a lot more MemberIDs where 3) applies. It turns out that every one of those MemberIDs have ClaimsTruncated=1 in the DaysInHospital csv files.

I suppose this means that the claims were anonymized, but the drug count data associated with them wasn't. This raises a few questions: Is the drug count data complete for all members, or did DrugCount.csv  get anonymized as well? Either way, the missing claims can be used to obtain a better estimate of the true number of claims for those members :-)

Thanked by Allan Engelhardt
 
Allan Engelhardt's image
Posts 77
Thanks 29
Joined 28 May '10
Email User

Ford Prefect wrote:

2) The DaysInHospital_Y2.csv has one extra row for MemberID  24027423

That's good: http://www.heritagehealthprize.com/c/hhp/forums/t/620/missing-daysinhospital-y2-for-memberid-24027423

Thanks for all the observations: keep them coiming!

 
Darragh's image
Posts 8
Joined 8 Apr '11
Email User
I noted that in the LabCount.csv file LabCount has been maximized to 10+ and in the DrugCount.csv file DrugCount has been maximized to 7+ There is no information as to what laboratory tests were carried out. Also there is no information as to what Drugs were supplied. Question to HPN Administrator - Are the count or number of these services the only data to be provided for security reasons? Thanking you Jim
 
Kwaak's image
Posts 7
Joined 8 Apr '11
Email User

Quote from the data page:

c. Labs Table, which will contain certain details of lab tests provided to members.

d. RX Table, which will contain certain details of prescriptions filled by members.

There are no details? just sums?

Also lots of memberID&year combinations are missing in the 2 datasets. Does it mean that no lab tests or prescriptions are given that year for that memberID?

 
Ford Prefect's image
Posts 23
Thanks 10
Joined 2 Dec '10
Email User

@Kwaak The drug counts are all at least 1, so it makes sense that there are missing combinations - those claims would have 0 drugs prescribed. Same with the lab counts, only nonzero combinations are provided.

If the lab counts and drug counts are complete, then it's ok to put zero counts in all the other claims, but if there was anonymization, then that may not be appropriate.

Thanked by Kwaak
 
David J. Slate's image
Rank 10th
Posts 85
Thanks 29
Joined 5 Aug '10
Email User
I noticed that the Data page still says "HHP_release2.zip contains the latest files, so you can ignore HHP_release1.zip.". Shouldn't that be updated to refer to HHP_release3.zip obsoleting HHP_release2.zip?
 
Anthony Goldbloom (Kaggle)'s image
Anthony Goldbloom (Kaggle)
Competition Admin
Kaggle Admin
Posts 383
Thanks 73
Joined 20 Jan '10
Email User
From Kaggle
Thanks Dave. The data description has been fixed.
 
Dirk Nachbar's image
Posts 84
Thanks 4
Joined 26 May '10
Email User
So according to Ford Prefect we should match the drug and lab data by memberid, year and dsfs to claims??
 
Allan Engelhardt's image
Posts 77
Thanks 29
Joined 28 May '10
Email User

We have drugs where there are no claims, e.g.

> drug[MemberID == "10002388" & Year == "Y1" & DSFS == "1- 2 months",]
     MemberID Year        DSFS DrugCount
[1,] 10002388   Y1 1- 2 months         2
> claims[MemberID == "10002388" & Year == "Y1" & DSFS == "1- 2 months",]
NULL data table

Are they free drugs, drugs paid for on a previous (later?) claim, or is dispensing of drugs not considered a treatment for the purposes of the claims table?

 
arbuckle's image
arbuckle
HHP Advisor
Posts 38
Thanks 21
Joined 5 May '11
Email User
The date a person fills out a prescription for drugs won't necessarily correspond to the date in the claims. The drugs could be a refill, a recurring condition, or a "if it doesn't get better in a week fill in this prescription".
Thanked by Sarkis and Allan Engelhardt
 
Ford Prefect's image
Posts 23
Thanks 10
Joined 2 Dec '10
Email User

@Dirk  Of course you're free to fit the data together any way you wish :)

However, if you do try to match drugs and labs to claims, then every one of the lab counts can be attached to an existing claim, but some claims won't have lab counts. Maybe the lab counts have no relation to the claim other than they occurred at the same time, but maybe the claim information complements the lab counts. I don't know :)

@Allan With drug counts, there are many cases where the drug count doesn't correspond to a claim in the dataset. There are 818241 drug counts, and assuming my code is correct I've identified 311958 (38%) instances which cannot be attached to an existing claim, whereas the remaining ones can. But all the 311958 instances have a MemberID where ClaimsTruncated=1 in the DaysInHospital file, check your example.

My current theory is that these 311958 instances indicate phantom claims, ie anonymized claims we can't see but which generated a drug prescription. If we count the real claims and the phantom claims together, that's about 3 million claims, ie 10% more than the real claims alone.

However that may be too simple. The comment by @arbuckle suggests trying to match phantom claims to existing claims that are possibly earlier. That won't work in your example, but might work in a number of other cases.

 
Dr. Z's image
Posts 1
Joined 12 Jun '12
Email User

Dear Anthony

I couldn't find the RX table and Labs table in the third dataset. Were they removed recently?

Thanks

Z

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?