Here's a list of issues I've discovered so far with the new dataset:

1) In the Claims.csv file, some of the ID numbers have extra zeros in front compared with release 2, but there are no changes to any of the values.

2) The DaysInHospital_Y2.csv has one extra row for MemberID  24027423

3) The new DrugCount.csv file contains some entries which don't correspond with Claims.csv

210,Y3,7- 8 months,1
210,Y3,8- 9 months,1
210,Y1,4- 5 months,1
210,Y3,5- 6 months,2

A I understand it, there should be claims for MemberID 210 appearing in Claims.csv for those particular year and DSFC combinations, but they are missing. The other rows of DrugCount.csv have a corresponding claim in Claims.csv. 

(Addendum) That last sentence isn't actually true. I've just looked at the data again, and there are a lot more MemberIDs where 3) applies. It turns out that every one of those MemberIDs have ClaimsTruncated=1 in the DaysInHospital csv files.

I suppose this means that the claims were anonymized, but the drug count data associated with them wasn't. This raises a few questions: Is the drug count data complete for all members, or did DrugCount.csv  get anonymized as well? Either way, the missing claims can be used to obtain a better estimate of the true number of claims for those members :-)

Ford Prefect wrote:

2) The DaysInHospital_Y2.csv has one extra row for MemberID  24027423

That's good:

Thanks for all the observations: keep them coiming!

I noted that in the LabCount.csv file LabCount has been maximized to 10+ and in the DrugCount.csv file DrugCount has been maximized to 7+ There is no information as to what laboratory tests were carried out. Also there is no information as to what Drugs were supplied. Question to HPN Administrator - Are the count or number of these services the only data to be provided for security reasons? Thanking you Jim

Quote from the data page:

c. Labs Table, which will contain certain details of lab tests provided to members.

d. RX Table, which will contain certain details of prescriptions filled by members.

There are no details? just sums?

Also lots of memberID&year combinations are missing in the 2 datasets. Does it mean that no lab tests or prescriptions are given that year for that memberID?

@Kwaak The drug counts are all at least 1, so it makes sense that there are missing combinations - those claims would have 0 drugs prescribed. Same with the lab counts, only nonzero combinations are provided.

If the lab counts and drug counts are complete, then it's ok to put zero counts in all the other claims, but if there was anonymization, then that may not be appropriate.

I noticed that the Data page still says " contains the latest files, so you can ignore". Shouldn't that be updated to refer to obsoleting
Thanks Dave. The data description has been fixed.
So according to Ford Prefect we should match the drug and lab data by memberid, year and dsfs to claims??

We have drugs where there are no claims, e.g.

> drug[MemberID == "10002388" & Year == "Y1" & DSFS == "1- 2 months",]
     MemberID Year        DSFS DrugCount
[1,] 10002388   Y1 1- 2 months         2
> claims[MemberID == "10002388" & Year == "Y1" & DSFS == "1- 2 months",]
NULL data table

Are they free drugs, drugs paid for on a previous (later?) claim, or is dispensing of drugs not considered a treatment for the purposes of the claims table?

The date a person fills out a prescription for drugs won't necessarily correspond to the date in the claims. The drugs could be a refill, a recurring condition, or a "if it doesn't get better in a week fill in this prescription".

@Dirk  Of course you're free to fit the data together any way you wish :)

However, if you do try to match drugs and labs to claims, then every one of the lab counts can be attached to an existing claim, but some claims won't have lab counts. Maybe the lab counts have no relation to the claim other than they occurred at the same time, but maybe the claim information complements the lab counts. I don't know :)

@Allan With drug counts, there are many cases where the drug count doesn't correspond to a claim in the dataset. There are 818241 drug counts, and assuming my code is correct I've identified 311958 (38%) instances which cannot be attached to an existing claim, whereas the remaining ones can. But all the 311958 instances have a MemberID where ClaimsTruncated=1 in the DaysInHospital file, check your example.

My current theory is that these 311958 instances indicate phantom claims, ie anonymized claims we can't see but which generated a drug prescription. If we count the real claims and the phantom claims together, that's about 3 million claims, ie 10% more than the real claims alone.

However that may be too simple. The comment by @arbuckle suggests trying to match phantom claims to existing claims that are possibly earlier. That won't work in your example, but might work in a number of other cases.

Dear Anthony

I couldn't find the RX table and Labs table in the third dataset. Were they removed recently?




Flag alert Flagging notifies Kaggle that this message is spam, inappropriate, abusive, or violates rules. Do not use flagging to indicate you disagree with an opinion or to hide a post.