Why all the data censorship? Surely will affect actual outcome accuracy...

« Prev
Topic
» Next
Topic
Graham Giller's image Posts 3
Joined 30 Mar '12 Email user

This question isn't really about winning the competition, it's more about the actual goal of the competition, which presumably is to price health care premiums more accurately.

Just reading though the table schema I'm noting that a lot of fields are truncated at the 95%ile. Since were are talking about numbers such as 6 or 7 (7 becomes "6+" for DrugCount) or 163 or 162 for pay delay (163 becomes 162+ for days) etc. Storage cannot be the reason for this truncation ("162+" requires 4 bytes, 163 requires 1 etc), and doesn't this censorship, which is all over the schema, have a major effect on the actual real world utility of the forecasting data?

Also consider the four reasons for a sick person not making a claim "next year": 1, death; 2, loss of financial ability to pay premiums/copays etc.; 3, transfer to another provider; 4, health. Naively, I would expect the costs of these outcomes to differ and so the likelihood of each outcome having an effect on the premium one charges a subscriber.

Yes, I understand it has no effect on the ability to forecast the out-of-sample data set if that is censored in the same way; but doesn't it have a consequences on the real-world usage of the winning model? After all the actual costs, and actual liabilities of the provider, don't get trucated in this manner?

Graham

 
S.U.T.'s image Posts 43
Thanks 7
Joined 5 Sep '11 Email user

Or, gains to accuracy... ageMISSING, and esp sexMISSING, seem to be highly predictive, more so than a given value for age of sex.

Of course, these result from a supprerssion algorithm, which is a positively related to claims history "descriptiveness". This means they are artifically added and wouldn't be available from HPN's files directly. However, presumably the function SuppressSex( x ) operates on x, a claims history equivalent(?) to the dataset we have been given. As such, it would be incorporatable into a

ProductionModel =
UnSupressedData -> SuppresAlgo(UnsupressedData) -> SuppressedData -> WinningModel(SupressedData)

Cutting off the extreme 1% values of pay_delay seems to be unimportant: would a value eg "190 days" reallly be more informative than already knowing "162+"

ClaimsTruncated and SupLOS are more troublesome however: The above argument would go "if we already have 44 claims for a member, is it really helpful to get more information for this clearly quite sick patient?". The problem is that the stated suppression policy for these fields is to truncate the most "descriptive" claims records - those that are most unique, and isolate a member from a population. Described in this way, these seems to be exactly the kind of juicy identifiers that predictive modelers look for.

 
Graham Giller's image Posts 3
Joined 30 Mar '12 Email user

You seem to be implying that missing demographics (age and sex) are the result of deliberate data set obfuscation. If that is the case, and the missingness is a strong predictor, then the obfuscation is clearly broken and the models that learn how the obfuscator works are clearly going to fail on real (unobfuscated) data even if they win in the hidden testing data. 

My assumption was the missing demographics were due to errors... Could easily be an unreadable cell in some handwritten form and the company doesn't have the manpower to phone everybody up to fix the problems; it seems attractive to save costs by waiting for a member to call in and then trying to fix the data when they call etc.

Presumably missing demographics are not causal... I mean, they're not recorded accurately for some reason but how strongly is that unknown reason correlated with actual health? (Although it might be, elderly people's handwriting being less legible etc.)

Let's assume that the error rate in future data is going to decline --- because these patients do have an age and a sex, and presumably the company is going to try and find out what they are for all members, and also possesses the ability to correct this data after the fact --- so a model that relies on missing demographics is going to ultimately fail out-of-sample even if it does actually "win" in-sample *and* in the unknown testing data.

 
S.U.T.'s image Posts 43
Thanks 7
Joined 5 Sep '11 Email user

The data is obfusicated in order to minimize identification of hopefully anonymous patients from their records. As such, it takes into how expressive or unique the events in the claims history are, and is more likely to obfusicate when presented with a highly unique member claim.

The fact that members history "unusualness" tends to correlate with hospitalization is not that suprising - a day in hospital is a rather rare event, and patients with unqie claims history may be perceived as being more at risk to become such an outlier.

Then, since the suppression algo is a function of claims history, if the code was known that it can be generated  from unobfusicated sex and age as an additional field for the model input.

 
S.U.T.'s image Posts 43
Thanks 7
Joined 5 Sep '11 Email user

P.S.There was a presentation by the party involved in this process on this forum about a month ago

 
Graham Giller's image Posts 3
Joined 30 Mar '12 Email user

Thanks, that's useful. Woke up thinking that the right approach might be to compose suitable Bayesian priors to put back the removed information!

 
mkline55's image Posts 3
Joined 17 Apr '12 Email user

I know it's late, but I joined late, so... I couldn't resist this. If you want, you could mark all of the members with a primary condition of pregnant as being females, except for the three who checked the wrong box. ;) 

 
Jorgensen's image Posts 21
Joined 14 Feb '12 Email user

They won't use the winning model directly anyway.  The winning model will be reviewed for insights how to build a better internal model to work on the real, complete data.  That is what happened with the Netflix prize where the winning "model" was basically useless in a production setting.

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?