Let's talk unbalanced data sets

« Prev
Topic
» Next
Topic

okay, so the data is presented as days in hospital, and it gets you thinking that this should be a regression and one should try to predict each member's dih as accurately as possibly.

But then we know that the "best constant value" benchmark of 0.21 is way better than most models people are chucking together. So this suggests that it's hard to separate out classes of members...I like the idea of betting with means and groups of some kind, rather than placing unique individual bets on every individual.

The simplest possible split is a binary classifer. 0 days or 1 or more days. Then this contest allows us to provide a mean of some kind for each group..i.e. we get to provide real numbers, not integers.

Obviously something slightly smaller than 0.21 will be best for the 0 day guessed group, and something more than 0.21 for the 1 or more day guessed group.

I've been playing with a binary classifer with 0.15 and .85 (or more, like 1 to 1.5) constant values for each group. I've been oversampling the minority output classes with different weights to get better error rates on the minority classes (compared to the majority 0-day output class)

I'm thinking there's more blood to squeeze from this approach, but don't have the knowledge.

There are two interesting papers from 2002 and 2011 on dealing with unbalanced data sets and binary classifiers:

"SMOTE: Synthetic Minority Over-sampling technique"

and

"Clustering-Based Binary-class Classification for Imbalanced Data Sets"

Question: what's considered the defacto state of the art for binary classifiers on unbalanced data sets? undersampling majority? oversampling minority? weighted output classes (I"m using RF)...thresholding on tree voting? I've seen people mention them all and am trying some.

I'm also encouraged by the use of extremely random trees (scikit ExtraTreeClassifiers) because of different bias behaviors.

thoughts/direction? I'm no expert here.

had some issues with internet connection, got posted twice..

had some issues with internet connection, got posted twice..

had some issues with internet connection, got posted twice..

The method I've seen most adopted in medical datamining research/competition is custom ensembling methods for trees. Each prediction element applied to the ensemble has a distribution of tree ("expert") scores, that through the custom technique, refine the prediction element score away from the enemble mean score.

On the research/technichal side, this distribution is parameterized which provides further input variables for the element's score: http://rd.springer.com/chapter/10.1007/978-3-642-19423-8_33

On the applied/competition side, the distribution of the ensemble of experts is tuned with a more heuristic approach, and often competition data-bias/ error-metric specific: see tuning the FROC on p17 www.prem-melville.com/publications/medical-mining-dmkd09.pdf

Thanks S.U.T
Prem Melville's (@IBM) papers seem interesting, and his KDD 2008/2009 stuff.

Talking about bias. Some thoughts:

This competition has inherent bias to humans who filed claims, or had lab, drugs or DaysInHospital info, right? So it's not a unbiased sample of humans, or of HH customers?

There are 113000 Members, and 113000 unique members in Claims
Looking at unique members in Claims per year, it's
76037 Y1, 71434 Y2, 70942 Y3

75998 unique members in DrugCount
86640 unique members in LabCount

So DrugCount and LabCount may be biased. There might be a flow of "get people with Claims over the N year period. List them as members. Then get their Drug and Lab counts"

So the prediction doesn't include people who had no claims in prior years and just all of a sudden ended up in the hospital the next year.

there's a weird calendar year effect...for some cases, the claims that are most relevant to predicting the hospital stay may occur in the same calendar year as the calendar year. I suppose depending on the time period of the leading indicators, some predictable percentage of those events will be in the same year as the hospital stay..assuming even distribution of likelihood in a calendar year.

Maybe the biggest problem making this hard to classify, is the calendar year issue?

If the leading indicators for most things is < 6 months, then predicting hospital stay for next year, might be most accurate if you only look at data from the last 6 months of the prior year? Or maybe even last 2 months.

But strangely, we don't have accurate info for that. DSFS (days since first service that year) is not useful?
Maybe it could be used this way: if there are multiple DSFS values for any data, just use the data associated with the biggest DSFS. that should be "last" in that calendar year, and closest for predicting short onset events for next year?

??

I suppose it's not useful to summarize counts of things over the entire prior year?

I'm also starting out trying to get accurate performance for a one-class classifier (0 DIH or >0 DIH).  I that works I would then probably train a seperate classifier for p(DIH | DIH > 0).  I am currently trying to train the first classifier using a deep learning approach.  If I then assign DIH to the two classes using random values from two judiciously chosen distributions I can get an OK score (~0.49), but still doesn't beat the GBM and Random Forest approaches I've previously tried.  I'm hoping performance will get much better when I implement the second classifier for DIH.

Is anyone else attempting a similar approach?  If so, what kind of performance are you getting?

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?