okay, so the data is presented as days in hospital, and it gets you thinking that this should be a regression and one should try to predict each member's dih as accurately as possibly.
But then we know that the "best constant value" benchmark of 0.21 is way better than most models people are chucking together. So this suggests that it's hard to separate out classes of members...I like the idea of betting with means and groups of some kind, rather than placing unique individual bets on every individual.
The simplest possible split is a binary classifer. 0 days or 1 or more days. Then this contest allows us to provide a mean of some kind for each group..i.e. we get to provide real numbers, not integers.
Obviously something slightly smaller than 0.21 will be best for the 0 day guessed group, and something more than 0.21 for the 1 or more day guessed group.
I've been playing with a binary classifer with 0.15 and .85 (or more, like 1 to 1.5) constant values for each group. I've been oversampling the minority output classes with different weights to get better error rates on the minority classes (compared to the majority 0-day output class)
I'm thinking there's more blood to squeeze from this approach, but don't have the knowledge.
There are two interesting papers from 2002 and 2011 on dealing with unbalanced data sets and binary classifiers:
"SMOTE: Synthetic Minority Over-sampling technique"
"Clustering-Based Binary-class Classification for Imbalanced Data Sets"
Question: what's considered the defacto state of the art for binary classifiers on unbalanced data sets? undersampling majority? oversampling minority? weighted output classes (I"m using RF)...thresholding on tree voting? I've seen people mention them all and am trying some.
I'm also encouraged by the use of extremely random trees (scikit ExtraTreeClassifiers) because of different bias behaviors.
thoughts/direction? I'm no expert here.