I am facing a puzzle and looking for someone who can suggest what could be an issue. I am using random forests and table below shows how train/test sets are organized and corresponding results. Sets 1 and 2 are split in training/testing subsets and set 3 is trained on Y1/Y2 combination and verified against Y3 days. I get pretty consistent ~0.45 results but when I submit set 4 for verification I get ~0.47.
Cases 3 and 4 should be symmetrical and I am looking for suggestion what generally might be wrong. Data sets are prepared the same way and same code is used in all 4 instances.
| [ Train data (drugs, labs, claims) ] | [ Days in hospital for training ] | [ Test days ] | [ Train / Test set size ] | [ Result ] | |
|---|---|---|---|---|---|
| 1 | Y1 | Y2 | Y2 | 80% / 20% | ~0.45 |
| 2 | Y2 | Y3 | Y3 | 80% / 20% | ~0.45 |
| 3 | Y1 | Y2 | Y3 | 100% | ~0.45 |
| 4 | Y2 | Y3 | Target | 100% | ~0.47 |
Thanks in advance...
Mirko
Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?

with —