The Order of Memberid in Target

« Prev
Topic
» Next
Topic

I would appreciate if someone from Kaggle clarifies whether keeping this particular order in our submissions matters.  It is quite difficult to rearrange the output to match this particular order.  If it turns out that the submission really has to be made in this particular order, it would be a great help for one team to post their R code for it.  Thank you.

There is probably a more elegant way to do this, but my shortcut in R is:

Target <- read.csv('Target.csv')

Target$ord <- 1:nrow(Target)

mydata <- some.data.frame.with.my.model.output

final <- merge(Target,mydata,by="MemberID",all=TRUE)

final <- final[order(final$ord),]

Gary wrote:

There is probably a more elegant way to do this, but my shortcut in R is: Target

Thank you.  In the meantime I found "Merge Tables Wizard for Excel", which did the trick for me.  When I re-submitted the same prediction in the "correct" order, my score did improve from 0.534130 to 0.512619.

@Gary: Just use sort=FALSE in merge, e.g. (and assuming here that your target is log1p(DaysInHospital) which is usually the best):

log1p.predict <- predict(model.train, newdata = score.x)
submit <- data.frame(MemberID = score.data[["MemberID"]],
                     DaysInHospital = expm1(log1p.predict))
score <- target[, 1:2] # From reading Target.csv
submit <- merge(score, submit, sort = FALSE)
write.csv(submit, file = "submit.train.csv", quote = FALSE, row.names = FALSE)

Hope this helps someone.

It would still be nice to get clarification on whether the order matters, and also if the values in column 2 matters (or can we for instance output zeroes there). I have assumed that the order does not matter, but my results are ... a little counter-intuitive.

The current scoring code effectively ignores whatever you put in the first two columns, so the order of member id's does matter. One possible suggestion is that you get the order we expect in a file format you like, then you can use a dictionary/map internally to calculate the prediction for each member and then at the very end just iterate through the expected order getting the value from your dictionary before you submit.

@Allan, great solution, thanks!
Jeff, thanks for clarifying. My scores now make more sense. Also, I think you should be more clear about the order on the submission page/instructions. Maybe even have the checker compare the submitted MemberID field to the expected value, and informing the submitter if the order is wrong.

I agree about updating the description on the website; I just submitted my predictions sorted by memberID, and while it's not a big deal, it's a pain to have to wait another day to see how my model did.

> effectively ignores whatever you put in the first two columns

Jeff, I'm sorry, is it year 2011 now? Not 1981? Why it is done this way? Are you lacking computer resources to sort 70k rows? If you like, I can provide efficient algorythm to match sumbissions regardless of row order.

As far as I could see, no official documentation says about this limitation.

Usually this costs 1 day of depression for a newcomer to see that first results are completely wrong and then find an answer in obscure forum thread.

No checks, no error messages, no documentation.
Unexpectedly low quality in this aspect.

Or, is it some kind of IQ check for newcomers?

kozz wrote:

> effectively ignores whatever you put in the first two columns

Jeff, I'm sorry, is it year 2011 now? Not 1981? Why it is done this way? Are you lacking computer resources to sort 70k rows? If you like, I can provide efficient algorythm to match sumbissions regardless of row order.

As far as I could see, no official documentation says about this limitation.

Usually this costs 1 day of depression for a newcomer to see that first results are completely wrong and then find an answer in obscure forum thread.

No checks, no error messages, no documentation.
Unexpectedly low quality in this aspect.

Or, is it some kind of IQ check for newcomers?

Sorry, I know it's frustrating. I plan to address it this week (along with making the parser more forgiving). There were just a few things ahead of it in the queue. 

Aligning one set of data in the order of the target file is a simple matter that can be done in just a few lines of code (at least in matlab). It runs in about 20 seconds.

I can post some sample code if anyone is interested. This is just gymnastics.

I see the current scoring code actually check MemberIDs (and column order doesn't matter, among other "improvements", personally I prefer one format and one format only), but were the old submissions re-scored when sort-by-ID was implemented ?

Reply

Flag alert Flagging notifies Kaggle that this message is spam, inappropriate, abusive, or violates rules. Do not use flagging to indicate you disagree with an opinion or to hide a post.