The Order of Memberid in Target

« Prev
Topic
» Next
Topic
R. Kaan Ozbayrak's image
Posts 13
Joined 20 Mar '11
Email User

I would appreciate if someone from Kaggle clarifies whether keeping this particular order in our submissions matters.  It is quite difficult to rearrange the output to match this particular order.  If it turns out that the submission really has to be made in this particular order, it would be a great help for one team to post their R code for it.  Thank you.

 
Gary's image
Posts 2
Joined 14 Feb '11
Email User

There is probably a more elegant way to do this, but my shortcut in R is:

Target <- read.csv('Target.csv')

Target$ord <- 1:nrow(Target)

mydata <- some.data.frame.with.my.model.output

final <- merge(Target,mydata,by="MemberID",all=TRUE)

final <- final[order(final$ord),]

 
R. Kaan Ozbayrak's image
Posts 13
Joined 20 Mar '11
Email User

Gary wrote:

There is probably a more elegant way to do this, but my shortcut in R is: Target

Thank you.  In the meantime I found "Merge Tables Wizard for Excel", which did the trick for me.  When I re-submitted the same prediction in the "correct" order, my score did improve from 0.534130 to 0.512619.

 
Allan Engelhardt's image
Posts 77
Thanks 29
Joined 28 May '10
Email User

@Gary: Just use sort=FALSE in merge, e.g. (and assuming here that your target is log1p(DaysInHospital) which is usually the best):

log1p.predict <- predict(model.train, newdata = score.x)
submit <- data.frame(MemberID = score.data[["MemberID"]],
                     DaysInHospital = expm1(log1p.predict))
score <- target[, 1:2] # From reading Target.csv
submit <- merge(score, submit, sort = FALSE)
write.csv(submit, file = "submit.train.csv", quote = FALSE, row.names = FALSE)

Hope this helps someone.

Thanked by R. Kaan Ozbayrak
 
Tapani's image
Posts 9
Joined 30 Apr '11
Email User
It would still be nice to get clarification on whether the order matters, and also if the values in column 2 matters (or can we for instance output zeroes there). I have assumed that the order does not matter, but my results are ... a little counter-intuitive.
 
Jeff Moser's image
Jeff Moser
Kaggle Admin
Posts 404
Thanks 214
Joined 21 Aug '10
Email User
From Kaggle

The current scoring code effectively ignores whatever you put in the first two columns, so the order of member id's does matter. One possible suggestion is that you get the order we expect in a file format you like, then you can use a dictionary/map internally to calculate the prediction for each member and then at the very end just iterate through the expected order getting the value from your dictionary before you submit.

Thanked by R. Kaan Ozbayrak
 
Gary's image
Posts 2
Joined 14 Feb '11
Email User
@Allan, great solution, thanks!
 
Tapani's image
Posts 9
Joined 30 Apr '11
Email User
Jeff, thanks for clarifying. My scores now make more sense. Also, I think you should be more clear about the order on the submission page/instructions. Maybe even have the checker compare the submitted MemberID field to the expected value, and informing the submitter if the order is wrong.
 
MightyMidwest's image
Posts 1
Joined 26 Jul '11
Email User

I agree about updating the description on the website; I just submitted my predictions sorted by memberID, and while it's not a big deal, it's a pain to have to wait another day to see how my model did.

 
kozz's image
Posts 1
Joined 23 May '11
Email User

> effectively ignores whatever you put in the first two columns

Jeff, I'm sorry, is it year 2011 now? Not 1981? Why it is done this way? Are you lacking computer resources to sort 70k rows? If you like, I can provide efficient algorythm to match sumbissions regardless of row order.

As far as I could see, no official documentation says about this limitation.

Usually this costs 1 day of depression for a newcomer to see that first results are completely wrong and then find an answer in obscure forum thread.

No checks, no error messages, no documentation.
Unexpectedly low quality in this aspect.

Or, is it some kind of IQ check for newcomers?

 
Jeff Moser's image
Jeff Moser
Kaggle Admin
Posts 404
Thanks 214
Joined 21 Aug '10
Email User
From Kaggle

kozz wrote:

> effectively ignores whatever you put in the first two columns

Jeff, I'm sorry, is it year 2011 now? Not 1981? Why it is done this way? Are you lacking computer resources to sort 70k rows? If you like, I can provide efficient algorythm to match sumbissions regardless of row order.

As far as I could see, no official documentation says about this limitation.

Usually this costs 1 day of depression for a newcomer to see that first results are completely wrong and then find an answer in obscure forum thread.

No checks, no error messages, no documentation.
Unexpectedly low quality in this aspect.

Or, is it some kind of IQ check for newcomers?

Sorry, I know it's frustrating. I plan to address it this week (along with making the parser more forgiving). There were just a few things ahead of it in the queue. 

 
Dobson's image
Posts 4
Joined 7 Jul '11
Email User

Aligning one set of data in the order of the target file is a simple matter that can be done in just a few lines of code (at least in matlab). It runs in about 20 seconds.

I can post some sample code if anyone is interested. This is just gymnastics.

 
B Yang's image
Rank 2nd
Posts 245
Thanks 65
Joined 12 Nov '10
Email User

I see the current scoring code actually check MemberIDs (and column order doesn't matter, among other "improvements", personally I prefer one format and one format only), but were the old submissions re-scored when sort-by-ID was implemented ?

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?