« Prev
Topic

The Order of Memberid in Target

» Next
Topic
R. Kaan Ozbayrak's image Posts 13
Joined 20 Mar '11

I would appreciate if someone from Kaggle clarifies whether keeping this particular order in our submissions matters.  It is quite difficult to rearrange the output to match this particular order.  If it turns out that the submission really has to be made in this particular order, it would be a great help for one team to post their R code for it.  Thank you.

 
Gary's image Posts 2
Joined 14 Feb '11

There is probably a more elegant way to do this, but my shortcut in R is:

 

Target <- read.csv('Target.csv')

Target$ord <- 1:nrow(Target)

mydata <- some.data.frame.with.my.model.output

final <- merge(Target,mydata,by="MemberID",all=TRUE)

final <- final[order(final$ord),]

 
R. Kaan Ozbayrak's image Posts 13
Joined 20 Mar '11

Gary wrote:

There is probably a more elegant way to do this, but my shortcut in R is: Target

 

Thank you.  In the meantime I found "Merge Tables Wizard for Excel", which did the trick for me.  When I re-submitted the same prediction in the "correct" order, my score did improve from 0.534130 to 0.512619.

 
Allan Engelhardt's image Rank 69th
Posts 77
Thanks 29
Joined 28 May '10

@Gary: Just use sort=FALSE in merge, e.g. (and assuming here that your target is log1p(DaysInHospital) which is usually the best):

log1p.predict <- predict(model.train, newdata = score.x)
submit <- data.frame(MemberID = score.data[["MemberID"]],
                     DaysInHospital = expm1(log1p.predict))
score <- target[, 1:2] # From reading Target.csv
submit <- merge(score, submit, sort = FALSE)
write.csv(submit, file = "submit.train.csv", quote = FALSE, row.names = FALSE)

Hope this helps someone.

Thanked by R. Kaan Ozbayrak
 
Tapani's image Posts 9
Joined 30 Apr '11
It would still be nice to get clarification on whether the order matters, and also if the values in column 2 matters (or can we for instance output zeroes there). I have assumed that the order does not matter, but my results are ... a little counter-intuitive.
 
Jeff Moser's image
Jeff Moser
Kaggle Admin
Posts 347
Thanks 166
Joined 21 Aug '10
From Kaggle

The current scoring code effectively ignores whatever you put in the first two columns, so the order of member id's does matter. One possible suggestion is that you get the order we expect in a file format you like, then you can use a dictionary/map internally to calculate the prediction for each member and then at the very end just iterate through the expected order getting the value from your dictionary before you submit.

Thanked by R. Kaan Ozbayrak
 
Gary's image Posts 2
Joined 14 Feb '11
@Allan, great solution, thanks!
 
Tapani's image Posts 9
Joined 30 Apr '11
Jeff, thanks for clarifying. My scores now make more sense. Also, I think you should be more clear about the order on the submission page/instructions. Maybe even have the checker compare the submitted MemberID field to the expected value, and informing the submitter if the order is wrong.
 
MightyMidwest's image Rank 59th
Posts 1
Joined 26 Jul '11

I agree about updating the description on the website; I just submitted my predictions sorted by memberID, and while it's not a big deal, it's a pain to have to wait another day to see how my model did.

 
kozz's image Posts 1
Joined 23 May '11

> effectively ignores whatever you put in the first two columns

Jeff, I'm sorry, is it year 2011 now? Not 1981? Why it is done this way? Are you lacking computer resources to sort 70k rows? If you like, I can provide efficient algorythm to match sumbissions regardless of row order.

As far as I could see, no official documentation says about this limitation.

Usually this costs 1 day of depression for a newcomer to see that first results are completely wrong and then find an answer in obscure forum thread.

 

No checks, no error messages, no documentation.
Unexpectedly low quality in this aspect.

Or, is it some kind of IQ check for newcomers?

 
Jeff Moser's image
Jeff Moser
Kaggle Admin
Posts 347
Thanks 166
Joined 21 Aug '10
From Kaggle

kozz wrote:

> effectively ignores whatever you put in the first two columns

Jeff, I'm sorry, is it year 2011 now? Not 1981? Why it is done this way? Are you lacking computer resources to sort 70k rows? If you like, I can provide efficient algorythm to match sumbissions regardless of row order.

As far as I could see, no official documentation says about this limitation.

Usually this costs 1 day of depression for a newcomer to see that first results are completely wrong and then find an answer in obscure forum thread.

 

No checks, no error messages, no documentation.
Unexpectedly low quality in this aspect.

Or, is it some kind of IQ check for newcomers?

Sorry, I know it's frustrating. I plan to address it this week (along with making the parser more forgiving). There were just a few things ahead of it in the queue. 

 
Dobson's image Posts 4
Joined 7 Jul '11

Aligning one set of data in the order of the target file is a simple matter that can be done in just a few lines of code (at least in matlab). It runs in about 20 seconds.

I can post some sample code if anyone is interested. This is just gymnastics.

 
B Yang's image Posts 120
Thanks 28
Joined 12 Nov '10

I see the current scoring code actually check MemberIDs (and column order doesn't matter, among other "improvements", personally I prefer one format and one format only), but were the old submissions re-scored when sort-by-ID was implemented ?

 
Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?