<12>
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 423
Thanks 106
Joined 21 Nov '10 Email user
From Kaggle

The papers written by the milestone winners are now available as attachments to this forum post. As described in section 13 of the rules, if you have any concerns about these papers, you have 30 days from their posting to provide your feedback.

2 Attachments —
Thanked by linus , Scott Thompson , Yunwen Chen , and Vijay Ram
 
Scott Thompson's image Posts 3
Joined 26 Jan '11 Email user

Both the docx and pdf files for Edward / Willem's paper appear to be missing formula and symbols at least on all the machine / software combinations I tried it on.

 
Signipinnis's image Posts 94
Thanks 25
Joined 8 Apr '11 Email user

My browser also has a display problem with the Edward+Willem paper, but it works fine for me if I save it (edit: the docx version) to disk and open it with Word.

 

 
Scott Thompson's image Posts 3
Joined 26 Jan '11 Email user

Thanks Signipinnis. I have downloaded and can open the file - there is text but formula appear to be missing. For example on the bottom of page 5 I have the sentence, "Now we have obtained the following variables: " and then ", , , , and" which suggest that something is missing. If someone could confirm more complete text, then the problem is at my end with the rendering. I've checked the docx xml code and there are only white space between the commas indicated above. The formula do not appear on pdf version, again on multiple systems.

 
Haru's image Rank 39th
Posts 3
Thanks 2
Joined 1 Mar '12 Email user

Scott Thompson wrote:

I have downloaded and can open the file - there is text but formula appear to be missing. For example on the bottom of page 5 I have the sentence, "Now we have obtained the following variables: " and then ", , , , and" which suggest that something is missing.

Looks like pdf version corrupted. But docx is ok. Try pdf in attachment (converted from docx)

1 Attachment —
Thanked by Scott Thompson , and Vikram Jha
 
Willem Mestrom's image Rank 4th
Posts 24
Thanks 9
Joined 28 Feb '11 Email user

Hi all, sorry for the trouble with the docx. Here the correct pdf.

1 Attachment —
Thanked by Scott Thompson , and Vijay Ram
 
Scott Thompson's image Posts 3
Joined 26 Jan '11 Email user

Thank you Haru and Willem. The new pdfs are good.

 
Oleg Vasilyev's image Rank 6th
Posts 18
Thanks 1
Joined 4 Jun '11 Email user

Congratulations, Willem & Edward team on your win.
Besides great modeling, you have solved the puzzle of special paydelay value=0, which is quite a realistic puzzle; it is typical in real life data that there are some special values combining different cases meaning of which is long lost.

We have a question from our team regarding table 1 (Claims distribution of paydelay vs Dsfs) in your paper.

From the paper, on what is counted in the table: "Important is that only claims are used that belong to members of Year3, that have a maximum DSFS value of "11-12" (month), because we then know to which real month all the claims belong."

When we select number of claims with DSFS=12 and Year=3 we get exactly same counts as in the table: 16044 claims with paydelay=0, 152 claims with paydelay=1...10, etc. However, for any other DSFS, we get counts much larger than in the table.
To be specific, let's consider the DSFS="8-9" and paydelay="111...120". That cell in the table has count 0, but we get 9.
Member 35764059 at year=3 has 1 claim with DSFS="11-12", but also at same year=3 has 3 claims with DSFS="8-9" and paydelay="111...120" (all 3 claims have paydelay=111). Member 47834891 at year=3 has a claim with DSFS="11-12", but also has 6 claims with DSFS="8-9" and paydelay="111..120" (these claims have paydelay=111,112,113,114,115 and 116). Thus, altogether we have 9 claims for DSFS="8-9" and paydelay="111...120".

Could you please let us know what do we miss here?
Thanks

 
DaveC's image Posts 14
Thanks 3
Joined 16 Feb '11 Email user

Hi Team Crescendo. Congratulations on your milestone achievement.
I have a question about your 'post processing'. You work out the 'true average' values for the 4 categories. What scaling method did you then use? Did you use a multiplicative factor (eg * 1.1), or an additive factor (eg +0.1) to adjust your estimates? Or some other method?
Thanks
Dave Clark

 
shekhar gupta's image Rank 16th
Posts 1
Joined 30 Jul '12 Email user

Congrats! Willem & Edward Team on wining the milestone

The modelling on DSFS looks interesting. Yeah, I agree with Oleg Vasilyev the number do not matches with the paper also the papers lacks on details around the simulations used to attain the constants or interpolated values used to evaluate the offset values.

Could you please provide more detail on this?

Thanks

 
infty's image Rank 4th
Posts 7
Thanks 2
Joined 27 Mar '12 Email user
Hi Dave, 
We added a constant to match the average.
Rie from crescendo
Thanked by DaveC
 
infty's image Rank 4th
Posts 7
Thanks 2
Joined 27 Mar '12 Email user

I wonder why my reply above got a rectangle and colors, though.

 
Tim Salimans's image Rank 2nd
Posts 35
Thanks 14
Joined 25 Oct '10 Email user

Congratulations to the milestone winners! Willem & Edward, very impressive improvement yet again. Crescendo, impressive improvement on the private data compared to the public leaderboard!

The crescendo report is very well written, thanks for taking the effort. I do have a few remaining questions though:

  • What implementation of random forests did you use?

  • The data sets described contain an unusually large number of variables for use in a tree based model (e.g. 14700 providerID count variables) Is this correct? If so, I assume you use some kind of sparse data format to keep all of this in memory?

  • Could you please provide us with the weights of the individual runs in each of your 5 blended runs? Also it might be interesting to include the cross-validated rmse of each of these individual runs in the report.

Thanks in advance!

 
B Yang's image Rank 2nd
Posts 195
Thanks 46
Joined 12 Nov '10 Email user

A question for team Crescendo: how did you deal with missing values ? For example, in feature set m1, what did you do for numeric Age value if the member's age is missing ?

 
infty's image Rank 4th
Posts 7
Thanks 2
Joined 27 Mar '12 Email user

Hi Tim,

Tim> What implementation of random forests did you use?

We used our own implementation.

TimThe data sets described contain an unusually large number of variables for use in a tree based model (e.g. 14700 providerID count variables) Is this correct? If so, I assume you use some kind of sparse data format to keep all of this in memory?

Our code can handle sparse data efficiently.

TimCould you please provide us with the weights of the individual runs in each of your 5 blended runs? Also it might be interesting to include the cross-validated rmse of each of these individual runs in the report.

Thanks for suggestions, but we choose not to provide those numbers, which are not needed for replicating the winning entry. In the course of replicating the results based on the current documentation, one would obtain those numbers as side products. It'd be just too tedious to list up or even read so many numbers. I hope you understand.

Rie

 
<12>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?