 DavidChudzicki Kaggle Admin Posts 418 Thanks 106 Joined 21 Nov '10 Email user The papers written by the milestone winners are now available as attachments to this forum post. As described in section 13 of the rules, if you have any concerns about these papers, you have 30 days from their posting to provide your feedback. 2 Attachments — Thanked by linus , Scott Thompson , Yunwen Chen , and Vijay Ram #1 / Posted 7 months ago / Edited 7 months ago
 Posts 3 Joined 26 Jan '11 Email user Both the docx and pdf files for Edward / Willem's paper appear to be missing formula and symbols at least on all the machine / software combinations I tried it on. #2 / Posted 7 months ago
 Posts 94 Thanks 25 Joined 8 Apr '11 Email user My browser also has a display problem with the Edward+Willem paper, but it works fine for me if I save it (edit: the docx version) to disk and open it with Word. #3 / Posted 7 months ago
 Posts 3 Joined 26 Jan '11 Email user Thanks Signipinnis. I have downloaded and can open the file - there is text but formula appear to be missing. For example on the bottom of page 5 I have the sentence, "Now we have obtained the following variables: " and then ", , , , and" which suggest that something is missing. If someone could confirm more complete text, then the problem is at my end with the rendering. I've checked the docx xml code and there are only white space between the commas indicated above. The formula do not appear on pdf version, again on multiple systems. #4 / Posted 7 months ago
 Rank 39th Posts 3 Thanks 2 Joined 1 Mar '12 Email user Scott Thompson wrote: I have downloaded and can open the file - there is text but formula appear to be missing. For example on the bottom of page 5 I have the sentence, "Now we have obtained the following variables: " and then ", , , , and" which suggest that something is missing. Looks like pdf version corrupted. But docx is ok. Try pdf in attachment (converted from docx) 1 Attachment — Thanked by Scott Thompson , and Vikram Jha #5 / Posted 7 months ago
 Rank 4th Posts 24 Thanks 9 Joined 28 Feb '11 Email user Hi all, sorry for the trouble with the docx. Here the correct pdf. 1 Attachment — Thanked by Scott Thompson , and Vijay Ram #6 / Posted 7 months ago
 Posts 3 Joined 26 Jan '11 Email user Thank you Haru and Willem. The new pdfs are good. #7 / Posted 7 months ago
 Rank 6th Posts 18 Thanks 1 Joined 4 Jun '11 Email user Congratulations, Willem & Edward team on your win. Besides great modeling, you have solved the puzzle of special paydelay value=0, which is quite a realistic puzzle; it is typical in real life data that there are some special values combining different cases meaning of which is long lost. We have a question from our team regarding table 1 (Claims distribution of paydelay vs Dsfs) in your paper. From the paper, on what is counted in the table: "Important is that only claims are used that belong to members of Year3, that have a maximum DSFS value of "11-12" (month), because we then know to which real month all the claims belong." When we select number of claims with DSFS=12 and Year=3 we get exactly same counts as in the table: 16044 claims with paydelay=0, 152 claims with paydelay=1...10, etc. However, for any other DSFS, we get counts much larger than in the table. To be specific, let's consider the DSFS="8-9" and paydelay="111...120". That cell in the table has count 0, but we get 9. Member 35764059 at year=3 has 1 claim with DSFS="11-12", but also at same year=3 has 3 claims with DSFS="8-9" and paydelay="111...120" (all 3 claims have paydelay=111). Member 47834891 at year=3 has a claim with DSFS="11-12", but also has 6 claims with DSFS="8-9" and paydelay="111..120" (these claims have paydelay=111,112,113,114,115 and 116). Thus, altogether we have 9 claims for DSFS="8-9" and paydelay="111...120". Could you please let us know what do we miss here? Thanks #8 / Posted 7 months ago
 Posts 14 Thanks 3 Joined 16 Feb '11 Email user Hi Team Crescendo. Congratulations on your milestone achievement. I have a question about your 'post processing'. You work out the 'true average' values for the 4 categories. What scaling method did you then use? Did you use a multiplicative factor (eg * 1.1), or an additive factor (eg +0.1) to adjust your estimates? Or some other method? Thanks Dave Clark #9 / Posted 7 months ago
 Rank 16th Posts 1 Joined 30 Jul '12 Email user Congrats! Willem & Edward Team on wining the milestone The modelling on DSFS looks interesting. Yeah, I agree with Oleg Vasilyev the number do not matches with the paper also the papers lacks on details around the simulations used to attain the constants or interpolated values used to evaluate the offset values. Could you please provide more detail on this? Thanks #10 / Posted 7 months ago
 Rank 4th Posts 7 Thanks 2 Joined 27 Mar '12 Email user Hi Dave, We added a constant to match the average. Rie from crescendo Thanked by DaveC #11 / Posted 7 months ago
 Rank 4th Posts 7 Thanks 2 Joined 27 Mar '12 Email user I wonder why my reply above got a rectangle and colors, though. #12 / Posted 7 months ago
 Rank 2nd Posts 35 Thanks 14 Joined 25 Oct '10 Email user Congratulations to the milestone winners! Willem & Edward, very impressive improvement yet again. Crescendo, impressive improvement on the private data compared to the public leaderboard! The crescendo report is very well written, thanks for taking the effort. I do have a few remaining questions though: What implementation of random forests did you use? The data sets described contain an unusually large number of variables for use in a tree based model (e.g. 14700 providerID count variables) Is this correct? If so, I assume you use some kind of sparse data format to keep all of this in memory? Could you please provide us with the weights of the individual runs in each of your 5 blended runs? Also it might be interesting to include the cross-validated rmse of each of these individual runs in the report. Thanks in advance! #13 / Posted 7 months ago
 Rank 2nd Posts 195 Thanks 46 Joined 12 Nov '10 Email user A question for team Crescendo: how did you deal with missing values ? For example, in feature set m1, what did you do for numeric Age value if the member's age is missing ? #14 / Posted 7 months ago
 Rank 4th Posts 7 Thanks 2 Joined 27 Mar '12 Email user Hi Tim, Tim> What implementation of random forests did you use? We used our own implementation. Tim> The data sets described contain an unusually large number of variables for use in a tree based model (e.g. 14700 providerID count variables) Is this correct? If so, I assume you use some kind of sparse data format to keep all of this in memory? Our code can handle sparse data efficiently. Tim> Could you please provide us with the weights of the individual runs in each of your 5 blended runs? Also it might be interesting to include the cross-validated rmse of each of these individual runs in the report. Thanks for suggestions, but we choose not to provide those numbers, which are not needed for replicating the winning entry. In the course of replicating the results based on the current documentation, one would obtain those numbers as side products. It'd be just too tedious to list up or even read so many numbers. I hope you understand. Rie #15 / Posted 7 months ago
