<12>

The papers written by the milestone winners are now available as attachments to this forum post. As described in section 13 of the rules, if you have any concerns about these papers, you have 30 days from their posting to provide your feedback.

2 Attachments —

Both the docx and pdf files for Edward / Willem's paper appear to be missing formula and symbols at least on all the machine / software combinations I tried it on.

My browser also has a display problem with the Edward+Willem paper, but it works fine for me if I save it (edit: the docx version) to disk and open it with Word.

Thanks Signipinnis. I have downloaded and can open the file - there is text but formula appear to be missing. For example on the bottom of page 5 I have the sentence, "Now we have obtained the following variables: " and then ", , , , and" which suggest that something is missing. If someone could confirm more complete text, then the problem is at my end with the rendering. I've checked the docx xml code and there are only white space between the commas indicated above. The formula do not appear on pdf version, again on multiple systems.

Scott Thompson wrote:

I have downloaded and can open the file - there is text but formula appear to be missing. For example on the bottom of page 5 I have the sentence, "Now we have obtained the following variables: " and then ", , , , and" which suggest that something is missing.

Looks like pdf version corrupted. But docx is ok. Try pdf in attachment (converted from docx)

1 Attachment —

Hi all, sorry for the trouble with the docx. Here the correct pdf.

1 Attachment —

Thank you Haru and Willem. The new pdfs are good.

Congratulations, Willem & Edward team on your win.
Besides great modeling, you have solved the puzzle of special paydelay value=0, which is quite a realistic puzzle; it is typical in real life data that there are some special values combining different cases meaning of which is long lost.

We have a question from our team regarding table 1 (Claims distribution of paydelay vs Dsfs) in your paper.

From the paper, on what is counted in the table: "Important is that only claims are used that belong to members of Year3, that have a maximum DSFS value of "11-12" (month), because we then know to which real month all the claims belong."

When we select number of claims with DSFS=12 and Year=3 we get exactly same counts as in the table: 16044 claims with paydelay=0, 152 claims with paydelay=1...10, etc. However, for any other DSFS, we get counts much larger than in the table.
To be specific, let's consider the DSFS="8-9" and paydelay="111...120". That cell in the table has count 0, but we get 9.
Member 35764059 at year=3 has 1 claim with DSFS="11-12", but also at same year=3 has 3 claims with DSFS="8-9" and paydelay="111...120" (all 3 claims have paydelay=111). Member 47834891 at year=3 has a claim with DSFS="11-12", but also has 6 claims with DSFS="8-9" and paydelay="111..120" (these claims have paydelay=111,112,113,114,115 and 116). Thus, altogether we have 9 claims for DSFS="8-9" and paydelay="111...120".

Could you please let us know what do we miss here?
Thanks

Hi Team Crescendo. Congratulations on your milestone achievement.
I have a question about your 'post processing'. You work out the 'true average' values for the 4 categories. What scaling method did you then use? Did you use a multiplicative factor (eg * 1.1), or an additive factor (eg +0.1) to adjust your estimates? Or some other method?
Thanks
Dave Clark

Congrats! Willem & Edward Team on wining the milestone

The modelling on DSFS looks interesting. Yeah, I agree with Oleg Vasilyev the number do not matches with the paper also the papers lacks on details around the simulations used to attain the constants or interpolated values used to evaluate the offset values.

Could you please provide more detail on this?

Thanks

Hi Dave, 
We added a constant to match the average.
Rie from crescendo

I wonder why my reply above got a rectangle and colors, though.

Congratulations to the milestone winners! Willem & Edward, very impressive improvement yet again. Crescendo, impressive improvement on the private data compared to the public leaderboard!

The crescendo report is very well written, thanks for taking the effort. I do have a few remaining questions though:

  • What implementation of random forests did you use?

  • The data sets described contain an unusually large number of variables for use in a tree based model (e.g. 14700 providerID count variables) Is this correct? If so, I assume you use some kind of sparse data format to keep all of this in memory?

  • Could you please provide us with the weights of the individual runs in each of your 5 blended runs? Also it might be interesting to include the cross-validated rmse of each of these individual runs in the report.

Thanks in advance!

A question for team Crescendo: how did you deal with missing values ? For example, in feature set m1, what did you do for numeric Age value if the member's age is missing ?

Hi Tim,

Tim> What implementation of random forests did you use?

We used our own implementation.

TimThe data sets described contain an unusually large number of variables for use in a tree based model (e.g. 14700 providerID count variables) Is this correct? If so, I assume you use some kind of sparse data format to keep all of this in memory?

Our code can handle sparse data efficiently.

TimCould you please provide us with the weights of the individual runs in each of your 5 blended runs? Also it might be interesting to include the cross-validated rmse of each of these individual runs in the report.

Thanks for suggestions, but we choose not to provide those numbers, which are not needed for replicating the winning entry. In the course of replicating the results based on the current documentation, one would obtain those numbers as side products. It'd be just too tedious to list up or even read so many numbers. I hope you understand.

Rie

B Yang> how did you deal with missing values ? For example, in feature set m1, what did you do for numeric Age value if the member's age is missing ?

Please see the third bullet under "Notation" in A.2 "Features derived from Claim data" for how missing values in Claim data were treated. As for "numeric Age", the documentation refers to [4] for conversion of categorical values to numerical values; however, I just realized that we assigned -1 for missing age, whereas [4] assigned 80.

infty wrote:
Please see the third bullet under "Notation" in A.2 "Features derived from Claim data" for how missing values in Claim data were treated. As for "numeric Age", the documentation refers to [4] for conversion of categorical values to numerical values; however, I just realized that we assigned -1 for missing age, whereas [4] assigned 80.

Thanks for the quick reply. Do your algorithms handle missing values or just use them as is ? For the numeric Age case, do they treat -1 as 'age missing' and do something special, or do they treat it same as other valid age values ?

How about values like avgs and SDs of various fields, for example LengthOfStay. For members who have at least one LengthOfStay value, do you just calculate min/max/avg/whatever while ignoring the missing values ? What about members who have all LengthOfStay values missing ?

Oleg wrote:
"Could you please let us know what do we miss here?"

Good observation about table 1.
Table 1 was generated in order to clarify the paydelay distribution.
The routine used to generate that table contained an error which resulted in incorrect counts.
The correct table is presented below.
Sorry for the confustion, and thanks for finding and reporting this fault.

         DSFS:
       month:
paydelay:

0-1
Jan

1-2
Feb

2-3
Mar

3-4
Apr

4-5
May

5-6
Jun

6-7
Jul

7-8
Aug

8-9
Sep

9-10
Oct

10-11
Nov

11-12
Dec

0....0

898

518

525

561

532

708

769

932

1666

2599

6907

16044

1...10

688

484

564

527

371

145

66

103

87

139

250

152

11...20

4978

3457

2810

2510

1569

1267

1196

1371

3991

4531

3692

1258

21...30

8391

4729

5255

5426

5668

5709

5597

6280

5626

4401

4294

327

31...40

3984

2433

1995

2009

1672

2282

2129

2336

1987

2041

1515

0

41...50

3016

1618

1314

1254

1041

1617

1170

1395

1173

1443

615

0

51...60

2186

770

864

602

692

1160

978

822

689

904

64

0

61...70

891

657

411

347

509

602

983

560

423

622

0

0

71...80

454

533

330

321

661

449

528

350

564

290

0

0

81...90

328

328

356

391

329

277

219

174

481

35

0

0

91..100

408

240

218

352

211

219

143

184

164

0

0

0

101..110

478

166

194

264

117

121

76

126

29

0

0

0

111..120

219

105

138

191

127

110

59

44

9

0

0

0

121..130

97

42

106

107

43

70

28

38

0

0

0

0

131..140

66

50

61

43

24

51

74

61

0

0

0

0

141..150

63

53

89

24

25

35

25

0

0

0

0

0

151..161

41

42

39

46

27

24

25

0

0

0

0

0

162..162

517

263

204

225

121

76

9

0

0

0

0

0


In the milestone 3 report, the text where month(6-7) is stated should now be changed month(7-8).
We will update this in the milestone 3 report.

Edward.

B Yang> Do your algorithms handle missing values or just use them as is ? For the numeric Age case, do they treat -1 as 'age missing' and do something special, or do they treat it same as other valid age values ? How about values like avgs and SDs of various fields, for example LengthOfStay. For members who have at least one LengthOfStay value, do you just calculate min/max/avg/whatever while ignoring the missing values ? What about members who have all LengthOfStay values missing ?

In our system, the algorithms (RGF, GBDT, random forests, etc.), do not handle missing values. You can't add or compare missing values, so yes, missing values were ignored in (i.e., excluded from) the computation of min/max/avg etc. For the members with all the values missing, min/max/avg etc. were set to zero except for "range" which was set to -1. I hope that this answers all of your questions.

Hi Rie,
I'm trying to use the RGF code you linked to in your paper. When I try to compile the code by running the makefile in os x, I get the error


Dans-Laptop:rgf1.2 dan2$ make
/bin/rm -f bin/rgf
g++ src/tet/driv_rgf.cpp src/com/AzDmat.cpp src/tet/AzFindSplit.cpp src/com/AzIntPool.cpp src/com/AzLoss.cpp src/tet/AzOptOnTree_TreeReg.cpp src/tet/AzOptOnTree.cpp src/com/AzParam.cpp src/tet/AzReg_Tsrbase.cpp src/tet/AzReg_TsrOpt.cpp src/tet/AzReg_TsrSib.cpp src/tet/AzRgf_FindSplit_Dflt.cpp src/tet/AzRgf_FindSplit_TreeReg.cpp src/tet/AzRgf_Optimizer_Dflt.cpp src/tet/AzRgforest.cpp src/tet/AzRgfTree.cpp src/com/AzSmat.cpp src/tet/AzSortedFeat.cpp src/com/AzStrPool.cpp src/com/AzSvDataS.cpp src/com/AzTaskTools.cpp src/tet/AzTETmain.cpp src/tet/AzTETproc.cpp src/com/AzTools.cpp src/tet/AzTree.cpp src/tet/AzTreeEnsemble.cpp src/tet/AzTrTree.cpp src/tet/AzTrTreeFeat.cpp src/com/AzUtil.cpp -Isrc/com -Isrc/tet_tools -O2 -o bin/rgf
src/tet/AzTree.cpp: In member function ‘void AzTree::finfo(AzIFarr*, AzIFarr*) const’:
src/tet/AzTree.cpp:232: error: call of overloaded ‘abs(double&)’ is ambiguous
/usr/include/stdlib.h:146: note: candidates are: int abs(int)
/usr/include/c++/4.2.1/cstdlib:174: note: long long int __gnu_cxx::abs(long long int)
/usr/include/c++/4.2.1/cstdlib:143: note: long int std::abs(long int)
make: *** [all] Error 1


Hopefully it's just a matter of changing the type declarations, but I'm years out of practice in C++.
Thanks for any help,
Dan

<12>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?