# Clarifying Rule #13 for Milestone 1 - Open Questions & Issues

« Prev
Topic
» Next
Topic
<123>
 Posts 18 Thanks 8 Joined 17 Jun '11 Email user @Anthony There have been several concerns raised in the forum about the impact and interpretation of Rule 13 on the contest, which states that conditional milestone winners must disclose their "Prediction Algorithm and documentation" to the website for competitor review and commentary?   In particular, there are unanswered questions with regard to inconsistencies and/or potentially unfair advantages arising from this rule.  Can you comment on the following specific items so the community has firm, consistent and realistic expectations as we approach the Milestone 1 date? Is it inconsistent, as Sali Mali pointed out in another thread, to require documentation of the winning algorithms be publicly disclosed to all competitors given Rule 20, Entrant Representations?  It seems that this disclosure will encourage other competitors to use aspects of the winning Prediction Algorithm which cause violation, directly or otherwise, of (i) - (iii) and possibly (iv) of that Rule. Can you clarify that code, libraries and software specifications are *not* required to be publicly disclosed to competitors?  These materials and intellectual property appear to be referenced separately from "Prediction Algorithm and documentation." Will Kaggle or Heritage have a moderation or appeals process for handling competitor complaints?  From the winning entrant's point-of-view, they would not want to be forced through the review process to allow back-door answers to code and libraries which accelerate a competitor's integration of the winning solution. Can you comment on the spirit and fairness of the public disclosure of the Prediction Algorithm documentation and it's impact on competitiveness?  In particular, if the documentation truly does meet the requirement of enabling a skilled computer science practitioner to reproduce the winning result, then this places the winning team at an unfair disadavantage: all competitors will have access to their algorithms and research, in addition to the winning algorithm. Can you provide more detailed clarification on the level of documentation required by conditional milestone winners?  The guideline provided by the rules would cover a range of details and description spanning from "lecture notes" to "detailed tutorial" to "whitepaper" to "conference paper", etc. Can you comment on the reproducibility requirement?  For example, it is possible to construct algorithms with stochastic elements that may not be precisely reproducible, even using the same random seed-- is it sufficient for these algorithms to reproduce the submission approximately?  What if they don't reproduce exactly, or reproduce at a prediction accuracy that is worse than the submission score, possibly worse than other competitor submissions?   Thanks, Andy Thanked by Bobby , and Anthony Goldbloom (Kaggle) #1 / Posted 21 months ago
 Posts 94 Thanks 25 Joined 8 Apr '11 Email user reply deleted by author - I'm not Anthony. #2 / Posted 21 months ago
 Posts 5 Joined 14 Dec '10 Email user I'm very interested in the answers to these questions as well. The answers will be a make it or break it for a lot of contestants. #3 / Posted 21 months ago
 Posts 18 Thanks 8 Joined 17 Jun '11 Email user @Signipinnis I don't mean this thread to be exclusionary -- sorry if it came across this way.  I addressed Anthony because I specifically want to get Kaggle's official comments on these items in addition to any other replies.  Please feel free to share your reply. #4 / Posted 21 months ago
 Posts 83 Thanks 50 Joined 1 Jul '10 Email user I believe some of these points were addressed in an early post by Jeremy Howard (at Kaggle): “Only the paper describing the algorithm will be posted publicly. The paper must fully describe the algorithm. If other competitors find that it's missing key information, or doesn't behave as advertised, then they can appeal. The idea of course is that progress prize winners will fully share the results they've used to that point, so that all competitors can benefit for the remainder of the comp, and so that the overall outcome for health care is improved.” Also, I think you won’t be forced to share your results, even if you’re in the #1 position – but then again, you won’t be able to claim the $30,000 or$20,000 either, unfortunately.  Those are the rules, and it certainly does create a dilemma for top competitors. Whether or not this structure is "fair" I think might be a question for philosophers. As a practical matter, it will spur innovation as people build off of others ideas, trying to stay competitive.   Also, note that there has been some great disclosures already  in the Forums (some with code!)  posted by top competitors (Chris R in particular) which have already helped others.  Next, I should point out that the Netflix Prize had the same type of milestone prize structure & disclosure requirement.  One team -- Team BellKor -- won milestone / 'progress' prizes, disclosed their methods along the way, and was still able to be part of the team that won the $1MM Grand Prize. Yes, other people built on the techniques they disclosed (but then again, BellKor's approach built on techniques that other teams had disclosed...). My point is that in at least that case, it was possible for the leaders to disclose their methods & still remain competitive. About the level of detail required. My opinion is that I would hope that the detail would strive to match the standards set by the Netflix Prize's "Progress Prize" papers. See the solution papers referenced in these posts: [ EDIT, to address ChrisR's point below ] There's a lot of 'fancy' math in these papers, but I don't want to imply that that's necessary. In fact, too many equations can hinder understanding, and clear text or pseudocode might be better at times. My point is that these documents do not try to gloss over any details or hide critical parameters in footnotes, etc. [ /EDIT ] Finally, just to be clear, much of the above is my own opinion (as a humble Kaggle competitor), not to be confused with any 'official' response to your questions. Thanked by Bobby , Zach , and Anthony Goldbloom (Kaggle) #5 / Posted 21 months ago / Edited 21 months ago  Posts 5 Joined 14 Dec '10 Email user "The idea of course is that progress prize winners will fully share the results they've used to that point, so that all competitors can benefit for the remainder of the comp, and so that the overall outcome for health care is improved.” Unacceptable. This is a contest not a group collaboration. "I think you won’t be forced to share your results, even if you’re in the #1 position – but then again, you won’t be able to claim the$30,000 or $20,000 either, unfortunately. " That is very unfortunate, and hopefully not true (I still hope a moderator will step in and inform us of the level of detail required). My only motive for this competition is the money, not to help others win money. "Whether or not this structure is "fair" I think might be a question for philosophers --- but as a practical matter, it will spur innovation as people build off of others ideas." It's not fair to anyone, and copy-cats stand to benefit. The idea I'm implementing has taken me my ENTIRE LIFE of research to get to. There's not a chance in hell I would willingly give it away for others to get a free shortcut/cheat. I'm standing by to hear the official response before I even make my submission to the leader board. #6 / Posted 21 months ago  Posts 94 Thanks 25 Joined 8 Apr '11 Email user Bobby wrote: Unacceptable. This is a contest not a group collaboration. Not really, this is a hybrid model of a crowd-sourced search for a problem solution. There are two preliminary phases, incentivized by cash awards specifically for collaboration. Then there's the gold rush for the best ultimate solution, arising from the previous shared benchmark/algorithm/methodology. Bobby wrote: The idea I'm implementing has taken me my ENTIRE LIFE of research to get to. There's not a chance in hell I would willingly give it away for others to get a free shortcut/cheat. I'm standing by to hear the official response before I even make my submission to the leader board. So wait until the 3rd phase starts. The way I see it, there are (likely) a number of people here with a proprietary approach/tool that they think will absolutely, unquestionably, blow the doors off everyone else. And needless to say, if one has that kind of a competitive edge, (esp. if based on one's own intellectual property developed from years of work), one would be extremely reluctant to give it up for a few pieces of silver. But here's the thing: many may THINK they exclusively have an unbeatable super-algorithm, but by definition, when all is said and done, only one of them can be the Bob Beamon of this contest. And there are a LOT of excellent data miners, using the best extant good tools and a lot of time & ingenuity, working the solution space. Think of it as a genius ensemble, with a huge amount of available computational time. Odds are very good that a hard-working data miner or health care analyst using existing tools will ultimately barely edge out another hard-working data miner. But the easy cure for the anyone with "I have proprietary secrets that are worth more than$x0,000" sentiments is to simply sandbag or wait on the sidelines until Phase 3 starts. Then take the Big Prize if you are able. Thanked by Christopher Hefele , Zach , and Anthony Goldbloom (Kaggle) #7 / Posted 21 months ago
 Rank 38th Posts 194 Thanks 90 Joined 9 Jul '10 Email user About the level of detail required.  My opinion is that I would hope that the detail would match the standards set by the Netflix Prize's "Progress Prize"  papers.  See the solution papers referenced in these posts: I love the netflix papers.  If I am lucky or skilled enought to win a prize - I would do my best to detail it in enough detail as possible - with a couple things in mind: 1) Math - don't know it - can't read it - can't write it .  If people want to see equations - you might be out of luck. 2) Statistics - don't know it - can't read it - can't write it . I have no clue half the time what people are talking about - I have my own methods for figuring stuff out.  I am only somewhat exagerrating here. 3) R - my code sucks, but I would give details on all packages used and any non default settings used as well.  I like the way the netflix papers are written - and it should be a lot shorter without all those equations.  And of course all features used. 4) I don't know why people are so over protective.  Most of the stuff I have is 75% similar to the features Dan posted (I thought he was on the way out :) ).  Dan is beating me now, but most of that stuff was stuff intelligent people could think of.  If you really have a kick ass algo - take your second best kick ass algo and use that instead.  There are tons of people here - including people in this very thread who fininshed #2 (tied for first score wise) in the NetFlix competition.  Do you really think they aren't going to think of something along the same lines?  If you can't think of a second kick ass algo - well then your first one probably sucks too - you just don't know it yet :) 5) IMHO - It IS a collaberation - I plan on going to the strata conference either way - hopefully can meet a few more of you.  Have met two of you already.  I am more than will to talk shop with other people and trade ideas (of course not any SECRET stuff).  I have learned a lot from this forum - from R packages I have never heard of - to better ways of doing things.  The things I have learned at other conferences weren't necessarily (these arent ML conferences, but SEO related) related to what I was looking for, but something someone would say would spark an idea about something else. 6) In the future I think other competitions might want to consider going with a "Second Price Auction" type model.  In google - when you bid on advertising - you pay the price of the person UNDER you.  This encourages people to bid their true value (and according to economists works out best in a GT/NE type way).  Using the same method a - a person winning a progress prize - could be required to produce an algo - that is equal to or better than the person below them.  Obviously doesn't work in this case, but for other ones in the future - maybe other people would like the idea.  This woud allow people to feel more comfortable in including their best stuff... .... Kind of droned on there - one last thing: Documentation must be written in English and must be written so that individuals trained in computer science can replicate the winning results. from the rules.  I hope people aren't going to try an beat a dead horse with the two teams that win.  In my mind if other people can confirm they are able to reproduce the results - then that settles it for me.  Hopefully I will be able to duplicate it as well, but IMHO it is not their job to get everyones code working. Thanked by Christopher Hefele , Zach , and Anthony Goldbloom (Kaggle) #8 / Posted 21 months ago
 Rank 2nd Posts 58 Thanks 46 Joined 6 Apr '11 Email user It seems a lot of people are concerned they might win a progress prize they don't want. My understanding is that you can choose which submission is considered for the prize. If you don't want the progress prize (and everything that comes with it), make a submission where every prediction is 1. For anyone concerned, NOT winning the prize should be very easy. Personally, I'd be surprised if those at risk of winning a progress prize did this. Thanked by Zach #9 / Posted 21 months ago
 Posts 94 Thanks 25 Joined 8 Apr '11 Email user ExitingSlowlySoMaybeHe'sNotAndThatsAOkayByMe DanB wrote: Personally, I'd be surprised if those at risk of winning a progress prize did this. Nice phrase. Personally, being at risk for winning a prize is something I'm looking forward to. #10 / Posted 21 months ago
 Posts 94 Thanks 25 Joined 8 Apr '11 Email user Speaking of (gasp) "collaboration": I hope it has not escaped anyone's attention that +/- 18 days ago, DanB announced "I don't have time for this anymore, here's what I've done so far, hope it helps somebody" ... and dumped various parts of his algorithm in forum posts for all to see. Various questions and answers then followed. Now DanB is in 5th Place on the Leaderboard.I could be wrong, but I don't think he was Top 10 before. Collaboration works !  Sometimes in unexpected ways !!! Thanks DanB. Hope you're able to stay in after all. #11 / Posted 21 months ago
 Rank 31st Posts 292 Thanks 64 Joined 2 Mar '11 Email user Signipinnis wrote: Thanks DanB. Hope you're able to stay in after all.   Me too!  And keep sharing ideas =) #12 / Posted 21 months ago
 Anthony Goldbloom (Kaggle) Competition Admin Kaggle Admin Posts 382 Thanks 72 Joined 20 Jan '10 Email user Hi all, Not ignoring this thread. Just seeking clarification from HPN on one issue. Anthony #13 / Posted 21 months ago
 Anthony Goldbloom (Kaggle) Competition Admin Kaggle Admin Posts 382 Thanks 72 Joined 20 Jan '10 Email user Sorry for the delay on this, was just clarifying some issues with HPN. Is it inconsistent, as Sali Mali pointed out in another thread, to require documentation of the winning algorithms be publicly disclosed to all competitors given Rule 20, Entrant Representations?  It seems that this disclosure will encourage other competitors to use aspects of the winning Prediction Algorithm which cause violation, directly or otherwise, of (i) - (iii) and possibly (iv) of that Rule. Rule 20 does not apply to the extent that it prevents (a) competitors other than a milestone prize-winner from using code published by a milestone prize-winner in accordance with competition rules; and (b) a milestone prize-winner from competing subsequently in the competition using code for which it was awarded the milestone prize. Can you clarify that code, libraries and software specifications are *not* required to be publicly disclosed to competitors?  These materials and intellectual property appear to be referenced separately from "Prediction Algorithm and documentation." Chris correctly points to Jeremy's response in an earlier forum post: “Only the paper describing the algorithm will be posted publicly. The paper must fully describe the algorithm. If other competitors find that it's missing key information, or doesn't behave as advertised, then they can appeal. The idea of course is that progress prize winners will fully share the results they've used to that point, so that all competitors can benefit for the remainder of the comp, and so that the overall outcome for health care is improved.” Will Kaggle or Heritage have a moderation or appeals process for handling competitor complaints?  From the winning entrant's point-of-view, they would not want to be forced through the review process to allow back-door answers to code and libraries which accelerate a competitor's integration of the winning solution. Kaggle and the HHP judging panel will moderate the appeals process. Can you comment on the spirit and fairness of the public disclosure of the Prediction Algorithm documentation and it's impact on competitiveness?  In particular, if the documentation truly does meet the requirement of enabling a skilled computer science practitioner to reproduce the winning result, then this places the winning team at an unfair disadavantage: all competitors will have access to their algorithms and research, in addition to the winning algorithm. This rule is in place to promote collaboration. Those who would prefer not to share can opt out of the prize. Can you provide more detailed clarification on the level of documentation required by conditional milestone winners?  The guideline provided by the rules would cover a range of details and description spanning from "lecture notes" to "detailed tutorial" to "whitepaper" to "conference paper", etc. Hopefully this was adequately dealt with in Jeremy's response (requoted above). Let me know if further clarification is needed. Can you comment on the reproducibility requirement?  For example, it is possible to construct algorithms with stochastic elements that may not be precisely reproducible, even using the same random seed-- is it sufficient for these algorithms to reproduce the submission approximately?  What if they don't reproduce exactly, or reproduce at a prediction accuracy that is worse than the submission score, possibly worse than other competitor submissions?   Exactly reproducibility is required. #14 / Posted 21 months ago
 Rank 4th Posts 292 Thanks 113 Joined 22 Jun '10 Email user Anthony Goldbloom wrote: Exactly reproducibility is required.  Is this reproducability of a submission that gives the same leaderboard score, or does the actual submission file have to be identical? If it is the latter, then I guess this will be impossible for most people - an I for one am out. An algorithm that relies on a particular setting of a random number seed to work is no good to anyone. The algorithm should result in the same overall predictive accuracy, but this is different from the exact same predictions.       Thanked by Chris Raimondi #15 / Posted 21 months ago
<123>