Four of the top 20 leaderboard entries (plus #21 in fact) were posted back in 2011, if I read the leaderboard correctly... most of them just before the Milestone 1 cutoff.  I can think of plenty reasons that would legitimately happen, but I find it kind of surprising, to get that close so soon and then disappear?  Where'd you guys go?!

Or do I misunderstand the date stamp?  Oh well.  The forums were slow lately so I thought I'd point it out.

I can't speak for the other people, but until last month I think I would have fallen into that group.  I only recently started submitting again.

Some of it has to do with some personal projects I am working on that consumer more of my time than it did before.

Some has to do with frustrations.  I am having trouble with understanding the math and ridge regression.  I have single models that do fairly well and features that I am pretty sure no one is using (that help).

Also the term merger issue may be keeping some teams on the DL.  I know I have screwed myself to a large degree by the number of submissions I have made.  There is some interesting strategy if you look at the timeline as to what may occur with papers being published and teams merging.

Maybe everyone else is going after the $500k and is waiting for us to do the easy stuff first.

This competition is at the same time invigorating and soul sucking.

Do you guys think that it is possible that someone will beat the 0.4 threshold? The top score has not even broken 0.45 yet.

Or did HHP choose 0.4 just because it is out of reach? 

I don't think it will be beat. You are talking about pretty much all the improvement that has already been made over a dummy model plus a little more. I don't think HHP is trying to be cheap - I think they wanted it to be a challenge. Netflix ended up working out by pure coincidence.

"This competition is at the same time invigorating and soul sucking."...

I couldn't agree more. :-) I've learned a LOT, though, which I fully enjoy, so it's been worth it.

I don't think any of it would be due to waiting for the big prize -- a team could enter and get a great leaderboard score, learning from their submissions, and just not select their best submission for consideration. That seems to be how the rules work. Not disqualifying yourself from a merger makes more sense, but even then I'd think a few submissions here and there would be necessary just to convince yourself you're on the right path. Maybe people have more confidence in their ability to avoid mistakes than I do, though.

To Damian's question, I agree with Chris -- I don't think 0.4 will be reached. I will say that I think it's impossible with the current milestone approaches, since they focus on a statistical best guess for each individual, rather than focusing on getting any of the values EXACTLY correct. In order to reach 0.4 I think a solution would have to nail a good chunk of the records with 0 error, which will require an entirely different approach.

I too think that it would be impossible to beat. But then this brings up the question about what is the best possible score that any algorithm can achieve.

Within the context of the competition, I have an idea on how this limit can be estimated...what if Kaggle creates a blend of all the current best individual submissions using the ridge regression technique? Surely, the score of this blend will be better that the current top score and it will give some indication of how much improvement can be made. If this makes any sense at all, it will certainly be a useful benchmark for all the other competitions.

You know I actually asked them to do that -- there's a thread here about suggestions for benchmarks, maybe together we can convince them:

You couldn't use ridge regression on them, however (or any regression), since you'd have to train it against the target set... well, Kaggle could, but it's effectively cheating; participants can only use regression on the training sets, so the result would still basically be unachievable. A straight average may still be an improvement.

They won't have to use the actual target set. There is a way to estimate the ridge regression (specifically, the t(X)%*%Y vector). See pages 20 - 21 of the Netflix solution . I believe this is the technique used by the top teams to blend their predictions. 

Yes... the milestone papers here indicated they are using that technique; relying on the leaderboard scores to calculate an ensemble. Is that really ridge regression, though? Perhaps I'm having the same problem with the math as @Chris. ;-) I've learned something new today, yay!

From what I've been reading, ridge regression is just linear regression but with a diagonal matrix(the penalty) added to the t(X) %*% X matrix. The penalty brings the magnitude of the coefficients closer to 0. 

The estimation of the t(X)%*%Y vector is something that the Netflix guys did (I'm not sure if they were the first to use it though)

So does it seem now that Kaggle can put together everyone's submission in one blend?...think of it as a communal blend.

There is an R Package for ridge regression here:

technique comes down to performing a ridge regression5 based on the leaderboard scores. The regularization
parameter was chosen as 0.0015 * 70492.

Page 12 V1 of Milestone 1 paper by Willem

Fine -I sort of get that there is an alpha parameter (which I am guessing is the lambda parameter in the R package), and the R package allows for a vector or scalar for the lambda value. But, if you somehow put the leaderboard scores in the lamda spot (which I guess I could do) - where the hell do you put the lambda/alpha value?

I have read through some of the stuff on Ridge Regression, but I am guessing ridge regression was invented before data mining competitions - and this isn't pure ridge regression.  There were no loeaderboard scores in Tikhonov's day.  Part of the problem is all the math looks like greek to me (I guess some of it is greek) - but I am assuming the alpha/lambda parameter deals with how ridgy it is - and functions similar to the L1/L2 parameter in glmnet.  Problem is I don't understand that either :)  I am guessing it is a penalty that punishes worse predictors/features.

Our candidate population contained 79 base models with each sub blend
containing a randomly selected n base models. The process was repeated 1,000
times with a ridge parameter of 0.0001. We built models with various values of
n, with generally increasing leaderboard performance as n increased, but also
also with an increasing probability that the model has overfit to the leaderboard.
The final choice of n (20) was a tactical choice that resulted in a final model
slightly better on the leaderboard than the third placed team.

from mm page 3 of milestone 2 report

This talks about iterations - which none of the academic stuff on ridge regression mentions - in the paragraph before - team mm suggests this is their improvement to straight ridge regression.  They do use a regularization parameter - and the leaderboard scores.  All of the academic stuff I have read come down to the whole matrix stuff the Damian mentions.

I am curious as to how much improvement is to be made from straight linear regression to ridge regression.

Something like the very last graphic on this poster:

I know that graphic exists in non poster form - I just can't find it right now.

Anyway - if anyone knows on the netflix competition -

If straight linear regression gets you 0.87525 I would be curious at to:

1) What does [obviously did] straight ridge regression with just the alpha/lambda parameter get someone on the Netflix leaderboard?
2) What does using both the alpha/lambda parameter and the leaderboard score gets some - but with out any bagging or iterative process?
3) What does the best possible use of ridge regression get you - using alpha/lambda, leaderboard scores, and bagging or iterative training (but none of the BGBT,NN, or other stuff along those lines)?

@Chris thanks for that poster!

I can't answer your questions but just wanted to point out that even though the poster says Linear Regression...the formula is that of ridge regression because it has the penalty λ. Also, if you're gonna extimate t(X)%*%y, you won't need the would be just a matter of doing the matrix multiplication.

ChipMonkey wrote:

Four of the top 20 leaderboard entries (plus #21 in fact) were posted back in 2011, if I read the leaderboard correctly... most of them just before the Milestone 1 cutoff.  I can think of plenty reasons that would legitimately happen, but I find it kind of surprising, to get that close so soon and then disappear?  Where'd you guys go?!

Or do I misunderstand the date stamp?  Oh well.  The forums were slow lately so I thought I'd point it out.

I've been busy with other projects (day job, other data stuff, a 2 year old at home), but do plan on diving back in fairly soon.  Part of the reason I stopped was burnout.  Pretty much all of my available free time in the month of August was spent running various models in R.  I was relieved once the milestone had passed; the pressure was off.

Clearly I am no expert,but the high amount of zeros makes producing a very good score somewhat difficult.


Flag alert Flagging notifies Kaggle that this message is spam, inappropriate, abusive, or violates rules. Do not use flagging to indicate you disagree with an opinion or to hide a post.