Anthony Goldbloom (Kaggle)'s image
Anthony Goldbloom (Kaggle)
Competition Admin
Kaggle Admin
Posts 383
Thanks 73
Joined 20 Jan '10
Email User
From Kaggle

We are aware that the rules haven't been as clear as we might have liked. Please be reminded that:

  • you cannot sign up to Kaggle from multiple accounts and therefore you cannot submit from multiple accounts; and
  • privately sharing code or data is not permitted outside of teams (sharing data or code is permissible if made available to all players, such as on the forums).

We've reached out to several teams about this issue. Please let us know ASAP if you have multiple accounts and we've not reached out to you.

 
Pablo Ruggia's image
Posts 7
Thanks 8
Joined 3 Jun '11
Email User

Great post from Sali Mali about this topic: http://anotherdataminingblog.blogspot.com/2011/12/phantom-of-opera.html

Thanked by DanB , Ghazian , knorthover and B Yang
 
RTDS's image
Posts 3
Joined 4 Jul '11
Email User

It seems "Yarong"and "liqo" have been pulled out of the contest but their names are still on the leaderboard.

 
Sali Mali's image
Rank 1st
Posts 326
Thanks 146
Joined 22 Jun '10
Email User

Objectifi Team wrote:

It seems "Yarong"and "liqo" have been pulled out of the contest but their names are still on the leaderboard.

and Larry_temp, lsun_sd,JYL, mmah, chongquingchef, HappyAcura, KatyPerry, vahid, mahdi, rutgers, mona, sailors, oakwood, syang, magneto, jing2009, thonda, cyclops    etc...

Is this an issue with the Kaggle site or have their accounts been frozen?

This is not really a problem though, as it is not always a good idea to try too hard...

http://anotherdataminingblog.blogspot.com/2011/12/two-become-one.html

 
Sergey Yurgenson's image
Posts 409
Thanks 221
Joined 2 Dec '10
Email User

Hi Sali Mali,
Some errata for you blog:
1. It was "Give Me Some Credit" competition, not "don't overfit competition"
2. vsu (person who was discussed in the forum) finished 9th not 7th. Team vsh finished 7th. At the moment there is no indication that it belongs to the same person.

Thanked by Sali Mali
 
John's image
Rank 5th
Posts 26
Thanks 7
Joined 21 Jul '11
Email User

There will be 3rd, 4th or more blogs poping up talking about the relationaship between SD_John_lily and the current top player Opera Solution by our famous Phil Brierley , AKA, Sali Mali, if I do not spend some of time saying something. SD_John and Lily are two close teams collaborating each other. We can see it without learning data mining skills. We merged into one to follow the new rules posted by Kaggle.

If the purpose of our Phil's hard work is only to discover SD_John and lily are collaborating teams, it may not be interesting enough to spend  the precious time and effort. The main goal here I see is to prove SD_John_Lily are part of Opera Solution's team. If it was not so sure in the first blog, it is definitely certain in the second blog:  "Lily, SD_John and Opera Solutions are all essentially the same entity (and JYL also entered) ....".   So, our famous data miner Phil digged out this undoubtable conclusion. 

I was thinking revealing our own blogs or linkedIn pages to clear the confusion. Now I feel it is better to leave it as is so Phil can keep digging. This reminds me the story of Robert A. Millikan, who was an experimental physicist, and Nobel laureate in physics. It was widely believe that he picked 58 data points to support his claim instead of using all raw data on the measurement of electron charge. If SD_John's correlation with JYL is > 0.99 based on Phil's measurement, I bet there must be some teams have greater than 0.99 correlation with MarketMaker if you repeat the same calculation on MarketMaker against all other teams. The reason I am so sure is because I never know JYL. Get to the know the name for the first time from the blog. When I saw Opera's final results on the credit competition, I can not help laughing. I know it gives another data point to prove I am part of it. Who knows, maybe in the future. I hear it is a pretty good place for data mining scientists. They always welcome great scientists.

I am suprised that I have not done anything on the HPN competition for more than 2 months. I have to come back to make more submissions. Maybe I should closely follow maketmaker's submission to increase the correlation? To give some hints on my profile: I participated several data mining competitions in the past (many times the performance was not bad); I had met David, member of MarketMaker years ago in conference, had a lot of respects to his early winning achievements on the KDD cup (great job on this competition, of course); Had chance to compete with Opera on other contest (outside of Kaggle); I encourage everyone to compete in the CHALEARN Gesture Challenge recently available because I am somehow associated and not eligible to win the prize. There are not many teams yet, it should be very interesting. ....

 
Sali Mali's image
Rank 1st
Posts 326
Thanks 146
Joined 22 Jun '10
Email User

Hi John,

I have reworded my blog - hope this now reads better.

Can you just clarify though that you have actually competed as part of the Opera team before?

John wrote:

 Had chance to compete with Opera on other contest (outside of Kaggle)

and if you and Lily had come first and second in the milestone prizes, what was the arrangement for dividing the prize money? You say you were close teams collaborating with each other - there are lots of other teams who would love to have collaborated and stayed as independent teams, but the rules were quite clear this was not allowed, so they didn't. Dave and I had to put our submissions on hold until our combined submission count was below the allowed level - we could have 'collaborated' and come first and second!

 
John's image
Rank 5th
Posts 26
Thanks 7
Joined 21 Jul '11
Email User

Unfortunately, I was never be part of the Opera team so far. Maybe I should say "I had chance to compete against Opera on other contest". One more fact, I am not an native English speaker. Don't be confused with name John. Dave is at Florida, you are at another side of the Ocean. If you guys play as seperate team and shared code with each other and be so sure that you can be first and second before the 1st milestone deadline, I am totally fine with it. The old rule was not clear about this, I guess that's why the new rule layed out.

I saw some discussions about the submission limit on the forum. Actually, I like the Sponsors just put 5% or 10% of the total holdout set as public leaderboard. In some other data mining competitions, there was no public leader board. You have to set aside your own validation test to avoid over-fitting. That's actually a pretty good option. Very few players would combined hunderds of models in their submission. Netflix competition started the public leaderboard idea (not sure it is the first one), which attracted so many players. I think that there are a lot of marketing purpose for NetFlix. Later competitions all like to set the public leader board in order to make the competition more fun and intensive and attractive to more players. However, people started gainning insights from the public board.

As an industry practitioner, we definitely want the simple and robust solutions. It is a hugh challenge to implement an ensemble model combing hundreds of complicated algorithms in a real-time environment. Take the credit score competition as an example, the best solution combined boosting decision tree, random forest, SVM and neural network together. The credit scoring industry in US has been strictly regulated by credit lending laws, you have to give clear reason codes to explain why a consumer gets certain score point. How can we explain a score if the model was a combination of all these non-linear algorithms?

Then what the heck these competitions are all about if the winning solution is practically not useful? I appreciated one discussion I had with a great modeler at FaceBook. "You can treat this ensemble solution as an ORACLE when the true solution is not available. You can also use the complicated solution to help you develop your simple , practical solution used in production."

 
RTDS's image
Posts 3
Joined 4 Jul '11
Email User

Have you seen the first copying machine? Majority of technical people like John were saying who needed this giant machine just to copy a sheet of paper! The rest is history. How about ENIAC?  It is true that it is difficult to implement an ensemble model but did we forget "Necessity is the mother of invention".

Necessity is the mother of invention

 
Sarkis's image
Posts 41
Thanks 5
Joined 5 Apr '11
Email User

Anthony Goldbloom (Kaggle) wrote:

We are aware that the rules haven't been as clear as we might have liked. Please be reminded that:

  • you cannot sign up to Kaggle from multiple accounts and therefore you cannot submit from multiple accounts; and
  • privately sharing code or data is not permitted outside of teams (sharing data or code is permissible if made available to all players, such as on the forums).

We've reached out to several teams about this issue. Please let us know ASAP if you have multiple accounts and we've not reached out to you.

Thanks everyone for this interesting discussion. I was wandering how are these rules enforced in practice? I see many teams in Public Leaderboard who are in top 100 with single digit submissions, e.g. realanalysis, PookyPANTS, datum miner, JackT, UCI-CS273A-Koi, shimi, sungw, riho, longi, etc.

Unless there is a leak of information from top 20 teams in the leaderboard, it's hard to believe that any team can get into top 100 with less than 10 submissions at this point in the game.

 
B Yang's image
Rank 2nd
Posts 245
Thanks 65
Joined 12 Nov '10
Email User

Alright, I really like to know what is going on here now ?

Yesterday, the team #18 on the leaderboard, Closer, reached .4603 with 5 submissions. If you expanded team Closer, you'd have seen a single member named Vess or something, with a link to a linkedin.com profile showing this person is an employee of Opera Solutions. Today the link to linkedin profile's gone, and member name was changed to "Chinese Democracy".

Thanked by Oleg Vasilyev
 
syntax's image
Rank 7th
Posts 8
Thanks 7
Joined 9 Jan '12
Email User

B Yang wrote:

Alright, I really like to know what is going on here now ?

Yesterday, the team #18 on the leaderboard, Closer, reached .4603 with 5 submissions. If you expanded team Closer, you'd have seen a single member named Vess or something, with a link to a linkedin.com profile showing this person is an employee of Opera Solutions. Today the link to linkedin profile's gone, and member name was changed to "Chinese Democracy".

As validation of this, you can still see user "vess" in team closer by using the Google cache. Clicking it will redirect to chinese democracy, probably because the user was renamed.

http://webcache.googleusercontent.com/search?q=cache:3aDY96sLYhEJ:www.kaggle.com/teams/9615/closer+%22vess%22+%22opera+solutions%22&cd=5&hl=da&ct=clnk&gl=dk

Thanked by Oleg Vasilyev
 
Sali Mali's image
Rank 1st
Posts 326
Thanks 146
Joined 22 Jun '10
Email User

B Yang wrote:

Alright, I really like to know what is going on here now ?

Yesterday, the team #18 on the leaderboard, Closer, reached .4603 with 5 submissions. If you expanded team Closer, you'd have seen a single member named Vess or something, with a link to a linkedin.com profile showing this person is an employee of Opera Solutions. Today the link to linkedin profile's gone, and member name was changed to "Chinese Democracy".

Here is another observation (which may be just a fluke)

On 26 Jan there are 2 teams at position 18 & 19 who make very impressive progress on the leaderboard  having done very few entries. They get literally the same score within a few hours of each other...

http://www.heritagehealthprize.com/c/hhp/Leaderboard?asOf=2012-01-27

A few days later another team gets basically the same score, having done an equally impressive 4 entries. 

http://www.heritagehealthprize.com/c/hhp/Leaderboard?asOf=2012-02-05

Thanked by Oleg Vasilyev
 
Edward's image
Rank 1st
Posts 5
Thanks 7
Joined 16 Feb '11
Email User

Hello All,

This is a call for clarification by Kaggle, because I worry about the fairness of this contest:

  • It has been noted that there are teams that had or have (!) multiple accounts and submissions and that these additional submissions weren't added to their total when this tread started?  If this is true this very unfair, and this should be corrected, otherwise the other players have a large disadvantage.  The number of submission times rule was imposed on us, and has already had big influence on the ability for many teams to merge, so this rule must apply for every team.
  • There are indications in this thread that there are teams that still perform additional submissions, with different accounts. Is this correct?
  • There has not been any reaction (that I noticed) of the teams (in)correctly accused on the forum.  
    This could be a strong indication that the observarions are indeed correct.
  • Kaggle has the data and very smart people to investigate this (could be their internal mining competition...).  They can for instance investigate the correlations between submissions, including other techniques already presented by other teams (Thanks for that!)  Altough I do not know how much effort Kaggle has put into this, it is remarkable that the only reaction presented (2 months ago) was when other teams point out certain observations, which should have been found by Kaggle on their own, and could now be found automatically (from now on at least).  Can we expect that that Kaggle become more actively searching for observations that point to rule infringements, than what they are doing now, to increase the fairness of this contest?

It should be clear that every team should be playing under the same rules en restrictions.
This certainly doesn't feel this way to me at this moment.

If I don't get a reaction of Kaggle or these teams I get the impression that I am right.

Thanked by Sali Mali and Oleg Vasilyev
 
Ben Hamner's image
Ben Hamner
Kaggle Admin
Posts 809
Thanks 356
Joined 31 May '10
Email User
From Kaggle

We've been looking into this, and will have a more formal response soon. If there are any other teams that you've been suspicious of, please post them here.

Teams violating the rules will be disqualified from the competition and will not receive any prizes.

 
Sali Mali's image
Rank 1st
Posts 326
Thanks 146
Joined 22 Jun '10
Email User

Ben Hamner wrote:

If there are any other teams that you've been suspicious of, please post them here.

I think Opera Solutions, Edward & Willem, Modeling Dudes, Petterson & Caetano and everyone else on the leaderboard are suspicious. Please disqualify them all immediately. ;-)

Thanked by syntax
 
DavidChudzicki's image
DavidChudzicki
Kaggle Admin
Posts 447
Thanks 107
Joined 21 Nov '10
Email User
From Kaggle

I apologize for the delay. We're working on this, but in the meantime, I want to reiterate what Ben said-- Teams violating the rules will be disqualified from the competition and will not receive any prizes.

We will have records from all entries in the competition, and will examine them for evidence of multiple accounts:

(a) before they can merge teams; and
(b) before they can win anything.

We will also do our best to identify and deal with everyone who has multiple accounts, but prioritizing those situations should deal with those concerns about the integrity of the competition. Let us know if you have other concerns.

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?