alexanderr's image Posts 42
Thanks 2
Joined 5 Apr '11 Email user

Is MS Excel better for this task than say mysql or php?

 

 
Jeff Moser's image
Jeff Moser
Kaggle Admin
Posts 356
Thanks 178
Joined 21 Aug '10 Email user
From Kaggle

I think several tools will be helpful for this competition, including Excel. Many Kaggle competitions are won by a clever insight from the data rather than requiring complicated algorithms or powerful machines.

Kaggle's own Jeremy Howard gave a great talk on getting ready for competitions like the Heritage Health Prize and he shows how he uses Excel. It's definitely worth a look.

 
TMiranda's image Posts 1
Joined 14 Apr '11 Email user

I think some advanced correlations could be better mesured with MiniTab

 
Anthony Goldbloom (Kaggle)'s image
Anthony Goldbloom (Kaggle)
Competition Admin
Kaggle Admin
Posts 382
Thanks 72
Joined 20 Jan '10 Email user
From Kaggle
For generating features, I recommend SQLLite - though MySQL does the same thing. I know Jeremy and Jeff like C#'s Linq. For building models, I use R.
Thanked by Cyfarwyddyd
 
Aeoliana's image Posts 17
Thanks 15
Joined 4 Apr '11 Email user
I use SQL for storing the data, this way I can create views etc to show me certain facets of the data that you might not see otherwise. For analysis I use C# and make almost exclusive use of collections and lamba functions (LINQ), to get the data from SQL to C# I use the ADO .Net Entity Framework. Also, Tableau to visualize the data.
 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

TMiranda wrote:

I think some advanced correlations could be better mesured with MiniTab

 

The R packages 'plyr' and 'reshape2' are pretty great for generating features, especially since 'plyr' can be easily parallelized.  It probably isn't as fast as using SQL, but I'm a much better R programmer than SQL programmer, so the tradeoff in speed is worth it.

 
abbas shojaee's image Posts 9
Thanks 1
Joined 4 Apr '11 Email user

Hi Alexander

 

Based on this post and the other post of yours (teaming up to implement your algorithm), and considering that the competition is a long term one, you may start learning and using F#, a free computational language by Microsoft which provides many great features. Off course you need to master several other technologies and concepts too, e.g. DBMSs(MySQL or SQL Server, DB4a , ...), Linq, Visualization, /Mathematics libraries etc.

 

 

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user
Learning R would also be beneficial, as its the most common (and successful) language on Kaggle.
 
alexanderr's image Posts 42
Thanks 2
Joined 5 Apr '11 Email user

I have been learning mysql but phpadmin won't allow the whole database  csv file to import.My web host said file size of 8MB maximimun for csv import.

I need 50 -60 MB.

This makes life very difficult.Chopping up csv files lots of times and putting them together again just makes an already difficult problem worse!I use a home laptop do I need a more powerful computer to use sql.My laptop would not allow visual studio files onto it and failed to install R.

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user
PHP is probably a bad choice for this competition. What kind of laptop do you have and what operating system are you running?
 
alexanderr's image Posts 42
Thanks 2
Joined 5 Apr '11 Email user

I have an acer laptop with 2 ghz processor 1 gb memory and 8o gb hard drive.Windows vista system 32 bit.

I have tried php but get lots of bug problems.I think this competition

can be done initially with excel to get a feel for the data.I am even trying openoffice.org program.

 
Chris Raimondi's image Rank 38th
Posts 194
Thanks 90
Joined 9 Jul '10 Email user
I have never had a problem installing r on a windows system. I have probably done it 8 times. Always works for me without having to tweak anything - as long as I put any files I want to import/export in "my documents". What happened when you tried to install it? I am no R expert, but I think it should work with those stats - although you'll have to not be sloppy with the memory.
 
alexanderr's image Posts 42
Thanks 2
Joined 5 Apr '11 Email user
It may just be I have insufficient memory on hard disc-only 500 MB left! I have just found open office.org spreadsheet can take 65000 rows of data and analyse them with functions like sum() average() etc like excel.I am going to see how I get on with this.Perhaps other people may find open office useful too.
 
William Cukierski's image
William Cukierski
Kaggle Admin
Posts 337
Thanks 165
Joined 13 Oct '10 Email user
From Kaggle

I would say start with R and forget the Heritage prize for a while. This is an advanced contest with some complicated data. You will learn much more and get less frustrated if you take on some simpler problems first. Get your feet wet on smaller problems and enjoy the learning. There are going to be teams of the brightest computer scientists and data miners from the best institutions in the world competing for this. Don't worry about the prizes.

 
Dirk Nachbar's image Posts 83
Thanks 3
Joined 26 May '10 Email user
I am using SAS, but might transfer some problems to R.
 
Team Fox's image Posts 1
Joined 4 Apr '11 Email user
Excel 2010 Powerpivots are great for exploring large datasets like the HHP. Because the have an OLAP engine & Data compression (Vertipaq) under the hood, they gobble up these big datasets. The slicer functionality means it quick to slice & dice the dataset in multiple.
 
alexanderr's image Posts 42
Thanks 2
Joined 5 Apr '11 Email user

I have managed to get php and mysql working well now. I want to use this competition to improve my programming skills on a difficult task.

I have been learning programming for 6 months now and I am improving.This is definitely a difficult task or else the problem would have been solved years ago-think of all the money governments put into health research.

I am a novice at programming but not at stats or biology so that is why  I think it is worth having a go.

 
Tatiana McClintock's image Posts 9
Joined 15 Apr '11 Email user
And no one is using Excel??? Is there free Minitab?
 
Solo Dolo's image Posts 8
Joined 17 Mar '11 Email user
I'm using Mathematica V8 ... anyone know if this is legal?
 
boegel's image Posts 17
Thanks 4
Joined 5 Apr '11 Email user
I'm using Haskell. Yes, I'm brave.
 
Information Man's image Posts 14
Thanks 1
Joined 8 Apr '11 Email user
1) SQL ... 2) Excel 3) Rapid Miner 4) R
 
Rapid Insight's image Posts 1
Joined 6 Apr '11 Email user
Rapid Insight Analytics for modeling. Rapid Insight Veera for preparing the analytic file and scoring.
 
Jason Morris's image Posts 11
Thanks 3
Joined 2 Apr '11 Email user

You can always try out Oracle Express, should work on your laptop and will have no issues digesting the large amount of data being used in the competition....free download from Oracle.com

 
alexanderr's image Posts 42
Thanks 2
Joined 5 Apr '11 Email user

I tried oracle express and it was too slow at executing queries!!

Will continue with mysql and php together.


Thanks for the suggestion though.

 
Solo Dolo's image Posts 8
Joined 17 Mar '11 Email user
Seriously though - is Mathematica an acceptable product to use in this competition? You have to get a license to use it, but there's no restrictions as to who can get a license and it isn't prohibitively expensive. Thanks ahead of time for the clarification.
 
Timmay's image Posts 1
Joined 17 Mar '11 Email user

Is anyone having memory problems getting anything done with the claims data in R, particularly in Revolution?

 
inf2207's image Posts 9
Joined 28 Apr '11 Email user

Timmay wrote:

Is anyone having memory problems getting anything done with the claims data in R, particularly in Revolution?

yes, on my laptop (it has only 2gb ram). but I'm going to switch to my pc as working platform for this competition as soon as I get to install win7 64bit on it.

 
BotM's image Posts 11
Thanks 4
Joined 5 Aug '10 Email user

Timmay wrote:

Is anyone having memory problems getting anything done with the claims data in R, particularly in Revolution?

You need to use vector operators instead of loop constructs. If memory is still a problem you should move to linux and 64 bit versions of R or try using biglm. I have tried 32 and 64 bit XP with R, but available memory is still strongly limited.

 
Zaccak Solutions's image Posts 39
Thanks 7
Joined 10 Feb '11 Email user

Timmay wrote:

Is anyone having memory problems getting anything done with the claims data in R, particularly in Revolution?

 

What opperations are you trying to do?

 

I'm been testing out Revolution for the past 2 weeks.. it's not bad but still needs lot of work to be a good IDE. Some things drive me nuts in it.. I would of prefered if they built it on top of Eclipse rather than V.Studio..

I have Win 7 64-bit with 4Gb of RAM and everything I do on the claims data seems to be fine. Doing a random forest, I can only use about 200 trees.. if I try 300 or more I run out of memory.

 
cbusch's image Posts 7
Joined 31 Aug '11 Email user

Do the rules prohibit using a tool such as Oracle Data Miner?

 
Anthony Goldbloom (Kaggle)'s image
Anthony Goldbloom (Kaggle)
Competition Admin
Kaggle Admin
Posts 382
Thanks 72
Joined 20 Jan '10 Email user
From Kaggle

The rules do not prohibit Oracle Data Miner.

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?