Using Weka on large data-sets

« Prev
Topic
» Next
Topic

Hi,

Could someone share ideas on how to train the algorithms implemented in WEKA on large data-sets and not have it run out of memory? Is it possible?

Right now, the largest training data-set size I have managed is that of 2000 instances and it's not serving very well.

Any help or suggestions in this regard would be appreciated.

Thanks in advance!

For WEKA usage, you will have to manually change the memory settings for the program.  Use the following steps:

1.  Get the file path to your WEKA program, specifically a file called WEKA.JAR (for me this was located in the following directory:  C:\Program Files\Weka-3-6)

2.  Open your command prompt and change directories to the directory of your WEKA.JAR file.

3.  Once you are in that directory, use the following command:  java -Xmx2g -classpath weka.jar weka.gui.GUIChooser

4.  Once you hit enter after entering the above command, WEKA will open up the GUI and from there you can start doing your data work.  Be aware that this only lasts as long as the session, once you close WEKA, it will revert back to normal, meaning that anytime you want to use WEKA on a large dataset you will have to run this command.

Note:  In this example, I set the java memory size to 2 gigs.  You can see this in the -Xmx2g part of the command, the 2g means 2 gigs.  If you want 128M just type 128m instead....etc etc.....If you have more questions try this link:

http://old.nabble.com/Changing-heap-size-for-knowledge-flow-interface-td9368828.html

You can set it permanently using RunWeka.ini if you are in windows.

See http://weka.wikispaces.com/Java+Virtual+Machine#Invocation

Cheers!

Thank you'll for the very specific suggestions. I actually did change the configuration file, and I was wondering if there was anything in addition. My OS allows me a max allocation of 1400m or 1.4g. I guess then it'd be a matter of upgrading the hardware. Also, does the execution of some of the regression algorithms on a numeric class attribute take more than 5 minutes or in other words longer? I was just wondering if this is generally the case or if I was doing something wrong.

Didn't know you could do it permanenly, good to know.

JLDml wrote:

Thank you'll for the very specific suggestions. I actually did change the configuration file, and I was wondering if there was anything in addition. My OS allows me a max allocation of 1400m or 1.4g. I guess then it'd be a matter of upgrading the hardware. Also, does the execution of some of the regression algorithms on a numeric class attribute take more than 5 minutes or in other words longer? I was just wondering if this is generally the case or if I was doing something wrong.

I have had the same experience, it may take some tweaking to develop a way to digest this amount of data.  For instance, you might be able to do some processing of the data in parallel and then combine the results using an average.  This is an article that just scratches the surface:

http://pubs.rgrossman.com/dl/proc-058.pdf

Reply

Flag alert Flagging notifies Kaggle that this message is spam, inappropriate, abusive, or violates rules. Do not use flagging to indicate you disagree with an opinion or to hide a post.