Using Weka on large data-sets

« Prev
Topic
» Next
Topic
JLDml's image Posts 7
Thanks 1
Joined 4 Apr '11 Email user

Hi,

Could someone share ideas on how to train the algorithms implemented in WEKA on large data-sets and not have it run out of memory? Is it possible?

Right now, the largest training data-set size I have managed is that of 2000 instances and it's not serving very well.

Any help or suggestions in this regard would be appreciated.

Thanks in advance!

 
Jason Morris's image Posts 11
Thanks 3
Joined 2 Apr '11 Email user

For WEKA usage, you will have to manually change the memory settings for the program.  Use the following steps:

1.  Get the file path to your WEKA program, specifically a file called WEKA.JAR (for me this was located in the following directory:  C:\Program Files\Weka-3-6)

2.  Open your command prompt and change directories to the directory of your WEKA.JAR file.

3.  Once you are in that directory, use the following command:  java -Xmx2g -classpath weka.jar weka.gui.GUIChooser

4.  Once you hit enter after entering the above command, WEKA will open up the GUI and from there you can start doing your data work.  Be aware that this only lasts as long as the session, once you close WEKA, it will revert back to normal, meaning that anytime you want to use WEKA on a large dataset you will have to run this command.

Note:  In this example, I set the java memory size to 2 gigs.  You can see this in the -Xmx2g part of the command, the 2g means 2 gigs.  If you want 128M just type 128m instead....etc etc.....If you have more questions try this link:

http://old.nabble.com/Changing-heap-size-for-knowledge-flow-interface-td9368828.html

Thanked by JLDml
 
Pablo Ruggia's image Posts 7
Thanks 8
Joined 3 Jun '11 Email user

You can set it permanently using RunWeka.ini if you are in windows.

See http://weka.wikispaces.com/Java+Virtual+Machine#Invocation

Cheers!

 

 

 

 

 

 

 

Thanked by JLDml , and Jason Morris
 
JLDml's image Posts 7
Thanks 1
Joined 4 Apr '11 Email user
Thank you'll for the very specific suggestions. I actually did change the configuration file, and I was wondering if there was anything in addition. My OS allows me a max allocation of 1400m or 1.4g. I guess then it'd be a matter of upgrading the hardware. Also, does the execution of some of the regression algorithms on a numeric class attribute take more than 5 minutes or in other words longer? I was just wondering if this is generally the case or if I was doing something wrong.
 
Jason Morris's image Posts 11
Thanks 3
Joined 2 Apr '11 Email user

Didn't know you could do it permanenly, good to know.

 
Jason Morris's image Posts 11
Thanks 3
Joined 2 Apr '11 Email user

JLDml wrote:

Thank you'll for the very specific suggestions. I actually did change the configuration file, and I was wondering if there was anything in addition. My OS allows me a max allocation of 1400m or 1.4g. I guess then it'd be a matter of upgrading the hardware. Also, does the execution of some of the regression algorithms on a numeric class attribute take more than 5 minutes or in other words longer? I was just wondering if this is generally the case or if I was doing something wrong.

I have had the same experience, it may take some tweaking to develop a way to digest this amount of data.  For instance, you might be able to do some processing of the data in parallel and then combine the results using an average.  This is an article that just scratches the surface:

http://pubs.rgrossman.com/dl/proc-058.pdf

Thanked by JLDml
 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?