<12>

I am facing serious problems running R on my IBM 64 bit with only 4 GB RAM, run out of memory very soon, which is getting frustrating as I know I can extract just that little bit more if I can get the computation done without worrying about memory or CPU usage. 

Is it possible to run R on Amazon's cloud service, i.e. rent a windows/linux instance (preferably 64 bit) with much higher memory, has anybody done this? More importantly what would the cost of doing say 12 hour modeling runs a few times a week? Can Kaggle wangle us a discount ?  should be good publicity for Amazon (better than using the cloud to hack into Sony!)

Hey Karan-

I don't know R, but AFAIK out of memory errors will not depend on the amount of RAM you have, but rather on the amount/size of virtual memory which the application sees. On a 64 bit machine this virtual memory size should be practically infinite (~O(2^64)) and will be only limited on the size of available disk space (i.e. swap file)

The available RAM should only accelerate the operations (i.e. needling less page faults) but not limit the virtual memory or cause "out of memory" problems for the application.

So- is there an option to see the amount of memory your R application sees? Its blocks of free memory (i.e. the maximal size of an array you can create at any point)?

Make sure that the R takes advantage of this 64 bit memory space.

I would be interested in this as well.  I have used their s3 and cloud front services, but only tried the ec2 stuff real quick, got frusterated and gave up.  I give up quickly on stuff that requires ssh - so don't let that deter you.

http://aws.amazon.com/ec2/pricing/

has pricing and what not.

You might want to look at this -- I am not sure if this will do what we want or not:

http://user2010.org/tutorials/Chine.html

If you try it and it works - let me know!

I use Amazon's EC2 running R with RStudio frequently. I can point you at two decent intros that should help you: http://www.drewconway.com/zia/?p=2701 http://inundata.org/2011/03/30/r-ec2-rstudio-server/ It's fairly easy to setup and get running. Chris posted links to the pricing. If you have any trouble getting it setup post it on here and I'll see if I can help.
Wild Cherry notes that "The available RAM should only accelerate the operations (i.e. needling less page faults) but not limit the virtual memory or cause 'out of memory' problems for the application." While that is technically true, the real issue is the pattern of memory use by the application. If there are large areas of virtual memory that have been allocated but are rarely used, then the system may be able to swap them out to disk in a reasonable amount of time and leave more physical RAM for use by active vectors, data frames, etc.. But if, as is often the case, the user's program runs all over memory, actively using data that can't all fit in physical RAM simultaneously, then the system may "thrash", constantly paging to and from disk, thus slowing program execution down severely. Note that a call to gc() in R forces an immediate "garbage collection" to free up unused memory space, and also prints some statistics regarding memory usage. I'm finding that although R is a fairly powerful system for doing data analysis, it can be a real memory and CPU "hog". I'm trying to optimize my programs by vectorizing and parallelizing code, and also by using the "data.table" package to optimize table lookups. -- Dave Slate
The bioconductor AMI includes R and Rstudio, and will quickly get you up and running with R on Amazon EC2. You can use RStudio to access the server without using SSL. Getting data on and off the server can be tricky though, and if you want to use packages like "multicore" to take full advantage of EC2 servers you'll need to get familiar with running R through SSL. http://www.bioconductor.org/help/bioconductor-cloud-ami/

Thanks for the responses. It seems its possible, but I really dont want to do any shell scripting etc, way beyond my capabilities.

is there a way I can use it through remote desktop login into a virtual machine and do my stuff on R/Rattle there? Also any approximate estimate regarding the cost

FWIW, Opani is an interesting start-up that has a free cluster available for you to run R, Octave, and Python scripts:

http://opani.com/

Their paid plans use AWS and charges at-cost, so this is a gentler, simpler solution for those who can't be bothered to set up AWS on their own.

Karan Sarao wrote:
Thanks for the responses. It seems its possible, but I really dont want to do any shell scripting etc, way beyond my capabilities.

You don't need to do any shell scripting.  You can start up an instance of the bioconductor AMI, login to it via PUTTY and then simply start R from the command line, by typing "R" and hitting enter.

Or you can use opani.

I have built a "large" EC2 instance with 2 virtual cores and 8GB of RAM and 8GB of storage. The processing and memory resources have been more than enough so far. I may need to increase the storage space at some point. I tried to get away with a free "tiny" instance, but it did not have enough memory. My first month's bill was $6 and change. I expect this month's to be around $10.

Services for data analysis:
R through RStudio
MySQL

Services for administration:
phpMyAdmin (create MySQL database, tables and indexes)

Encryption:
RStudio and phpMyAdmin through Apache2 with SSL
MySQL over SSH
File transfer over SSH

Clients from my Windows7 laptop:
RStudio and phpMyAdmin - Firefox
MySQL - MySQL Workbench
Terminal - Putty
File transfer - WinSCP

I should probably create an AMI one of these days.

I am very happy with this setup so far because it is secure, very flexible, and the expandable storage on the machine allows me to archive my inputs and results every time I change my data or algorithm.

You can bid on unused cpu time apparently on Amazon's EC2...

I picked up a spot instance of the "High-CPU Extra Large Instance" variety last night. I ended up paying about 23 cents per hour for four hours of usage, which is well less than half the "retail" cost. Spot prices haven't bounced much above that rate since May (when they were inexplicably about $1.00 per hour for a while), although you run a risk of them just shutting you off on the Spot instances (make use of frequent external backups).

So for roughly $1 I saved myself about 30 hours of compute time.

It's important to note that this did not, in fact, improve my HHP score. :-)

wow....Now I just need to figure out how to use Rattle or R on the cloud. Havent touched the data since over a month, cant wait to put in Lab and Drugs and see if it improves!

ChipMonkey wrote:

You can bid on unused cpu time apparently on Amazon's EC2...

I picked up a spot instance of the "High-CPU Extra Large Instance" variety last night. I ended up paying about 23 cents per hour for four hours of usage, which is well less than half the "retail" cost. Spot prices haven't bounced much above that rate since May (when they were inexplicably about $1.00 per hour for a while), although you run a risk of them just shutting you off on the Spot instances (make use of frequent external backups).

So for roughly $1 I saved myself about 30 hours of compute time.

It's important to note that this did not, in fact, improve my HHP score. :-)

However, if the spot price jumps back up to $1 again, your instance will terminate and you'll lose your work.

Karan Sarao wrote:

wow....Now I just need to figure out how to use Rattle or R on the cloud. Havent touched the data since over a month, cant wait to put in Lab and Drugs and see if it improves!

I suggest trying RStudio.  The bioconductor AMI comes with it pre-installed.

However, if the spot price jumps back up to $1 again, your instance will terminate and you'll lose your work.

Agreed, which is why I mentioned frequent backups -- it's a good point to reinforce though.

What I've actually done is attach a second EBS volume to the Spot instance.  I'm running all of my work there, saving often to rotating files, and reasonably frequently backing up key content off of the system entirely (using SC3 is probably Amazon's recommended approach, although I'm using rsync to a non-Amazon Unix box entirely under my control).  If the machine terminates the separate volume persists (only the volume created at system startup is deleted), so no work lost, although I pay a tiny amount for the storage it's still roughly a third of the cost (total) of the dedicated cycles.

I use the free micro instance for persistent code, storage, syntax checking and other minor testing, and the Spot instances for short bursts of high compute needs.  I think that's a decent model.  The few dollars in savings may not really be worth it, but I'm actually never risking more than about an hour's compute time.  It's a very workable option if you're cheap-like-me.  :-)

Just came across  this blog post that might me useful.

Hi,
To get an idea of the RAM size on the machines of the participants - Could whoever's willing post machine configuration information or atleast the RAM size?
As far as I am concerned - I have a machine with 4GB RAM and I find myself falling short.
The purpose for asking this question is to get an idea if my requirements for the machine aren't overboard. I guess I'm trying to remain aligned with the Occam's Razor principle.
I do appreciate any help!

I have more now, but a few weeks ago (when I was ranked higher) I had 6GB of ram.

Conclusion - adding more ram makes you lose your position...

I am not much of a programmer - so I don't know in some cases how to make things more efficient.  Also - I tend to overbook my workspace....

I am trying some of everything I can think of, but so far - everything that has worked for me can be done in 6GB.

I guess I'm trying to remain aligned with the Occam's Razor principle

Have you read the netflix papers? :)

I think Occam would tell you to buy more memory in case you need it - especially if you won't need to swap out motherboards.  I have almost never run up against hard drive or CPU limitations in the things I want to do - it always seems to be the memory!

I run all my models on a dual core laptop (i5) with 4GB of memory. So far I had very little memory issues, CPU seems to be the most limiting factor currently.

Chris Raimondi wrote:

Have you read the netflix papers? :)

I haven't read the Netflix papers. This philosophy has been drilled in by a professor at school, who'd be like - "The best solution needn't be the most complicated." :)

<12>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?