While it's a touchy subject in the software development world, I think it's generally
not a good idea to put binary data into source control systems. There are many blogs that discuss the pros and cons, as well as sections on this topic in some of the popular SCM manuals. I generally setup a Data directory which I exclude from version
control, and create submissions, release, and working subdirectories. I include md5 hashes of data files in all scripts that I run that reference or create data sets, so I know it's reproducible, and I backup the Data directory using normal means.
"Make files" or scripts for generating submissions are a good idea, of course, and these things, which are basically source code, do benefit from source control.
Pastebins or codeboxes or snippet libraries are useful for keeping ad-hoc coding idioms handy.
Bookmarking and notebook services like Evernote and Instapaper are useful for caching handy papers, web pages and other online references for studying. Use a URL shortener and put a link to references in the comments of your source files for future reference.
More generally, it sounds like you are interested in the topic of "scientific workflows." This is an active research area, and there are many tools available that you can try freely. I've used some of these in the past with some success, but I think they
are a long way from being compelling. Still, if you read a couple of papers on it I am sure you can get some ideas about how to abstract your data contest (and other data mining / stats) workflows into modular components, and develop or borrow some best practices
to accelerate your productivity.
Andy
with —