Project Management software for Data Analysis

« Prev
Topic
» Next
Topic
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

Are there any project management tools for data analysis (something that integrates a version control system, keeps track of relationships between data files and source code, etc.)?

While I'm at it, what larger data analysis communities forums are there to ask this sort of question?

 
Sarkis's image Posts 41
Thanks 5
Joined 5 Apr '11 Email user

DanB wrote:

Are there any project management tools for data analysis (something that integrates a version control system, keeps track of relationships between data files and source code, etc.)?

I'm using Eclipse with PyDev for code editing and version control. It doesn't track relationships between data files and source code though. Having a tool like that would be really useful. I can make one in 2 years for $200k upfront and $200k upon delivery.

DanB wrote:

While I'm at it, what larger data analysis communities forums are there to ask this sort of question?

I don't know of any larger data analysis communities forums, but you can also Google for machine learning forums.

 
Jeff Moser's image
Jeff Moser
Kaggle Admin
Posts 356
Thanks 178
Joined 21 Aug '10 Email user
From Kaggle

DanB wrote:

Are there any project management tools for data analysis (something that integrates a version control system, keeps track of relationships between data files and source code, etc.)?

Just curious: why not just have a simple directory structure of /data, /code, /submissions in a version control system like git and then do a commit after each submission? A commit would implicitly tie together/link data and code.

You could make your "master" branch what you submit to Kaggle and then have other branches for explorations that may or may not lead to a submission.

Thanked by DanB , and Zach
 
Allan Engelhardt's image Posts 77
Thanks 29
Joined 28 May '10 Email user

Tool specific packages, but the approach may be generally useful.

Long has a book on Stata workflow that is apparently very good (and part of the inspiration for dryWorkflow which is hot off the useR! 2011 conference).  That should cover two of the OP's favourite tools.

Thanked by DanB , Gabi Huiber , and Pablo Ruggia
 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

Jeff,

I'm doing something close to what I think you are suggesting (using Mercurial).  I only use version control on my source code, and I recreate intermediate data files from source when necessary. Do you put your data in version control?  

When you say committing links the data and source, is that something beyond that you committed them at the same time?  Unless you always run all your code at the same time, this still requires some care.

For instance, when I modify the source that creates a data file, I want a warning if I try to use that data file before updating it by re-running my modified source.  Can git do that?

 
Jeff Moser's image
Jeff Moser
Kaggle Admin
Posts 356
Thanks 178
Joined 21 Aug '10 Email user
From Kaggle

DanB wrote:

Do you put your data in version control?  

You could put data from Kaggle and any derived data that that was used as input to generate your submission into source control. The key idea is being able to exactly reproduce any submission.

You could use version control with a simple file naming convention ("submission.csv") instead of a file naming convention like submission156.csv.

Your commit directory snapshot would have everything you need to reproduce the submission exactly.

DanB wrote:

When you say committing links the data and source, is that something beyond that you committed them at the same time?  Unless you always run all your code at the same time, this still requires some care.

You can have separate passes that generate any itermediate data that is versioned differently. The key thing is exact reproducability for any Kaggle submission. You create a commit for every Kaggle submission.

DanB wrote:

For instance, when I modify the source that creates a data file, I want a warning if I try to use that data file before updating it by re-running my modified source.  Can git do that?

I'd suggest you create a simple makefile or script that creates your submission for you. It could do a diff of your data file(s) and source and generate a warning for you. Given this arrangement, any decent version control should work (Mercurial, git, etc).

 
DanB's image Rank 2nd
Posts 58
Thanks 46
Joined 6 Apr '11 Email user

Jeff,

The makefile is a great suggestion. I don't know how I've never thought about using those for statistical work.

I was hoping to change my workflow in a way that goes beyond reproducability (though that's important too). I'm hoping for something that improves my efficiency along the way. If you have other workflow suggestions, I'd love to hear them.

I just followed Allan's links, and Project Template looks useful. I'm also about to buy Long's Stata book.

Thanks guys!

 
andywocky's image Posts 18
Thanks 8
Joined 17 Jun '11 Email user

While it's a touchy subject in the software development world, I think it's generally not a good idea to put binary data into source control systems.  There are many blogs that discuss the pros and cons, as well as sections on this topic in some of the popular SCM manuals.  I generally setup a Data directory which I exclude from version control, and create submissions, release, and working subdirectories.  I include md5 hashes of data files in all scripts that I run that reference or create data sets, so I know it's reproducible, and I backup the Data directory using normal means.

"Make files" or scripts for generating submissions are a good idea, of course, and these things, which are basically source code, do benefit from source control.

Pastebins or codeboxes or snippet libraries are useful for keeping ad-hoc coding idioms handy.

Bookmarking and notebook services like Evernote and Instapaper are useful for caching handy papers, web pages and other online references for studying.  Use a URL shortener and put a link to references in the comments of your source files for future reference.

More generally, it sounds like you are interested in the topic of "scientific workflows."  This is an active research area, and there are many tools available that you can try freely.  I've used some of these in the past with some success, but I think they are a long way from being compelling.  Still, if you read a couple of papers on it I am sure you can get some ideas about how to abstract your data contest (and other data mining / stats) workflows into modular components, and develop or borrow some best practices to accelerate your productivity.

Andy

Thanked by DanB
 
ChipMonkey's image Rank 84th
Posts 60
Thanks 14
Joined 20 Mar '11 Email user

For my own personal machinations, I use svn to keep a standing repository of my work which lets me check it in and out of multiple systems, particularly the Amazon EC2 cloud. I have a folder for R code, one for perl, one for SQL, and folders for notes/documentation.  I typically tag YYYYMM, although I've made a few branches when trying wild ideas (which haven't panned out) to keep the main trunk clean.

I've been meaning to move to git, by the way -- I like it more, I just haven't migrated yet.

I don't keep data in the repository; I come from a data warehouse background with strong ETL history, so I've got my scripts set up so they can rebuild everything from the original HHP data with minimal intervention, and I actually frequently do this when my database gets too cluttered with abandoned ideas. This helps with the reproducible requirement if nothing else.

For team work we've stood up a google site (http://sites.google.com/) although github seems a viable option, or any of a million other content management sites... joomla or mambo or a mediawiki.

I love wikis.

---Chip

 
Zach's image Rank 31st
Posts 292
Thanks 64
Joined 2 Mar '11 Email user

I think RStudio now has built in git support, which is worth checking out if you use R.

Thanked by Gabi Huiber , and DanB
 
marialewis520's image Posts 1
Joined 7 May '13 Email user
This message has been flagged for moderator review.
 
Cloud9's image Posts 4
Joined 9 Jul '11 Email user

DanB wrote:

While I'm at it, what larger data analysis communities forums are there to ask this sort of question?

I recently launched something in this area as a result of this very project, seeing that there was really no community for healthcare analytics professionals. WWW.healthdataweb.com is basically a venue for people who work with health data - data analysts, health economists and other health data professionals to share stories, ideas, best practices, ask questions and get feedback from a community of peers.

 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?