Over the past year, I’ve been co-organizing a jointly-funded, large, international working group (with co-PIs Maria Dornelas, Mary O’Connor, and Andrew Gonzalez). The group has had several meetings now (October 2014, May 2015, June 2015) and will meet again next week with the goal of finishing ongoing projects and making major progress on several new ones. I’ve learned a lot about managing large collaborative working groups through this process, but one area I’ve been thinking a lot about this week is how to maintain good communication before, during, and after the working group, especially with respect to code.
Since our group is particularly large (> 25 participants) and since different combinations of participants have attended different meetings, communicating the important discussions, decisions about data and analysis, progress on projects, and general working group goals has been a challenge, and somewhat of a moving target. Overall, I think we’ve managed this generally well, by having semi-regular discussions among the co-PIs, having subsets of participants help on “core teams” to work on specific data, models, or manuscripts between working group meetings, and by using online collaborative tools such as Google Docs for writing manuscripts and keeping track of meeting notes.
As we prepare for a meeting that will be more strongly focused on using models and data analysis to test hypotheses, it is critical to also keep track of the code we are developing, and to ensure that members within our group (as well as future collaborators, reviewers, and manuscript readers) can understand what we did. In my own work, I rely on git and GitHub (www.github.com) to help me remember what I’ve been working on, be able to quickly resolve mistakes, and to share my code with collaborators or as Supplementary Material to a manuscript.
I’ve proposed that our working group use GitHub as the main place to keep track of and share code being developed for the projects. On GitHub, it is possible to request a private Organization for educational or research purposes, which allows the code to be developed privately within the group (e.g., only invited collaborators can view the Organization’s repositories), but later released publicly when the final product is ready. An Organization allows large teams (potentially with varying read/write permissions) to view, develop, edit, and comment on repositories from a central location – a great tool for a working group! I am aware that members of our group have varying levels of comfort with this tool (no familiarity, have an inactive account, regularly use, complex collaborative workflow), but my goal is for us to have a central “master” location for developing code (versus people’s personal repositories, laptops, or email), and a place to discuss issues that arise – rather than a barrier for participants who are not currently using or familiar with GitHub, or uncomfortable with developing code “out in the open”.
With this in mind, it seems like for a large working group, it would be useful to develop a set of guidelines for our GitHub Organization, that include how to set up and use repositories, communication, preferred style formats, and links to cheatsheets or guidelines for novice users, as well as a README.md template that could be used for each new project repository. Researchers are generally very busy, and the guidelines should be concise and informative so they don’t end up TL;DR.
I’m curious if others have used GitHub for large working groups, if you developed a set of use guidelines, and how you felt the approach was received (Were participants on-board with the plan? Did it help your code organization?). If you developed a set of basic guidelines, what main points did you emphasize the most? What are the major basic points to get across to GitHub novice-advanced users? Is there anything you see in the proposed guidelines below that seems unclear, too wordy, or incorrect? I’d love to hear what has worked (0r not) for other teams.
Here’s what I’ve outlined so far:
Contains guidelines for collaborating in the GitHub Organization for the working group. In order to keep track of what we’ve done, enhance reproducibility, improve communication across groups, and to be able to easily track down and fix potential errors, we strongly encourage all sub-groups to use GitHub to store and edit code related to the main projects.
- repositories should have short, but informative, names, with no spaces
- README.md file – this file has brief text with the names of the main participants, the main questions being addressed (like a mini-abstract), and the main tools (R, R packages, datasets) that are needed to run the code. It should also include any important information someone new looking at the project would need to run or understand the code (e.g. the code needs to be run on a server, dependencies, known bugs, or the order in which multiple files are to be run).
- Please look at the README_template.md file in this repository as an example.
- If you are unfamiliar with markdown, you can find a cheatsheet here.
- code – we anticipate that most projects will be done using R scripts, but other code is OK, if it is useful/necessary. Ideally, the code will allow a user to go from raw data to final figures and results.
- DO NOT PUT DATA FILES ON GITHUB – While small data files can often be stored on GitHub without causing large issues, larger datafiles or files that are frequently edited may exceed GitHub’s storage space and cause problems downstream. It would be better to access the data within the code from it’s home on Dropbox (or elsewhere), and be really clear in your README file about which data you are using.
3. At a minimum, someone from each break-out group should update the repository at the end of the day (but feel free to commit your changes more frequently!).
- Please use short, but informative commit messages that clearly describe what the new changes accomplish. Ideally, each commit message represents a cohesive “chunk” or a single type of update (e.g. NOT “We made a lot of changes in 5 different files that all run models”).
4. Strive for clear communication among team members to keep track of changes, comments, and bugs in the code being developed.
- Use short, informative commit messages (see above)
- We don’t need to follow a really complex workflow (e.g. you can update files in the repositories directly instead of using pull requests), but we can use some of the tools to help us keep track of comments and issues
- Issues are a way for code collaborators to comment and receive notification about a particular question, bug, or opinion about the code, and to be able to join a broader discussion. You can read more about issues here.
- When submitting an issue, please try to briefly (1) clearly describe the problem and exactly where in the code it occurs, (2) explain why, specifically, it is a problem for you, and (3) try to reproduce the error, suggest what might be causing the problem (line of code, commit, or data) and/or suggest a possible solution (if you don’t have one yourself).
5. As much as possible, follow consistent code format and style guidelines so that it can be more easily read and shared among team members.
- R Style guidelines can be found here
- Use TODO and FIXME comments to identify specific areas in the code that need attention
- Don’t worry too much about specific styles, but consistent naming practices, good use of white space, and informative comments can go a long way towards making collaborative coding better!