Lately, I’ve been trying to figure out how I might change the way my lab collects and handles data and carries out analyses. I think we’re doing an okay job, but I know we could be doing better –I’m just not exactly sure of how! A key goal of this post is to get ideas from others to try to figure out approaches that might work for my lab. Please comment if you have ideas!
First, to describe the general way in which we collect, store, analyze, and publish data at present:
Data collection: Data collection is always done in a notebook or on a data sheet with pencil. I know that officially pencil is a no-no for lab notebooks, but this is how I was trained, with the argument that, for a lab that works with lots and lots of water, pencil is a safer way of recording things than pen. (Even with this general guideline to collect data in pencil, I remove any pen lying around the lab that looks like it has ink that wouldn’t hold up to a spill.) In terms of notebooks, we used to have notebooks for individual people, but it was becoming a mess in terms of collaborative projects, where one day’s data collection would be in one notebook and the next day’s in another. So, we’ve moved to a system where there are lab notebooks for each individual project. For certain types of data (especially life tables), we record the data on data sheets. The upside is that data collection is much more efficient with data sheets. The downside is that they are so much easier to misplace, which is a source of anxiety for me. These get collected in binders or, if there are only one or two of them, taped into a lab notebook. And I emphasize that they should get scanned as soon as possible. I tend to take a photo of the data sheets at the end of the day with my phone just to be on the safe side.
Data entry: Data are then entered into excel and proofed to make sure there weren’t data entry errors. This means that data ends up being stored as a hard copy and in excel files. Any files on the lab computer or my computers get automatically backed up to the cloud. Writing this post is making me realize that I don’t know how my lab folks back up their computers, and that that is something I should ask them about!
Data analysis: For the most part, we now carry out analyses in R. We do some amount of sharing of code with each other, but different people mostly are coding on their own. We don’t have a central lab repository for code, but I did recently start an Evernote notebook where we can keep track of code for basic things that we need to do frequently (but perhaps not quite frequently enough to remember quickly!) I meet with students as they work out new analyses to talk about things like error structure, model terms, etc., but they are writing the code themselves, for the most part.
Data publishing: We eventually create a folder where we have all the data files and code for the analyses in a given paper. We used to not gather them all together as coherently as we do now. Then, we moved to a point where we would do this once the paper was accepted. Now, I do this as soon as I start drafting the manuscript. In my opinion a great benefit of the move towards publishing data along with papers has been that it provides all scientists with extra incentive to have their data files and code in a relatively easily retrievable form.
Okay, so that explains what we’re currently doing. Now, to move on to things I’m considering changing or that I think could be improved:
Data collection: I’ve been considering whether we should move to electronic notebooks (most likely Evernote, since we’ve been using that for other lab things). I think the biggest benefits would be that:
- people would probably write out more detail about lab experiments, because typing is faster than writing (though see concerns below),
- I could more easily keep an eye on what is going on in terms of data collection in the lab (which feels a little big brother, but we’ve all had experiences like the “WHERE IS THE NOTEBOOK AND WHY AREN’T YOU WRITING MORE THINGS IN IT?” one relayed in this comment), and
- it would be easily searchable. This seems especially key as we have more and more projects in the lab. Right now, it can be hard for me to go back and find specific information (e.g., the temperature of a PCR reaction run in 2009 or the temperature at which we grew rotifers for a particular life table in 2010)*, especially if it is from before the point where we switched to project-based lab notebooks.
The downsides to this approach that I worry about are:
- whether people will tend to think “Oh, I’ll just fill in these details about experiment setup when I get home, after eating some dinner” and end not writing as much (or, worse, forgetting to come back to it entirely),
- sometimes drawing diagrams is handy, and this would be harder (but could probably be solved by quickly uploading a cell phone photo of a sketch)
- something about not having a hard copy of data feels weird to me. (I realize that’s not the most scientific reasoning!)
I would love to hear from people who’ve tried out electronic lab notebooks to hear their experiences!
Data entry: I’m not sure that there’s a better way to do this, though I suppose there could be some fancy way of taking data directly from an electronic lab notebook and getting it into a data file. But I don’t anticipate us moving away from the general approach of typing everything into excel and proofing it.
Data analysis: Two things I’ve started emphasizing to my lab are the Software Carpentry mantra of “Your primary collaborator is yourself 6 months from now, and your past self doesn’t answer email” and that other people will eventually be looking through their data and code, so they need to make sure things can be understood by someone else (and that they won’t be embarrassed by the state of their files!) I think these are both really helpful.
The main things about data analysis that I want to change are:
- to get a better culture of different people checking out other people’s data and code to look for errors, and
- to not have everyone reinventing the wheel in terms of analyses.
I’ve heard that some labs have set scripts that everyone uses for certain tasks. This sounds great in principle, but I have no idea how to implement it in practice. I feel like every specific analysis we do is different (say, in the error structure we need or whether we care about the interactions or whatever), though I imagine that is true of pretty much everyone. I would love to get ideas from others on how they handle this!
What does your lab do?
What do you think would be ideal?
Data publishing: I think I probably need to spend more time figuring out github and see if that would help with the data publishing process. And I’ll continue with my plan to always publish code and data with publications. In most cases, I think this will be done as either an appendix/supplement to the paper or via something like Data Dryad or FigShare. As I said above, I really like this approach in part because it helps emphasize the importance of making sure the data and code are saved in a way that they can be accessed and understood by others well into the future. What is your preferred way of publishing data and code? Am I totally missing an aspect of data publishing that I should be considering?
As I said at the beginning, I’d love to hear how other labs handle issues related to data collection, storage, analysis, and publishing. What works? What doesn’t work? And how did you make the shift to a new system?
*In my experience, the easiest way to find this information is to go to the end-of-semester write-up by the undergrad who worked on the project. They are the best at including all the nitty gritty details in their write-ups!