Lately, I’ve been trying to figure out how I might change the way my lab collects and handles data and carries out analyses. I think we’re doing an okay job, but I know we could be doing better –I’m just not exactly sure of how! A key goal of this post is to get ideas from others to try to figure out approaches that might work for my lab. Please comment if you have ideas!
First, to describe the general way in which we collect, store, analyze, and publish data at present:
Data collection: Data collection is always done in a notebook or on a data sheet with pencil. I know that officially pencil is a no-no for lab notebooks, but this is how I was trained, with the argument that, for a lab that works with lots and lots of water, pencil is a safer way of recording things than pen. (Even with this general guideline to collect data in pencil, I remove any pen lying around the lab that looks like it has ink that wouldn’t hold up to a spill.) In terms of notebooks, we used to have notebooks for individual people, but it was becoming a mess in terms of collaborative projects, where one day’s data collection would be in one notebook and the next day’s in another. So, we’ve moved to a system where there are lab notebooks for each individual project. For certain types of data (especially life tables), we record the data on data sheets. The upside is that data collection is much more efficient with data sheets. The downside is that they are so much easier to misplace, which is a source of anxiety for me. These get collected in binders or, if there are only one or two of them, taped into a lab notebook. And I emphasize that they should get scanned as soon as possible. I tend to take a photo of the data sheets at the end of the day with my phone just to be on the safe side.
Data entry: Data are then entered into excel and proofed to make sure there weren’t data entry errors. This means that data ends up being stored as a hard copy and in excel files. Any files on the lab computer or my computers get automatically backed up to the cloud. Writing this post is making me realize that I don’t know how my lab folks back up their computers, and that that is something I should ask them about!
Data analysis: For the most part, we now carry out analyses in R. We do some amount of sharing of code with each other, but different people mostly are coding on their own. We don’t have a central lab repository for code, but I did recently start an Evernote notebook where we can keep track of code for basic things that we need to do frequently (but perhaps not quite frequently enough to remember quickly!) I meet with students as they work out new analyses to talk about things like error structure, model terms, etc., but they are writing the code themselves, for the most part.
Data publishing: We eventually create a folder where we have all the data files and code for the analyses in a given paper. We used to not gather them all together as coherently as we do now. Then, we moved to a point where we would do this once the paper was accepted. Now, I do this as soon as I start drafting the manuscript. In my opinion a great benefit of the move towards publishing data along with papers has been that it provides all scientists with extra incentive to have their data files and code in a relatively easily retrievable form.
Okay, so that explains what we’re currently doing. Now, to move on to things I’m considering changing or that I think could be improved:
Data collection: I’ve been considering whether we should move to electronic notebooks (most likely Evernote, since we’ve been using that for other lab things). I think the biggest benefits would be that:
- people would probably write out more detail about lab experiments, because typing is faster than writing (though see concerns below),
- I could more easily keep an eye on what is going on in terms of data collection in the lab (which feels a little big brother, but we’ve all had experiences like the “WHERE IS THE NOTEBOOK AND WHY AREN’T YOU WRITING MORE THINGS IN IT?” one relayed in this comment), and
- it would be easily searchable. This seems especially key as we have more and more projects in the lab. Right now, it can be hard for me to go back and find specific information (e.g., the temperature of a PCR reaction run in 2009 or the temperature at which we grew rotifers for a particular life table in 2010)*, especially if it is from before the point where we switched to project-based lab notebooks.
The downsides to this approach that I worry about are:
- whether people will tend to think “Oh, I’ll just fill in these details about experiment setup when I get home, after eating some dinner” and end not writing as much (or, worse, forgetting to come back to it entirely),
- sometimes drawing diagrams is handy, and this would be harder (but could probably be solved by quickly uploading a cell phone photo of a sketch)
- something about not having a hard copy of data feels weird to me. (I realize that’s not the most scientific reasoning!)
I would love to hear from people who’ve tried out electronic lab notebooks to hear their experiences!
Data entry: I’m not sure that there’s a better way to do this, though I suppose there could be some fancy way of taking data directly from an electronic lab notebook and getting it into a data file. But I don’t anticipate us moving away from the general approach of typing everything into excel and proofing it.
Data analysis: Two things I’ve started emphasizing to my lab are the Software Carpentry mantra of “Your primary collaborator is yourself 6 months from now, and your past self doesn’t answer email” and that other people will eventually be looking through their data and code, so they need to make sure things can be understood by someone else (and that they won’t be embarrassed by the state of their files!) I think these are both really helpful.
The main things about data analysis that I want to change are:
- to get a better culture of different people checking out other people’s data and code to look for errors, and
- to not have everyone reinventing the wheel in terms of analyses.
I’ve heard that some labs have set scripts that everyone uses for certain tasks. This sounds great in principle, but I have no idea how to implement it in practice. I feel like every specific analysis we do is different (say, in the error structure we need or whether we care about the interactions or whatever), though I imagine that is true of pretty much everyone. I would love to get ideas from others on how they handle this!
What does your lab do?
What do you think would be ideal?
Data publishing: I think I probably need to spend more time figuring out github and see if that would help with the data publishing process. And I’ll continue with my plan to always publish code and data with publications. In most cases, I think this will be done as either an appendix/supplement to the paper or via something like Data Dryad or FigShare. As I said above, I really like this approach in part because it helps emphasize the importance of making sure the data and code are saved in a way that they can be accessed and understood by others well into the future. What is your preferred way of publishing data and code? Am I totally missing an aspect of data publishing that I should be considering?
As I said at the beginning, I’d love to hear how other labs handle issues related to data collection, storage, analysis, and publishing. What works? What doesn’t work? And how did you make the shift to a new system?
*In my experience, the easiest way to find this information is to go to the end-of-semester write-up by the undergrad who worked on the project. They are the best at including all the nitty gritty details in their write-ups!
My biggest hangup with electronic lab notebooks is that everything I work with at the microscope has a high chance of being covered in salt water, so the electronics stay far away from the lab bench. How do you anticipate getting around that in a lab that also plays with water a lot?
Excellent question. I hadn’t really thought through the full logistics of this. I imagine that people would initially write some things down on paper and then transfer to the electronic notebook, but now that I type that out that seems potentially problematic. Plus, while most people in the lab have laptops, not all do. I also don’t know how to deal with that. I would love to hear how other people do this!
A surface tablet would be most useful. I’m sure you can get waterproof ( the NFL uses them). Also can hand write & use text rec while at scope, and draw too.
I think this notion of “getting a better culture” is one of the hardest things to achieve, mainly because it involves people and their behaviour. So I would be wary of anyone who claims to have the perfect formula!
Re Data: I think it’s important to distinguish between the raw data (that gets collected in the lab, e.g. the sequence) and the derived data on which the published research is based (e.g. the genotype and its frequency). As a reader of your paper, I would be more interested in the latter, but your lab archive will be more concerned with the former. As a discipline, I think we ecologists need to have a serious discussion about which kinds of data should be prioritized for archiving.
This is a really important point. We rarely talk about raw and derived data in ecology and the proper annotation, archiving and sharing of them.
Using Evernote as a code repository is interesting but, as you mention, why not use GitHub? That’s what it’s designed for. There’s a bit of a learning curve, but the benefits are well worth it.
Also, I think you’re missing a key piece, which is designing a data model. Obviously it’s important to think about how you’re storing your data physically, but it’s equally important to think about how you’re storing your data logically/conceptually. Hadley Wickham’s paper on Tidy Data is a great resource: http://vita.had.co.nz/papers/tidy-data.pdf
The main reason I’ve been using Evernote rather than GitHub is I already regularly use EverNote but not Github. Not the best reason, I know! I had a short introduction to git/github as part of the Software Carpentry workshop I attended, but still don’t feel very comfortable with it. I probably should invest more time in figuring it out.
And, yes, figuring out how to store data logically is a really important piece that I left out! As others note in the comments, it’s definitely not always intuitive, especially to people new to the lab, what the best way is to store data. An issue we’ve run into is that the format that makes the most sense for data sheets in terms of recording data is not the format that makes the most sense for the electronic data files. That can be a hard shift for students to make, too.
In our microcosm work, the most sensible arrangement of data sheets and electronic data files is exactly the same. It had never occurred to me that it would be really inconvenient if that weren’t the case. Now I’m really glad that it is!
I think you’re exactly right, for most people it makes sense to store data in the same way it’s collected, i.e. one big table with data on multiple observational units, columns that aren’t actually variables, lots of data duplication, etc. Unfortunately, this results is so much time wasted in processing and analyzing data, as well as in all sorts of data quality issues. I often cringe when I look at the data being collected by other grad students, but I’ve found it’s hard to convince people of the value of investing time up front in designing a good data model. It shocks me that ecology departments don’t make grad students take some sort of data management course before collecting data.
“It shocks me that ecology departments don’t make grad students take some sort of data management course before collecting data.”
Fair enough. What part of grad student training should be axed to free up the time? Honest question; see here for background: https://dynamicecology.wordpress.com/2014/10/08/what-should-ecologists-learn-less-of/
Yeah- great point about how one would find time to insert yet another course into the grad curriculum. Although I know that during my doctoral studies, data management became a nightmare early on due to the number of persons on the project. Language barriers also reared their ugly head as we had non-native English speakers. We spent inordinate amounts of time early on just reconciling all the issues of different people using different approaches to data management. In the end we were able to devise a system that was optimal for all users, but I think the headaches could have been avoided. So rather than a formal course, maybe a better approach would be for the lab or field group to brainstorm in advance, and create a system that allows for relative easy use. For myself, once I established my system of analyses, I had about nine Excel sheets that had all the requisite formulae encoded, so that all I had to do was enter the raw data and everything took care of itself. I was also able to export raw data to other software packages and essentially do the same.
Good point, Jeremy; however, from my experience that’s more of an issue in undergrad than grad school. At my school, we only have to take four 1 semester courses over the course of a masters and many struggle to meet that requirement because available courses are so specialized. A data focused course would be applicable to almost everyone I would imagine. Also, I think a short (maybe half day) workshop would provide enough of an overview of the tools and concepts that students could then delve deeper into the areas most relevant to them on their own. No need for a full semester course. Having said that, I recognize that others might make the same argument for any number of other potential course offerings…
I still side with Jeremy on this one. Even though a particular program might be starved for course offerings, I think you run the risk of “instructional overkill” with a course of this nature. I recall my department requiring a semester-long course in scientific ethics when I was working on my doctoral degree. Ugh… yes, these are really important topics, but to be honest, a one-day workshop would have been more than sufficient. So from the perspective of the students, it was like watching paint dry… .
Perhaps a better approach would be to implement data management instruction into a statistics course. Seems like an appropriate place for it.
GitHub is great for collaborating with code, probably less useful for data publishing (although you can link GitHub repositories to figshare for a DOI and so maybe could combine code and data through figshare). I have found that it is well worth the time investment to learn how to use git/GitHub. I would also say that while it is nice to be able to do this via the command line, RStudio has great git integration that makes it really easy for beginners (just start a new Rproject with git enabled) and GitHub actually does have a nice GUI for using git/GitHub (I started out using git through the GUI and definitely found it worthwhile).
“something about not having a hard copy of data feels weird to me. (I realize that’s not the most scientific reasoning!)”
I think that would feel weird to me as well, it was something hardwired into me with training.
I think my main issue is that I don’t really understand what I should be doing with git/github. That is, the logic of what I’m accomplishing with it is still kind of fuzzy to me. Yes, I can push code there. And, yes, it would be helpful for avoiding keeping unneeded bits of code in my R script, just in case I decide later I do need them. If I’m doing an analysis on my own, what are the other advantages to github?
I think feeling unsure of what to do with github is common, and I know that people who’ve adopted it seem to feel like they can no longer live without it. I need to figure out what they’ve figured out that I haven’t!
I use it now with RStudio and love it, though my use only scratches surface of what it’s for. Honestly, I think if all I/we do is use it for code back-up and version control and that its easy for other lab members or collabos to see it thats good enough. One level up for me was learning how to freeze a version with zenodo to get a doi and include in lit cited. And If one day someone actually uses it and suggests improvements I’ll be over the moon.
But the biggest upside? Now that I know my code is there for all to see I am way better about commenting and keeping code clean and organized! It’s not always slick code, but at least anyone can understand what I’m trying to do, including Future Me.
I would be very interested to hear more about how people handle these issues for more field-based projects. In my own group, we tend to have projects that are collaborative with multiple partnering agencies/labs. We may collect some additional dataset ourselves, which is then related to other ongoing data collection by collaborators. As the analysts, we are then responsible for tidying up the data (see Matt’s previous comment) and combining it with other datasets, which sometimes results in a lot of back and forth fact-checking. At present, this takes the form of a folder with all the original data files for a project, some R scripts that clean those up and output cleaned data files for further analyses. Some of us have made the leap over to GitHub/BitBucket, but it hasn’t been ideal for projects where only a small subset of folks on the project are able to use it (because the others are spending their time in the field) or where the datasets may be large and changing.
I would love to hear more thoughts on this, too! I know some people with field-based projects who have people in the field with cellular-enabled iPads (or some similar device), so the PI or collaborators back in their offices can watch the data coming in real time. But that’s about the extent of detail that I know about what they do.
Photograph, Photograph, Photograph everything. In this digital era, I do not understand why this practice is not widespread. As there is no way to predict what might happen to the original data (field data forms & lab notebooks)- back it up, everyday. Fire, flood, theft- you name… you can lose it. My practice for many years now is to have all personnel photograph these materials at the end of each day and send me the pic files. I save them on a hard drive & CD.
Better safe than sorry…
Good point! I like Meg’s idea to take the photo at the end of the day, and I keep trying to instill this practice with my students…but they haven’t experienced catastrophe yet or missing data so I don’t think they are as motivated.
Why not scan or use an app like CamScanner to convert to pdf? Though I suppose jpeg isn’t proprietary so that’s bettter?
Scan scan scan. All raw data sheets get scanned. Then printed. Data entry is off the printed sheets, as, if you can’t read them, the scan isn’t good enough, so go back and scan again. And it provides a more bulletproof way to save the originals. It’s a nice way to create a clean chain of custody.
In my lab we are most concerned with sample tracking and (meta)data management, so we went with a custom scientific database designer that I had collaborated with during my postdoc: http://bigrosestudio.com/. Once we got our data model correct (which as noted above, was a big deal!), it gives us both casual access (e.g., for undergrads archiving samples) and data mining capability via MYSQL for analysis. I keep close watch on my group so that they are entering data as close to source as possible – fortunately we don’t have the water issue you do! Field data entry is possible when there is cell coverage, otherwise we have EXCEL templates that can be uploaded separately. There was an upfront cost (~$5K) but I was easily able to justify in my data management plan, and the benefits have spread out to subsequent grants so it was a good investment IMO. We are also working to tie in some of our outreach to the resource, so there have been multiple benefits of the custom solution.
In my experience, data checking is hard and entirely depends on enforcement. I’ve been in some labs where the PI was sloppy and so went the group. I’ve tried to build a more rigorous culture in my lab, maybe with ~75% success to date.
I believe your point about the example set by the PI is a great one. In my experience, it was not necessarily the group imprinting upon bad behavior of the alpha monkey, but the bad behavior in of itself really curtailing optimal function. So for example, I once had a PI that rarely showed up for work, preferring to “mail it in” from home. She also waited until the last minute for every deadline- whether it was experimental design, staffing, data entry, analysis, writing & publishing. These behaviors translated into everyone being frenzied, unorganized and prone to errors… not because they were sloppy, but because the PI painted them into a corner with no escape.
I’m really curious about using MySQL to manage ecology data. It seems like the ideal in many ways, but since the data are likely stored in a format quite different than they’re collected, I see a challenge getting the data into MySQL. Can you elaborate on your workflow for getting data from a notebook into MySQL. Did the database designer develop tools to help with this?
Also, I can see how this would be really useful for long term projects with a standard data collection protocol. However, if a new or short term project starts up with different data requirements, how easy is it for you to incorporate that into your existing database? Is your system flexible?
If there is one program that has been the biggest improvement in my research is it MySQL (github/bitbucket come second for sure). In MySQL, it is easy to load data from Excel as a csv file. You have to make sure some of the formatting is correct. This takes only a few minutes once you get the hang of it and is similar to other database programs. You can also easily link other data – say site or trait data – to sample data. It is really powerful.
On top of that, you just import data directly from MySQL directly into R. There you can do all the subsetting and data formatting for specific analyses. That way you don’t have to do queries each time.
This is an important point – our experience is that there is certainly a range of what should be kept in our database vs. more ad-hoc media. One way that we’ve addressed this is to provide a database field that links to a file location on our local data server where unique data files are stored. So the uniquely-formatted data is still linked with corresponding metadata while still being formally stored. When the protocols/data formats become more widely used, we work with our collaborating designer to roll it into the more formally structured part of the database. This is a minor cost (~$500/data type) that I budget into grants that will produce that data type in abundance.
Relating to the second question: yes, my database designer has many collaborations implementing new tools for data entry. Sample barcoding is a basic one, although we are letting other labs work the bugs out before we adopt it ourselves. 😉 Certainly “virtuous cycles of reuse” happen – another advantage we’ve found of having a collaborating data scientist vs. an off-the-shelf solution. Another layer of this is that I own the code, which means I can do low-level modifications myself. All in all it fits our evolving work flow nicely. We can also integrate other resources like git and googledocs as we need that flexibility.
One other comment on the above: our database is stored in the cloud, but we also main a local server for some data (e.g., raw DNA sequence data) that do not port well to the cloud (cost me ~$600 to set up). Currently, my group can write to the server, which is backed up nightly to a different hard drive while at the same time changing file ownerships so that they are read only. So we have redundant unalterable copies of these data that are linked to our cloud-based metadata. Layers upon layers…
It is really interesting to hear about your experience! I would love to hear more. (Seriously: if you want to write a whole post on this, I know I am not the only person who would love to know more about what you’re doing.) Matt Hall (@mattd_hall) said via twitter this morning that he is moving to a barcoding system. I can see lots of ways where that would help us. The two most common data recording/entry errors we have are:
1. Confusing clone number and rep numbers (e.g., putting the data for Bd 5 rep 3 in the cell for Bd 3 rep 5 — this had made me wish that we named our clones Mabel, Daisy, etc.)
2. Being off by one day on field-related data. For each sampling day, we have at least the live count (which we scan for infection) and the preserved count (which we use to get density). Sometimes we also have chlorophyll samples, nutrient samples, etc. My poor grad student spent a huge amount of time lining up samples where one of them had gotten entered with the wrong day (generally off by just one day, so easy enough to figure out, but still quite tedious).
How do folks joining your lab learn MySQL? How did you make the switch to this approach? Who was involved in figuring out the data model?
Can you tell I’m really interested? 😉
As a person who has done a ton of biochemical and field research, I’ve experienced these issues too when it comes to coding & labeling samples, subsamples, etc. The BD5-rep3 experience is a good one. I went to an alpha, numeric, Roman numeral system which seemed pretty fool proof. So for example, CDC25-IV, CDC25-V, CDC25-VI was the kind of system that allowed me to use a consistent and repetitive system with little chance for error.
Barcodes can be helpful in some situations. However, if samples, for example, are taken from liquid nitrogen, thawed on dry ice with ethanol, then maintained in a humidified environment, chances are the barcode will be in tatters. Caustic chemicals can do the same.
The other thing I found to be tremendously helpful was to either print or hand-write all labels IN ADVANCE of doing any experiments- opposed to the label-as-you-go approach. When you are thinking about liters, nanograms, beakers, flasks, sterile technique & so on- labeling becomes problematic. So I always strive to get all my coconuts in a row before geeking out on the science.
This is also related to teaching. Many (most) of our lab courses stress the importance of a lab notebook and every lab works on these skills. All our courses use traditional paper notebooks and pen, not by any policy, but inertia. Last week I sat in on a meeting with 10-12 representatives, each from a different local biotech company, to open a conversation about what our students learn and what these biotech companies value. Lab notebooks were of primary value at every company but, interestingly, every single company uses electronic notebooks. So, we’d probably be doing our students a favor if our courses migrated to electronic notebooks too.
This is an excellent point. For universities that have laptop requirements, it seems like it could be relatively doable to implement.
Learning git (or some other version control system) is absolutely, totally, 100% worth doing if you are writing code for real projects. I was skeptical before I learned it myself, but doing a project now without checking it in to git makes me feel really exposed. Git/GitHub provides a backup, against both hardware failures and your own mistakes (i.e., if you inadvertently bork your code you can easily un-bork it).
A side benefit is that you can keep your scripts tidier–you don’t have to leave blocks of old code commented out in them “just in case you need them again.” Delete freely, and let your git repository store all the old “might need it again, but probably won’t” code.
Have you been looking at my scripts? 😉 How easy is it to search git to find those old bits of code?
As I said in response to a comment above, I think I still don’t really “get” git, in terms of how it would be useful. My computer is backed up via crashplan, which saves various versions of files automatically. It wouldn’t be as easy to find as with git, most likely, but I could reconstruct old code if needed to. But people love git so much that there must be some reason why it’s better that I just haven’t grasped yet.
Regarding lab notebooks, towards the end of my PhD I switched over to a hybrid Evernote/paper system. Paper (either a big TOPS computation book in the lab or Rite-in-the-rain/Field Notes Expedition for field work) just works too well in too many places to give up. Then at the end of the day, I use my phone and a scanning app to take pictures of the pages and add them into Evernote. One of the killer features in Evernote in my opinion is it’s ability to create searchable indexing of _handwritten_ text in images. This even works for my quite bad handwriting. So, this allows the best of both worlds. Fully searchable, taggable, backed up and synced across all devices, shareable, plus the convenience of paper/pen. For code, +1 for git and GitHub; for sample tracking and etc I am working on a custom Filemaker database.
Ah, yes, I’d forgotten about the searchable handwritten text feature. That is definitely very nice, and is one of the things that people I know who use Evernote really like. But Margaret’s comment below about Evernote is a really important one to consider, too.
I totally agree with Margaret’s point; hence having both the original paper and the digital copies! Plus, you can export your entire Evernote database including all files etc as HTML, which is easily readable and/or parseable if Evernote goes out of business or something. I try to do this weekly, just to be safe.
I think this is the system we’re going to use. At least, we’re trying it out! We are still recording everything on paper, but then, at the end of each day, everything will be scanned (most likely via phones) and put into Evernote. Your endorsement of its OCR capabilities even for bad handwriting helped influence me! Thanks!
Awesome! It has worked really well for me. I have found that the “Scanner Pro” app for iOS consistently works the best for capturing notebook pages (and auto-cropping, deskewing, increasing contrast, etc). I think is ~$3 or so on the app store. https://readdle.com/products/scannerpro5
Excellent post, seems like there’s a lot we could all learn by comparing notes on this kind of thing, so thank you for sharing. Has the NSF data management plan requirement had any influence on your practices, and do you describe this stuff there?
I’m curious if you think mobile devices have a role here. Equipped with everything from cameras and network connections, they could make it easy to both collect data and sync it to a central (and thus backed up) collection. Google docs or dropbox would be simple versions of this, though there are some promising apps now like ODK Collect (see the video tutorial https://play.google.com/store/apps/details?id=org.odk.collect.android&hl=en) where you can define your data form in an excel spreadsheet, and then it creates an easy data-entry interface on phone or tablet that automatically syncs to a central database.
My understanding is that electronic notebooks are de rigueur in the private sector (e.g pharmaceutical industry), so there are now many companies selling mature products and must be familiar with the usual hurdles regarding spills etc. Nature and Science both cover free and commercial offerings aimed at scientists, e.g. http://www.nature.com/news/going-paperless-the-digital-lab-1.9881 http://www.sciencemag.org/site/products/lst_20140613.xhtml
I would definitely echo what other commenters have said about raw data vs final data, and the ‘data model.’ Talking about how to layout data in a spreadsheet (e.g. each variables as a column, observations as rows) sounds so trivial; but in my experience it isn’t either common or easy. The temptation to add pretty formatting like colors and bolds in Excel also gets in the way.
One thing I’d add is thinking about recording metadata. Where do you record the units that a measurement is made in, or the definitions for a categorical variable code or abbreviation used in the data entry? This would also include things like who is recording the data, when, where, for what experiment. In publishing data, it is a lot more useful if this information is captured in a machine-readable way; which some repositories support much better than others: Dryad at least asks for location, time, and species coverage; things that other users might want to search by in the future. The KNB’s Ecological Metadata Language offers a richer and more flexible way to do this.
“Has the NSF data management plan requirement had any influence on your practices, and do you describe this stuff there?”
Not really. At least, not to date, but I should probably change that. The general data collection and management system I use in my lab is not all that different than the one I used as a grad student. The main exception is that things automatically back up to the cloud now, rather than me burning copies of my harddrive to CDs periodically. I have always been really careful about backing up data, based on having heard horror stories of buildings going down in flames (literally). My friend’s father worked in a building that burned down near the end of his PhD. He was only able to finish his dissertation because someone went into the burning building and grabbed the folder that contained all his figures. (This was well before the days of personal computers, let alone cloud backup!)
In terms of data publishing, I’d say that at first I was unsure, but now I’m a believer. If someone else can use data I collected, great. But the bigger value to me, as I mentioned in the post, is that I think it provides a bit more incentive to lab members to have all the metadata easily accessible, the files neatly organized, the code not full of kludges, etc.
“I’m curious if you think mobile devices have a role here.”
Something that came up in a comment above and via twitter is the potential to use barcodes for samples, which would then link with mobile devices. I am really intrigued by this possibility! Right now, the main way we use mobile devices is to take pictures of data sheets. In my opinion, there’s no reason not to do that at the end of every day.
I imagine that there’s a lot that mobile devices could do that I haven’t thought of. Part of why I thought Jonathan’s comment above was so interesting is that it makes me wonder what solutions someone with real expertise in data management could suggest. Part of why I wrote this post was feeling like there are options out there that would be valuable, but that I am completely unaware of.
Your comment and those of others is making me wonder if we should have a “Good Data Practices” intro that we do with all new lab members, that focuses on data management, tidy data, archiving, metadata, etc. I’d say that, by the time we are ready to publish a study, the metadata are in good shape. But before then, they aren’t always, and that’s a problem. This includes in cases where I want to quickly check on something, but realize I don’t have access to the right file, or am not sure if “Experiment 1” was the one that manipulated X or whatever.
I’m so glad this post has sparked such thoughtful discussion! It’s really getting me thinking more carefully about this. So, to come back to your first point: I think my data management plan for my next proposal will be much, much better and more thoughtful based on this.
When I read that you were thinking of using Evernote for your data collection / lab notebook, warning bells went off. I would strongly advise against using any sort of proprietary software for anything you want to keep long-term. What happens if you have all your notes with Evernote and then Evernote goes out of business? Or has a bug in it that evaporates all your notes? Or adopts a for-fee structure that doesn’t work for your budget? These aren’t crazy hypotheticals. I adopted Evernote fairly recently, and early on some of my notes disappeared for a few days (and then magically reappeared). Made me really hesitant to use Evernote for anything really important.
This is an excellent, really important point that I hadn’t considered at all. I guess I should look into other options!
I haven’t done much lab work — more field work. I’ve found data loggers to be much more efficient and better for backing up than paper. It saves the transcribing step, which can be very time consuming as well as introduce an extra step prone to errors. When I had field assistants, they had to make a backup on the device after every plot. At lunch and at the end of the day, they had to upload all the data onto a computer and email it to me. That meant that the data was in three places — the device, the field computer, and in the cloud — twice daily. Before the field season I write scripts to error-check the data, so if there’s a mistake (a subplot was missed, e.g., or one plot was mislabeled) I catch it that evening simply by running the script on that day’s data. Then I can request a redo of the affected plot(s) the next day. I also have scripts prepared to do data conversion from raw format to useful format and to run initial analyses. That way I can easily see how things are going and catch any potential weirdnesses right away.
What this system is not good for is ad hoc notes, diagrams, etc. Those such things get jotted down on paper, typed or scanned daily, emailed, and consolidated in a Word file. This works ok for me, but likely doesn’t scale up well to a lab group.
I’m a fan of GitHub for sharing code. It could be used for sharing data within a group, too. And I also use SQL for large and complex data sets, though I’m not sure the learning curve is worth it unless you’ve got so much data that the data becomes unwieldy as flat files.
Is GitHub really so different from Evernote? Yes, you still have the underlying git repository if the company goes under or you switch hosting, but all the other features (issue tickets, etc.) will be lost or won’t work quite the same.
Fair criticism. I’ve only started using GitHub fairly recently, also, but it seems a lot more stable to me than Evernote. The fact that the repos are clearly marked and easy to find on my computer makes it seem a bit safer. I have no idea where Evernote is storing stuff on my computer. (Though I could probably go find it if necessary.)
One option for data storage might be REDCap. It is a fully blown database management suite that allows you to enter data through a web browser. Projects are super easy to set up – its possible to create a fully functioning database from scratch in less than an hour. It’s incredibly easy to use.
It was originally designed for the clinical trials world but I could see it being very suitable for ecological data too. One of the good things about it is that it’s free to research centers, as I understand it at least – all you would need is a server and perhaps someone to set it up (I would imagine your IT department). I know that the institute for clinical health and research at Michigan have a REDCap installation…perhaps if you asked nicely they’d let you have a play.
If you coded up all of your existing data to match the structure of your REDCap projects you would also be able to import all of your existing data….
REDCap also allows for some validations (min/max/text/number/date…) which should cut down on at least some typos
There are also export formats that allows very easy export to R/Stata/SPSS/SAS, including defining all of your variables appropriately. Alternatively, you can access the data through an API and pull the data down from within R.
As far as water and computers go…apparently iPads and the like work quite well in sealed bags…
I can recommend it. Far easier to learn than SQL! It might get a bit awkward with large vegetation surveys though perhaps…
Last I heard they were also looking at making it work off line too…enter the data into an app and the next time the device connects to the internet it should sync. Useful for in the field…
I don’t know how close to completion this is though…
This is really interesting! I will definitely look into this. Thanks!
Pen and paper collecting data in the field, then scan the notebook now and then (should probably be done more frequently, but I normally enter the data on a daily basis, so I am not too stressed out about it)
GoogleSheets for data entry. Although I would like to set up a proper database for this, it almost always ends up with a flat data format. But for most of the things I do this is alright – no hugely complex data structures. I always use the validation tools in GoogleSheet to ensure only valid values enters the sheet. Works pretty good I think. I have collaborated on a couple of projects where several people contributed to the data collection and in these projects being able to work in the same sheet at the same time was a huge advantage.
For collaborating on code – +1 for GitHub. As mentioned above there is a nice GUI on both mac and win, so it is really not that bad getting started.
Publishing data I think Dryad is just fine. I would not try to use GitHub for that. And if you go with Dryad you can use the rdryad package to load the data directly from a script – a nice bonus 🙂
On the polls, in my experience it is not so much how many people check but how the checking is done. There are more and less rigorous ways to do it. Two people doing loose checks is not as good as one person doing a well thought out protocol that is very likely to catch errors. Its to be more specific because it depends so much what you’re doing. But at a minimum if your’e not catching a lot of errors, you know you’re not being rigorous.
In the software engineering world they sometimes seed intentional known errors and then assess how many of them are caught as a measure of their thoroughness.
A post on rigorous ways to check data and code would be really interesting! (Hint, hint 😉 )
Well, here is my 2 cents worth- don’t over do it. Meaning, the frequency of original errors goes waaay up as burn-out sets in. The efficacy of checking goes waaay down as burn-out takes hold. This is where exercising your role as a PI is really important. I always very carefully schedule data entry periods & data checking periods. I do not allow anyone working for me (or myself) to perform either or both of these tasks for more than two hours per day. Typically, I will schedule two, 30 minute blocks before lunch and two blocks after lunch. I also ensure there is at least 90 minutes of “other” activity between blocks of data work. I know that doesn’t sound like much for those who accumulate scads of data. But, if you budget time accordingly, you will have much happier employees and much purer data to work with. In fact, our internal studies suggest this approach almost entirely eliminates data entry errors.
+1000 to what Margaret Kosmala said above. People think of paper as the ‘safe’ way to collect data, but it is actually *more* prone to errors because there are two steps at whcih they can be introduced – when recording on paper or when transferring from paper to computer. The more data points collected the less likely these errors can be caught, even with carefull review or qa/qc, and some can’t be caught at all
This error minimization is one advantage of digital collection in both field and lab, and there are a number of apps to do so…ahem…at http://brunalab.org/apps/. Epicollect is a good one because you create your own customized forms amd the data can be auto-uploaded to the cloud. (The other major advantage – immediate avaialability for analysis).
But if for some reason paper is essential, at least one can make the forms less error prone by doing things llike making users circle values from a limited range imstead of filling in blanks, using pre-made number labels or barcode stickers, or just minimizing the amount of writing one has to do.
Just gave a talk on this a few weeks ago – maybe I’ll post it on slideshare and link here for others to give feedback on the content.
Pingback: Weekly links round-up: 10/04/2015 | BES Quantitative Ecology Blog
My institution is switching to Microsoft OneNote as our main notebook software and investing in Surface tablets. Margaret’s points about proprietary software above are well taken, but we acknowledge that people are not going to give up Excel (which I guess has a quasi-open file format), so we might as well share and version those files.
I still use quite a bit of paper at the lab bench, scan everything, and transcribe many of my hand-written notes. Multiple places that error can be introduced, but also multiple steps of review to catch error.
As a simpler alternative to Git I recommend http://fossil-scm.org/. An SCM is a much better way to share/track plain text than Evernote or the like, especially for your most frequently used code.
Pingback: Guest post: setting up a lab data management system | Dynamic Ecology
Pingback: My first experience with GitHub for sharing data and code | Dynamic Ecology