Note from Meg: This post is by Jonathan Klassen, who is an Assistant Professor of Moleculary & Cell Biology at the University of Connecticut. He had a couple of really interesting comments on my post on collecting and managing data. I was really interested in learning more about the system he had set up, and suspected others would be too. I asked him if he could tell us more in a guest post, and am very happy he agreed!
Lab databases aren’t sexy. Figuring out what’s in all of those barely readable tubes in the -80°C freezer sounds like a nightmare to most of us. But consider the math: each grant (often worth hundreds of thousands of dollars) generates a freezer rack or two worth of samples, a hard drive full of data, and a folder full of metadata. Given direct and indirect costs, it’s not long before the value invested in these resources passes that of my mortgage! The bank doesn’t use the “smudged sharpie and spreadsheet” filing system, though…
Once I wrote this great proposal, about how I was going to use these wonderful samples that had extensive meta- and experimental data to answer big questions in ecology and evolution. Fantastic! And the reviewers agreed, sending me away funded to do noble things for the good of all. But you can guess what I found, right? Snowed under freezer boxes. With smudged labels. And spreadsheets. Well, for some boxes at least. Actually, why are there three different spreadsheets for this box? And they are all contradictory! How many samples named “A1” does my lab HAVE!?! What do you mean “ask so-and-so”, they quit the PhD program YEARS ago!!
A better system was clearly needed. But what? After some brainstorming, we established five main goals for what would become our lab database: (i) store meta- and experimental data so that they remained unambiguously linked; (ii) redundantly back up all data; (iii) be compatible with existing metadata standards (e.g., the Genomic Standards Consortium projects); (iv) make it easy for neophytes to import and access data; and (v) allow programmatic access by power users. Importantly, our design was NOT an electronic lab notebook, but instead a system to record the experimental outputs so that they could be reused in perpetuity.
No existing product met these requirements. Although we could have spent months designing and implementing such a system ourselves, we quickly realized that it would be cheaper and more defensible to collaborate with a data management expert.* With their guidance, we developed a model that defined the logical structure and how different data types related to each other. Our collaborator also helped us develop an identification scheme that removed the ambiguities plaguing the multiple-spreadsheet system and allowed us to clearly cross-reference different data types**.
My lab studies the eco-evolution of fungus-growing ants and their microbial symbionts. These ants cultivate a symbiotic fungus garden and protect it from microbial pathogens using (among other things) antibiotic-producing bacteria. We collect a large number of ant colonies each summer, and use genomics to characterize the ants and their microbial symbionts. This generates a large amount of data, currently on the order of ~100 colonies, ~800 samples, and ~800 strains per year.
After thinking about how our data relates to each other, and making (and breaking!) several database prototypes, we settled on the following database design:
Each box represents a database “book” containing all information pertaining to a specific data type. Each arrow represents a cross-reference between records belonging to different data types. At the highest level, we organize our data into broad projects (e.g., “fungus-growing ants”) and sub-projects (e.g., individual sampling trips). This efficiently summarizes our data, e.g., for grant and permit reports. Because we collect entire ant colonies, “host” is the next logical layer. These data pertain to the entire ant colony (e.g., where and when it was collected, nest structure, soil temperature, etc.) and are compliant with the MIxS guidelines. Each “host” entry is cross-linked to all “project” and “sub-project” information, clearly linking these different conceptual levels together. Next in our workflow, we preserve multiple sub-samples from each colony, e.g., for vouchers and microbiome analyses. Metadata for each tube is stored in the “sample” book, e.g., what each tube contains, where it is stored, and how it is preserved. Each of these “sample” records is cross-referenced to a matching “host” record, thereby preserving links to colony-level metadata. “Metagenome” and “community amplicon sequence” books contain MIMS- and MIMARKS-compliant metadata, respectively, and are linked to “sample” and “host” metadata. Similar to “sample”, the “strain” book contains metadata for each microbe that we isolate (e.g., where the strain is stored and in what storage medium) and cross-references them to “host” and “sample” as appropriate. Sequence data associated with each strain is kept in the MIMARKS- and MIGS-compliant “amplicon sequence” and “genome” books. Note that data can be cross-referenced to some books but not others (e.g., “image” can refer to a “host”, “sample”, or “strain”), or none at all (e.g., strains purchased for a culture collection).
We also use our database for more standard lab management tasks, e.g., chemical and supply inventories*** and lab scheduling. This part of the database also includes lab protocols and scripts, which can link to GoogleDocs and Github for collaborative work. We also store papers, theses, grant applications, reports, posters, and presentations in our library to make them available to the entire lab.
Our database is nearly entirely web-based, and is therefore accessible using any device that can host a web browser. This means that we can even use it while in the field, allowing data entry on site. Our system is compatible with sample barcoding, although we have not gone quite this far yet (our collaborators have). Data entry is done through web forms or by importing template spreadsheets. The website also has a “cart” function that lets you search the database and export matching data as a table or multiple .fasta file (similar to using NCBI). This makes it easy to submit data to public databases (e.g., Figshare or NCBI)****. Most of these data are securely and confidentially stored in the cloud, thanks to the expertise of our data management collaborator. The only exception are large data files that we instead maintain locally*****. For the advanced users, the database can also be directly searched using MySQL. Both novice and advanced users can therefore efficiently interact with our stored data at their level of computational expertise.
Bottom line: does it work? Yes, quite well! Although with the caveat that we are still developing a lab culture of data stewardship and that enforcement is still required. However, I think that having such a well-developed system helps to show my lab that data management is a priority. Additionally, our ongoing data management collaboration allows us to creatively innovate in this area. Was the cost worth it? Yes! There are some upfront costs (in the thousands of dollars), but this is easily justified in grants or start-up negotiations to support the achievability and rigor of our data management plans. And in retrospect, it would have cost far more for us to create an equivalent product ourselves. I therefore highly recommend collaboratively creating a customized lab database as a lab data management strategy.
*The company that we collaborate with is Big Rose Web Design (UPDATE: link fixed), a specialist in custom scientific data management solutions. And I can’t say enough good things about them – you should really all go and buy 3-4 databases! (No, I don’t have any vested interest.) Importantly, they have extensive wet lab experience and a unique understanding of how labs actually function, unlike many generic LIMS providers.
**Formally, this attribute is inherent to databases themselves: each record has a unique identifier. This is a key difference between databases and spreadsheets that is often overlooked. Cross-linking data is easy in MySQL by using different identifiers for each data type. Such cross-links are preferable to a single “master table” because they can easily accommodate many-to-one data relationships: changing high-level data (e.g., host) is automatically propagated to all lower levels (e.g., sample, strain, genome, etc.).
***Our chemical safety officers REALLY like how our chemical inventory includes an MSDS for each item. Biosafety similarly REALLY like our sample and strain lists.
****In principle, the database could output submission-ready files directly from the database, e.g., for submitting sequence data to NCBI. But implementing this remains for the future.
*****We maintain a dedicated data server for our raw data that both backs up nightly and converts file permissions such that lab members can read but not edit files once they’ve been uploaded. Not quite as good as distributed cloud backup, but far cheaper over the long run. File locations on this server are included in the web database as metadata.