Note from Meg: This post is by Jonathan Klassen, who is an Assistant Professor of Moleculary & Cell Biology at the University of Connecticut. He had a couple of really interesting comments on my post on collecting and managing data. I was really interested in learning more about the system he had set up, and suspected others would be too. I asked him if he could tell us more in a guest post, and am very happy he agreed!
Lab databases aren’t sexy. Figuring out what’s in all of those barely readable tubes in the -80°C freezer sounds like a nightmare to most of us. But consider the math: each grant (often worth hundreds of thousands of dollars) generates a freezer rack or two worth of samples, a hard drive full of data, and a folder full of metadata. Given direct and indirect costs, it’s not long before the value invested in these resources passes that of my mortgage! The bank doesn’t use the “smudged sharpie and spreadsheet” filing system, though…
Once I wrote this great proposal, about how I was going to use these wonderful samples that had extensive meta- and experimental data to answer big questions in ecology and evolution. Fantastic! And the reviewers agreed, sending me away funded to do noble things for the good of all. But you can guess what I found, right? Snowed under freezer boxes. With smudged labels. And spreadsheets. Well, for some boxes at least. Actually, why are there three different spreadsheets for this box? And they are all contradictory! How many samples named “A1” does my lab HAVE!?! What do you mean “ask so-and-so”, they quit the PhD program YEARS ago!!
A better system was clearly needed. But what? After some brainstorming, we established five main goals for what would become our lab database: (i) store meta- and experimental data so that they remained unambiguously linked; (ii) redundantly back up all data; (iii) be compatible with existing metadata standards (e.g., the Genomic Standards Consortium projects); (iv) make it easy for neophytes to import and access data; and (v) allow programmatic access by power users. Importantly, our design was NOT an electronic lab notebook, but instead a system to record the experimental outputs so that they could be reused in perpetuity.
No existing product met these requirements. Although we could have spent months designing and implementing such a system ourselves, we quickly realized that it would be cheaper and more defensible to collaborate with a data management expert.* With their guidance, we developed a model that defined the logical structure and how different data types related to each other. Our collaborator also helped us develop an identification scheme that removed the ambiguities plaguing the multiple-spreadsheet system and allowed us to clearly cross-reference different data types**.
My lab studies the eco-evolution of fungus-growing ants and their microbial symbionts. These ants cultivate a symbiotic fungus garden and protect it from microbial pathogens using (among other things) antibiotic-producing bacteria. We collect a large number of ant colonies each summer, and use genomics to characterize the ants and their microbial symbionts. This generates a large amount of data, currently on the order of ~100 colonies, ~800 samples, and ~800 strains per year.
After thinking about how our data relates to each other, and making (and breaking!) several database prototypes, we settled on the following database design:
Each box represents a database “book” containing all information pertaining to a specific data type. Each arrow represents a cross-reference between records belonging to different data types. At the highest level, we organize our data into broad projects (e.g., “fungus-growing ants”) and sub-projects (e.g., individual sampling trips). This efficiently summarizes our data, e.g., for grant and permit reports. Because we collect entire ant colonies, “host” is the next logical layer. These data pertain to the entire ant colony (e.g., where and when it was collected, nest structure, soil temperature, etc.) and are compliant with the MIxS guidelines. Each “host” entry is cross-linked to all “project” and “sub-project” information, clearly linking these different conceptual levels together. Next in our workflow, we preserve multiple sub-samples from each colony, e.g., for vouchers and microbiome analyses. Metadata for each tube is stored in the “sample” book, e.g., what each tube contains, where it is stored, and how it is preserved. Each of these “sample” records is cross-referenced to a matching “host” record, thereby preserving links to colony-level metadata. “Metagenome” and “community amplicon sequence” books contain MIMS- and MIMARKS-compliant metadata, respectively, and are linked to “sample” and “host” metadata. Similar to “sample”, the “strain” book contains metadata for each microbe that we isolate (e.g., where the strain is stored and in what storage medium) and cross-references them to “host” and “sample” as appropriate. Sequence data associated with each strain is kept in the MIMARKS- and MIGS-compliant “amplicon sequence” and “genome” books. Note that data can be cross-referenced to some books but not others (e.g., “image” can refer to a “host”, “sample”, or “strain”), or none at all (e.g., strains purchased for a culture collection).
We also use our database for more standard lab management tasks, e.g., chemical and supply inventories*** and lab scheduling. This part of the database also includes lab protocols and scripts, which can link to GoogleDocs and Github for collaborative work. We also store papers, theses, grant applications, reports, posters, and presentations in our library to make them available to the entire lab.
Our database is nearly entirely web-based, and is therefore accessible using any device that can host a web browser. This means that we can even use it while in the field, allowing data entry on site. Our system is compatible with sample barcoding, although we have not gone quite this far yet (our collaborators have). Data entry is done through web forms or by importing template spreadsheets. The website also has a “cart” function that lets you search the database and export matching data as a table or multiple .fasta file (similar to using NCBI). This makes it easy to submit data to public databases (e.g., Figshare or NCBI)****. Most of these data are securely and confidentially stored in the cloud, thanks to the expertise of our data management collaborator. The only exception are large data files that we instead maintain locally*****. For the advanced users, the database can also be directly searched using MySQL. Both novice and advanced users can therefore efficiently interact with our stored data at their level of computational expertise.
Bottom line: does it work? Yes, quite well! Although with the caveat that we are still developing a lab culture of data stewardship and that enforcement is still required. However, I think that having such a well-developed system helps to show my lab that data management is a priority. Additionally, our ongoing data management collaboration allows us to creatively innovate in this area. Was the cost worth it? Yes! There are some upfront costs (in the thousands of dollars), but this is easily justified in grants or start-up negotiations to support the achievability and rigor of our data management plans. And in retrospect, it would have cost far more for us to create an equivalent product ourselves. I therefore highly recommend collaboratively creating a customized lab database as a lab data management strategy.
*The company that we collaborate with is Big Rose Web Design (UPDATE: link fixed), a specialist in custom scientific data management solutions. And I can’t say enough good things about them – you should really all go and buy 3-4 databases! (No, I don’t have any vested interest.) Importantly, they have extensive wet lab experience and a unique understanding of how labs actually function, unlike many generic LIMS providers.
**Formally, this attribute is inherent to databases themselves: each record has a unique identifier. This is a key difference between databases and spreadsheets that is often overlooked. Cross-linking data is easy in MySQL by using different identifiers for each data type. Such cross-links are preferable to a single “master table” because they can easily accommodate many-to-one data relationships: changing high-level data (e.g., host) is automatically propagated to all lower levels (e.g., sample, strain, genome, etc.).
***Our chemical safety officers REALLY like how our chemical inventory includes an MSDS for each item. Biosafety similarly REALLY like our sample and strain lists.
****In principle, the database could output submission-ready files directly from the database, e.g., for submitting sequence data to NCBI. But implementing this remains for the future.
*****We maintain a dedicated data server for our raw data that both backs up nightly and converts file permissions such that lab members can read but not edit files once they’ve been uploaded. Not quite as good as distributed cloud backup, but far cheaper over the long run. File locations on this server are included in the web database as metadata.
Question: how much of the value of a system like this is in the data management itself (making data easier to access and share, etc.) vs. the culture and practices of data management that the system instills in (or forces on!) everyone in the lab? I ask because the (all too familiar!) examples of bad data management with which the post begins sound to me like bad practices, like writing “ask so-and-so” as a substitute for metadata. Bad practices that, in principle, could be fixed with good paper-based data management just as much as by a custom database.
Second question: how much of the value of a system like this depends on the volume and variety of data the lab generates? Ecology labs vary a *lot* on these dimensions. My lab for instance is probably (?) on the small side for research university ecology labs in terms of the volume of data we generate, but perhaps unusually (?) varied in terms of the kinds of data we generate since each grad student has his/her own project and those projects often are totally unrelated (everything from spatial synchrony in protist microcosms to character displacement in bean beetles to plant-pollinator interaction networks in alpine meadows). Presumably there’s some threshold of overall volume and/or variety of lab activity outside of which a system like this isn’t worth the investment. But for someone who doesn’t yet have a system like this (which is most ecology labs, I’d guess), it could be hard to judge whether the investment would be worth it. Any guidance to offer?
Great questions, Jeremy.
re: paper-based data management vs. a database – I think that you are absolutely right that no matter what form they take, some form of good data management is key. I think that formal databases have unique advantages, especially by making redundant names impossible and enforcing a central, standardized data store. Whether you need a custom solution like mine or a more stock solution will depend on what kind of data you generate.
re: data types – I’ve seen a pretty wide gradient of how much data “belongs” to a single person vs. the lab more generally. At one extreme, you have places like clinical laboratories where everything must be systematically archived forever using very sophisticated formal LIMS systems. At the other end, there are labs (perhaps more like yours?) where data from different projects will (can?) not likely be combined or reused once each project is finished. In that case, a formal lab system like mine may make less sense if all the data from each project can be properly archived elsewhere (e.g., in Dryad, supplemental data, or with the PI or project lead) and properly managed. My lab falls somewhere in between: most of the data that we generate will be reused by multiple projects, and we also have a responsibility to provide voucher specimens and microbes if requested by colleagues. So in my case, the opportunity cost of not being able to reuse or share our data clearly exceeds the cost of the database. As an attempt to generalize, I think that the more you tend toward “big” data, the more useful a formal database like this becomes (auxiliary issues of proper data management aside).
This sounds fantastic. I’ve been struggling with these issues for decades. what were the ballpark costs of the consult, products, etc.?
More questions from an old curmudgeon: How transportable are the databases? What happens when your grant runs out and you can no longer afford the cloud services? It’s absolutely crucial that, when that time comes, you the PI can actually access and manage the database. I once had a technician set up a database using Paradox, but it has since been lost to the mists of time and operating system upgrades.
Our start-up costs were not too bad (IMO), ~$8000 for the bad database and an public lab website as I recall. I do write in improvements into my grants (e.g., when starting to use a new data type), typically $500 to a few thousands per grant, depending on the complexity. My consultant, at least, has a stock set of database types that can be applied to different labs. It’s only when something more exotic needs to be coded from scratch where it gets more expensive. For example, we’ve proposed a project to let high-school students deposit and analyze citizen science data on our website and budgeted ~$1000/yr for the file of the grant to build the infrastructure.
Transportability is a great question! I own all of the code and databases, and actually do some of the maintenance myself (although my collaborator supports that too, typically without charge). Storing my database in the cloud costs me ~$200/yr. I know other folks with similar systems that store theirs on university servers, which requires working with university IT but is workable. And like I mentioned in the footnotes, I built my own server for ~$600 for data that was less suited for the cloud – one could easily put the entire database there. Certainly there are many solutions to keep the data from vanishing!
That should be “lab database”, not “bad database”!
I love that typo! Bad database, bad! 🙂
Your Big Rose Web Design link is broken
That’s great that you still own the code. Fancy putting it on GitHub for others to remix for their own needs?
I’ve talked over that one with the consultant, and he’s hesitant for business reasons and so I’m deferring to him as my collaborator. We’ve talked about doing a public version with paper etc. but this one doesn’t quite have legs yet.
Thanks for the great post! How universal vs. custom do you think a data system like yours is for lab based ecology? You mention there were no off the shelf solutions when you looked. But your set of 5 (fairly universal) needs and the generic structure of relational databases make me think that it might be possible to design a single solution that would answer the needs of, say, 80% of ecology labs.
Thanks Margaret – This is a pretty common question is seems. On one hand, I do think that there are parts of it that are very generalizable, e.g., lab management, and I see things like Quartzy (https://www.quartzy.com/) in this space. In my experience (now on my third database), what tends to not be very generalizable is the metadata model. For example, the logic I outlined above works well for our sequence data. But we have had to do an (ongoing) rethink to incorporate our bioassay data, which that is ontologically quite different from sequence. This seems to be where customization is important: different labs have data with different ontological structures, and different database structures are needed to accommodate them. Another surprising thing that we found was that information overload strongly inhibited user uptake of the system. So for example, while we have only included the small subset of MIxS terms (out of hundreds) that are relevant to us in our system, other labs will inevitably need a different subset.
Interesting, and makes sense. Thanks for the reply!
I am working on exactly this problem to make life easier in Labs. I am building a project and data management software mainly for bio Labs.
Please get in touch with me as I would love to discuss this with you.