Yesterday I presented what I tongue-in-cheek (or arrogantly – take your pick) called “10 commandments for good data management”. In that post I laid out what I believe to be best practices for managing and analyzing scientific data. Key points were to separate a data entry copy from an analysis copy of the data and to organize them differently, to use row-column organization of raw data, to use a star schema, and to denormalize data before analysis.
Here I present a worked example. It is from a hypothetical survey of raptors (data actually generated by a computer simulation). It records abundances for a number of species of raptors at a number of sites and on a number of days. The sites are unimaginatively named alpha, beta, gamma, etc. Dates are American (mm/dd/yyyy) format. Species names are real species names for raptors in North America. Abundances are made up. There is also data on temperature for each of those sites for each of those days. And some ancillary information on sites (including lat/lon coordinates). It is a constellation schema in the terms of yesterdays post. One fact table is abundance with dimensions of time, site, and taxa. The other fact table is measure with dimensions of time and site. It also has a number of errors in the data entry of the types typically seen.
You can download the three data files here (word of caution you can open them in Excel or a word processor to look at them, but do NOT save them from there – it will cause you much grief – use the ability to “save attachment as” in your browser to locate these files in the directory you want):
http://www.brianmcgill.org/raptor_survey.csv (the fact table, erroneously stored in dimensional format but with dimensions of site, time and taxa)
http://www.brianmcgill.org/raptor_sites.csv (the site dimension table)
http://www.brianmcgill.org/raptor_temps.csv (a 2nd fact table of abiotic observations with site and time dimensions)
And you can download the R script here:
If you just stick the three data files and the R script in your R working directory and run it, it will produce outputs. Just before the outputs it will create a dataframe called “thedata” which is a row-column denormalized analysis table which is the main point of this exercise. But the interesting thing to do is to walk through the R script, executing it line by line and understanding what it does. We did this as a group in my lab group last semester. I won’t say I created a bunch of clones with students who will exactly copy my style (and I wouldn’t want to), but people had a lot of fun talking through the code together and learning things from the code and each other. In fact, it inspired us to want to do some code walkthroughs as a lab group this fall to continue the learning experience.
For those of you without compatriots at hand, I hope you can use the comment section as a way to interact around this code. I will try to answer, but others are welcome to chip in too.
So although my approach was software neutral (and indeed you could load my 3 data files into SQL or Python or Matlab or Excel if you wanted to), the example I give is in R. And specifically it uses three packages by Hadley Wickham (also the author of ggplot2): dplyr, tidyr, and lubridate. tidyr is primarily used for the gather and spread commands which go from dimensional to row-column format and back (Commandment #2). It also has code to split a column into two columns which is used in my example to split species binomials and get a genus name. lubridate is for working with dates and is especially useful for taking dates in a format like 7/4/2016 and converting them into a day of year (DOY) or month of year (MOY) for group by analysis. dplyr is the core package. It basically implements SQL-like operations that start with a row-column dataframe and allow you to add calculated columns, filter rows (or columns), sort rows, and do group-by summarise operations. It is well worth the learning curve if you are a regular R user. It is beyond the scope of this blog post to teach dplyr, but dplyr comes with a good vignette and a number of other good tutorials are out there (e.g. this and resources here). The dplyr/tidyr (aka data wrangling) cheat sheet from R studio’s cheat sheet page is an invaluable quick reference.
So walk through the code have fun. And let me know what you like or don’t like.
Update: check out Karl Cottenie’s comment below or visit his blog directly http://www.cottenielab.org/2017/02/standing-on-shoulders-of-giants.html for some tweaks and alternative opinions on approaches.