You asked us anything–here’s our latest answer!
What “nuts and bolts” methods and data analysis techniques should every ecologist be familiar with, no matter what their job? (Matt Ricketts)
Brian: At the top of the list have to be ANOVA, regression and Principal Component Analysis – I would think it would be hard to read very far without understanding these. Beyond, that I think I would prioritize a deeper understanding of these. What are the assumptions of regression? How important are the violations? What can regression tell us and not tell us?
Beyond that generalized linear models (e.g., Poisson and logistic regression) is really important these days as are mixed models/random effects (the old fashioned linear regression context, not necessarily the Bayesian). Again deeper understanding is important – getting to know the Bayesian vs Monte Carlo vs likelihood approaches to these techniques is worth the time.
As far as newer stuff, I always say the three most underutilized statistical methods in ecology are regression trees (and more generally machine learning), path analysis and quantile regression. These are not just elaborations and improvements on the core linear regression (as everything in my last paragraph was), but fundamentally different ways to approach scientific inference.
Jeremy: More or less what Brian said: general linear models (which subsumes linear regression, multiple regression, ANOVA, ANCOVA, and combinations thereof as special cases) is the big one. Then generalized linear models and the simplest dimensionality reduction techniques like PCA. Then maybe randomization tests and bootstrapping, and the simplest model selection/guarding-against-overfitting techniques: AIC and cross validation. Beyond that, I think it’s very much need- and preference-specific. Brian cites path analysis, regression trees, and quantile regression, but I’ve never had the least need for any of those things in my own work, am not nearly as enthusiastic about them as Brian is, and haven’t even needed to know much about them to read the work of others. Much more useful to me has been time series analysis techniques (both time domain and frequency domain).
I wonder how much interesting data lies mouldering in Excel files because folks don’t know about quantile regression? Since so many of our hypotheses deal with constraints, it breaks my heart to see a beautiful polygon skewered with a LSM regression line. Quantile regression expresses a concept ecologists should take to heart: drivers that are “Necessary but not sufficient” to account for a phenomenon.
Of the three underused methods Brian speaks of, the one that most intrigues me is decision tree/machine learning. Do you have in mind a handful of papers that have used it to great effect, or a nice tutorial? Christmas break can’t just be about eggnog and bingeing on Breaking Bad.
“I wonder how much interesting data lies mouldering in Excel files because folks don’t know about quantile regression?”
On the other hand, I wonder how often quantile regression is abused by people who have vague or seriously-confused ideas about what “constraints” are. My worry here is not at all hypothetical. For instance, advocates of the idea that diversity peaks at intermediate disturbance or productivity are fond of using quantile regression to show that maximum diversity is a humped function of disturbance or productivity, and claiming vindication for their hypothesis. The trouble is that, if diversity and disturbance or productivity are both unimodally distributed (which they are), then even if those two variables are independent of one another, the plot of diversity vs. disturbance or productivity is going to have a humped upper bound. Which to my mind does not indicate a “constraint”, much less an association of any sort between the two variables. I freely admit this is just one specific example, but it does make me worry if the whole notion of “constraint” is sufficiently clearly defined to make trying to detect “constraints” via quantile regression a useful exercise. Frankly, it’s my impression that some ecologists (not all, but more than a trivially small number), whenever they see a scatterplot with some ploygonal-ish shape, immediately think “Aha–constraints!” Which is simply wrong.
“Frankly, it’s my impression that some ecologists (not all, but more than a trivially small number), whenever they see a scatterplot with some ploygonal-ish shape, immediately think “Aha–constraints!” Which is simply wrong.”
Yes, but as you said, this post is about underused techniques. Quantile regression, to many of us, is an underused method that reflects the under-appreciated concept in ecology of “necessary vs. sufficient”. Think of all the right triangle scatterplots supporting the hypothesis “X is necessary but not sufficient to generate Y”, and all the things that go wrong when these are analyzed with linear regression.
“but as you said, this post is about underused techniques. ”
I never said that. The original question was about techniques it’s essential for every ecologist (academic and non-academic) to know, not about techniques that, while not essential to all ecologists, are used less than they should be by some ecologists. Here’s the original question, which I shortened and paraphrased at the top of the post for benefit of readers: https://dynamicecology.wordpress.com/2013/11/26/ask-us-anything/comment-page-1/#comment-20760
Brian chose to also throw in a plug for some methods he finds useful and feels are underused, and I’m happy for the conversation to go in that direction if that’s what people want. I just would prefer that we not confuse “techniques that are underused” with “techniques that are essential for all ecologists to know”. It’s easy for all of us (very much including me) to overgeneralize from our own experiences and so mix up “our own personal favorite approaches” with “approaches that it’s essential for everyone to know and use.” So if you say that quantile regression is underused, I’m happy to agree, with the caveat that when it’s used I see it used badly more often than I would like (and presumably if quantile regression was used more often and taught more widely, it would be used better when it is used). But if you say that quantile regression is *essential* for all ecologists to know (where “essential” means something like “at least as essential as GLMs” or “you can’t consider yourself a competent ecologist unless you know quantile regression”), sorry, got to disagree with you on that.
Jeremy – the example you cite is an odd one. I actually haven’t seen the papers using quantile regression to support the IDH. But it sounds like a poor application to me.
First off in your example where the two are normally distributed – you should get a classic ellipsoid cloud with density tapering off towards the edges. That is not a situation that should be addressed by quantile regression (and I personally haven’t read a paper where I’ve seen this). I’m not even a fan of applying quantile regression to “house shaped” plots or other circumstances except very carefully.
So in general, I think quantile regression probably in most cases should stick to the linear form (because the proper null hypothesis for peaked forms is not obvious as you highlight). In this context there are good inference methods (e.g. pseudo r2, bootstrapping, etc) to be rigorous. And as Mike says (or as I’m paraphrasing Mike in more exaggerated language) it is a crime to see a triangular plot and then run a line through the mean (or median) without thinking about if this is the right interpretation.
Ecological data is full of situations where quantile regression IS a good interpretation. Indeed one of the oldest “laws” in ecology (Liebig’s law of the minimum) is directly related to quantile regression which explains the ubiquity of data needing quantile regression. As an informal but frequent statistical consultant, I can’t tell you how many times I’ve had somebody come into my office and say “my data is a mess what can I do” and the answer is quantile regression. Most of these people then tell me that quantile regression changes their whole world view. Which is what I claimed.
The example I cite may be odd, but it’s a real example. I’m absolutely not out to strawman quantile regression here!
An informal version is in the next to last paragraph of Fridley et al. 2012. The (bang on correct) reply is Fig. 1 in Grace et al. 2012. Of course, Grace et al.’s reply didn’t stop somebody else from repeating Fridley et al.’s mistake (this time with an actual quantile regression): Pierce in press. And I just reviewed a paper in which someone used a quantile regression in exactly the same way Pierce did.
I’d like to believe that Fridley et al., Pierce, and the paper I just reviewed are isolated, atypical examples of people misusing quantile regression. But the total number of authors involved is sufficiently large (and many of them are sufficiently prominent) that I’m reluctant to just write them off as isolated cases. On the other hand, I freely admit that these particular examples probably bug me more than they should, because they involve people abusing quantile regression in order to attack a paper that I think is very good, very important, and that was lead-authored by a friend.
@Jeremy – thanks for those three papers. I have to agree that those papers are using quantile regression wrong. The null model you explain is absolutely the best explanation for the data. The first one doesn’t disturb me too much as it is just sort of off the cuff verbal. But I am surprised the last one got through reviewers.
I do stick by my claim though that not only is that an improper but it is not a typical example of the use of quantile regression in ecology.
“But I am surprised the last one got through reviewers.”
That surprises me too. Especially since Jim Grace, lead author of Grace et al. 2012, is acknowledged as one of the reviewers, so I’d imagine that he said more or less the same thing about Pierce’s quantile regression that Grace et al. said about the verbal version in Fridley et al. 2012. But Pierce got published anyway, and one can of course imagine various reasons. For instance, Pierce is a “commentary” paper, and perhaps the editor took the view that authors of “commentaries” should be free to express their own views even if some or all of the reviewers disagree. Or maybe the editor and/or the other reviewers agree with Pierce’s use of quantile regression, so that any dissent from Grace was “outvoted”. Or there are all sorts of other reasons.
@Mike and Brian:
So is there a need for a post (which I could imagine might lead to a paper) reviewing/explicating/discussing what a “constraint” is, and in that context reviewing/critiquing applications of quantile regression in ecology (good and bad)? Or maybe the folks from Ethan White and Morgan Ernest’s Ignite session at the last ESA are already doing that? (http://jabberwocky.weecology.org/2013/08/14/ignite-talk-why-constraint-based-approaches-to-ecology/)
We decided not to, given time constraints.
Almost forgot: this old post is relevant to the general topic of what a “constraint” is and the links between “constraints”, “trade-offs”, and “shapes and bounds of statistical distributions”:
> Or maybe the folks from Ethan White and Morgan Ernest’s Ignite
> session at the last ESA are already doing that?
The constraint related work that the two of us are interested in is different from the kind one would approach using quantile regression. I agree with Brian & Mike that it is a potentially valuable tool that hasn’t been well explored, and also with Jeremy that it is sometimes applied rather naively (just like everything else I suppose). I’d certainly be interested in reading a good well considered treatment of the topic.
Regarding Simon Pierce’s use of Boundary Regression (wasn’t actually quantile regression, but somewhat similar intent) it was said,
“That surprises me too. Especially since Jim Grace, lead author of Grace et al. 2012, is acknowledged as one of the reviewers”
Then, “Pierce is a “commentary” paper, and perhaps the editor took the view that authors of “commentaries” should be free to express their own views even if some or all of the reviewers disagree.”
You are spot on here Jeremy. My review of Pierce’s paper was very critical, but I imagine it was clear to the Editor that this was a discussion that should be held in public. Peter Adler and I, along with some of the usual suspects from the Nutrient Network (Stan Harpole, Elizabeth Borer, Eric Seabloom) are revising a response commentary on Pierce’s commentary at this instant (well, back to it when I finish this post).
Briefly, we think this discussion with Pierce may be an educational moment regarding some fundamental issues related to how data are used to evaluate theories. I think some of the analyses and illustrations we provide will relate quite directly to issues that lie behind points raised in your discussion of essential statistical techniques. In particular, I hope you both (Jeremy and Brian) find our application of causal analysis interesting. I think that behind the disagreements being expressed in the thread about the interpretation of quantile regression are assumptions about the underlying causal interpretations. I won’t go into all that now, but will be sure to pass along our response to Pierce when it can be distributed. Lots of heat, and hopefully some light as well.
@Jim – thanks for your comments. You raise a really important point I missed on a quick read through. Pierce et al did not even use quantile regression (which fits a percentile – e.g. 90th percentile – and is therefore at least somewhat self-correcting to the issues Jeremy raised – i.e. requiring more points above the line in regions dense with data). They arbitrarily took the top 20 points in each bin and fit a curve through it which has no mechanism for correcting for density of points in different regions (indeed pretty much does the opposite).
I’d be really curious to know what a true, good quantile regression through that data looks like! Do you address that in your reply?
Thanks for responding. Good to know there are some folks interested in the next page of the story. We are attempting a proper-ish quantile regression analysis on the Adler data at the request of the reviewers. Because plots are clustered within sites and because of the interest in a non-linear modal relationship, we are using a newer R package “lqmm” for mixed model quartile regression, which has some limitations to it. All this is a bit frustrating, however, because we do not believe a continued fixation on the bivariate relationship between biomass production and richness moves our understanding forward on underlying mechanisms one bit. Our Response works very hard to explain why the study of bivariate relations will not move us forward and how progress in understanding might be made. We are very excited to see what folks think of our causal analysis of Grime’s humped-back model.
Will keep you posted.
Some good introductory papers on machine learning (both in the tutorial vein but with empirical examples):
CLASSIFICATION AND REGRESSION TREES: A POWERFUL YET SIMPLE
TECHNIQUE FOR ECOLOGICAL DATA ANALYSIS
GLENN DE’ATH1 AND KATHARINA E. FABRICIUS (Ecology 2000)
MACHINE LEARNING METHODS WITHOUT TEARS: A PRIMER
Julian D. Olden et al (Quarterly Review of Biology 2008)
Of course any paper on niche modelling is using machine learning under the hood, although I don’t think those are the most interesting applications.
Broadly, I think machine learning is useful in several contexts:
1) When we just want to see how much variance in dependent variable Y a set of variables X explains (and we can additionally compare and contrast different sets of X)
2) When we expect there are a lot of interactions and non-linearities in the responses but we don’t have prior hypotheses about what these are. They can of course be chased down by adding interactions and polynomial terms to a traditional linear regression but if you have 5 or 6 explanatory variables this gets very unwieldy if you don’t have prior hypotheses about where to add interactions and nonlinearities
3) In a more applied context where prediction is your primary focus.
If regression (or classification) trees are underused in ecology, they should probably stay that way. It is quite dangerous to point to tree models *alone* because they are exceedingly sensitive to the sample of data you happen to have collected. By the time Glenn and Katharina published their Ecology paper, Leo Brieman was publishing his RandomForests paper and many years before that he’d promoted the use of bagged trees (bagging) as a way to improve the predictions from trees, because of these instability defects. Unfortunately, with these ensemble methods one loses the nice tree-like interpretation that often catches the eye of the user.
Whilst you have to start somewhere, I would suggest also Glenn’s later paper (Ecology, 2007) on boosted trees (which includes discussion of bagging, random forests etc) to be read in conjunction with the earlier CART paper you mention. And also Cutler et al (Ecology 2007 ) who focus on random forests.
I do agree, however, with your three contexts in which machine learning methods, properly applied, are useful, as long as 2) is not an excuse not to think 😉
More broadly, Elements of Statistical Learning is a solid choice, as is this book (http://www-bcf.usc.edu/~gareth/ISL/) which just came out and I’m pretty stoked about.
There’s also a nice MOOC on machine learning (https://www.coursera.org/course/ml), with the one downside (or opportunity!!) being that the programming exercises will force you to use Matlab (or Octave).
@Tad – Elements of Statistical Learning is the classic and still my goto on a regular basis. But it is definitely for somebody with (at least what is considered in ecology) pretty good statistical/mathematical chops. Its good to know that Hastie has done an intro book for non-statisticians. Based on the clarity of his other work and writings, I bet it is great.
I would add visualizing model-based estimates/predictions and comparing them to observed data in order to understand: 1) the importance of the estimates for the study system, 2) the relevance of the estimates to the conditions the ecologist is interested in, and 3) the appropriateness of the model. I think of these as independent techniques even though any class which covers (generalized) linear models ought to cover visualization of estimates/predictions but most graduate students I’ve met struggle with these. Also, knowing what over-fitting or poor fit look like in visualizations independent of AIC/etc…
I agree that visual approaches to model checking are essential. But I don’t think of those as independent techniques, I think of them as part and parcel of learning to do statistics well, and teach them that way in my own classes.
It is interesting that you both mention PCA and yet this is generally considered the worst of the ordination methods when it comes to species data because it fails to handle non-linear responses in the variables (without some prior transformation sensus Legendre & Gallagher (2001; Oecologia). I’m not suggesting we should all be schooled in the delights of NMDS, CA, RDA and CCA (and the multitude of other acronym-bearing methods), just that it would appear that PCA is really of limited use compared to NMDS (which has it’s own baggage and problems) or other methods.
Similarly interesting is that you both highlight the general linear model as core understanding and then mention the need to understand the GLM almost as an extra. It has been my experience working with ecologists in a range of fields with a range of data that the general linear model is rarely of much use. So much so that I have toyed with the idea of not teaching the general linear model aspects first and then introducing the extensions required to move to the GLM, but to just start with the GLM and really hone in on the distributional assumptions of the models and then go on to discuss how these assumptions might be changed (in the GLM sense to other probability distributions) or relaxed (in the sense of GLS, GEE) to deal with other issues such as arise with spatial or temporal data. (Needless to say this would be an upper undergraduate, graduate level course, not biostats 101.)
Having attended a workshop by James Grace & Don Schoolmaster on structural equation models, I can certainly relate to Brian’s mentioning of path analysis.
One key skill, but perhaps stretching the bounds of “statistical” would be exploratory data analysis and data visualization; we see so many poor plots of data/results in the literature that this would be a Day 1 topic on any course I give.
There seems to be a lot of attention paid to fitting models, but not on interpreting and critiquing them. How many talks or papers have we seen where a statistically significant effect is lauded but is of ecologically irrelevant size? Really understanding how to go from the summary outputs that our stats software of choice gives us to dissection of the model outputs via diagnostics stats/plots, partial-effects plots etc. seems to be missing from the toolbox in many instances.
Not sure where I sit with AIC and model selection. Selecting via AIC is effectively the same as selecting terms via p-values and we know how bad that can be. Model averaging or shrinkage methods would seem to get us out of making the strong statement that certain variables have exactly zero effect, but there is a price to pay.
“Similarly interesting is that you both highlight the general linear model as core understanding and then mention the need to understand the GLM almost as an extra”
For all methods I suggest, I think you should understand them! Sorry if that wasn’t clear. And by “understand them” I include “be able to interpret their output sensibly” and “be able to use these methods in the aid of answering an intelligent scientific question” (which includes things like not mixing up statistical and biological significance, not trumpeting the rejection of an obviously-false null hypothesis that we learn nothing by rejecting, etc.)
Model selection via AIC isn’t “effectively” the same as selecting terms via p-values as far as I know. Can you provide a citation for that equivalence?
Yes, there are various other ordination methods besides PCA. But they’re just that–ordination methods. PCA has other uses besides ordinating species data. I’d put ordination methods like CCA, NMDS, etc. in the category of more specialized techniques that are essential to know about for ecologists working in certain areas, and very much optional or even irrelevant for ecologists working in other areas. The question asked about the methods everyone needs to know about, which in practice means “the methods lots of ecologists use”. If I had to guess (and I emphasize I am just guessing), I bet many more ecology papers, on a much wider range of topics, use PCA than use CCA, NMDS, etc.
Depends what you mean by model selection then. I can’t put my finger on it just now, but if by model selection you mean choosing whether a fixed effect term should remain in the model or not, one variable at a time, AIC is a restatement of the p-value; you are just using a different alpha from say the usual 1-0.05 value when deciding to retain or remove that variable. Frank Harrell of Regression Modelling Strategies fame told me this on R-Help list 5 odd years ago now. Hence you won’t necessarily get the same final model via AIC and p <= 0.05 as the step-wise selection criteria, but there is a value x such that selecting via AIC is the same as using p <= x only I forget what that value of x is.
Additionally, my point was that choosing models stepwise via AIC retains many of the problems that we complain about when people do this using p-values. Not least the risk of strong for bias in the estimated parameters when we make the implicit statement that an effect is zero.
But perhaps you were referring to other forms of selection, in which case we may be talking past each other?
I think GLM’s are very useful, and their zero-inflated extensions are as well. An excessive number of zero’s in species count data are quite common. Check out Zuur et al.’s “Mixed Effects Models and Extensions in Ecology with R” for a great qualitative section on zero-inflated models that includes talking hippos (and lot’s of other stat goodies!).
AIC analysis is start, not a finish, but as a start, AIC and other information-theoretic approaches can be really useful. Unfortunately, I see AIC analysis being misused all the time. Using the model with the lowest AIC score for inference is often not an acceptable practice. Model averaging should be used to estimate parameter when more than one model have similar AIC scores. I’m no expert, but I would refer folks to Burnham & Anderson’s book, “Multimodel inference: understanding AIC and BIC in Model Selection” (2004). Also, giant tables of AIC scores reported as end results are not informative, but parameter estimates from one or more models (using averaging) can be.
Multivariate methods are not appropriate for many data and/or questions, but they can be powerful tools when summoned. You will probably need lots of data for these methods. For an intro to ordination methods, regression trees (and forests), and cluster analyses, check out the course page from my department head Dave Roberts. His class was great, and all his notes and and R code are here: http://ecology.msu.montana.edu/labdsv/R/labs/
He also invented fuzzy set ordination a long time ago, which is largely unknown, but I found to be really useful for examining a priori species/environment relationships. Here’s how he assessed fuzzy set ordination in comparison to CCA and others: http://www.esajournals.org/doi/abs/10.1890/07-1673.1
If you are going to do an analysis that requires your species data to be converted to a (dis)similarity matrix, it might be worth looking into Chao et al,’s new approach that can account for unseen, shared species absences among sites: http://onlinelibrary.wiley.com/doi/10.1111/j.1461-0248.2004.00707.x/abstract;jsessionid=B524A36777BD7C43E601EA7D09DCC060.f02t04?deniedAccessCustomisedMessage=&userIsAuthenticated=false
I mostly agree with Gavin here. And it seems the issues with PCA vs. advanced multivariate techniques are analogous to linear models vs. GLMs – the simple versions are subclasses of larger concepts, not distinct units. And yet, the presentation/teaching can often take a sequential approach that is not intuitive, especially when the advanced material is never reached. I don’t remember when I learned about the relationship between ANOVA and linear regression, but it was not in a classroom (which is disappointing). It also does not help to have different unrelated names and abbreviations that only further confusion, especially with multivariate techniques.
On the other hand, the question posed was for all ecologists regardless of their specific job. I suppose it’s pretty hard to provide an all-encompassing answer. There is an obvious trade-off between being broadly knowledgeable and having a deep understanding of statistics (or any subject matter, really).
Re: model selection, it is generally accepted that a difference in 2 AIC units is equivalent to a likelihood ratio test with a critical p-value of 0.15. So for stepwise procedures, using AIC values to select variables is effectively the same as using p-values (albeit, with a higher threshold). This is at the heart of why AIC model selection generally favors more complex models and can tend to over fit.
What one needs to know to read the literature and what one should know to improve one’s research aren’t the same. In my field, PCA is way, way over used. It’s often used where other multivariate techniques (PLS, DFA/CVA, CCA) would be much more efficient and effective.
A response to Jeremey’s comment: we are all a bit like the hammer that sees every problem as a nail. Our research path (including how we design and analyze experiments and even what kinds of questions we ask) is very much influenced by the tools we know. As a consequence we don’t recognize that other tools are useful. Or more importantly, other tools might allow us to see the problem in a way that we just couldn’t see before. I scan something like Marc Mangle’s books and think, “I don’t need to know this”. But a better way to think about would be, “I wonder how thinking like this could improve how I address the problems that interest me or even identify new problems that I didn’t recognize were a problem!”
I would add: become really, really comfortable with matrix algebra. Not linear algebra (although that would be nice), just matrix algebra, including some of the basic matrix algebra manipulations. Can you read a Lande paper and understand the math? If not, I think it would be really hard to have anything but a superficial understanding of the content (I think this is why PCA is overused by the way, a limited understanding of matrix algebra and linear algebra to an extent).
I would also add: learn how to write code really really well to build up the skill of thinking algorithmically. A problem that I have with some literature is that hypotheses or models or theory are not quantitative and not algorithmic. An example is much of the work on plasticity/accommodation – does the phenotype “lead” the genotype when colonizing a novel environment. When I try to model this I find that the verbal descriptions in various introductions and discussions are too vague to really know what the authors mean.
All good advice Jeff–though it does get back to the issue of what it’s truly *essential* for every ecologist to know in order to do their job (the original question, which I shortened, emphasized that the questioner was asking about what non-academic as well as academic ecologists need to know).
For instance, I’m a rubbish coder. So are lots of ecologists. If being able to code really well is “essential”, then that means I’m a bad ecologist, and so are lots of other people.
Don’t get me wrong, your suggestions are well taken. The ability to do things like code well is indeed very useful in many contexts, and I’m sure many good ecologists (including me) would be even better ecologists if they were better at coding (and at various other things). But I have the impression that you and the other commenters are (gently) disagreeing with our answers because you’re interpreting the question differently than I (and perhaps Brian) did.
Similarly, I totally agree that understanding matrix algebra (and some calculus) is really helpful for understanding large chunks of the ecology literature, particularly the modeling literature. But the question asked about statistical techniques, so I limited my answer to that.
In that regard, I would humbly disagree with PCA (or any dimensionality reduction method) being a statistical technique that every ecologists needs to know. I think whether or not you need to know PCA (or multivariate methods in general) depends on both your sub-discipline and your “scientific style”.
For me personally, the ability to understand, interpret, or implement a PCA has never furthered my own scientific understanding or research. Probably a lot of this has to do with my own biases about what is interesting. I do not want to imply in any way that these are bad or worthless techniques. It’s just that I am generally not interested in questions where a PCA is needed to determine the answer.
It’s obviously difficult to impossible to draw a clear bright line between essential and inessential techniques for all ecologists. I intended my own answer to describe a gradient from more to less essential techniques, and I think Brian intended something similar. Sorry if that wasn’t entirely clear.
And I suspect that the further one gets towards the less essential end of the gradient, the more scope there is for reasonable disagreement about how to order different techniques along the gradient. Because almost by definition, “less essential” really means “essential to smaller and smaller numbers of people”. I do think almost all ecologists need to know something about GLMs, which is why GLMs (or special cases like t-tests, ANOVA, linear regression, etc.) are taught to most ecology undergraduates. But yeah, if you’d place PCA further towards the less essential end of the gradient than Brian or I would, I think that’s reasonable even if I don’t agree.
I hastily finished that comment because a student came in.
I don’t disagree with the core (ANOVA + regression + PCA) to understand the literature but what I hoped to say is that this constrains the way we do our science and there may be better tools so maybe these other tools should be the core. This isn’t too dissimilar to the discussion here several months ago about the problems of recipe-driven STATS101 teaching – if everyone agrees this isn’t the best, then why teach it? Because its what everyone uses, its the core. Why, because that is what we were taught! So there is this circularity and the only way to break it is to not teach recipe-driven stats 101 from the start.
The point about coding was not to gain the skill but to train the brain to think using algorithms and models. I can understand much more about a statistic, say a P-value or an F-statistic, if I model a process and then perturb parameters and see how the statistics behave. But if I don’t know how to code I don’t think of statistics as a modeling process but as a recipe.
I’m going to defend PCA here. First, nobody said it should always be used instead of NMDS, CCA etc. This was a post focused on core techniques and what somebody with limited time should learn. In that vein, I would argue that PCA Is the proper gateway into the whole suite of multivariate dimension reduction techniques. Second, PCA is not just for ordination. It is used in morphometrics, increasingly in trait analysis and many other domains in which CCA is not really appropriate and NMDS greatly weakens the amount of information that can be interpreted. Similarly many techniques used (e.g. PCNM for spatial regression) boil down to PCA under the hood. Things like multivariate selection (one commentor mentioned Lande) and structured population models also depend on eigenvector analysis which while not identical to PCA is very closely related. In short understanding PCA is a gateway to a whole world of techniques. It is also very widely used across many domains of ecology in many papers. Which makes it a good tool to learn with a limited amount of time. Which was the original poster’s question.
Hmmm, I’ll take issue with that.
I’m not sure PCA has any real advantages over correspondence analysis (CA, not CCA, which is a different beast) in any situation (if it does, I’m not aware of it), but it has a very definite, and strong, disadvantage with certain types of data that are very common in plant ecology in particular, estimates of relative abundance in any system with more than a few species and/or a few samples. Samples thereof are often zero inflated, but shared values of zero do not necessarily indicate similarity and PCA thinks they do. And from an ease of understanding standpoint, I find CA to be much easier to understand than PCA is, since eigen analysis is not involved in the CA algorithm–it’s just an iterative computation.
Ok, but again, it seems like you’re restricting attention to one type of data. PCA gets used in lots of situations, and for lots of purposes, besides those in which one can use ordination methods. PCA isn’t just for ordination of species abundance data.
@Jim your description of CA is an old one. CA is an Eigen analysis; take a look at the algorithm in Legendre & Legendre’s Numerical Ecology book (2nd or 3rd editions), or that of the cca() function in the vegan package for R (although in that we use the SVD not the an eigendecomposition).
My main point there was that I can’t see how PCA is more generally useful than CA is if you feel the need to include a multivariate analysis tool in your kit. And on that note, I more generally have sympathy with Benjamin’s view above; I’m not sure I’d include any such on the list of must-have tools, for the simple reason that many ecological issues addressed are not multivariate. Moreover, even if you’re working with multivariate data, I’d raise questions about whether “dimensionality reduction” really does much for one’s understanding of the system at hand. I can see it’s usefulness in something like multi-spectral satellite image interpretation, (especially as things become increasingly hyper-spectral), because your question there is very focused (what material substance am I likely detecting here), and you know that collinearity among various bands is possible to likely, depending on the spectral resolution, and that the response variables are not involved in complex web-like relationships, e.g. feedbacks etc.
Getting back to the original question, I’d say that a categorical data analysis tool, like chi-square or G or whatever, is a definite essential.
And also a basic TSA tool, like ARMA.
But even more fundamentally, any number of exploratory data analysis tools.
Pingback: Friday links: math with bad drawings, the rejection that wasn’t, holiday caRd, and more | Dynamic Ecology
This seems like a good place to advance a mistake of mine that I learned about after I thought my analyses were finished. ANOVA in SAS uses type III error tests by default yet base R uses type I error tests by default. This is very important because type I error results are sensitive to the order that variables are read into the program and the results for main effects may not be accurate when interactions are included. Somehow in my classes and independent research as I made the SAS to R transition this fact had escaped me for much longer than it should have.
Pingback: Friday links: animals jumping badly, the recency of publish-or-perish, mythologizing wolves, and more | Dynamic Ecology
I have recently attended some university lectures on classification techniques such as trees, random forest, support vector machines. I work in the health field and from numerous conversations and articles I’ve read, the general opinion seems to be that, in most cases, logistic regression performs just as well as the above more complex machine learning methods.
Does anyone have any experience of comparing these approaches in ecology?
It seems odd that most ‘Data Science’ MSc’s I’ve seen don’t even mention logistic regression
Pingback: Statistik dan Arus Informasi dalam Ekologi: Apa Saja yang Harus Dipelajari? | Cuma Ide