Detection probabilities, statistical machismo, and estimator theory

Posted on September 15, 2014 by Brian McGill

Detection probabilities are a statistical method using repeated sampling of the same site combined with hierarchical statistical models to estimate the true occupancy of a site*. See here for a detailed explanation including formulas.

Statistical machismo, as I define it in this blog, is the pushing of complex statistical methods (e.g. reviewers requiring the use of a method, authors claiming their paper is better solely because of the use of a complex method) when the gains are small or even occur at some cost. By the way, the opposite of statistical machismo is an inclusive approach that recognizes every method has trade-offs and there is no such thing as a best statistical method.

This post is a fairly technical statistical discussion .If you’re interested in detection probabilities but don’t want to follow the details, skip to the last section for my summary recommendations.

Background

I have claimed in the past that I think there is a lot of statistical machismo around detection probabilities these days. I cited some examples from my own experience where reviewers insisted that detection probabilities be used on data sets that had high value in their spatial and temporal coverage but for which detection probabilities were not possible (and even in some cases when I wasn’t even interested in occupancy). I also discussed a paper by Welsh, Lindenmayer and Donnelly (or WLD) which used simulations to show limitations of detection probability methods in estimating occupancy (clearly driven by their own frustrations of being on the receiving end of statistical machismo for their own ecological papers).

In July the detection probability proponents fired back at WLD with a rebuttal paper By Guillero-Arroita and four coauthors (hereafter GLMWM). Several people have asked me what I think about this paper including some comments on my earlier blog post (I think usually in the same way one approaches a Red Sox fan and asks them about the Yankees – mostly hoping for an entertaining reaction).

The original WLD paper basically claimed that in a number of real world scenarios, just ignoring detection probabilities gave a better estimator of occupancy. Three real-world scenarios they invoked were: a) when the software had a hard time finding the best fit detection probability model, b) a scenario with moderate occupancy (Ψ=40%) and moderate detection probabilities (about p=50%), and c) a scenario where detection probabilities depend on abundance (which they obviously do). In each of these cases they showed, using Mean Squared Error (or MSE, see here for a definition), that using simple logistic regression only of occupancy ignoring detection probabilities had better behavior (lower MSE).

GLMWM basically pick different scenarios (higher occupancy Ψ=80%, lower detection p=20% and a different SAD for abundances) and show that detection probability models have a lower MSE. They also argue extensively that software problems finding best fits are not that big a problem**. This is not really a deeply informative debate. It is basically,” I can find a case where your method sucks. Oh yeah, well, I can find a case where your method sucks.”

Trying to make sense of the opposing views

But I do think stepping back, thinking a little deeper, framing this debate in the appropriate technical context – the concept of estimation theory, and pulling out a really great appendix in GLMWM that unfortunately barely got addressed in their main paper, a lot of progress can be made.

First, lets think about the two cases where each works well. Ignoring detection worked well when detection probability, p, was high (50%). It worked poorly when p was very low (20%). This is just not surprising. When detection is good you can ignore it, when it is bad you err to ignore it! Now WLD did go a little further, they didn’t just say that you can get away with ignoring detection probability at a high p – they actually showed you get a better result than if you don’t ignore it. That might at first glance seem a bit surprising – surely the more complex model should do better? Well, actually no. The big problem with the detection probability model is identifability – separating out occupancy from detection. What one actually observes is Ψ*p (i.e. that % of sites will have an observed individual). So how do you go from observing Ψ*p to estimating Ψ (and p in the case of the detection model)? Well ignoring p is just the same as taking Ψ*p as your estimate. I’ll return to the issues with this in a minute. But in the detection probability model you are trying to disentangle Ψ vs. p just from observed % of sites with very little additional information (the fact that observations are repeated on a site). Without this additional information Ψ*p are completely unseparable – you cannot do better than randomly pick some combination of Ψ and p and that together multiple to give the % of sites observed (and again the non-detection model essentially does this by assuming p=1 so it will be really wrong when p=0.2 but only a bit wrong p=0.8). The problem for the detection model is that if you only have two or three repeat observations at a site and p is high, then most sites where the species is actually present it will show up at all two or three observations (and of course not at all when it is not present). So you will end up with observations of mostly 0/0/0 or 1/1/1 at a given site. This does not help differentiate (identify) Ψ from p at all. Thus it is actually completely predictable that detection models shine when p is low and ignoring detection shines when p is high.

Now what to make of the fact, something that GLMWM make much of, that just using Ψ*p as an estimate for Ψ is always wrong anytime p<1. Well, they are correct about it always being wrong. In fact using the observed % of sites present (Ψ*p) as an estimator for Ψ is wrong in a specific way known as bias. Ψ*p is a biased estimator of Ψ. Recall that bias is when the estimate consistently overshoots or undershoots the true answer. Here Ψ*p consistently undershoots the real answer by a very precise amount Ψ*(1-p) (so by 0.2 when Ψ=40% and p=50%). Surely this must be a fatal flaw to intentionally choose an approach that you know on average is always wrong? Actually, no, it is well known in statistics that sometimes biased estimator are the best estimator (by criteria like MSE).

Estimation theory

Pay attention here – this is the pivotal point – a good estimator has two properties – it’s on average close to right (low bias), and the spread of its guesses (i.e. the variance of the estimate over many different samples of the data) is small (low variance). And in most real world examples there is a tradeoff between bias and variance! More accurate on average (less bias) means more variance in the guesses (more variance)! In a few special cases you can pick an estimator that has both the lowest bias and the lowest variance. But anytime there is a trade-off you have to look at the nature of the trade-off to minimize MSE (best overall estimator by at least one criteria). (Since Mean Squared Error or MSE=Bias^2+Variance one can actually minimize MSE if one knows the trade-off between bias and variance).This is the bias/variance trade-off to a statistician (Jeremy has given Friday links to posts on this topic by Gelman).

Bias and Variance - Here estimator A is biased (average guess is off the true value) but it has low variance. Estimator B has zero bias (average guess is exactly on the true value) but the variance is larger. In such a case Estimator B can (and in this example does) have a larger Mean Squared Error (MSE) - a metric of overall goodness of an estimator.

Figure 1 – Bias and Variance – Here estimator A is biased (average guess is off the true value) but it has low variance. Estimator B has zero bias (average guess is exactly on the true value) but the variance is larger. In such a case Estimator B can (and in this example does) have a larger Mean Squared Error (MSE) – a metric of overall goodness of an estimator. This can happen because MSE depends on both bias and variance – specifically MSE=Bias^2+Variance.

This is exactly why the WLD ignore detection probabilities method (which GLMWM somewhat disparagingly call the naive method) can have a lower Mean Square Error (MSE) than using detection probabilities despite always being biased (starting from behind if you will). Detection probabilities have zero bias and non-detection methods have bias, but in some scenarios, non-detection methods have so much lower variance than detection methods that the overall MSE is better to ignore the detection method. Not so naive after all! Or in other words, being unbiased isn’t everything. Having low variance (known in statistics as an efficient estimator) is also important. Both the bias of ignoring detection probabilities (labelled “naive” by GLMWM) and the higher variances of the detection methods can easily be seen in Figures 2 and 3 of GLMWM.

When does ignoring detection probabilities give a lower MSE than using them?

OK – so we dove into enough estimation theory to understand that both WLD and GLMWM are correct in the scenarios they chose (and that the authors of both papers were probably smart enough to pick in advance a scenario that would make their side look good). Where does this leave the question most readers will care about most – “should I use detection probabilities or not?” Well the appendix to GLMWM is actually exceptionally useful (although it would have been more useful if they bothered to discuss it!) – specifically supplemental material tables S2.1 and S2.2.

Let’s start with S2.1. This shows the MSE (remember low is good) of the ignore detection model in the top half and the MSE of the use the deteciton model in the bottom half for different samples sizes S, repeat visits K, and values of Ψ and p. They color code the cases red when ignore beats use detection, and green when detection beats ignore (and no color when they are too close to call). Many of the differences are small, but some are gigantic in either direction (e.g. for Ψ=0.2, p=0.2, ignoring detection has an MSE of 0.025 – a really accurate estimator – while using detection probabilities has an MSE of 0.536 – a really bad estimate given Ψ ranges only from 0-1, but similar discrepancies can be found in the opposite direction too). The first thing to note is that at smaller sample sizes the red, green and no color regions are all pretty equal! IE ignoring or using detection probabilities is a tossup! Flip a coin! But we can do better than that. When Ψ (occupancy) is < 50% ignore wins, when Ψ>50%, use detection wins, and when p (detection rate) is high, say>60% then it doesn’t matter. In short, the contrasting results between WLD and GLMWM are general! Going a little further, we can see that when sample sizes (S but especially number of repeat visits K) creep up, then using detection probabilities starts to win much more often which also makes sense – more complicated models always win when you have enough data, but don’t necessarily (and here don’t) win when you don’t have enough data.

Figure 2 – Figure 1 with confidence intervals added

Now lets look at table S2.2. This is looking at something that we haven’t talked about yet. Namely, most estimators have, for a given set of data, a guess about how much variance they have. This is basically the confidence interval in Figure 2. In Figure 2, Estimator A is a better estimator of the true value (it is biased, but the variance is low so MSE is much lower), but Estimator A is over confident – it reports a confidence interval (estimate of variance) that is much smaller than reality. Estimator B is a worse estimator, but it is at least honest – it has really large variance and it reports a really large confidence interval. Table S2.2 in GLMWM shows that ignoring detection probabilities is often too cocky – the reported confidence intervals are too small (which has nothing to do with and in no way changes that ignoring detection probabilities is in many case still a better or equally good estimator of the mean – the conclusion from table S2.1). But using detection probabilities is just right – not too cocky, not too pessimistic – it’s confidence intervals are very accurate – when there’s a lot of variance, it knows it! In short Figure 2 is a good representation of reality over a large chunk of parameter space where method A is ignore detection (and has lower MSE on the estimate for Ψ but over-confident confidence intervals) and method B is use detection-based methods (and has worse MSE for the estimation of Ψ but has very accurate confidence intervals)..

(As a side-note, this closely parallels the situation for ignoring vs statistically treating spatial, temporal and phylogenetic autocorrelation. In that case both estimators are unbiased . In principal the variance of the methods treating autocorrelation should be lower, although in practice they can have larger variance when bad estimates of autocorrrelation occur so they are both roughly equally good estimators of the regression coefficients. But the methods ignoring autocorrelation are always over-confident – their reported confidence intervals are too small.)

So which is better – a low MSE (metric of how good at guessing the mean) or an honest, not cocky estimator that tells you when its got big error bars? Well in some regions you don’t have to choose using detection probabilities is a better estimator of the mean by MSE and you get good confidence intervals. But in other regions – especially when Ψ and p are low you have to pick – there is a tradeoff – more honesty gets you worse estimates of the occupancy. Ouch! That’s statistics for you. No easy obvious choice. You have to think! You have to reject statistical machismo!

Summary and recommendations

Let me summarize four facts that emerge across the WLD and GLMWM papers:

Ignoring detection probabilities (sensu WLD) can give an estimate of occupancy that is better (1/3 of parameter space), as good as (1/3 of parameter space) or worse than (1/3 of parameter space) estimates using hierarchical detection probability models in terms of estimating the actual occupancy. Specifically, ignoring detection guarantees bias, but may result in sufficiently reduced variance to give an improved MSE.These results come from well-known proponents of using detection probabilities using a well-known package (unmarked in R), so they’re hard to argue with. More precisely, ignoring detection works best when Ψ is low (<50%) and p is low, using detection works best when Ψ is high (>50%) and p is low, and both work very well (and roughly equally well) when p is high (roughly when p>50% and certainly when p>80%) rgardless of Ψ.
Ignoring detection probabilities leads to overconfidence (reported confidence intervals that are too small) except when p is high (say >70%). This is a statement about confidence intervals. It does not affect the actual point estimate of occupancy which is described by #1 above.
As data size gets very large (e.g. 4-5 repeat visits of 165 sites) detection probability models general get noticeably better – the results in #1 mostly apply at smaller, but in my opinion more typically found, sample sizes (55 sites, 2 repeat visits).

And one thing talked about a lot which we don’t really know yet:

Both WLD and GLMWM talk about whether working with detection probabilities requires larger samples than ignoring detection probabilities. Ignoring detection probabilities allows Ψ to be estimated with only single visits to a site, while hierarchical detection probabilities requires a minimum of 2 and as GLMWM shows really shines most with 3 or 4 repeat visits. To keep a level playing field both WLD and GLMWM reports results where the non-detection approach uses the repeat visits too (it just makes less use of the information by collapsing all visits into either species seen at least once or never seen). Otherwise you would be comparing a model with more data to a model with less data which isn’t fair. However, nobody has really full evaluated the real trade-off – 50 sites visited 3 times with detection probabilities vs 150 sites visited once with no detection probabilities. And in particular nobody has really visited this in a general way across the whole parameter space for the real-world case where the interest is not in estimating Ψ, the occupancy, but the β’s or coefficients in a logistic regression of how Ψ varies with environmental covariates (like vegetation height, food abundance, predator abundance, degree of human impact, etc). My intuition tells me that with 4-5 covariates that are realistically covarying (e.g. correlations of 0.3-0.7) getting 150 independent measures of the covariates will outweigh the benefits of 3 replicates of 50 sites (again especially for accurate estimation of the β’s) but to my knowledge this has never been measured. The question of whether estimating detection probabilities requires more data (site visits) remains unaswered by WLD and GLMWM but badly needs to be answered (hint: free paper idea here).

So with these 3 facts and one fact remaining unknown, what can we say?

Detection probabilities are not an uber method that strictly dominates ignoring them. As first found by WLD and now clearly shown to be general in the appendices of GLMWM, there are fairly large regions of parameter space where the primary focus – the estimate of Ψ – is more accurate if one ignores detection probabilities! This is news the detection probably machismo-ists probably don’t want you to know (which could be an explanation for why it is never discussed in GLMWM).
Detection probabilities clearly give better estimates of their certainty (or in a lot of cases uncertainty) – i.e. the variance of the estimates.
If you’re designing data collection (i.e. estimating # of sites vs # visits/site before you’ve taken measurements – e.g. visit 150 sites once or 50 sites 3 times), I would recommend something like the following decision tree:
1. Do you care more about the estimate of error (confidence intervals) than the error the estimate (accuracy of Ψ)? If yes then use detection probabilities (unless p is high).
2. If you care more about accuracy of Ψ, do you have a pretty good guess that Ψ much less or much greater than 50% or that p is much greater than 70%? If so then you should use detection probabilities if Ψ is much greater than 50% and p less than or equal to 50-60%, but ignore them if Ψ much less than 50% or p clearly greater than 50-60%.
3. If you care more about accuracy of Ψ and don’t have a good idea in advance of roughly what Ψ or p will be, then you have really entered a zone of judgement call where you have to weigh the benefits of more sites visited vs. more repeat visits (or hope somebody answers my question #4 above soon!).
4. And always, always if you’re interested in abundance or species richness, don’t let somebody bully you into switching over to occupancy because of the “superiority” of detection models (which as we’ve seen is not even always superior at occupancy). Both the abundance and species richness fields have other well established methods (e.g. indices of abundance, rarefaction and extrapolation) for dealing with non-detection.
5. Similarly, if you have a fantastic dataset (e.g. a long term monitoring dataset) set up before detection probabilities became fashionable (i.e. no repeat visits) don’t let the enormous benefits of long term (and perhaps large spatial scale) data get lost just because you can’t use detection probabilities. As we’ve seen detection probabilities are (a good method, but also a flawed method which is clearly outperformed in some cases just like every other method in statistics. They are not so perfect that they mandate throwing away good data.

The debate over detection probabilities have generated a lot more heat and smoke than light, and there are clearly some very machismo types out there, but I feel like if you read carefully between the lines and into the appendices, we have learned some things about when to use detection probabilities and when not to. The question #4 still remains a major open question just begging for a truly balanced, even-handed assessment. What do you think? Do you use detection probabilities in your work? Do you use them because you think they’re a good idea or because you fear you can’t get your paper published without them? Has your opinion changed with this blog?

*I’m aware there are other kinds of detection probabilities (e.g. distance based) and that what I’m really talking about here are hierarchical detection probabilities – I’m just trying to keep the terminology from getting too thick.

**Although I have to say I found it very ironic that the software code GLMWM provided in an appendix, which uses the R package unmarked, arguably the dominant detection probability estimation software, apparently had enough problems finding optima that they rerun each estimation problem 10 times from different starting points – a pretty sure sign that optima are not easy to find.

65 thoughts on “Detection probabilities, statistical machismo, and estimator theory”

Dan Linden on September 15, 2014 at 1:07 pm said:

Glad I could contribute to the poking of the bear.

I still don’t see why precision should be celebrated when it is so biased (as I said in my comment on the previous blog post). I agree that those appendices provide some great info to this discussion, but I fail to see where biased yet precise estimates are useful. Maybe you can provide an example.

A biased estimation of where the species is present will not get you very useful estimates of regression coefficients. How could it? You want to understand the relationship between X and Y, yet you don’t have accurate estimates of Y. There is a long history of handling measurement error in statistical modeling. If anything, the bias will be compounded by trying to estimate relationships with a error-filled response variable. I would say it’s better to recognize how little information your data contain with understandably large confidence intervals. With small data, you just can’t say much.

I agree that answering question #4 would shed some important light here.

Reply ↓
- Brian McGill on September 15, 2014 at 1:41 pm said:
  
  Well, lets say the true answer is psi=0.5
  Method A gives 0.51, 0.52, 050, 0.51, 0.52, 0.50 (biased but precise)
  Method B gives 0.50, 0.60, 0.70, 0.40, 0.30, 0.50 (unbiased but imprecise)
  
  It is ultimately subjective, but I would sure prefer Method A over Method B. And MSE/RMSE would certainly say Method A did a lot better. When evaluating these, recall that in the real world you will only get one of those numbers effectively chosen at random – not the whole distribution.
  
  I agree with you that the conversation needs to turn more to estimating beta (i.e. how occupancy varies with environmental covariates) as that is in many (most?) cases the real world problem. But personally I suspect (nobody will know until somebody does the simulations) that bias will cause even less problems here – we may be consistently 0.10 low, but if what we really care about is how occupancy goes up with vegetation height (the derivative or slope), then a bias affects mostly the intercept not the slope. At least that’s what my intuition is saying.
  
  Keep in mind that when you say error-filled response variable messing up estimates of slope that variance in psi matters as well as bias in creating an error-filled response variable (i.e. Method B above), and as shown in GLMWM, the variance can be pretty bad (for both hierarchial and no-detection models actually) – that’s a subtheme I didn’t touch on – just how bad either model can be in very high or low occupancy, very low detection scenarios.
  
  Reply ↓
- Jeremy Fox on September 15, 2014 at 1:52 pm said:
  
  Dan, are you questioning Brian’s argument that unbiasedness isn’t everything only in this particular context? Or are you denying it in general, and saying that unbiasedness is always and everywhere far more important than precision in all statistical applications? If the former, I don’t really have anything to add, not knowing anything about occupancy modeling besides what I’ve learned from Brian’s posts. But if it’s the latter, that’s quite an extreme position, and I confess I don’t quite see the motivation for adopting it. For instance (just the first thing that occurred to me): http://statweb.stanford.edu/~tibs/lasso/lasso.pdf
  
  Or am I just misunderstanding? (if so, apologies)
  
  Reply ↓
  - Dan Linden on September 15, 2014 at 2:14 pm said:
    
    Definitely the former. I wouldn’t argue that there are never cases where the use of a biased estimator is necessary or useful. Sometimes it is all you have, and that is part of Brian’s point here. It depends on the objectives of your modeling exercise and I was making a generalization about species distribution modeling.
    
    I think the lasso is trying to address a problem that Brian mentions earlier about how any given dataset represents a single realization, and a model trying to extract parameters may on average perform well but for a particular case be off in certain ways. The lasso tries to rein in regression coefficients that are simply artifacts, but in doing so can induce bias. I think… 🙂
Brian McGill on September 15, 2014 at 1:58 pm said:

Just found this very nice source on the bias-variance trade-off: http://scott.fortmann-roe.com/docs/BiasVariance.html

And the money quote from there:

Fight Your Instincts
A gut feeling many people have is that they should minimize bias even at the expense of variance. Their thinking goes that the presence of bias indicates something basically wrong with their model and algorithm. Yes, they acknowledge, variance is also bad but a model with high variance could at least predict well on average, at least it is not fundamentally wrong.

This is mistaken logic. It is true that a high variance and low bias model can preform well in some sort of long-run average sense. However, in practice modelers are always dealing with a single realization of the data set. In these cases, long run averages are irrelevant, what is important is the performance of the model on the data you actually have and in this case bias and variance are equally important and one should not be improved at an excessive expense to the other.

The piece is targeted at machine learning (a discipline that talks a lot about the bias-variance trade-off) but it is a good read for somebody wanting to dig deeper into the topic.

Reply ↓
- Dan Linden on September 15, 2014 at 2:23 pm said:
  
  Should have refreshed before I replied above. But I agree that this problem of a single realization is really important and until you simulate data that mimic your own situation and fit the models you intend to use, you’ll never fully appreciate the uncertainty in the estimation. So yeah, trying to apply hard and fast rules about which models to use can lead you astray. The rules are general for a reason.
  
  I’ve spent my fair share of time simulating data for occupancy and other mark-recapture models, only to find that certain data configurations can perform very poorly. It can be a sobering experience!
  
  Reply ↓
- Dan Linden on September 15, 2014 at 2:53 pm said:
  
  If I really wanted to open a can of worms, I would say that using a Bayesian approach is useful for handling the bias-variance trade-off in small data situations. But I’ll keep that can closed.
  
  Reply ↓
  - Brian McGill on September 15, 2014 at 3:53 pm said:
    
    Hah! – you do enjoy poking sticks don’t you 🙂 But in all honesty if that’s your rationale, then its a good one and go for it. Its the people who use Bayesian methods without really understanding why but think it will “make the paper better” or “using the most up to date methods” – or most calculating of all “will get in a better journal with a fancier method” that I take issue with.
  - Andrew Dolman on September 16, 2014 at 1:49 pm said:
    
    A bayesian approach was my thought too. It must be possible to define some reasonable prior distributions for detection probabilities of various organism types in various habitats. We shouldn’t have to estimate each time from scratch the probability of detecting say a bird in a patch of woodland.
Jeremy Fox on September 15, 2014 at 10:58 pm said:

Another excellent paper for shaking up your intuitions about what makes for a good estimator (and then replacing them with better intuitions): a classic popular article by Brad Efron on “Stein’s paradox”:

Click to access Article1977.pdf

Reply ↓
Mark Brewer on September 16, 2014 at 8:57 am said:

Thanks Brian – more great thought-provoking material! I agree that a priori, neither including or excluding detection probabilities is uniformly optimal.

One of the reasons I like Simon Wood’s mgcv package in R so much is that it has built-in guards against machismo. When fitting a GAM using that package’s gam() function, response/covariate relationships are shrunk towards being straight lines if there in insufficient evidence for them to be wiggly. A related, sensible approach to the detection probability issue discussed here would be to apply some shrinkage – via either some ridge-type penalty (in a frequentist world) or some regularising prior (in a Bayesian world) – to allow the data to decide whether detection probability is important enough or even estimable.

Of course, if the data say NO to detection probabilities, I’d then feel justified in fitting a simpler model without them. In a similar vein – and this is something that bugs me! – I wish people would refit GAMs with linear terms when gam() is telling them some relationships are straight lines!

Reply ↓
Richard on September 16, 2014 at 12:38 pm said:

Brian, thanks for the engaging post. I agree with you on several points: all statistical methods have their limitations; researchers should use whatever method they think is best; you shouldn’t discard data in cases where detection probability can’t be estimated; and it’s not good to focus on occupancy if you’re really interested in abundance. However, I disagree with many of your characterizations of the GLMWM paper, and I think a better summary of this debate is as follows:

(1) If you model the detection process, the estimator of psi will have low bias and appropriate confidence interval coverage.

(2) If you ignore the detection process, and view the estimator as an estimator of psi, it will be biased and will underestimate the variance, which explains the poor coverage (and the lower MSE).

(3) If you ignore the detection process, and view the estimator as an estimator of psi*p, it will be unbiased and will exhibit correct coverage.

(4) However, if you are modeling psi*p, you can’t be sure if your covariates are explaining variation in occurrence or detection, which is one reason why very few people have advocated this approach.

I wonder if you agree with these conclusions?

Reply ↓
- Brian McGill on September 16, 2014 at 1:21 pm said:
  
  I already acknowledged every one of those points in my post.
  
  I’m getting kind of tired of detection probability folks repeating over and over and over that ignoring detection probabilities is biased. Yes it is. But is that fact important? No – not particularly. I think it is used as a bogeyman to scare people without advanced statistical training. And I hope my post has changed that.
  
  In return, why do you find it so hard to acknowledge the main point of my post? – that for the primary job of providing a point estimate of psi it is not at all obvious that you are better off to use detection probabilities – indeed often are demonstrably by objective, quantitative criteria (MSE) much worse off.
  
  Reply ↓
Richard on September 16, 2014 at 1:49 pm said:

It’s not just that the point estimator is biased, the variance is underestimated too (if you regard the naive model as an estimator of psi). So, I think your focus on MSE as the best criterion for evaluation is misleading.

Also, having worked with many of the people that developed these methods, I can tell you that no one is trying scare anyone. Where did get that idea? Who has been talking about the boogyman?

The motivation for developing these models has always been a desire to advance knowledge and inform conservation efforts. These methods are especially important when working with managers and policy makes because they usually want to know what the population size is or how many sites are occupied, not an index of these parameters. So, the intentions are good, and no one will be upset if some researchers prefer biased estimator with unreasonably small confidence intervals.

Reply ↓
- Brian McGill on September 16, 2014 at 3:23 pm said:
  
  I’m not sure what you’re saying that I haven’t also said, except acknowledging the point that the point estimators are much better for non-trivial parts of parameter space if you ignore detection probabilities. Do you disagree with this? or just avoid saying it?
  
  As far as motivations, I agree that more and better models are good. And hierarchical detection probabilities were a great innovation. However, it has reached a point of group think where people who don’t know the models as well as the inventors now insist that you “have” to use them. This is all I object to. Its usually the adoptors of methods that become the excessive advocates.
  
  Reply ↓
  - Richard on September 16, 2014 at 3:40 pm said:
    
    I was trying to make the point that I don’t find it all that useful to compare the MSE of the estimator of psi, to the MSE of the estimator of psi*p. They are different quantities and it’s important to decide which one is of interest. Sorry if that wasn’t clear.
    
    I agree with you that it is easy to abuse and over use these methods. Unfortunately, that is a hard thing to address. I appreciate that this is one of your objectives, but I think the message should be that people need to think very hard about what process they are trying to model, and then be up-front about the pros and cons of the method they choose. Instead, in my opinion, this discussion has focused more on technical points that obscure this message.
  - Brian McGill on September 16, 2014 at 4:00 pm said:
    
    I don’t understand why you say there was a comparison of the MSE of an estimator of psi to the MSE of an estimator to psi*p. The MSE reported in WLD and GLMWM are all the MSE vs the known true value of psi alone. Thus both the non-detection and hierarchical detection models are being evaluated (using MSE) as an estimator of psi alone (and some of the time the detection models lose). Nobody is evaluating the performance of an estimator of psi*p because nobody cares about such a thing.
    
    WLD was very clear that they were interested in psi, not psi*p. Its just that sometimes psi*p is a better point estimator of psi (hard to get your head around maybe but clearly shown to be true by GLMWM). Nobody is interested in psi*p, at least not that I know of – its just that psi*p is what is observed. Then the question becomes how best to estimate psi. (and sometimes going with the simple % of sites occupied works better than trying to disentangle psi from p).
    
    On a side note, I don’t personally think its an argument in favor of detection models to keep emphasizing that psi*p is what is measured directly. Anybody who knows statistics will quickly see that that means that psi and p will be very hard to estimate separately as two distinct parameters. In fact, I personally would interpret GLMWM as showing you need 4 or 5 site visits to actually begin to get good at disentangling (identifying in statistical terminology) – this is not something many people realize.
    
    I do find it kind of ironic that I am being accused of taking this debate into technical details. I’m not the one who can’t mention detection probabilities without saying the word “bias” which is a specific statistical term (not saying you are either, but there are a lot of such people out there). If the end result of this post is that people think carefully about the pros and cons of each method as it relates to their real-world biological system and there is a diversity of methods making it into the literature, my job is done!
    
    Thanks for the discussion so far!
Michael McCarthy on September 16, 2014 at 1:53 pm said:

Hi Brian,

I (or someone else) will get to the technical details in due course – that might take a few days. But can I just clarify in the meantime – do you really think we tried to hide particular results in the appendix rather than present them in the main body of the paper? In an era where everything is mostly read online, is this likely to be an effective strategy to hide results? And would we colour code the results if we were trying to hide them? And if you thought that hiding results was our motivation, would it be fair to ask before accusing us (given you are just guessing our motives)?

By showing that “the model that does not account for imperfect detection” usually has lower MSE error than the hierarchical model for one particular combination of parameters (see Table 3a in the main body of our paper – it is not hidden in the appendix), we are not really doing an effective job of hiding these results anyway. And your claim that we don’t discuss this result is wrong – it is in the “Message 3” section of our paper, accompanying that table.

Also, your suggestion “that the authors of both papers were probably smart enough to pick in advance a scenario that would make their side look good” is extremely questionable. I doubt WLD did that. I know we didn’t. Our paper points out that WLD examined a narrow range of parameters, yet made general statements about hierarchical detection models being unsuitable. If WLD had looked at other parameter values where the hierarchical model achieved better results, I doubt they would have made their general claim because they would have anticipated a response such as ours.

We could have used only one set of parameters to disprove WLD’s general claim – that is simple falsification. So it is not a question of cherry-picking, but using simple logic to disprove a general claim.

However, we examined a wider range of parameter values than just one set that falsified the general claim. See Fig. 6 of our paper as one example where we examined results over a wide range of parameter values. Your blog post itself points out other instances where we examined a wide range of parameter values. Your implication that we cherry-picked a limited set of parameters to support our case is simply wrong – and not supported by what we present in the paper.

Cheers,

Mick

P.S. We use the term “naive model” instead of “model that does not account for imperfect detection” because the former is less of a mouthful and the model is naive to imperfect detection – it seems fair enough to me and it has been used previously in the literature. However, give me a better term and we’ll use it in future so as not to offend anyone. (“Non-detection model” doesn’t work because there are detections.)

Reply ↓
- Brian McGill on September 16, 2014 at 3:47 pm said:
  
  Hi Mick – I appreciate your stopping by and engaging.
  
  As far as :”hiding” I never used this word you did. As best I can tell you have one sentence in Message 3 that acknowledge MSE is lower for non-detection models and paragraphs that talk about other stuff. Also, you did present MSE comparisons in Table 3 in the main text, but the only scenario you present there where non-detection does better than detection is the one WLD already showed which is a qualitatively different message than what comes out of Table S2.1. So I’ll stick by my claim that I actually did say that you produced “great appendix in GLMWM that unfortunately barely got addressed in their main paper” (meaning here Table S2.1 – you obviously spend a lot of time talking about S2.2). I don’t think I attributed motives to the authors about why Table S2.1 barely got addressed.
  
  But still I apologize if you thought I was accusing you of something untoward. Personally I think table S2.1 is really important and you did a great service in providing it.
  
  We’re just going to have to agree to disagree about whether the burden of proof is on GLMWM or WLD in terms of where a single counter-example is surprising news and where an exhaustive analysis of parameter space is needed. This comes from my viewpoint that detection probability models completely dominate the literature these days. Do you disagree with that last assessment?
  
  I’m not sure what you mean by “will get to the technical details in due course”. You’ve managed to kind of imply there are errors without naming any, which is not really playing fair in my book. I’m sure you can find a nit (or three) if you spend enough time – I wrote the post in a couple of hours as is usual on blog posts. I also kept the language really basic and non-technical because of general readership. But the basic facts of a bias-variance trade-off, MSE (and yes coverage of confidence intervals) are pretty simple stuff to somebody of your or my technical background and are laid out in black and white in your own paper and accurately reported here. I don’t really have an interest in distracting onto minor details when the core points are true. Do you disagree with my core points (#1-#3 in the summary)? If so which?
  
  “Simple” model would be my preference instead of “naive” – it means the same thing but doesn’t have the judgmental connotations.
  
  Again, thanks for stopping by and looking forward to further conversation.
  
  Reply ↓
  - Michael McCarthy on September 17, 2014 at 6:20 am said:
    
    I don’t mean to be unfair. I was posting my comment at around midnight Melbourne time, and the next few days are solid with meetings, teaching and family things in front of me (I’m in the middle of that now). If you want a sensible response to the technical details, it will take me longer than the time I have available over the next couple of days. I know various co-authors of our paper are travelling, so I doubt they have time to chip in just now either. I’m afraid you will have to be patient if you want a sensible response.
    
    I read your post as implying that we tried to hide results that didn’t support a pre-existing opinion. Here are three quotes from your piece:
    
    “Well the appendix to GLMWM is actually exceptionally useful (although it would have been more useful if they bothered to discuss it!)”
    
    “the authors of both papers were probably smart enough to pick in advance a scenario that would make their side look good”
    
    “This is news the detection probably machismo-ists probably don’t want you to know (which could be an explanation for why it is never discussed in GLMWM).”
    
    Given what you wrote, I think it is reasonable for me to conclude that you are suggesting we (including WLD) cherry-picked results and tried not to present others. I think at least some of your readers would come to the same conclusion. But as I have said in my previous comment, the evidence in our paper does not support that implication.
    
    The reason I took the time to comment on your post was that I think it unfairly disparages my colleagues, including WLD. I figured the technical discussion could wait, but I thought setting the record straight for your readers needed to be addressed sooner rather than later.
  - Brian McGill on September 17, 2014 at 12:00 pm said:
    
    Mick – definitely didn’t mean to disparage you or your colleagues. It’s hard for me to wrap my head around how you could produce the whole table S2.1 a lot of work (and not really give it more than a sentence or two in the whole main text. But it is your choice and I don’t know why you made the choice. The 2nd quote, well before ever looking at your table I had an intuition about how high vs low psi and p would play out in how the two models performed. If you didn’t I take your word. I don’t know about WLD. But either way, I don’t see it is negative/unethical/wrong if you did. We work with such intuitions guiding our analysis all the time. The 3rd quote is the only one I understand why somebody would feel targeted, but it is definitely not targeted at the authors of ZGLMWM. It is targeted at a specific group “deteciton probability macismo-ists” (which do exist, people who insist in every review that you need det probs and defend them as perfect to the death in public). I don’t consider it my role who fits in that box and who doesn’t. I leave it up to the individual. But I definitely don’t assume any of GLMWM fit in that category.
    
    Again sorry if you felt my saying you didn’t talk very much about the appendices was disparaging. It wasn’t intended that way.
  - Brian McGill on September 17, 2014 at 12:04 pm said:
    
    And Mick – I’m really most interested in whether you think my 3 summary points are right or wrong. I don’t think it’s very productive to go haring off on some technical details. Thanks
Richard on September 16, 2014 at 4:20 pm said:

I misspoke. However, I still think the focus on MSE in this context is misleading because (if you are trying to model variation in psi, and you acknowledge imperfect detection), then ignoring the detection process will underestimate psi and the sampling variance. This is problematic in my view. Anyhow, gotta go. Thanks for the discussion.

Reply ↓
- Brian McGill on September 16, 2014 at 7:48 pm said:
  
  One can prioritize confidence intervals over overall MSE of point estimation. It is a subjective call, but I don’t think its the choice most people would make.
  
  Reply ↓
  - Mark Brewer on September 16, 2014 at 8:07 pm said:
    
    Well, this discussion heated up!
    
    On this particular point, surely it’s a question of what you’re trying to do?
    
    If all you want is an estimate of psi, then Table S2.1 in GLMWM does indeed provide guidance as to when detection models might be beneficial or not, IF you are prepared to make assumptions as to the likely levels of psi (e.g. high vs low).
    
    On the other hand, if you actually want to make inferences, you need to get the variances right or your inferences will be garbage. This is why statisticians prefer the detection model; it’s not even machismo, it’s just the right way to achieve what we (statisticians) ordinarily want to do.
    
    I can’t help but feel there’s a lot of arguing at cross purposes here.
    
    For the record, as a statistician, I have spoken (at a stats ecology conference!) about “method ageism” (I guess akin to machismo) where statistical techniques are discounted by some just because newer methods exist; those newer methods aren’t always better, sometimes aren’t even appropriate. But detection models have their uses, just as they have their weaknesses – and I honestly thought GLMWM was a superb piece of work, as those weaknesses were not hidden away.
    
    Brian, your article above was excellent and thought-provoking; I’ve enjoyed some of the responses here rather less, I’m afraid to say.
  - Brian McGill on September 16, 2014 at 8:20 pm said:
    
    Mark I agree that you need confidence intervals to do inference. The issues arise when there is a trade-off. EG you get a worse point estimate to get better confidence intervals (which applies to some regions of parameter space for detection probabilities). Or in the real world, you have to collect much more data to fit detection models. In these cases, I think we’re agreeing that there is no right answer and that it depends on the goals?
    
    In reply to your earlier point about GAM and mgcv – I tend to agree I’m also a fan of GAM (although I find the interpretation of model complexity with interacting terms a bit of a slog, but there’s probably no easy answer there).
    
    I think it is not so easy in the detection case though to say just run it with detection models to see if you need them for three reasons: 1) sometimes the detection model is very wrong, and 2) running the detection model seems to be very costly in data collection – it’s not just an analysis issue but a data collection issue, and related 3) some really high quality datasets were never designed with detection proabilities in mind and so don’t have repeat visits. Does this make sense or am I missing something?
Dan Linden on September 16, 2014 at 7:31 pm said:

This does not address question #4, but it does illustrate the simple case where the regression estimates are biased low (for both the intercept and slope) when trying to model the relationship between species occurrence and some site covariate, in the face of imperfect detection:

Code for simulating occurrence data and fitting simple (naive) model

Note that in my R code example I’ve set p = 0.8, which you might assume would not be overly problematic. You can be the judge as to whether that’s the case, but you can also examine how much worse it gets under lower values.

Reply ↓
- Brian McGill on September 16, 2014 at 8:03 pm said:
  
  Thanks Dan – code is always a constructive contribution!
  
  I guess to convert this into an answer of #4 (I know, not your agenda!) one would have to:
  a) estimate both the simple/naive and the hierarchical detection and compare their MSE for beta0 and beta1 (the fact that the simple/naive is biased by itself is not news – although I do find it interesting it affects slope as well as intercept)
  b) make this multivariate (instead of one predictor, use 4 or 5 explanatory variables and have them moderately correlated)
  c) do this over the whole parameter space and especially compare the 50×3 vs 150×1 case
  
  Reply ↓
Special Agent Oso on September 16, 2014 at 8:24 pm said:

Hi all,

Here are a few comments on the general issue addressed by the blog
post as well as previous incarnations of the whole thing (I edited my
comments from the unmarked group to remove all the inflammatory
stuff):

A. I think the core argument about “detectability” that Brian makes
(and Welsh et al. too, and others) and the back and forth here in the
discussion of the post is really a distraction and misses the big
point, which is: what is the ecological process that is the focus of
inference? The use of an occupancy model clearly identifies that
objective as “probability of occurrence” which I think is often a
reasonable thing to develop models for. But what about Brian McGill
and these guys? Would they claim the object of inference to be the
product p*psi? If this is such a good thing, how come no one ever
states upfront that p*psi is the target of their inference? Think
about that: what if we claim to do science about some ecological
process, but then never declare what that process is? Does that make
sense? To me it does not.

By saying it’s really great to have estimators of p*psi that have low
MSE, then what problem in ecological science does this help you solve?
So you estimate thing A with less variability than someone else
estimates thing B. But it seems relevant what thing A and thing B are
and whether one of them is more relevant to the scientific or
management question being asked.

This has nothing to do about “detectability”, nothing even about
statistics, and especially not about being macho in your use of statistics,
but is really more of a basic conceptual or philosophical issue.
What system are we studying, what is our objective, what is a
sensible model of that system? (and note that occupancy models are a
model of the system, not some black-box for generating estimators of
things, so the model makes sense as a description of a state of
nature and how we observe that).

B. There are now many papers (and blog entries) which scold people for
“detectability”. Any statistician could make a living by writing such
papers — simply take any procedure, find where it breaks, and then
hold that situation up as the evidence against doing it (and write a
paper scolding the users). But this evidence doesn’t support the
alternative, or any specific alternative for the simple reason that we
know, without doing ANY analysis, that any statistical procedure will
not function well uniformly over all conditions. Understanding
performance over a range of conditions is useful for designing studies
in which we aim to use the procedure but there is no logical basis for
dismissing the whole procedure because we identify some conditions
where it breaks down. (unless you are against all of statistics which is
not an argument I’m seeing anyone make).

So we have an example of this: People use occupancy models for making
inferences about psi (see point 4 above), Welsh et al. write a paper
demonstrating some unfavorable conditions and it is then used to
dismiss the whole idea of occupancy models.
“here are some cases where, for an estimator of some thing, and by
some objective function, we do better by not accounting for detection
probability and therefore the detection probability guys are wrong!”
Instead, this could at best be interpreted as being relevant to study
design and not the model of the system (occupancy), unless someone has
a coherent argument to support the virtues of inference about
“p*psi”. i.e., what ecological problem does it solve?

C. Summary:

Getting mired down in quibbling about bias and variance and MSE
is completely unproductive. You can build a robot to take a random side of
any statistical issue (pro or con) and support it’s randomly chosen side with
simulations.

It is preferable to focus on clear statements of objectives and building
sensible models for ecological systems (which we normally have to observe
by taking samples of individuals, of spatial locations, of species, etc.). Does
some methodological framework describe your system and facilitate inference
about some quantity that gives a clear interpretation to your scientific or management
question? Then use it.

You should be able to convince ecologists of this without talking about bias,
variance, estimation or anything else having to do with statistics. Models
have utility (as descriptions of a system) even in the absence of data to train
them on and I think people lose sight of that in this age of so much data, fast
computers and cool statistical procedures.

Special Agent Oso

Reply ↓
- Brian McGill on September 16, 2014 at 8:42 pm said:
  
  SAO – this closely parallels my discussion with Richard.
  
  Everybody is trying to estimate psi. I don’t know anybody who is trying to estimate psi*p. But the fact is we observe psi*p. So the question is how best to estimate psi. Since it is so hard to separate psi from p in psi*p without very many repeat visits, sometimes ignoring detection probabilities is the best way to estimate psi. IE sometimes psi*p IS the best point estimator for psi, so if you want the best point estimator for psi (which, along with good confidence intervals for estimate of psi, is the goal), then strict mathematics tells you you should use psi*p, not because your goal is to study psi*p. And again, I’m not the one who keeps bringing up bias! I just said IF you want to talk about bias you better be fair and talk about variance MSE. And the ONLY time I bring up MSE it is as an estimator of psi. I NEVER brought up MSE as an estimator of psi*p.
  
  Just to be clear my goal is the best estimator of psi. There is no best estimator. It is a complex problem that requires trade-offs. Often the application is a real world conservation problem where the question is not whether psi is significantly different from something but just the best point estimate of psi (endangered species has occupancy of x or for endangered species occupancy depends on vegetation as relation y). It’s a sad but true fact that sometimes to get this you have to sacrifice the best confidence intervals or collect boatloads of data that may not be realistic. And you end up using psi*p as the best way to get to estimating psi. Don’t blame the messenger! And don’t twist my words about what my goal is.
  
  Reply ↓
  - Torbjørn Ergon on September 17, 2014 at 7:56 am said:
    
    I understand what you are saying Brian, but I don’t agree with your conclusion. MSE can be a very useful tool if your goal is to minimize prediction error, but this requires that you have validation data where the response variable can be measured exactly (and inexpensively). But we don’t have this for occupancy probability (and many other quantities we are interested in ecology) because occupancy probability (psi) is a latent variable that cannot be measured exactly. Neither can occupancy state at specific sites (0/1 variable) in many situations – at least not without very high effort. Of course, you can always calculate MSE from simulations (where you know the truth), but basing your choice of interference model on this would be very sketchy because you have to rely on the very assumptions you put into the simulations – I wouldn’t use such an argument unless it is not possible to separate psi from psi*p.
    
    I would never base my inferences about psi on psi*p if it was possible to estimate psi, unless I had a very good argument that p was close to 1 and constant – but the best argument for that would be to show that the estimates of p from an occupancy model is close to 1. It is not harder to fit an occupancy model in e.g. the R package ‘unmarked’ than a GLM, so I don’t see any good reason for basing your inference about psi on psi*p if you don’t have to.
    
    One more related comment (and related to many of the comments by others above): When assessing model performance for best inference (as opposed to best prediction), based on simulations, I think it is a lot more relevant to look at confidence interval coverage failure rates than MSE. If you do that, you may reach a different conclusion.
  - Brian McGill on September 17, 2014 at 11:46 am said:
    
    I’m not quite sure I get your point about MSE. They’re not based on real-world measurements of true psi. They’re based on simulations where we know exactly what psi is supposed to be. This is a fairly standard technique.
    
    I have absolutely nothing new to say about your 2nd point except nobody is talking about doing inference on psi*p so why does everybody keep repeating it? It seems like some talking point that has been circulated?
    
    Again – I’ve already said if inference trumps point estimate then yes you have to go with the method that gives better confidence intervals. I’m just curious exactly what the real-world scenario is where people are so ready to throw out a good point estimate to make sure they have the p-values right?
  - Dan Linden on September 17, 2014 at 2:46 pm said:
    
    Brian, when you survey for species occurrence the detection data are generated by an ecological process (psi) and an observation process (p). If you choose to ignore p, then your parameter estimate for psi is actually that for psi*p. Therefore, any inferences being made with the simple model relate to psi*p, not to psi. This is why people keep mentioning it. You may find that an estimator of psi*p can approximate psi well enough and you can say your intended inferences are for psi, but the parameter being estimated is still psi*p.
    
    If your observation process is really good and p is near 1 then obviously psi*p will match psi well, if not perfectly. Also, if p is constant across sites and surveys, making inferences on psi*p for the purposes of habitat relationships may still not be problematic. This is especially true if all you are interested in is showing some non-zero effect of a covariate. My R example showed that the covariate was important, despite the dampened effect.
    
    This talk about MSE being better for the simple model refers to the very simple simulations where NO covariates are included for either data-generating process and the observation process is imperfect but constant. That is really the best case scenario for field data collection, and for the simple model. If you start introducing covariates into these simulations, which is far more applicable to real world scenarios, we would get a better idea of how the tradeoffs change. It is hard for me to imagine that unexplained heterogeneity in p would allow the MSE to remain low for the simple model. And there are many examples showing how ignoring imperfect detection can alter inferences about habitat relationships. I suppose we need the question #4 material to fully explore just how often it does and does not matter.
  - Brian McGill on September 17, 2014 at 3:42 pm said:
    
    You can tell me I am estimating psi*p all you want. But I only care about psi, am estimating psi, and sometimes get closer to psi using the dreaded psi*p then you can get using psi and p estimated separately. Really not sure what more I can say about that. The confidence interval argument (need to estimate separately to get better confidence intervals) I at least understand and quickly agree its a matter of priority point estimation accuracy vs. inferential accuracy (I would make the case that in the real world for this question point estimation is often more important but not always and I can quickly see how reasonable people can disagree). But the “I’m really studying psi*p” which I’ve heard over and over is wrong and I just don’t know what more I can say. If I want to study population variance I don’t study 1/N*Σ(x-xbar)^2 which is the sample variance, nor do I study the population variance (because I cannot measure it and don’t know what it is). Instead, I study N/(N-1)*sample variance which is NOT the sample variance, the population variance or anything else except an estimator of the population variance that has good properties. The analogy to using psi*p to estimate psi (which is also unknown and unmeasurable) is pretty exact. Estimating is tricky and non-intuitive.
    
    I think there was one covariate in WLD and GMLWM, no? And without knowing true occupancy values in the field, simulations (where we do know the true value of psi) is about the only way to assess MSE, no?
  - Dan Linden on September 17, 2014 at 4:51 pm said:
    
    The simulations in Appendix S2 that you claim provide the smoking gun for this debate do not examine variation in either psi or p. Both are held constant.
    
    For another analogy, how well could we make inferences on the incidence/prevalence of a disease over time and space if we had no understanding of the reporting rate? If you make faulty assumptions about the reporting rate and trudge ahead, you risk making conclusions about variation in incidence that is actually attributable to variation in reporting. This doesn’t stop one from taking available data and trying to make the most of it when necessary, but it DOES convince other scientists to temper the conclusions drawn from such a study and consider the observation process when designing new studies.
  - Brian McGill on September 17, 2014 at 5:54 pm said:
    
    You’re right about S2, but the results in the main text of GLMWM and WLD that are single points but kind of echo the results in S2 have covariates I think. In any case, we probably both agree more analysis of situations with multiple covariates are needed?
    
    All I’m going to say on the last point is many goals, many ways to get to each goal.
Florian Hartig on September 17, 2014 at 8:26 am said:

I guess we can agree on the fact that a more complicated model, even if structurally correct, will not always have a lower MSE when comparing it to a simpler model using the same, limited amount of data.

What I’m missing in the discussion though (not sure if the mentioned papers cover that) is that observation submodels permit including additional data that cannot be used in simpler models, either in the form of priors on detection probabilities, or in the form of repeated measurements, to fix detection probabilities directly. May be our current datasets do not include this kind of data, but future datasets could, and having such dedicated data should be vastly more effective than estimating detection probabilities indirectly. May result in a very different assessment of inferential power.

Reply ↓
- Brian McGill on September 17, 2014 at 11:55 am said:
  
  Thanks FLorian,
  
  I think you have hit the key point – how much additional information is there in repeat visits? The simple, no-detection model uses this only minimally – it basically boils it down to all absent=absent or at least one present=present. The hierarchical model makes better use of this. Specifically it uses the frequency of 1/1 vs 1/0 and 0/1 vs 0/0 (to give the simplest two visit case) to estimate detection probability. My main conclusion from GLMWM is that two visits doesn’t give enough information to do this very well (indeed this is in essence why some of the detection probabilities fit so badly). And they do especially badly if p is really low (because then there are lots of 0/0 but not many 1/0 or 0/1). But if you have 4 repeat visits you start to get a good estimate of p and then the whole model starts to work well.
  
  Which kind of takes me back to my original post (and today’s post). What are the costs of visiting a site 4 times? Obviously labor costs. But I speculate there may be some statistical costs too – when you are looking at covariates between sites losing on number of sites you can visit hurts. Of coursing its not hard to imagine visiting fewer sites could degrade the accuracy of estimating the overall occupancy of a region too via poorer sampling. I don’t think anybody really knows right now.
  
  Reply ↓
  - Florian Hartig on September 17, 2014 at 3:30 pm said:
    
    Hi Brian, I missed the details about how they set it up, thanks for clarifying.
    
    Well, as you say, its hard to make any general statements without going through this more specifically. And there’s a lot of details that could make a difference. For example, I’m not totally convinced that it is necessary to visit each site x times to build a good observation model. Depending on how complex one wants the observation model to be, it may be OK to make detectability tests on a limited number of sites, and then use this to correct much larger datasets.
    
    Similarly (also relating to your new post), I probably wouldn’t recommend people to visit every site 3 times by default, but I think it’s a good idea to visit a few sites several times to get an appreciation of the observation uncertainty and think about whether it is a problem. And for later analysis (e.g. uncertainty estimation) it can be really useful to have these data.
Pingback: Detection probabilities – back to the big picture … and a poll | Dynamic Ecology
Darryl MacKenzie on September 17, 2014 at 10:40 pm said:

Hi Brian
Guru metioned in an emial to the coauthors of our paper that you’d posted something about our paper. After reading your post and the comments I thought I’d chip in with a couple of my own.

1. It’s a fact that if you’re going to use a method that does not account for detection then the quantity you’re estimateing/modelling is psi*p, or more generally psi*(1-(1-p)^k) = psi*p_overall if number of surveys (k) >1. It doesn’t matter how much you protest that your actually intersted in psi, but ignoring detection that’s not what you’re getting unless p_overall is close to 1. It keeps getting raised because it’s something you don’t seem to be grasping and I really doubt anyone is circulating it as a talking point.

2. I agree that MSE is useful for comparing methods and in some cases the MSE from a simple model that ignores detection can be smaller than the MSE for psi from an occupancy model. Table S.2.1 clearly demonstrates this. However, and this was one of the points of our paper, is that if you don’t estimate detection you don’t know what p is so therefore you can get the same naive occupancy estimate for a number of different combinations of values for psi and p (and you note this yourself above). So lets follow that thought. Suppose we come back from the field after surveying 55 sites twice and decide we want to estimate occupancy but not worry about detection, and come up with an estimate of 0.15. Does that estimate have a smaller MSE than what we would get if we accounted for detection? Below I’ve rearranged Table S.2.1 S=55, K=2 and ordered the results in terms of naive occupancy (ie psi*p_overall). What you can see is that we get a naive value of 0.15 if the combination (psi,p) = (0.2,0.5) in which case the MSE of the simple approach (ignoring detection) is smaller than the MSE from the occupancy model. However, we also get a naive value of 0.152 (and could make it 0.15 with some small tweaks) for the case (0.8,0.1) where the MSE of the occupancy model is the smaller. Obviously the table only examines things at finite number of discrete points, but the pattern holds for other combinations of values: the same naive occupancy value (which is what you would observe in the real world) can have quite a varying MSE depending upon what the true (unknown) values for psi and p acutally are. Arguing that the it’s ok to ignore detection in some cases because the MSE is smaller can only be reasonble if you can identify which case you might be in, but if you’re not going to bother with estimating detection, how do you know which case that might be?

naïve psi p MSE_simple MSE_occ
0.038 0.2 0.1 0.025 0.536
0.076 0.4 0.1 0.104 0.289
0.102 0.2 0.3 0.011 0.242
0.114 0.6 0.1 0.239 0.147
0.15 0.2 0.5 0.005 0.052
0.152 0.8 0.1 0.423 0.091
0.182 0.2 0.7 0.003 0.005
0.198 0.2 0.9 0.003 0.003
0.204 0.4 0.3 0.042 0.088
0.3 0.4 0.5 0.014 0.023
0.306 0.6 0.3 0.091 0.057
0.364 0.4 0.7 0.005 0.006
0.396 0.4 0.9 0.004 0.004
0.408 0.8 0.3 0.159 0.033
0.45 0.6 0.5 0.027 0.02
0.546 0.6 0.7 0.007 0.007
0.594 0.6 0.9 0.005 0.005
0.6 0.8 0.5 0.045 0.015
0.728 0.8 0.7 0.009 0.006
0.792 0.8 0.9 0.003 0.003

3. You’re also quite wrong with this statement in your post:
“The problem for the detection model is that if you only have two or three repeat observations at a site and p is high, then most sites where the species is actually present it will show up at all two or three observations (and of course not at all when it is not present). So you will end up with observations of mostly 0/0/0 or 1/1/1 at a given site. This does not help differentiate (identify) Ψ from p at all. Thus it is actually completely predictable that detection models shine when p is low and ignoring detection shines when p is high.”
There is actually lot of information about detection if your observations are mostly all 0’s or all 1’s and detection models actually work really well in those cases. But, because detection is so high, inferences from methods that include or ignore detection are going to be very similar.

Cheers
Darryl

Reply ↓
- Brian McGill on September 18, 2014 at 1:31 am said:
  
  Thanks Darryl for dropping by.
  Point by point
  1) This is at best a semantic argument (and one in which I challenge you to find a statistics book supporting your definition of estimator). psi*p may be a good or a bad estimator of psi which is the discussion I am trying to have but it’s an estimator. In point of fact 0.42 is an estimator of psi as well. So is the number of legs on the last observed creature divided by four. Did you read Jeremy’s link to Stein’s paradox where Brad Efron goes on about how using baseball batting averages improves the estimator of the percent of foreign cars? In statistical terms ANY function of the data is a statistic which may be evaluated as an estimator. Direct from the Wikipedia entry on estimators “The definition places virtually no restrictions on which functions of the data can be called the “estimators”.” I can find you similar quotes in hard core stats textbooks but we can’t all access them. Until somebody wants to quote a real, credibly sourced statistical definition of estimator that proves I can’t use psi*p to estimate psi or engage with me seriously about why N/(N-1)*sample variance which is the product of a function of N and something that is not population variance cannot (in analogy to the claim you’re making) be an estimator of population variance, I am past done wasting time on this one. Really people. Statistics is a precise mathematical language. You can’t just make stuff up.
  
  psi*p is a biased estimator is true but barely relevant (the main point of this post) and psi*p is not an estimator of psi is just flat out wrong. Nearly everything else is a matter of opinion about which one can have a reasonable discussion. But these two are a waste of time.
  
  2) Different emphasis but not fundamentally different from what I said. For example I talked about a scenario where you have a good guess at the detection rates and another scenario where you don’t. You may or may not think it is kosher to use biological knowledge to get a “prior” on the estimator and use it in experimental design/determination of analysis, but I do and it would appear many of the poll participants do (indeed a few scolded me for not giving more detail about the type of organism up front for this reason). This is of course a point where people can disagree but I don’t think it is a domain where one can say one person is absolutely right and the other is absolutely wrong.
  
  3) I think actually 0/0/0 still doesn’t have much information about whether psi is very close to zero or p is very close to zero? 0/0/0/1/0 is a different story but I didn’t talk about that. You are right and I was wrong that 1/1/1 strongly implies both psi and p are close to one which is a stronger statement. But I don’t think it changes my final conclusion “Thus it is actually completely predictable that detection models shine when p is low and ignoring detection shines when p is high.” Do you disagree?
  
  What I think would move the argument forward most would be just a very direct simple statement from proponents of detection probabilities which of my three summary statements (#1-#3 at the top of the summary section) are right and which are wrong and why wrong (you will note that 2 of the 3 are favorable to detection probabilities). That gives a starting point that is still in the quantitative, objective realm. Then there is a whole long discussion to be had where reasonable people can disagree about implications for moving forward (my A,B,C, etc, or even my second #1 and #2). This latter area is what my poll is getting at.
  
  But straight up – do you agree with #1, #2 and #3? if not why not?
  
  Cheers
  Brian
  
  Reply ↓
  - Darryl MacKenzie on September 18, 2014 at 5:26 am said:
    
    1. LOL! If you want to use psi*p_overall for your estimator for psi, go ahead. Doesn’t matter what you say it’s an estimator for, how you use your data certainly has an effect on how you should interpret the results. If you’re going to acknowledge it’s a biased estimate, you need to honestly take that into account with your final inferences. Are you now claiming that using the number of legs on the last observed creature divided by four is just as valid an estimator for psi as anything else? We are in agreement that continuing this line of discussion is clearly a waste of time.
    
    2. I have absolutely no problem with people using biological knowledge in design/analysis. People should always use their knowledge to carefully design their studies before even thinking about going into the field. Then they can get a realistic expectation of what sort of results they can achieve. The sad fact is that people do not spend enough time doing so in their haste to get out into the field. In my experience, a bit of extra forethought could stop a whole lot of the troubles that people run into once they start analysing data. If you do have prior expectations about parameter values, then by all means bring them into the analysis, and a Bayesian approach would obviously be suitable. However, if the data being used has little information about those parameters, results may be sensitive to exactly what prior is being used.
    
    3. Your quote wasn’t just about the all 0’s observations, or just about the all 1’s observations, but about when you have high detection so mostly all 0’s or all 1’s in combination. What you’re saying in the original vs replying to my post are related but different arguments. Your original argument was flawed, but I do agree with your conclusion although would say that when p is high, both give similar results so it shouldn’t matter too much which method you use.
    
    In terms of your original summary points:
    #1 partially agree. There are situations where ignoring detection gives a smaller MSE, but in order to know whether you’re in that situation you need to know something about true psi and p, which will often be problematic. Whether one method works ‘better’ depends on what your definition of ‘better’ is. You seem to be focusing on MSE and just the point estimate, but how useful is a ‘naked’ point estimate.
    #2 agree, but when wouldn’t you be interested in confidence intervals (or some measure of uncertainty)?
    #3 agree detection models work better with larger sample sizes (what stats method doesn’t), though number of repeat surveys required depends on p, 2-3 can be than sufficient if p>0.5 or so (this has been covered in the literature on these models). I don’t think the view of ‘I don’t have enough data for the detection models to work properly so therefore am going to ignore detection’ is well justified. These methods were developed because ecologists wanted to deal with detection to get a more reliable view of where the species might really be. That added degree of reliability requires more information. If the detection models aren’t working that well, that tells you something and playing ostrich doesn’t resolve the underlying issue. This is where study design is important to give you a realistic since of what’s achievable given your resources. It may be more productive to follow an alternative line of inquiry.
    
    #4 would be a valuable contribution. But as Dan said earlier, WLD said and I’ve said a number of times, detection is really a measurement error problem. The real value of repeat surveys isn’t that they allow detection to be estimated, but they provide more surity (ie reduce measurement error) on underlying species presence/absence. Therefore, even if you start looking at covariate relationships there’s going to be some benefit to dealing with detection either by having repeat surveys and/or accounting for it in analysis. The level of benefit is going to depend on the nature of the covariate relationships with both psi and p.
    
    As for your 3 concluding points:
    #1 it’s debatable about whether the primary focus is strictly about the estimation of psi alone. There’s a very strong implication in your final sentence that GLMWM are part of the detection probability machismo-ists you’re decrying, despite your earlier comment to Mick. I’ve got a pretty thick skin and been called plenty worse in my time, but you know nothing about my motives for working in this field and I find the implication both insulting and offensive. If you’re going to use such language don’t been surprised if some people are a bit prickly in their comments.
    #2 agree.
    #3 I have a hard time getting to B and C, because why wouldn’t I want to know something about how much uncertainty might associated with my estimate? D, I completely agree with, wholeheartedly (though there are detection issues there that need considered). E, ‘fantastic dataset’ and ‘long term monitoring’ aren’t equivalent. Some long term datasets are good and useful, and some aren’t and deserved to be scrapped (and not necessarily just because they may not have collected detection data). If you have a useful one, but recognise it could be improved in someway why not modify it? If you want to maintain some continuity, make the changes in such a way that allows you to do so. Quantity doesn’t always trump quality, and just because data has been collected in the same way for 20 years doesn’t make it inherently fantastic. GIGO always applies.
  - Brian McGill on September 18, 2014 at 11:57 am said:
    
    Thanks Darryl,
    
    #1 – Done
    
    #2 – the only problem with putting things in a Bayesian formal context, at least as I think you’re describing, is it requires collecting the data as if you’re going to do a hierarchical analysis. That may have a high cost. As per my discussion w/ Mick I would advocate sometimes just assuming during the sampling design that you’re in a particular region of space.
    
    #3 – we’re converging – nothing to add
    
    Summaries
    #1 See #2 above and my response to Mick about that. Sometimes you have a pretty darn good idea where you’re at and I’m OK with building that into the design. Sometimes of course you don’t and then life is tough.
    #2 – of course I’m interested in confidence intervals. I’m also interested in high quality point estimates. Unfortunately sometimes there is a trade-off
    #3 – agree on facts, disagree on interpretation.
    #4 – what you say is true – what I said about the value of more sites to disentangle covariance among the variables is also true. I am really curious how the two trade-off
    
    Conclusions
    #1 – that sentence clearly bothered you and Mick. I regret it. But I will say I wasn’t thinking of people like you & Mick when I wrote it. It is almost always not the people highly informed (or the inventors of a method) who are most dogmatic about it – it is the 2ary group who adopt it too fiercely. And they’re the ones who do your side harm by making ridiculous claims in reviews about the impossibility of doing science without using hierarchical models and never acknowledging any trade-offs what-so-ever.
    #3 – can’t disagree with what you say in generalities, but I have a lot of stories about what I think are quality long-term datasets that are used to address long term questions (e.g. cliamte change that go back before detection probability models existed) that are told that can’t be published, effectively saying no science can be done on this question until we wait another 20 years to get fully detection probability designed data just over a methodological issue. That is wrong!
    
    Thanks for the conversation!
  - Darryl MacKenzie on September 19, 2014 at 12:17 am said:
    
    Hi Brian,
    Replying the points in the same order, hopefully readers can follow it! Apologies also for any poor spelling, wordpress and Firefox on ubuntu don’t seem to play nice with automatic spell check.
    
    #2 You’re reading too much into my comment, I didn’t imply anything about having to have repeat surveys for a Bayesian approach to be applied. You could do it with a single survey although you would have to use a detection model in the analysis and have a prior on the detection probability. Pretty straight forward really. It’s not something I would generally recommend though unless your prior comes from some real data and not just expert opinion. Even then, detection probabilities are often time/space/method specific so you would want to consider some form of sensitivity analysis if you’re doing it properly. That all said, collecting information to inform you about detection doesn’t necessarily mean a large increase in costs; often there are ways to collect the required information in a single visit which may be reasonable depending on your objectives.
    
    Summaries
    #1 I’m pretty comfortable with people having some feel for whether detection might be low, medium or high, but think it would be real tough to expect much more than that. Although it also depends on how the study design and goals relate to the species ecology, and whether in order for a detection to be made, a number of things have to happen. Sometimes people might only be thinking about ‘detection’ in terms of one of the sequence of events so there’s a mismatch between how they’re thinking about ‘detection’ vs a ‘detection’ given their study design. eg. people might be thinking that when a bird calls I have a high chance of detecting it, but there are a number of other things that have to happen before the bird calls that could be components of detection, hence actual detection probabilty could be a lot lower. There’s no subsitute for some careful, considered, and informed upfont thinking before going out into the field. Pretty sure we’re in agreement on that.
    #2 I’m really struggling to see what the value of such a tradeoff would be. Perhaps we have a different definition of ‘quality’. An estimator that produces an estimate with an unknown bias (given you don’t know what p is) and poor coverage (often <50% and even essentially 0% for the parts of the parameter space I think we're discussing, p<0.5) is not one that I would regard as 'quality'.
    #4 Would be a good masters project for someone. I agree there will be tradeoff, but I'm picking that going to lots of places and collecting data that has potentially high measurement error (due to low detection and unsufficient repeat surveys) and the nature of the measurement error also varies (because detection varying in time/space), isn't going to be the winner.
    
    Conclusions
    #1 Good to know it's unintentional, but you might what to carefully reconsider your writing style then because there's a number of places where it reads like you're lumping detection model proponents in with your machismo-ists, not just that one sentence. Judging by some of the other comments, it's not just coauthors of GLMWM that got that impression.
    #3 Can't really comment without knowing specifics of the cases, but I've also got stories where people have thought they've got fantastic datasets but once you start getting into the details of them things quickly fall apart for a variety of reasons.
  - Brian McGill on September 19, 2014 at 8:38 am said:
    
    Darryl – much to agree on. For #2 – I’m pretty happy with MSE as a balancer of bias and variance. Doesn’t address confidence intervals though. So for me its MSE vs confidence intervals is the trade-off (again as I said in the original post – only in some parts of parameter space – other parts there is no trade-off – you get best of both worlds with hierarchical models)
Michael McCarthy on September 18, 2014 at 1:02 am said:

Hi Brian,

I’ll go through your summary recommendations and facts. I’ll use “simple model” to refer to cases where imperfect detection is ignored, and “hierarchical model” to refer to cases where it is included. So, starting with Point 1 in your summary, the point I’d like to make is:

The simple model has superior MSE in one region of the parameter space only if it uses information not available to the hierarchical model (that detection probability is high). Let me explain what I mean (I’ve just noticed that Darryl has already made this point, more succinctly than me, but I’ll paste in my comment anyway):

I agree that hierarchical models sometimes have poorer MSE than simple models. As WLD show and we replicate, there are cases where the MSE of the simple model is better. As you note in your post, the simple model has lower MSE only when both occupancy and detection probability are low and there are relatively few sites. I say “relatively few sites” because it basically applies to cases where there are few detections (and occupancy is low) such that estimates of detection area uncertain.

Here’s an example. Assume occupancy is 0.1, and detection probability (per visit) is 0.8. If there are two visits to each site, then the probability of failing to see the species on both visits is 0.2*0.2 = 0.04 (assuming constant detection and independence), so the chance of seeing it at least once at each site where it is present is 0.96 (1 – 0.04).

Therefore, estimates of occupancy based on the simple model (the proportion of sites at which the species is seen) will tend to be around 0.096 (=0.1 * 0.096; occupancy times the probability of detection per site). Estimates of this proportion will be reasonably precise (depending on the number of sites), and not too far from the true occupancy of 0.1. So the simple model will perform quite well (values around 0.096 as a simple estimate when true occupancy is 0.1 – few if any people would complain about that).

In contrast, the hierarchical model will try to adjust for imperfect detection. But with only two repeat visits (and assuming few surveyed sites), the estimate of detection probability per visit will be very uncertain; this makes the probability of detection per site very uncertain.

You can think of the estimate of occupancy as being:

occupancy = (proportion of sites at which the species is detected) / (probability of detection per site).

With two visits per site and few sites, uncertainty in the probability of detection per site (the denominator) makes the estimate of occupancy very uncertain. If uncertainty in the detection probability is large enough, uncertainty in the occupancy estimate will span almost all possible values between one and something just over zero. This uncertainty arises because the species will be seen so infrequently – the model can’t distinguish between low detection probabilities and low occupancy.

The high uncertainty in the estimate of occupancy flows through to high MSE. With insufficient replication, some datasets will suggest small values for detection probability, which will then imply high occupancy. In other cases, the estimate of detection probability will be better, and the estimated occupancy will be closer to the truth. But generally with a variable estimate of detection, the estimate of occupancy will be all over the place – hence a poor MSE.

Note, however, any estimate of detection probability for the hierarchical model will be tend to be uncertain when data are scarce, so the confidence interval will be wide – regardless of whether the data suggest occupancy might be high or low, the hierarchical model will tend to report high uncertainty in this estimate. It will suggest that occupancy is uncertain, regardless of what the best estimate might be.

The simple model avoids this problem by assuming that the probability of detection per site is one. Because it doesn’t need to estimate this value, the simple model provides relatively precise estimates, which in this case (high site-level detection) is close to the truth (i.e., it happens to be precisely close to being correct).

So, with few detections, can we assume that the simple model will work better (in the sense of having lower MSE)? Unfortunately, the answer is no. Few detections can also arise when occupancy is relatively high and detection is particularly low. For example, we’ll also have few detections when occupancy is 0.8 and per visit detection probabilities are 0.1. In this case, the site-level detection probability is 0.19 (= 1 – 0.9*0.9), so the simple model will tend to estimate occupancy to be around 0.152 (0.8 * 0.19). This estimate will tend to be relatively precise under the simple model. So now the simple model provides a very poor estimate of the true occupancy (values around 0.152, which is quite far from 0.8). In this case, the simple model is precisely wrong (and will have high MSE).

In contrast, the hierarchical model will still have a hard time estimating the detection probability, so the uncertain estimates of occupancy will remain. So, when detections are rare (and detection probability could be high or low), the hierarchical model will be honestly uncertain about the occupancy.

It is for these parameters (high occupancy and low detection) that the hierarchical model has clearly lower MSE than the simple model; the simple model has high precision but it is precisely wrong (very biased and high MSE). I think Brian, I and everyone else agree on this point, and it is one that Brian has already summarised in his post.

Now, the interesting part is when we have few detections, how do we know which in part of the parameter space do we fall? Do we have low detection and high occupancy, or high detection and low occupancy? If we don’t know the detection probability, then I’d rather be honest about the uncertainty and use the hierarchical model, rather than assume high detection rate, use the simple model, and risk being precisely wrong.

Now, we might be prepared to assume that detection probabilities are high for a particular species. For example, the species we are looking for might be loud, active, colourful and hard to confuse with another species (think of rainbow lorikeets; although they move about, so the meaning of occupancy needs to be carefully defined). In that case, assuming high detection (and using the simple model) would be reasonable, and in most cases the sensible choice. And the simple model would have better MSE in this case than using a hierarchical model where detection probability is estimated but remains uncertain. However, if we know the detection probability is high, then we could constrain it within the hierarchical model (putting bounds on it, or using an informative prior). In that case, the MSE of the occupancy estimate from the hierarchical model will tend to be similar to that from the simple model.

So in the region of the parameter space where the simple model has better MSE than the hierarchical model (low occupancy, high detection; it works well because the simple model implies perfect detection, which is roughly correct), if we put a similar constraint on the hierarchical model (detection is high), the superior MSE of the simple model will evaporate.

So, if we know that detection probability is high, it is fine to use the simple model. But it is also fine to use the hierarchical model if we constrain detection to high values. In other cases, the hierarchical model will do roughly as well, or outperform, the simple model (in terms of MSE).

Got to run now, but I agree with your summary points 2 and 3.

There is some literature about your 4th point (perhaps there could be more to fully explore what you suggest):

Papers that assess consequences of imperfect detection on regression coefficients:

Tyre, A.J., Tenhumberg, B., Field, S.A., Niejalke, D., Parris, K. & Possingham, H.P. (2003) Improving precision and reducing bias in biological surveys: estimating false-negative error rates. Ecological Applications, 13, 1790-1801.

Gu, W. & Swihart, R.H. (2004) Absent or undetected? Effects of non-detection of species occurrence on wildlife–habitat models. Biological Conservation, 116, 195-203.

The consequences of this for decision making have also been explored (both in a paper and even a movie!):

Video: Phil talks about our research on detectability and species distribution models!

From paper to movie

Reply ↓
- Brian McGill on September 18, 2014 at 1:51 am said:
  
  Thanks – I really appreciate your tone and constructive approach.
  
  So what I hear is you agree with #2 and #3, but disagree with #1. Your math and logic are correct of course. I just approach #1 with a different perspective. To me the key implications of #1/your Table S2.1 are:
  a) if you have prior expectations about p and/or psi you can use Table S2.1 to give guidance about whether to use simple or hierarchical models. You are saying after you’ve collected the data, you can’t tell from the data where you are in parameter space. These are different claims/uses of Table S2.1. I think we’re both right on this one.
  b) just that using detection models is not so perfectly better than simple models that it is reasonable to start telling people to throw away fabulous long term datasets that don’t have repeat visits. You might not have needed Table S2.1 to get to this conclusion, but a good number of people are recommending throwing away long term datasets because they don’t support detection probabilities and the fact that detection probabilities are not perfect or always best is not at all surprising (such perfect methods don’t exist and the people who know them best don’t usually claim they are perfect) but is a good counterargument to the idea that you have to 100% use hierarchical methods regardless of the downsides (like throwing away good data).
  
  Again – thanks for the constructive response
  
  Reply ↓
  - Michael McCarthy on September 18, 2014 at 3:11 am said:
    
    I don’t disagree with #1 – but your point doesn’t seem to answer what to choose when we don’t know detection probabilities.
    
    If we know detection probabilities are high, then we can go ahead with the simple model (or a hierarchical model with constraints on the detection probability – that might be unnecessarily complicated, but wouldn’t be bad per se).
    
    If we know detection probabilities are low, we should use the hierarchical model.
    
    We definitely agree up to this stage as far as I can tell.
    
    If we don’t know detection probabilities, then I’d rather use a hierarchical model so that I can estimate detection probabilities and know how bad the estimate of occupancy might be.
    
    I’m not sure whether we agree at this point or not, but I can’t quite understand how an alternative might be better. Please let me know your thoughts.
    
    If we can’t estimate detection probabilities (e.g., the datasets you mention), then we might feel like we have no choice if we want to analyse the data (I wouldn’t advocate throwing away the data) – so we might use the simple model. But I would wonder if the inference is sensitive to imperfect detection (by inference I don’t mean just p-values – I mean the estimates and confidence intervals combined). That, I think, gets to your your fourth point, which has been at least partly answered in the literature that I mentioned (one example – ignoring imperfect detection can bias parameter estimates in a regression model).
    
    In response to not knowing detectability, it might be prudent to analyse the data under different assumptions of detectability and see if the results are robust (e.g., use a hierarchical model with detection fixed at particular values). Or we could use the simple model and hope everything is OK. I think the former option is safer. I’m not sure you agree with that point.
    
    As an aside, in David Lindenmayer’s datasets (David is the L in WLD – I know you know that – I’m just clarifying for readers), he generally has notoriously many sites, but also many repeat visits. For the vast majority of David’s species, the probability that they are present at a site and remain undetected over multiple visits is close to zero. Therefore, I understand why he would choose to use an analysis that doesn’t account for imperfect detection, and his results will be robust. In fact, when his data are fit to a hierarchical model, he tells me that he gets basically the same result, so he doesn’t see the point of going to the extra trouble. I’m not surprised that is the case, and I agree what he does is reasonable. For average punters with smaller datasets and fewer repeat visits, they might not be in such a comfortable position.
  - Brian McGill on September 18, 2014 at 12:02 pm said:
    
    I think we’re back at the core issue of there is a cost to collecting the extra data. There is a benefit too. It is not obvious to me it is ALWAYS (often certainly, but not always) worth the extra cost. In particular if I have good biological knowledge that suggests we’re in a region of parameter space where simple model is likely to beat hierarchical, why would I collect 3 times as much data.
    
    You make an interesting point about David’s data. When I analyze the North American BBS, I almost always average a route (already 25 stops) across 5 years. Now the detection at a single stop is pretty low, but across a route, reasonably high, and across five repeat visits I think extremely high (yes I know there are temporal changes but I’m looking at really large scale patterns where I am content to treat 5 years as a point in time). It sounds like even though it is a not-fancy approach to detection, you would trust it?
    
    Cheers
Michael McCarthy on September 18, 2014 at 7:31 am said:

OK – I have a bit more time to work through the summary and facts. I was up to the facts section.

Fact #1. I agree that “Detection probabilities are not an uber method that strictly dominates ignoring them.” I think that is covered in my comments on Summary Point 1. However, it is also true that “Ignoring detection probabilities is not an uber method that strictly dominates including them.” I’m reasonably sure Brian agrees with that second statement too.

The WLD paper is used by others to defend ignoring detection probabilities without any evidence that their parameter space does not cover the area where simple models perform noticeably worse by all criteria. Look at the papers that cite WLD and you will see statements such as:

“We did not account for detectability because it introduces problems that are at least as large as when it is included (Welsh et al.).”

That is not an exact quote came from any particular paper but it paraphrases what can be found in papers by different people. WLD is being cited as a reason for not considering imperfect detection without demonstrating that the parameters (in particular the detection probabilities) justify that stance. Essentially, researchers are using WLD to not apparently not even think about the consequences of imperfect detection. That is at least as bad as requiring people to fit hierarchical models when there might be good reasons to believe that doing so is unnecessarily complicated.

People should think about imperfect detection, but having thought about it, they might justify not including it in their analysis for sound reasons. Brian, I think you would also agree with that statement. We all want people to think about their analyses.

Reply ↓
- Brian McGill on September 18, 2014 at 12:08 pm said:
  
  Definitley agree simple models are not an uber model. Would even agree that hierarchical models are the better approach a majority of the time. Just don’t like being told that simple models are NEVER acceptable (which I do hear a lot).
  
  You are preaching to the choir on this. The world of data and the world of statistical analysis is extraordinarily complex and full of high-diemnsional trade-offs. It is inexecusable not to think these through and justify the choices you make carefully (glib justifications don’t count).
  
  By the same token in this complex, trade-off world, it is pretty much impossible that any one analysis approach is ALWAYS right 100% of the time. That is the bee in my bonnet – the # of times I have been told that I HAVE to do hierarchical models (usually completely irrelevant to the questions I am addressing since I rarely even look at occupancy).
  
  So I completely agree with your assessments. We’re probably coming at it from opposite sides (with opposite frustrations), and would probably have different default biases. But I think we fundamentally agree. As I said to Darryl below its never the people who really understand what is going on that are unreasonable – its the people who know just enough to be dangerous and turn things into dogma.
  
  Thanks again for the conversation.
  
  Reply ↓
gguillera on September 18, 2014 at 2:09 pm said:

Hi Brian,

As Mick pointed out, I was travelling and it’s only now that I have the chance to comment on your post. I see there has been a lot of discussion going on, so I’ll stick to a few key points.

First I’d like to say that I’m really glad you clarified you were not suggesting we tricked our paper by hiding or manipulating results, because that is the exact impression I got when I first read your post. As you know those would have been extremely serious accusations, offensive and defamatory. In connection to this, I’d suggest you reconsider whether it is appropriate to accuse people of being “machismo types” or to claim that some are deliberately trying to scare people with difficult methods… I don’t think people are so mean; I believe most of us are just trying to do the best science we can. In fact, if you think it inappropriate to label a model as naïve (a term used in the literature), it seems strange that you think labeling people as machismo-ists is acceptable in a scientific discussion.

Also, I see you have misunderstood/misinterpreted our paper in several ways. Essentially you summarize our work as:

“” GLMWM basically pick different scenarios […] and show that detection probability models have a lower MSE. […] This is not really a deeply informative debate. It is basically, ”I can find a case where your method sucks. Oh yeah, well, I can find a case where your method sucks.” “”

This is not what we do. Our paper is not about “picking different scenarios” to show occupancy-detection models can have lower MSE. What we do is:

1. We consider a new scenario in comparison to that in WLD (scenario 2 per your blog post) to illustrate the point that one can get the same estimates from the “simple model” in very different situations (some with low MSE, some with large MSE), and that one cannot tell from the estimates in which case one is. With this we support our argument that the occupancy-detection model is better because it honestly represents the uncertainty that we have once we recognize that detectability may be imperfect. Both Mick and Darryl have explained this point very well in their comments (see a further attempt in Note1 below).

2. We investigate why WLD obtained the counter-intuitive result that occupancy-detection models and the “simple models” have the same bias when detectability depends on abundance regardless of sample size. We find that this is because they considered a scenario of extreme heterogeneity (see Note2 below). We considered a couple of less extreme scenarios to confirm that the bias in the occupancy-detection model is smaller. We also explored this theoretically (with one site category). I think this is clearly more than simply “picking” one scenario that has lower MSE.

Also, it is important to note that you give recommendations based on our table, while our table considers the same amount of survey effort per site for both models (as you also mention somewhere in your post). If part of your argument to support the “simple model” is the difficulty to have repeat visits to model detection, then you should compare MSEs considering the case where the “simple model” is fitted to data from a single visit. Our table does not give you that.

Finally, you may be aware that a while ago I wrote a blog post where I explain the motivations behind our paper as well as its key results/conclusions. There it is clear that our paper is not just about picking counterexamples with low MSE. I invite those interested to read it at:

Accounting for detectability is not a waste – a response to Welsh et al

Regards,
Guru

PS: Regarding your last note in your post, in our paper we report that “in 98.5% of the simulations, unmarked found the maximum-likelihood estimates at the first attempt using its default values”.

******************************************************************************************************
(NOTE 1) The discussion about which estimator is best is comparable to the following toy scenario. Let’s imagine we need to estimate a proportion but only have a few samples to do so. Now I present you with two estimators:

• Estimator 1: take the samples and estimate the proportion as usual, i.e. number_positive_records/total_number_records. Here we will obtain an estimate with a large CI because we have few samples.
• Estimator 2: estimate the proportion to be 0.5 (fixed), with no uncertainty.

Which estimator would you choose? I’d say it is more satisfactory to go for estimator 1 and claim that “with this amount of data all I can say is that the proportion can pretty much take any value from 0.x-0.y”, than to trust estimator 2, and claim “I’m pretty sure the proportion is around 0.5”. Estimator 2 will indeed have lower MSE in some cases, but in others it will be very biased. If those estimates are to have any real use (i.e. a decision or conclusion needs to be based on them), it seems better to openly acknowledge uncertainty than to confidently rely on an estimate that can be largely wrong. I think that there is general consensus nowadays about the need to estimate and report uncertainty in estimation.

One could draw recommendations (similar to those you suggest) and say that estimator 2 is better when the proportion has mid values. But as Mick says, if one truly has that sort of prior knowledge, it could also be combined with Estimator 1 to improve its precision.

(NOTE 2) In your blog post you state that WLD test “real-world scenarios”. I’d be interested in whether you really think that the third scenario (the one where detectability depends on abundance, or more generally, when detectability was heterogeneous) was a particularly realistic one. They consider a case where, within sites of a given type, in some sites overall detectability is around 1 (the species is detected for sure), in others is around 0 (there are no chances of detecting the species), and in practically none detectability falls between those two extremes. To me that sounds as a quite extreme scenario, and I’m not sure if there are many ecological explanations for such conditions.

Reply ↓
- Brian McGill on September 18, 2014 at 7:46 pm said:
  
  Welcome back Guru,
  
  Just a quick note – models are “statistical machismo” – some (and only some) people who obnoxiously advocate a model, usually without being that well informed, are. In my experience people like yourself, Mick and Darryl are not usually such people.
  
  Table S2.1 got very little comment in the main text of GLMWM (I think only one sentence?). That is not the choice I would make – I think that table is very interesting and useful (which is why I wrote a whole blog post about it – and thank you for providing the table!) – but its a free world. I guess you’re entitled to emphasize what you want. And I certainly don’t know why you chose not to discuss in any depth Table S2.1.. If you want to answer, I’d be curious to know why?
  
  I just have a very different view about the state of the field, and hence where the burden of proof lies (where a single counter example vs a general statement). FWIW my view starts with the fact that hierarchical detection models are strongly entrenched (something the poll is bearing out), so that is not a criticism of hierarchical deteciton models. I guess all I’m saying is that hierarchical detection models are SO entrenched that some people forget they’re not perfect either.
  
  One could debate whether WLD or GLMWM have more realistic scenarios for the case based on abundance, but personally I think they’re both a bit orthogonal to the real issues invoked when abundances vary lognormal/logseries-ish so its not really a debate I want to get into. I certainly can see your point of view on this.
  
  I agree that repeating table S2.1 for what you call the K=1 scenario would be very useful/interesting.
  
  I read your blog with interest when it came out. TO be honest – this is why I got into blogging – the opportunity to have conversations with people such as yourself that I might never have had a conversation otherwise.
  
  Thanks for your thoughtful comments.
  
  Reply ↓
  - Michael McCarthy on September 18, 2014 at 8:59 pm said:
    
    In response to your question about why we didn’t spend more time discussing the appendix, Guru notes in her blog post that we decided to focus the paper on five key messages (that did include the issue of MSE, and not knowing where in the parameter space we might fall). It was a matter of balance. You think we didn’t get the balance correct, which will be true for at least some other readers too. It might be even universally true.
    
    But we thought carefully about how much detail we wanted to include in the paper. We thought the paper was already quite dense, so adding any further detail on any one issue (plenty of other points could have been discussed in more detail – drafts of our paper certainly had those) might have distracted from the other messages.
    
    Dissecting the appendix in more detail as you have done is useful, but it also opens up a can of worms such as figuring out how to design surveys when we don’t know occupancy or detection rates (a priori). So, in the paper we simply pointed out this difficulty (we can’t really tell in which part of the parameter space we fall, so we can’t be sure if the simple model will be precisely right or precisely wrong). Addressing in more detail would seem to demand a paper in itself (one I’ve had in draft form for years I’m sorry to say).
  - Brian McGill on September 18, 2014 at 9:44 pm said:
    
    Fair enough! Space limitations unfortunately drive lots of decisions. As you say you can’t please all the readers all the time. Just because I would have had different priorities doesn’t mean much. I do think the appendix is very interesting and useful, so regardless of how much it was commented upon, I am grateful to GLMWM for producing it!
  - Michael McCarthy on September 18, 2014 at 9:56 pm said:
    
    Reflecting on this some more, some of the items in the paper were developed in response to reviewer comments. Fig. 6, from memory, was one item; aspects of the appendix too I think, but I can’t remember the details, and I haven’t recently looked through the revisions and reviewer comments. I remember that none of our additions changed the fundamental messages we tried to convey. Thus, we were also balancing the amount we changed in the manuscript, and how much was just interesting addenda that the reviewers wanted to see.
Michael McCarthy on September 19, 2014 at 6:41 am said:

I’ve chatted with a few people about this post. Most people understand why I defended my colleagues (and me) over the suggestion they might have intended to conceal results that did not support their position. Others thought that concealing such results would have been fine in the cut and thrust of debate.

Let me be clear – concealing evidence that doesn’t support a position falls squarely in the realm of scientific misconduct. There are papers about the prevalence of these sorts of behaviours. Even though some people might think these behaviours are acceptable, they are not and they harm science. That’s why I sought to set the record straight, and why Brain and his readers might sense we were offended by the suggestion.

Below are two examples of the literature on this topic. This quote from the abstract of the first paper by John et al. (2012) is worrying:

“This finding suggests that some questionable practices may constitute the prevailing research norm.”

Click to access MeasPrevalQuestTruthTelling.pdf

https://peerj.com/articles/562/

Reply ↓
- Brian McGill on September 19, 2014 at 8:41 am said:
  
  It is very evident from your comments here that you are a very thoughtful, upright person. I rather suspect you’re one of the last people science has to worry about being unethical. Again, no reason for you to have perceived the same importance in different aspects of GLMWM that I did.
  
  Reply ↓
  - Michael McCarthy on September 19, 2014 at 11:11 am said:
    
    Sorry – I should have said you’d covered that ground plenty enough already in your comments.
    
    I posted this because I thought it was worth letting other people know the literature on this, and the changing landscape. Fiona Fidler gave a seminar covering this in my department this week. She pointed out that, in the not too distant future, it is very likely that ecology will be scrutinized on matters such as this (essentially it gets to the issue of reproducibility of science). It would be good to get our house in order before then.
Pingback: Detection probability survey results | Dynamic Ecology
Jose L. Alcantara on September 27, 2014 at 2:12 am said:

I just want to make a suggestion (not directly related to the technical aspect of detection probabilities ). It is rather about the use of the term “statistical machismo”. Because the concept of machismo has to do mainly with a gender issue (i.e., men feeling superior to women) then, it can be misleading. So, I propose instead “statistical despotism”, which I think fits much better the idea of a group imposing certain rules to others. Just my three cents.

Reply ↓
- Brian McGill on September 27, 2014 at 2:00 pm said:
  
  I do understand your points. I addressed the gendered aspect of it in my first post on the topic. Perhaps you’re right that statistical despotism is a better term, although it may be harder to change now.
  
  Reply ↓
Nick Peterson on October 23, 2015 at 2:24 pm said:

Hi Brian, I have what I hope is a rather simple question, but it may not be so simple. I know that when using DISTANCE to calculate density, Buckland recommends having a minimum of 60-80 detections so that your density calculation is relatively accurate. My question is, if I’m only calculating detection probability, using Program MARK, is there a similar general rule-of-thumb on the minimum number of detections one should have before even bothering to calculate detection probability? I know MARK will of course run the analysis with very few detections, but the results may not necessarily be accurate. What are your thoughts? Thanks for the advice.

Reply ↓
Pingback: Statistical Machismo | brouwern