Also this week: what it’s like to visit Congress, in (qualified) praise of impact metrics, research integrity zealot vs. humor, how to talk to people at conferences, journal life list, and more.
From Jeremy:
UPDATE: Wish I’d seen this earlier, want to link to it now while it’s still timely. A very sensible comment on the kerfuffle over whether to lower the conventional alpha (type I error threshold) from P<0.05 to P<0.005. tl;dr: If we want to reduce the rate at which we publish type I errors, and reduce the file drawer problem, don’t we want to raise the type I error threshold rather than lower it? Click through for the full argument; it’s not trolling.
How to start a conversation with a stranger at a scientific conference. Hint: don’t start with your “elevator pitch”.
The “Caltech rules” for how to structure a scientific paper, particularly one applying a theoretical idea to a particular question. (ht @jtlevy) I’m still mulling over this remark, about why you should not try to model your own papers after great papers that you’ve read:
The reason is that your own experience in reading [great papers] is a bad model of a reader of your own paper. Most paper in this category have already been acclaimed as great. When a reader gets yours (e.g., as a referee, a senior person in your field to whom you have sent the paper), you will be unknown. Most of these readers will therefore read it quickly. A complex, intricate, or discursive argument will confuse such a reader.
We have lots of advice on how to write a paper: here, here, here, and here for instance.
Arjun Raj with qualified praise for metrics of scientific impact. Some good points here, especially about everyone’s tendency to refute metrics they don’t like with anecdotes (I’m guilty of that). I do think he overlooks the problem of gaming metrics. There are contexts in which that’s an overrated problem (e.g., people self-citing so as to increase their citation counts; not enough people do that often enough to be worth worrying about). But there are contexts in which it’s a huge deal. I’m thinking of how the British research assessment excercise–a metrics-based assessment of research at all British university departments–has distorted faculty hiring practices in obviously-suboptimal ways. (Corrected; commenter Jeff Ollerton points out that my memory is faulty and the REF is not explicitly metric-based. Though as Brian points out, even evaluation systems not explicitly based on metrics may produce results similar to the results that would have been produced by a metrics-based system. In any case, the broader point that universities do try to game the REF in a big way still stands, I believe.)
A 2×2 decision matrix to help you decide if your dissertation topic is a good idea. (ht @dandrezner)
Old school science cred: when “Big Data” meant “a truck full of water-damaged punch cards“.
Hoisted from the comments: Gregor alerts us that some journals (including Plos Genetics and Genome Biology) are now having their editors scour preprint servers and inviting authors of suitable preprints to submit them to the journal. I’m curious to hear what folks think of this practice. Is it a good thing because it improves efficiency of the peer review system (presumably, it cuts down on rejection)? Is it a good thing because it’s an endorsement of preprints, and any endorsement of preprints is a Good Thing? Or is it at least potentially a bad thing depending on how journals are choosing invitees? For instance, one can imagine that preprints from famous people might tend to garner invitations from selective journals. And one could imagine an unselective author-pays open access journal inviting anybody who posts a preprint in order to attract publication fees. Also: I’m reminded of this old post, from back in my Oikos Blog days, noting that law journals already have a centralized system that allows them to “bid” on unpublished manuscripts. But a key difference with that system is that authors decide which journals to seek bids from, as opposed to having to hope a journal editor favors them with an invitation. I’m also recalling an old link I can’t find just now from someone arguing that you can either have wide uptake of preprints, or you can have double-blind peer review, but not both.
A while back Brian posted on “crowdsourced truth telling” as a highly-imperfect-but-better-than-nothing way to deal with serial bullies in academia. The basic idea was for everyone who knows about a bully to cautiously and truthfully share that information in their own local circles, with the hope that word will get around and allow others to steer clear of the bully. Of course, one major limitation of the approach is that the information may not reach people who are in greatest need of it. Think for instance of prospective grad students who might join a bully’s lab. One commenter suggested a solution to that, but with obvious and serious downsides: something like RateMyProfessor. A site on which anyone could anonymously post statements about any graduate advisor. Well, somebody else thought of the same idea, and is actually doing it. It’s called QCist. I confess that I don’t see it working, because their moderation methods seem both ineffective and problematic. “When a user contacts us about a potential problem [with a supervisor], we will investigate the user who left the comments.” Really? And, um, how do you propose to do that, exactly, while also preserving anonymity? “[W]e also encourage the reviewer to leave words in the “additional comments” section to make the review more authentic. This way, a prospective trainee will be able to tell a bogus review from a truly authentic one.” Again, really? “[A] popular (and truly good) professor should receive multiple reviews. So one or two “bad-mouth” reviews will be drowned out by other good reviews.” Nice thought, but no: that doesn’t even work for RateMyProfessor (the ratings for which are infamously sexist), and they have way more ratings per professor than QCist will ever get. I’m sorry, I know QCist’s founders are tying to deal with a real problem, that’s very much to their credit. But I just can’t get behind their proposed solution. This seems like a much more promising initiative to me.
Several scientists are planning to run for Congress in the next US election; it’s an unfortunate sign of the times that they’re all Democrats. Hats off to them, I wish them best of luck. Running for federal office for your first political campaign is really ambitious. It’s tough to do successfully, though it has been done. I would be interested to hear if there are more scientists planning to run for state and local office. Collectively, state and local governments matter at least as much as the federal government, those offices are easier to run for in many cases, and those offices can be stepping stones to higher office. And a lot of policy that scientists tend to care a lot about is set at the state level (e.g., education and higher education policy).
In other political news, US scientist readers may want to call their senators regarding Sam Clovis, the nominee for USDA chief scientist. He previously accused the Obama administration of paying climate scientists to falsify their work, and that’s far from the only crazy idea he pushed during his years as a conservative talk radio host.
“How can I give honest feedback [on statistics to my lab group] in a way that doesn’t come across as overly negative?” Good discussion thread. (ht Andrew Gelman, who has the good suggestion to frame your criticism as coming from a hypothetical reviewer)
University of Vermont medical school will end all lectures in two years. Because they got a $60+ M donation to build classrooms suited to active learning in small groups.
Stephen Heard’s journal life list. I have a JLL too, but it’s not nearly as long as his.
Philosophy journal corrects 35-year old paper by famous philosopher’s cat. It was actually by the famous philosopher himself, of course, and published under his cat’s name as a joke. A joke of which philosophers were, and are, widely but not universally aware. Recently, a humor-challenged philosopher complained that this harmless decades-old joke contravened “research integrity” because think of the children! some younger philosophers might be unaware of the joke. So the journal added a note to the article spelling out the joke. I’m reminded of the sad effort to have Onion articles prominently labeled as satire on social media, because occasionally somebody thinks they’re real. Somewhere, a single tear is rolling down Stephen Heard’s cheek. I hope that if anybody ever asks Evolutionary Ecology to attach a note to Vincent, Van & Goh 1996, the journal has the good sense to tell them to go pound sand. It’s ok for some people to be taken in by a joke. Trying to make sure nobody is ever taken in by a joke makes the world a worse place on balance.
And finally, you when you first started on Twitter vs. you now. I defer to Meghan on whether this is universally true. 🙂
From Meghan:
Georgia Auteri, a grad student in UMich’s EEB department, wrote a post about her experience on Capitol Hill for the annual Biological and Ecological Sciences Coalition’s (BESC’s) Congressional Visits Day, where she lobbied for support for NSF. The event is co-hosted by AIBS and ESA and sounds like a great opportunity! And, as she notes there, now is a great time to contact legislators about the budget, as this is a key time for it! If you think you don’t have time, perhaps this idea will work for you:
Hi Jeremy – a slight correction: the British Research Excellence Framework, and its predecessor the Research Assessment Exercise, are emphatically not metrics-based assessments of research at all British university departments. In fact the REF panels were told specifically not to take journal impact factors into account, and citations and h-indexes were not explicitly assessed.
There have been calls to make the REF more metrics-based but there seems to be little appetite for that, certainly in the humanities. The rules for the next REF have yet to be published but the word on the street is that they are unlikely to change much.
Note also that there is no requirement for all departments, nor indeed all universities, to submit to the REF. But if they don’t then they lose out on government cash and it reflects on their research reputation.
Thank you for correcting my memory Jeff, I’ll update the post.
“Note also that there is no requirement for all departments, nor indeed all universities, to submit to the REF.”
A very Humean view of freedom of action! (https://plato.stanford.edu/entries/hume-freewill/) Universities are free to not submit to the REF in the same sense that my students that my students are free to cheat on exams. They’re welcome to cheat, they just have to be ok with failing the course and possibly being expelled. 🙂
Jeff – my admittedly remote understanding is that it is not officially metric based. But in practice lo and behold papers in Science, Nature, PNAS and Ecology Letters end up counted as four star publications. And a spate of part time hires of people with four star publications ensued: https://www.timeshighereducation.com/news/twenty-per-cent-contracts-rise-in-run-up-to-ref/2007670.article
Mind you I’m not knocking the British system. Most of the world is moving this way.
I didn’t know about this REF, and just looked up a bit about it. Hmm. Well, if the outcome is universities trying to hire more people with better papers, is that necessarily bad? (The part time hiring thing is admittedly weird and distorting, though.) I mean, by that logic, all universities are trying to “game” the system by hiring the best people and poaching them from other places. One could call this “gaming” or call it “trying to hire better people”, no? Not quite sure also why it is so surprising that papers in good journals often earn 4 stars. Would one expect that not to be the case? Strikes me that if anything, the directive not to consider impact factor and citations and so forth could be more subjective (and subject to bias) than just counting cites. But forgive me if I’m misunderstanding; again, unfamiliar with this system, so this is just a quick impression.
Hi Brian – there is lots of discussion and speculation about this but the truth is that we don’t actually know if REF panels are *automatically* rating papers published in Science, Nature, PNAS, Ecology Letters, etc. as four star. That’s because the grading of individual outputs (papers, books, etc.) is not revealed, just the overall grade profile for all outputs submitted by a university to a given Unit of Assessment.
Received wisdom/academic psychology would suggest that the panels are doing just that, but I suspect that it’s certainly not always the case: the panel members do discuss individual outputs amongst themselves so there will be some degree of finessing.
Having said all that I’d be dumb not to submit my 2014 Science paper to the next REF in 2021, just as I submitted my 2011 Ecology Letters paper to the last one 🙂
@Arjun,
Well, it is bad if unis hire profs at other unis (even as far away as Australia) to “part time” jobs (which in practice involve little or no physical presence at the uni) just so they can count the hired prof’s publications towards the REF.
Yep, agree completely that that’s a pretty warped outcome.
any sort of comparison or ranking implies a metric doesn’t it, even if it’s not explicitly calculated?
besides citation impact, here are some that have been in use at various times
1. number of publications, irrespective of whether anyone read or cited them
2. volume of research funds
3. diligence in flattering the DVC for Research or equivalent
4. best 5 publications, and whether some selected group of experts likes them
I do entirely understand the problems with citation impact, and also that whatever you use for comparison people will apply their energies to gaming it. Still and all, citation impact feels to me like some sort of corrective compared to 1-3 above (each of which I’ve experienced).
Your comment on type I errors is a little weird to me. Isn’t the point that we should also care about type II errors, and that these two errors trade off? So lowering type I increases type II etc. If we care more about type II then we should increase type II and so on.
Oops, last sentence should read ‘if we care more about type II then we should increase type I’. We could also just plot a likelihood function in this context, which displays the optimal trade off at various levels.
Click through and read the linked piece. Apologies if my brief blurb for it is unclear. The piece makes the same point you’re making.
That’s what I meant – your blurb doesn’t mention type II errors, which seems to be the main point of the piece?
“I’m also recalling an old link I can’t find just now from someone arguing that you can either have wide uptake of preprints, or you can have double-blind peer review, but not both.”
Was it: Academic publishing death match: Double blind review vs. preprints
http://steamtraen.blogspot.co.uk/2017/01/academic-publishing-death-match-double.html
via a Friday link 20 Jan 2017
https://dynamicecology.wordpress.com/2017/01/20/friday-links-115/?
Got it in one!
In the linked post, Schimmack writes (of the recommendation to reduce the criterial p value from 0.05 to 0.005), “I believe that this recommendation is misguided because it ignores the consequences of a more stringent significance criterion on type-II errors.”
The Benjamin et al paper proposing this change explicitly discusses the trade off between Type I and Type II errors. On page 8 of the preprint, the first sentence in the section Why 0.005 is “The choice of any particular threshold is arbitrary and involves a trade-off between Type I and II errors.” They go on to discuss their rationale for recommending 0.005, and Figure 2 illustrates the false positive rate as a function of power for three different prior odds of a true effect for p < 0.05 and p < 0.005.
In addition, on page 10, in the the Potential Objections section, they write (the first bit is a potential objection):
“The false negative rate would become unacceptably high. Evidence that does not reach the new significance threshold should be treated as suggestive, and where possible further evidence should be accumulated; indeed, the combined results from several studies may be compelling even if any particular study is not. Failing to reject the null hypothesis does not mean accepting the null hypothesis. Moreover, the false negative rate will not increase if sample sizes are increased so that statistical power is held constant.”
Schimmack may be right that more stringent alpha values contribute to the replicability/reproducibility problem by virtue of their relation to questionable research practices and exploitation of researcher degrees of freedom, and he’s right to discuss all of this in the context of efficient use of resources. But he somehow either missed or ignored the fact that Benjamin et al are explicitly arguing about efficient use of resources, too. Continuing directly from the quoted paragraph above, they write (emphasis mine):
“For a wide range of common statistical tests, transitioning from a P-value threshold of 𝛼 = 0.05 to 𝛼 = 0.005 while maintaining 80% power would require an increase in sample sizes of about 70%. Such an increase means that fewer studies can be conducted using current experimental designs and budgets. But Figure 2 shows the benefit: false positive rates would typically fall by factors greater than two. Hence, considerable resources would be saved by not performing future studies based on false premises. Increasing sample sizes is also desirable because studies with small sample sizes tend to yield inflated effect size estimates (11), and publication and other biases may be more likely in an environment of small studies (12). We believe that efficiency gains would far outweigh losses.”
None of this is meant to defend Benjamin et al’s recommendation, mind you. Ultimately, it all seems to boil down to a semantic argument about which subset of (new) results we can label “statistically significant,” which subset we can merely label “suggestive,” and which subset we can label “inconclusive.” And maybe Schimmack is right that worrying about power is more important to the larger project of improving the credibility of psychology as a whole. But he grossly misrepresents the proposal he claims to be arguing against.
Yeah, for me p-values generally have little to no weight when forming an opinion on the results of a paper/study (not always of course). My impression is that there are many other folks who have a similar outlook. But more often than seems quite right, I will run into someone who is hard core ‘p0.05 then there is no publication/no reward, but it is often seems to be from the perspective of statistical insignificance = biological insignificance/study was a waste of time. I’m curious as to whether my experience is common? Is it generational? Of more a case of where a person studied and who they learned their stats from? More to the point, is this perspective declining?
Should be p< 0.05
What about stopping to use a significance threshold in p-value and present it as an weighted measure of the evidence against the null hypothesis?
As far as I understand, does not matter what is the threshold, as long as there is one, there is also an publication bias for statistically significant results (usually researches that found a high effect). I think this works like a scientific community p-hack. We just see the results that are above the line we defined, and hide the others. If we don’t know how many studies had non-significant results, how can we know if the statistically significant results are not just the 5% expected by chance to have phttps://peerj.com/articles/3544/
I’m sorry for the poor English. Non native here!!
P-value as a measure of weight of evidence is an idea with a long history. My own view, following Deborah Mayo, is that that’s not really a good way to interpret p-values. If you’re going to do hypothesis testing (and that’s not the only useful thing to do in science!), then I think you want to aim for what Mayo calls “severe tests”. See this old post for an overview: https://dynamicecology.wordpress.com/2012/11/28/why-and-how-to-do-statistics-its-probably-not-why-and-how-you-think/
Thank you for the suggestion. Based on your post, looks to me that P-value is one of the things I should consider when evaluating the severity of my test. Not the only, and not necessarily the most important. But I need to read Mayo’s paper to understand better the idea.
However, the main question I am trying to answer is: why to classify it as significant or non-significant? What do we gain with it? Looks to me that establishing a threshold just increases the selection of significant result in the publication process (not that this would not happen if threshold are removed, but I guess it would be weaker).
As you wrote: “It’s rarely (not never, but rarely) the case in science that a single experiment, analyzed with a single statistical test, can justify drawing a substantive scientific conclusion.”. So, in the most common cases, how can we draw substantive scientific conclusions using a sample of studies biased toward statistical significance and high effect.
Even when evaluating how severely some hypothesis have been tested, how can we do this if some results are just not published? “Test T severely passes H with data x if x accords with H, and if, with very high probability, test T would have produced a result that accords less well with H than x does if H were false.” If there is a population of a to z studies, but we just know about the test T of x, our evaluation of how severely H has been tested is not reliable.
Another site where you can comment on mentors is rateyourmentor.net. I don’t know much about their protocols, and it looks like only a few people have been reviewed there so far.
Pingback: Should journals invite submissions from preprint authors? | Dynamic Ecology