This week: OLS regression is like democracy (it’s the worst option, except for all the others), xkcd vs. coronavirus, and more.
RIP Adam Schlesinger, brilliant songwriter and fellow Eph. I got to see Fountains of Wayne once in Boston, they were excellent.
This is probably the last thing the faculty among you want to hear right now, but have you started to think yet about how you’re going to teach your classes in the fall? Especially the big ones? Note that the linked post doesn’t even get into the challenges of wet labs. I confess I find it hard to think about this. Presumably because I wish I didn’t have to.
The latest “many analysts, one dataset” project, this one from sociology. 160 analytical teams were given a massive, well-known, long-term longitudinal monitoring dataset on people’s life outcomes. The teams were asked to predict the next as-yet-unpublished tranche of life outcomes, using any statistical method they wanted. Despite (or because of?) having data on thousands of potential predictor variables, and despite the fact that social scientists think these predictor variables are informative about life outcomes, even the best models–which were machine learning models–did poorly. Even the best models had very low predictive ability in an absolute sense, and only barely improved on garden-variety least-squares linear regressions using just a couple of predictor variables chosen by subject matter experts. Basically, all methods are about equally good at predicting typical life outcomes, and equally terrible at predicting extreme or uncommon life outcomes (e.g., predicting which subjects would go on be laid off from a job). I leave it to you to decide if this is encouraging or depressing news for (i) sociologists, (ii) machine learning methods, and (iii) OLS regression. Related old post from Brian on how many predictor variables your model should include. (ht @kjhealy)
“We’re not trapped in here with the coronavirus. The coronavirus is trapped in here with us.” 🙂
Hey look, there’s still some professional sport being played. The sport of…responding to peer reviews. 🙂 (ht Meghan)
It almost seems too far off to even dream about right now, but someday instead of having to hear the socially-distanced Ode to Joy I linked to the other week, we’ll be able to hear it like this again:
“Note that the linked post doesn’t even get into the challenges of wet labs.” — Jeremy Fox
Well, for human anatomy and physiology, students, though “distance learning”, could dissect bodies lying in the street near their place of residence — or perhaps even within it. Supply should be plentiful, thanks to political incompetence.
As Martin Rowson pointed out in his March 7 editorial cartoon in The Guardian newspaper: “Don’t bring out your dead. Stay indoors with them until you need to eat them”.
(While currently correct, the above is not a stable link, but rather, a cartoon queue).
Seriously, though, the “how you’re going to teach your classes in the fall?” article you linked to is spot-on. But beyond issues of pedagogical efficacy, I predict that students will not be willing to pay anywhere near current charges for a “Zoom”-based education. The intelligent, serious ones will, on their own, hole up in isolation with a stack of books and additionally make use of whatever journal articles or other resources they can access for free/minimal cost via the internet. The majority of students will probably simply drop out, and economic considerations will probably make that severance permanent.
The sociology paper is quite interesting. They mention a few reasons why this dataset is particularly unpredictable, such as the large time gap between wave 5 and 6, the 2008 recession, some predictor variables not being shared due to privacy reasons etc.
As for predicting uncommon/extreme life outcomes, perhaps extreme value theory might be a good candidate to consider? It may also be that studies that include much larger number of “units” (in this case, the family is the unit for some of the variables being predicted) would be able to better predict values for these extreme variables. But in the absence of such data (and sometimes, the actual reality itself may not have enough sample size), maybe the best use of models in this case is to generate probability distributions for the future, rather than estimate actual values?
I also would ask whether this kind of prediction is even useful. The goal here is only to know what the future is like, and whether we can predict it from the past. But science isn’t just that. You also want to understand *why* the future, or any particular event, is the way it is. One downside from the way this paper aims to predict the future is that you don’t have a mechanistic model or understanding of why the predictor variables affect the outcome variables in the way they do. So, for example, you can’t extrapolate the results of this (in terms of predictor variables and their coefficients) into another group of families set in a different socio-economic setting, say in another country. This is almost like blackboxing, and I feel like this isn’t really the point of science because it doesn’t really give a good understanding of nature.
An example from one of Feynman’s lectures (https://www.youtube.com/watch?v=NM-zWTU7X-k) is as follows: “Suppose that a young man went to the astronomer and said, ‘I have an idea. Maybe those things are going around, and there are balls of something like rocks out there, and we could calculate how they move in a completely different way from just calculating what time they appear in the sky’. ‘Yes’, says the astronomer, ‘and how accurately can you predict eclipses?’ He says, ‘I haven’t developed the thing very far yet’. Then says the astronomer, ‘Well, we can calculate eclipses more accurately than you can with your model, so you must not pay any attention to your idea because obviously the mathematical scheme is better’.” Prediction is not sufficient for scientific understanding, I would say.
I agree that forecasting isn’t sufficient for scientific understanding. And in my own work, I care much more about understanding than forecasting. But in this context, I do think our inability to forecast life outcomes should give us pause about whether we really understand the determinants of life outcomes, even in retrospect. (But this isn’t at all my field of course, so take my opinions with a very large grain of salt).
Some relevant old posts, addressing these broad issues in an ecological context:
https://dynamicecology.wordpress.com/2013/03/19/ecologists-need-to-do-a-better-job-of-prediction-part-iv-quantifying-prediction-quality/ (part of a series, includes links to other posts in the series)
Oh man, the post-peer-review press conference definitely took it to the next level! Here’s an early effort from me: https://case.edu/artsci/biol/snyder/olympic_scoring.html
“but the questionable limit on line 6 causes her to fall short of a perfect score”. Your proof may not have been perfect, but that joke is 10/10.
That pic has me thinking about possible variants. You could do a boxing version–have a couple of cornermen who rub your shoulders and squirt coffee into your mouth, in between rounds of fighting a particularly tricky derivation.