Once upon a time there was a cute, little method called the Linear Regression. It had some interesting uses. It blew many people’s minds. It was loved and cherished by economists. But then some people loved it too much. They used it for everything. Everything. I’d like to give a little example (nonlinear relationship) and a possible fix for the problem.
I’m usually a Python user myself, but for quick and easy convenience, along with the fact that a lot more economists and social scientists use R
, I’ve gone with the latter. Let’s imagine an independent varaible, xx, and two dependent variables (yy, zz) entirely derived from xx and noise:
y=x2+εyy = x^2 + \varepsilon_y
z=x2+εzz = x^2 + \varepsilon_z
ε=dN(0,18)\varepsilon =_d \mathcal{N} (0,\frac{1}{8})
created in the following R
code:
x = runif(1000,-1,1)
y = x^2 + rnorm(1000,0,0.25)
z = x^2 + rnorm(1000,0,0.25)
You might see where this is heading. Let’s try running the simple linear regression
z=β0+βxx+βyy+εz = \beta_0 + \beta_x x + \beta_y y + \varepsilon
summary(lm(z~x+y))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.123104 0.013118 9.385 <2e-16 ***
x -0.003933 0.017132 -0.230 0.818
y 0.601872 0.025118 23.961 <2e-16 ***
Residual standard error: 0.3168 on 997 degrees of freedom
Well here’s a problem!!! In case you didn’t notice, the variable from which zz is derived (xx) does not show as "significant" (there’s a discussion for another day) in the linear regression. Oh, and yy, the variable which is only related to zz via xx, shows as unbelievably significant. The only thing even close to right here is the standard error. To be fair, who can blame the computer?
So here we are, using a technique that does not automatically pick up nonlinearities. In such a simple example with only two variables the problem could perhaps be rectified by eyeballing the plots and accordingly making something like x2 = x^2
and then lm(z~x2+y)
. In real life, however, we’re usually looking at at least a dozen variables, perhaps hundreds (or thousands). Eyeballing a relationship for each variable isn’t even remotely plausible. What is one to do in light of such misfortune and villainy?!
In walks nonparametric statistics, a wide-brim Stetson shading its eyes. A Native American flute plays a foreboding tune.
Let’s try something else. An extraordinary tool that should have been implemented in everyday econometrics a decade ago. The GAM (generalized additive model). The simplest way to describe a GAM is an additive penalized spline. For example, instead of y=β0+β1x1+β2x2y = \beta_0 + \beta_1 x_1 + \beta_2 x_2, one can think of a GAM estimating a smoothed curve y=f(x1)+f(x2)y = f(x_1) + f(x_2). The curviness is optimized (rather than being forced, like our prior x2x^2 example) so it gives you a line if the relationship is linear, and a curve if appropriate. And yes, you can even get p-values. ??? The most commonly used R
package is mgcv
with the gam()
function (a difficult to remember function name…). Trying out gam()
:
summary(gam(z~s(x)+s(y)))
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 5.403 6.541 84.176 <2e-16 ***
s(y) 1.633 2.055 0.654 0.568
Holy crap! yy drops out like the poser it really is, while xx makes a move to become zz's new significant other (pun totally intended). And thus they lived happily ever after. And look at that smile!
Ok. I’ll concede, a GAM isn’t a cure-all elixir. But it solves problems of indescribable magnitude in empirical research. And I’ll give my opinion without shame: A GAM should always be compared when running econometrics on a linear model. If I can dare to be even more controversial, I’d say we should just stop using linear regressions altogether in favor of GAMs.
I’ve considered working on a paper to similar effect as this blog post, obviously more involved, using previous studies, technical technicalities, and cool applications. Suggestions are appreciated!