Once upon a time there was a cute, little method called the *Linear Regression*. It had some interesting uses. It blew many people’s minds. It was loved and cherished by economists. But then some people loved it *too* much. They used it for everything. *Everything*. I’d like to give a little example (nonlinear relationship) and a possible fix for the problem.

I’m usually a Python user myself, but for quick and easy convenience, along with the fact that a lot more economists and social scientists use `R`

, I’ve gone with the latter. Let’s imagine an independent varaible, $x$, and two dependent variables ($y$, $z$) entirely derived from $x$ and noise:

$y=x_{2}+ε_{y}$

$z=x_{2}+ε_{z}$

$ε=_{d}N(0,81 )$

created in the following `R`

code:

```
x = runif(1000,-1,1)
y = x^2 + rnorm(1000,0,0.25)
z = x^2 + rnorm(1000,0,0.25)
```

You might see where this is heading. Let’s try running the simple linear regression

$z=β_{0}+β_{x}x+β_{y}y+ε$

`summary(lm(z~x+y))`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.123104 0.013118 9.385 <2e-16 ***
x -0.003933 0.017132 -0.230 0.818
y 0.601872 0.025118 23.961 <2e-16 ***
Residual standard error: 0.3168 on 997 degrees of freedom
```

Well here’s a problem!!! In case you didn’t notice, the variable from which $z$ is derived ($x$) does not show as "*significant*" (there’s a discussion for another day) in the linear regression. Oh, and $y$, the variable which is only related to $z$ via $x$, shows as unbelievably significant. The only thing even close to right here is the standard error. To be fair, who can blame the computer?

So here we are, using a technique that does not automatically pick up nonlinearities. In such a simple example with only two variables the problem could perhaps be rectified by eyeballing the plots and accordingly making something like `x2 = x^2`

and then `lm(z~x2+y)`

. In real life, however, we’re usually looking at at least a dozen variables, perhaps hundreds (or thousands). Eyeballing a relationship for each variable isn’t even remotely plausible. What is one to do in light of such misfortune and villainy?!**In walks nonparametric statistics, a wide-brim Stetson shading its eyes. A Native American flute plays a foreboding tune.**

Let’s try something else. An extraordinary tool that should have been implemented in everyday econometrics a decade ago. **The GAM** (generalized additive model). The simplest way to describe a GAM is an additive penalized spline. For example, instead of $y=β_{0}+β_{1}x_{1}+β_{2}x_{2}$, one can think of a GAM estimating a smoothed curve $y=f(x_{1})+f(x_{2})$. The curviness is optimized (rather than being forced, like our prior $x_{2}$ example) so it gives you a line if the relationship is linear, and a curve if appropriate. And yes, you can even get p-values. ??? The most commonly used `R`

package is `mgcv`

with the `gam()`

function (a difficult to remember function name…). Trying out `gam()`

:

`summary(gam(z~s(x)+s(y)))`

```
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 5.403 6.541 84.176 <2e-16 ***
s(y) 1.633 2.055 0.654 0.568
```

Holy crap! $y$ drops out like the poser it really is, while $x$ makes a move to become $z$'s new significant other (pun totally intended). And thus they lived happily ever after. And look at that smile!

Ok. I’ll concede, a GAM isn’t a cure-all elixir. But it solves problems of indescribable magnitude in empirical research. And I’ll give my opinion without shame: **A GAM should always be compared when running econometrics on a linear model.** If I can dare to be even more controversial, I’d say

*we should just stop using linear regressions altogether in favor of GAMs.*

I’ve considered working on a paper to similar effect as this blog post, obviously more involved, using previous studies, technical technicalities, and cool applications. Suggestions are appreciated!