Spurious Correlations

This joint post developed from several discussions about correlations with (soon to be) Dr Martin Brandt who helped write this post. You will find the same post on his blog, as well as many other interesting posts.

Correlations are a very famous and popular way to express relationships (and their strength) between two variables. Applications in environmental sciences span from relations of satellite based parameters with ground observations, to relationships between parameters like vegetation and precipitation. Furthermore, scientists use correlations to find linkages between totally different datasets of different scientific disciplines and spatial scales, e.g. migration and environment. However, many scientists blindly trust these statistical analyses and even low correlations are often interpreted in an awkward and very speculative way without questioning the results.

Too much reliance on statistical parameters can be dangerous, as you can have a strong correlation between two variables that are not related. This is shown in this website where, for example, the per capita consumption of margarine (US) is correlated with divorce rate in Maine at a correlation coefficient of 0.992558. How would you interpret such a relationship? Does this prove that married people shouldn’t eat margarine? It’s a nonsense correlation, these two variables simply happened to occur during the same years (the correlation was based on time, not space). In this relationship, the scale problems are quite obvious. First of all, the variables do not have the same spatial extent, even though they overlap in Maine. Also the temporal detail can be questioned. Many things happen during a year, so how would this correlation look at a finer temporal detail, for example monthly? We’re sure it would not be as strong.

Strong correlations can often be found between variables that are not directly linked, especially when the spatial and temporal details are coarse (e.g. nationwide, yearly). The interpretation of statistical analysis outputs can be a challenge and therefore it is important to make sure that you know what you’re doing. Furthermore, the output values should be interpreted using common sense and an awareness of how scale issues might affect the results.

Time series of divorce rate in Maine (left axis) and  per capita margarine consumption (right axis).

Time series of divorce rate in Maine (left axis) and per capita margarine consumption (right axis). Source: http://www.tylervigen.com/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s