Separating the Signal and the Noise

Applying statistical significance to your marketing analyses without overfitting the data. 

One of the biggest challenges facing a marketing analyst, particularly one tasked with analyzing “small data”, is making sure to draw statistically sound conclusions. Often, in an attempt to apply as many valuable attributes as possible to a data set, a statistical model can become so complex that it starts amplifying errors and losing it’s predictive value. In essence, the model begins to “memorize” the data, offering valid conclusions about the data set but losing the ability to properly analyze new data.

I’ll offer a real world example; one that I borrowed from an old colleague that has stuck with me. Imagine you’re at a symphony and you want to capture the music as accurately as you possibly can. In order to do so, you buy a top-of-the-line microphone and bring it into the concert hall. This microphone records with such acuity as to pick up all of the ambient sounds apart from the symphony. You hear people shuffling and clearing their throats, soft conversations from within the audience, programs crumpling in hands. Perhaps this ambient sound takes away from the symphony; even worse, perhaps it overwhelms the symphony and results in a recording that is muddled with background noise.

In this scenario, the omnidirectional microphone is the statistical model. The microphone, while very much capable in it’s own right, is much too powerful for the situation it’s being applied to. Perhaps a unidirectional microphone would have been more fitting. Your microphone overfits the symphony.

To give a more statistical example, and perhaps some insight into what caused this train of thought to leave the station, recently I was working on a data set that consisted of only 370 rows, each with as many as fifty-two attributes. I was tasked with isolating which particular metrics were predictive of attrition; this task was further complicated by the fact that I only had 70 or so historical data points to analyze. Fortunately, with fifty-two attributes at my disposal, the potential for meaningful, correlating metrics was immense, right?

Wrong. The deeper into the model I went, the less predictive it became. At first this didn’t make any sense — if I’m finding dozens of correlating, significant data pairs, why isn’t my model correctly predicting attrition? My model seemed infallible. My Pearson R’s were suggesting correlation; the corresponding P’s were suggesting complete significance.

Ultimately, the answer was buried in my inability to separate the signal from the noise. Consider the question here, simply: what factors contribute to attrition? In this scenario, the signal could be interpreted as metrics like purchase consistency, seasonality, return on investment and other, common data points that exist in most transactional data sets. But the noise? Oh, the noise. There is quite a bit of it in only 370 rows. When you consider the number of metrics for each parameter, 370*52, the opportunity for noisy, unpredictable data is immense. In this data set, finding the noise was relatively easy. Applying a regression model to such a small amount of data doesn’t make much sense. Breaking down metrics into population means and then into standard deviations also created a seemingly unnecessary amount of complexity. Unfortunately, identifying noise in a data model can be difficult; sometimes it’s only possible when the model fails to succeed in being predictive. When your test data starts to dramatically diverge from your training data, you’ve likely built a model that fits just a bit too well.

To summarize concisely, an overfitting model is one where the ratio of the complexity of the model to the training data size is too large. While striving for linearity, it’s very easy to lose sight of the goal. Making x and y linear is easy if you have a model complex enough to reduce the significance of each error down to almost zero. Any data set can be manipulated to the point where all of the data correlates so well as to be completely unrealistic. But, unfortunately, we live in a noisy world and your perfect, overfit model will sometimes fall apart when exposed to the elements.