A friend of mine asked me what I thought of this article in Wired, which argues that the existence of extremely large data sets is fundamentally changing our approach to data. The thesis is roughly that we no longer need to worry about constructing models or distinguishing between correlation and causation: with the ability to gather and mine ultralarge data sets, we can just measure correlations and be done with it.
This article is certainly entertaining, but fundamentally it’s deeply silly. There’s a nugget of truth in it, which is that large data sets allow people to do data mining – that is, searching for patterns and correlations in data without really knowing what they’re looking for. That’s an approach to data that used to be very rare and is now very common, and it certainly is changing some areas of science. But it’s certainly not the case this somehow replaces the idea of a model or blurs the distinction between correlation and causation. On the contrary, the reason that this sort of thing has been important to science is that it’s a great tool for constructing and refining models, which can then be tested in the usual way.
Two of the specific cases the article cites are actually examples of precisely this process. As I understand it, Google’s search algorithm started in more or less the way it’s described in the article, but over time they refined it by means of a process of old-fashioned model-building. Google doesn’t rank pages by simply counting the number of incoming links; that’s one ingredient in a complicated algorithm (most of which is secret, of course) that is continually refined based on tests performed on the data. Craig Venter’s an even better example: sure, his approach is based on gathering and mining large data sets, but the reason scientists care about those data sets is precisely so that they can use them to construct old-fashioned models and testable hypotheses.
The case of fundamental particle physics, which is also discussed in the article, is quite different. In fact, it’s totally unclear what it has to do with anything else in the article. Fundamental particle physics has arguably been stuck for a couple of decades now because of a lack of data, not an excess of data. There’s not even the remotest connection between the woes of particle physics and the phenomenon of mining large data sets.
Here’s another way to think about the whole business. The distinction between correlation and causation was always a distinction in principle, not a mere practicality. It’s not affected by the quality and quantity of the data.