Does data mining supersede science?

A friend of mine asked me what I thought of this article in Wired, which argues that the existence of extremely large data sets is fundamentally changing our approach to data. The thesis is roughly that we no longer need to worry about constructing models or distinguishing between correlation and causation: with the ability to gather and mine ultralarge data sets, we can just measure correlations and be done with it.

This article is certainly entertaining, but fundamentally it’s deeply silly. There’s a nugget of truth in it, which is that large data sets allow people to do data mining – that is, searching for patterns and correlations in data without really knowing what they’re looking for. That’s an approach to data that used to be very rare and is now very common, and it certainly is changing some areas of science. But it’s certainly not the case this somehow replaces the idea of a model or blurs the distinction between correlation and causation. On the contrary, the reason that this sort of thing has been important to science is that it’s a great tool for constructing and refining models, which can then be tested in the usual way.

Two of the specific cases the article cites are actually examples of precisely this process. As I understand it, Google’s search algorithm started in more or less the way it’s described in the article, but over time they refined it by means of a process of old-fashioned model-building. Google doesn’t rank pages by simply counting the number of incoming links; that’s one ingredient in a complicated algorithm (most of which is secret, of course) that is continually refined based on tests performed on the data. Craig Venter’s an even better example: sure, his approach is based on gathering and mining large data sets, but the reason scientists care about those data sets is precisely so that they can use them to construct old-fashioned models and testable hypotheses.

The case of fundamental particle physics, which is also discussed in the article, is quite different. In fact, it’s totally unclear what it has to do with anything else in the article. Fundamental particle physics has arguably been stuck for a couple of decades now because of a lack of data, not an excess of data. There’s not even the remotest connection between the woes of particle physics and the phenomenon of mining large data sets.

Here’s another way to think about the whole business. The distinction between correlation and causation was always a distinction in principle, not a mere practicality. It’s not affected by the quality and quantity of the data.

Bonjour!

I’m spending this week in Paris, at a conference on Bolometric interferometry for the B mode search.  This is a narrowly focused workshop on a specific technique that may be used  for future measurements of the microwave background.  (That’s why its name is so completely esoteric.) For a number of years now,  I’ve been part of a collaboration trying to develop this technique.

This is the second conference I’ve been to in the past few weeks.  St. Louis was perfectly nice, but I have to say it’s nicer to be in Paris, weak dollar notwithstanding.

Plutoids

According to the International Astronomical Union, Pluto is still not a planet, but it is a Plutoid.  If I recall correctly, at the time the original naming decision was made, there was a proposal to call the class of Pluto-like objects “Plutons,” but that was rejected, in part because “Pluton” is already the name of Pluto in various languages, including French.  I guess “Plutoid” solves that problem.

I don’t much care about Pluto no longer being considered a planet, but I do think that the IAU made a poor choice of naming conventions.  According to the new system, Pluto and similar objects are not planets, but they are “dwarf planets.”  That’s right: a dwarf planet is not a planet.  That’s a needlessly confusing naming convention, especially since it’s inconsistent with the terminology in the rest of astronomy: dwarf stars are stars, and dwarf galaxies are galaxies.

That’s old news now, of course: the new wrinkle, namely the introduction of the term “Plutoid,” neither solves nor worsens that problem.

Even though it’s all in the past, here are a couple of observations about the Pluto-classification flap:

1. Obviously, no interesting scientific questions hinge on whether we choose to classify Pluto as a planet.  I recall a news article at the time of the Great Naming Controversy saying that the future of NASA’s New Horizons probe was in doubt because of the reclassification of Pluto as a non-planet.  That’s an obviously ridiculous notion: As Abraham Lincoln could tell you, the nature of a thing doesn’t change because of what we call it.

2.  The justification for the Great Renaming was to have a precise physical definition of the word “planet.”  Mike Brown has argued against the need for such a definition: Why not just consider the word “planet” to mean the nine bodies that it has traditionally meant?  By way of analogy, the word “continent” refers to a conventional set of seven land masses.  We don’t really need to justify why it’s that list of seven (Why aren’t Europe and Asia considered as one? Why not include Greenland?).   Often, science needs precise, objective definitions in order to proceed.  But it’s not clear that in this case anyone was being hampered by the arbitrary nine-body definition of the word “planet.”  What, exactly, was the problem that the IAU solved?

American Astronomical Society Meeting

My research group and I are just back from the summer AAS meeting in St. Louis. Here we all are at the hotel just before leaving:

aas20082.JPG

Our group presented four posters, with primary authors Austin Bourdon, Brent Follin, Ben Rybolt, and me. This means that pretty large fraction of the undergraduate presentations were by UR students. I suspect that we had more undergraduates presenting than any other college, although someone said that Wesleyan had a lot too. If you want to know about the research we were presenting, take a look at Austin’s, Brent’s, Ben’s, and my posters.

The fifth member of the group is Haoxuan Jeff Zheng, who wasn’t presenting this time because he only started research quite recently.

The meeting felt a bit small to me, compared to past AAS meetings I’ve been to: there didn’t seem to be that much going on. There were certainly some good talks, though. John Monnier talked about using interferometers (particularly CHARA) to produce images of rapidly rotating stars. In general, stars just look like points of light, even through the largest telescopes. But with these interferometers, you can actually resolve the stars well enough to see their overall shapes. Rapidly rotating stars bulge out at the equator, so they’re quite far from circular in appearance. Some are rotating so fast that they’re pretty close to breaking up. I had no idea how far the state of the art in this field had advanced in recent years.

The best talk was by Sean Carroll, on one of those questions that sounds stupid when you first hear it, but gets more interesting the more you think about it: Why does time flow in one direction and not the other? Why is the future different from the past? The reason it’s puzzling is that the microscopic laws of physics look the same whether you run time forwards or backwards, which makes it a bit strange that the large-scale universe doesn’t.

A conventional answer to this question is to invoke the second law of thermodynamics. Carroll argued that this only pushes the problem back one step, rather than really solving it. He argued further that none of the usual attempts at further explanation, including the theory of inflation, really solve the problem. He speculated a bit on what a true explanation might look like, but mostly had to admit that we have no idea.