“Using data as a singular is wrong,” writes Barbara Walraff in Word Court. I agreed with this in my middle-school days, when I strove to show that I was destined for great things and illustrated it by being painfully, officiously, incessantly prim. I went to science fairs, and on the little note cards that carried my polished speech about bacteria, or supernovae, or whatever it was I was claiming to have made an important discovery about, I usually wrote “The data show that …” just before revealing the results that I expected would blow the judges away. Surprisingly, though, they weren’t shocked by results such as “oil-eating bacteria do indeed eat oil” nor amazed by my revelations that “salt-water shrimp die in fresh water”. But maybe they were just too distracted by the phrase “the data show” to actually listen to what it was that the data were showing.
I stopped generally treating data as a plural a few years ago, because no matter how many times I used it, it always sounded like I was putting on airs. I know, I know — data entered the language as the plural of the Latin borrowing datum, and therefore data forever should be a plural in English. But it’s really not so simple as that. I’m not about to argue that data are is wrong. But I am going to argue that there are some reasonable reasons to accept data is.
Exhibit A: the acceptability of — in fact, the preference for — data is in certain circumstances. There are two major senses for the word data. The original sense is a collection of numbers, facts, results, etc. from experiments and observations, as in (1). The other sense is a collection of information stored on a computer, usually in binary form, as in (2).
The second sense of data is a mass noun; it sounds quite odd to say “I have a data/datum on this hard drive”. It’s like mail, milk, money, and some non-m words as well. Mass nouns receive singular agreement:
(3a) Your mail is/*are sitting on the table.
(3b) The data on these hard drives is/*are corrupt.
So for this computerized sense, data is is not only acceptable, but strongly preferred. (There are a few instances of plural agreement with computer data, but these are quite rare.) Now here’s the problem: nowadays it’s awful hard to separate the two senses of data. I, for instance, build computer models of human language usage. So my data is a collection of facts in the world that is represented as a collection of binary digits on a computer disk; I could be using either sense of data to describe it. So what’s the problem with choosing to treat it as a mass noun, if that’s one possible form for it?
Exhibit B: other Latinate words have shed their plural history for the singular. Most prominent amongst these is agenda. Yes, agenda, meaning the set of points to be discussed in a meeting, the set of things to do in the future, or the book in which a calendar is kept. Agenda is treated as a singular noun, with agendas as its plural. Agendum, the “proper” singular form, has pretty well disappeared from English. Surprisingly, this transition has NOT led to linguistic anarchy, nor any other notable harm to the language or its speakers. It seems to be safe to allow data to follow the same path.
Exhibit C: there is not always agreement between semantics and syntax.
(4a) Where are my pants?
(4b) My scissors have rusted.
(4c) I own many pairs of plaid shorts.
What do these sentences have in common? Each refers to a semantically singular object with plural syntactic agreement. Note that each of these objects is composed of two parts, but each undeniably functions as a whole. A pair of pants is not like a pair of shoes in that regard, because you could have a single shoe, but not a single pant — that’s a pant leg. (I’m ignoring Express’s Editor Pant here.) So if we English speakers are willing to tolerate a single object taking plural agreement, why can’t we tolerate the “plural” data taking singular agreement?
Exhibit D: Lastly, when I’m speaking of data as a linguist, I’m not just talking about a set of facts, but rather a collection of facts, observations, arguments, and analyses. It is rare, in this day and age, that the points of data in an experiment, taken alone, can justify a claim. (I’m pretty sure this is the case in most fields.) The fact that people take longer to read certain words in a certain task does not, in and of itself, establish that these words are harder than others. Rather, this fact, combined with a set of assumptions and analyses that we all agree to accept, establishes the claim. We’re viewing the datums as a sort of team, all working together. In that sense, the data is an inter-related mass, rather than a series of separable points of data. Thus, they ought to be reasonably thought of as a mass or collective (like family or team) noun, either of which would take singular agreement in Standard American English.
That’s four reasons why I think data is should be all right. Insisting that people should say data are, in spite of the fact that an American English speaker can’t use data are without sounding pretentious or outmoded, is stupid. You’re welcome to keep using it, but stop making other people use it too. I don’t see any harm coming to the language based on how you use data. I don’t see any improvement to the language either. Go with what feels right. I’m guessing that’s data is.
Summary: A lot of people insist that data is is unacceptable. But there’re at least four reasons why data is should be fine. So if you think, like I do, that data is works better than data are, well, go ahead and use it! The same holds if you think data are is better. But it’s stupid to argue that only one or the other is correct.
The Stupid Grammar Rules series as it stands:
- I: Email vs. e-mail (04/11/08)
- II: data are (08/11/08)