The explosion of data available to language researchers in the form of the Internet and massive corpora (e.g., the Corpus of Contemporary American English or the British National Corpus) is, I think, a necessary step toward a complete theory of what the users of a language know about their language and how they use that information. I became convinced of this with Joan Bresnan’s work on the dative alternation — which I’ve previously fawned over as the research that really drew me into linguistics — in which she and her colleagues show that people unconsciously combine multiple pieces of information during language production in order to make probabilistic decisions about the grammatical structures they use. This went against the original idea (which many grammaticasters still hold) that sentences are always either strictly grammatical or strictly ungrammatical. Furthermore, it showed the essential wrongness of arguing that one structure is ungrammatical on analogy to another structure. After all, if (1a) is grammatical, by analogy (1b) has to be as well, right?
(1a) Ann faxed Beth the news
(1b) ?Ann yelled Beth the news
That’s not the case, though.* There are a lot of different factors affecting grammaticality in the dative alternation, including the length difference between the objects, their animacy and number, and even the verb itself. But this conclusion was only reached by using a regression model over a large corpus of dative sentences. This regression identified both the significant features and their effects on the alternation proportions. In addition, having the corpus allowed the researchers to find grammatical sentences that broke previously assumed rules about the dative alternation, showing that the assumed rules were false. Prior to having a corpus study on this alternation, people thought they mostly understood it, but now that we have the corpus study, the results are much different from what we’d been saying.
And this illustrates the power and downright necessity of corpora to descriptivist linguistics (i.e., linguistics). Sure, it might seem obvious that if you really want to describe a language, you need to have massive amounts of data about the language to drive your conclusions. But for almost the whole history of linguistics, we didn’t have it, and had to make do from extracted snippets of the language and imagined sentences, and those are susceptible to all kinds of biases and illusions. Having the corpora available and accessible can save us from some of these biases.
But, of course, corpora can introduce biases of their own. Corpora are imperfect, and in general they still must be supplemented by value judgments and constructed examples. An example that I once had the pleasure of seeing Ivan Sag and Joan Bresnan discuss was that if we go by raw word counts, the common typo teh was as much a word in the 1800s as crinkled. Similarly, if we were to turn linguistics over to corpora entirely and only accept observed sentences as grammatical, then I swept a sphere under the fogged window would be ungrammatical, since it has no hits on Google (at least until this post is indexed). Corpora are treasure troves, but as a quick review of the Indiana Jones series will remind you, treasure troves are laden with pitfalls and spikes.
I was reminded of this when I looked up the historical usage of common English first names to look for rises and falls in their popularity. I looked up Brian in the Google Books N-grams, and found a spike that represents what I like to call the era of Brian:
Hmm, something sent Brian usage through the roof in the late 1920s, only to come crashing back down like the stock market (might they have been linked?!). Time to investigate further in the Corpus of Historical American English (COHA):
Oh wait, never mind, the era of Brian wasn’t in the 1920s; it was in the 1860s (and presaged in the 1830s). Wait, what? Let me go back to Google N-grams:
Oh dear, it’s spreading! What is happening? What is the meaning of Brian?!
The fact is, as you surely already knew, that there was no era of Brian. The variability of the length of the era in the first and third graphs is due to me changing the smoothing factor on the graph. The source of the spike is that in one year the proportion of “Brian” in the corpus shot up to around 10 to 20 times its base level. (This becomes clear if you look at the unsmoothed numbers.) And if we look at the composition of the corpus at that point (1929), it turns out that the Google Books corpus contains a 262-page book titled “Brian, a story”, which seems like it would account for this surge. The COHA corpus has a similar thing going on; two books in 1832 and 1834 have prominent characters with the name Brian, and 1860 has a book titled “Brian O’Linn”.
And that’s one of the problems of corpora. Sure, they’re full of far more linguistic information than the little sampling we used to use, but they’re still incomplete and composed as a not statistically independent sample of the full range of language. If these corpora contained the whole of all writing published in these years, the Brian spike would be negligible, but because of the inherently incomplete nature of corpora, a single book can have an inordinate effect on the apparent proportions of different words.
Corpora are great, but they’re also noisy, and they do require interpretation. I didn’t get that at first, and thought that interpreting corpus data was invariably infecting it with one’s own prejudices. And yeah, that’s a danger, writing off real phenomena that you don’t believe in because you don’t believe in them. But the answer isn’t to accept the corpus data as absolute truth. You have to be as skeptical of your corpora as you are of your constructed examples. And that’s advice, I’m sure, that very few of you will ever need.
*: If you find (1b) to be perfectly grammatical, that’s fine. I think you’ll find other examples in the paper that you consider less than perfectly grammatical but have grammatical analogues. And even if you don’t, the data will hopefully assure you that other people do.