You are currently browsing the tag archive for the ‘google n-grams’ tag.

One of my labmates was practicing his talk for an upcoming conference, looking at the predictability of different continuations of a sentence. Showing a logarithmic graph of word frequencies, he remarked that at one end of the scale, the words were one-in-a-million continuations. None of us were surprised. That’s one of the neat things about being a computational psycholinguist; we deal with one-in-a-million occurrences so often that they’re mundane.

But as I thought about it, I realized that one-in-a-million events shouldn’t be so surprising to any language user. Consider the following sentences:

(1a) The eulogy started sweet and was joined by a chorus of sniffles […]
(1b) The dignified woman, who came to the shelter in February, taught a younger woman to tell time […]
(1c) The entrenched board is made up of individuals […]
(1d) The kitties hid inside the space under the dishwasher […]

Each of the above sentences starts out with a one-in-a-million event; if you look at a million sentences starting with The, on average there will be one sentence with eulogy as the second word, one with dignified as the second word, and so on. Those might not feel like one-in-a-million words, so let me go a little into how we get the odds of a certain word. (You can skip the next section if you already know about corpus probabilities.)

The odds of a word. I used the Google N-gram corpus, a collection of a trillion words from the Internet. It’s called the n-gram corpus because it gives counts of n-grams, which are just phrases of n words. A 1-gram (or unigram) is just an individual word, a 2-gram (or bigram) is something like the house or standing after, and so on for larger values of n. N-grams are really useful in natural language processing because they are an easy-to-use stand-in for more complicated linguistic knowledge.

The question we’re looking at here is how predictable a word is given all the context you have. Context can cover a wide range of information, including what’s already been said, the environment the words are said in, and personal knowledge about the speaker and the world. For instance, if you hear someone carrying an umbrella say “The weather forecast calls for”, you’re probably going to predict the next word is “rain”. If the speaker were carrying a shovel instead, you might guess “snow”.

If you want a quick estimate of the predictability of a word, you can use n-grams to give a sort of general probability for the next word. So, in (1a), the predictability of eulogy is estimated as the probability of seeing eulogy following The at the start of the sentence based on the counts in the corpus. Here’s how we get the one-in-a-million estimate:

Let me break this equation down. The left-hand side, p(eulogy|The) is the estimated probability of seeing eulogy given that we started the sentence with The. This estimate is gotten by counting the number of times we see The eulogy at the start of a sentence, and dividing by the number of times we see The at the start of a sentence. (I’ve written these as C(The eulogy) and C(The).) The reason we’re dividing is that we know, when we reach the second word in the sentence, that the first word was The, and we want to know what proportion of those sentences continue with eulogy. (If we wanted the probability that a randomly-chosen sentence starts with The eulogy, we’d divide by the total number of sentences in the corpus instead.) In the corpus, there are 2.3 billion sentences starting with The, and 2288 starting with The eulogy. So, given that we’ve seen The, there’s a one-in-a-million chance we’ll see eulogy next.

From 1/1,000,000 to 1/20. Okay, thanks for sticking with me through the math. Now let’s talk about what this really means. In conversation, saying that the odds are a million to one against something means that it’s not going to happen, and yet we often see these linguistic one-in-a-million events. In fact, to finally get around the point I mentioned in the post title, it turns out that if the first word of a sentence is The, there’s a one-in-twenty chance of a one-in-a-million event. How’s that?

Well, let’s start by thinking of a scenario where there is a 100% chance of a one-in-a-million occurrence: a sweepstakes with one million tickets. If the sweepstakes is above-board, someone has to win it. The probability of some specific ticket winning is one in a million, but at the same time, the probability that some ticket wins is one. The sweepstakes is guaranteed to generate a one-in-a-million event based on the way it is set up. That’s why it’s no surprise to find out someone won the lottery, but it’s a shock when it turns out to be you.

Now suppose you want to boost your chances by buying 1,000 tickets. Each individual ticket still has the one in a million probability, but the probability of the winning ticket being one of your purchased ones is now one in a thousand. This is sort of what’s going on in the linguistic world. In language, there are so many low-probability words that even though they individually have less than a one-in-a-million chance of following The, the aggregate likelihood of seeing one of these words is relatively high. The starts 2.3 billion sentences, and of those sentences, .15 billion of them continue with “rare” words like eulogy or kitties in the second position. Each of these words is individually rare, but there are a whole lot of them, so they carry a lot of probability mass.

Outside of language, too. This is a more general point: rare events aren’t so rare if you don’t care which rare event occurs. As a big sports watcher, I’m always amazed at a good sports statistician’s ability to find a rare event in the most mundane of games. For instance, consider today’s report from the Elias Sports Bureau, where they note that yesterday was the first time that there were seven or more baseball games in which the winning team scored less than four runs and won by a single run since May 21, 1978. It’s a rare event, sure, but if it hadn’t been this specific rare event, it would have been another.

The webcomic XKCD shows how science (and science journalism, especially) can suffer from this same problem. A statistically significant result is generally one where there is only a one-in-twenty chance of its occurring as a coincidence. But if you test, as the scientists in the comic do, twenty colors of jellybeans independently to see if they cause acne, at least one probably will appear to. (There’s a 64% chance of that, in fact.) Again, a rare event becomes expected simply because there are so many ways it can occur. This is why it’s easy to find coincidences.

This is getting long, even for me, so let me wrap up with two basic points. If a lot of events occur, some of them will almost certainly be rare ones. Similarly, if all of the possible outcomes are individually rare, then the observed outcome will almost certainly be rare. It’s true in language, and it’s true in life. I’m sure you can find other morals to these stories, and I’d love to hear them.

[By the way, if you want to read more math posts written for a lay audience, go check out Math Goes Pop!, written by my college roommate who, unlike me, didn’t defect out of mathematics.]

The explosion of data available to language researchers in the form of the Internet and massive corpora (e.g., the Corpus of Contemporary American English or the British National Corpus) is, I think, a necessary step toward a complete theory of what the users of a language know about their language and how they use that information. I became convinced of this with Joan Bresnan’s work on the dative alternation — which I’ve previously fawned over as the research that really drew me into linguistics — in which she and her colleagues show that people unconsciously combine multiple pieces of information during language production in order to make probabilistic decisions about the grammatical structures they use. This went against the original idea (which many grammaticasters still hold) that sentences are always either strictly grammatical or strictly ungrammatical. Furthermore, it showed the essential wrongness of arguing that one structure is ungrammatical on analogy to another structure. After all, if (1a) is grammatical, by analogy (1b) has to be as well, right?

(1a) Ann faxed Beth the news
(1b) ?Ann yelled Beth the news

That’s not the case, though.* There are a lot of different factors affecting grammaticality in the dative alternation, including the length difference between the objects, their animacy and number, and even the verb itself. But this conclusion was only reached by using a regression model over a large corpus of dative sentences. This regression identified both the significant features and their effects on the alternation proportions. In addition, having the corpus allowed the researchers to find grammatical sentences that broke previously assumed rules about the dative alternation, showing that the assumed rules were false. Prior to having a corpus study on this alternation, people thought they mostly understood it, but now that we have the corpus study, the results are much different from what we’d been saying.

And this illustrates the power and downright necessity of corpora to descriptivist linguistics (i.e., linguistics). Sure, it might seem obvious that if you really want to describe a language, you need to have massive amounts of data about the language to drive your conclusions. But for almost the whole history of linguistics, we didn’t have it, and had to make do from extracted snippets of the language and imagined sentences, and those are susceptible to all kinds of biases and illusions. Having the corpora available and accessible can save us from some of these biases.

But, of course, corpora can introduce biases of their own. Corpora are imperfect, and in general they still must be supplemented by value judgments and constructed examples. An example that I once had the pleasure of seeing Ivan Sag and Joan Bresnan discuss was that if we go by raw word counts, the common typo teh was as much a word in the 1800s as crinkled. Similarly, if we were to turn linguistics over to corpora entirely and only accept observed sentences as grammatical, then I swept a sphere under the fogged window would be ungrammatical, since it has no hits on Google (at least until this post is indexed). Corpora are treasure troves, but as a quick review of the Indiana Jones series will remind you, treasure troves are laden with pitfalls and spikes.

Yep, this is exactly the sort of danger I face in my day-to-day research.

I was reminded of this when I looked up the historical usage of common English first names to look for rises and falls in their popularity. I looked up Brian in the Google Books N-grams, and found a spike that represents what I like to call the era of Brian:

Hmm, something sent Brian usage through the roof in the late 1920s, only to come crashing back down like the stock market (might they have been linked?!). Time to investigate further in the Corpus of Historical American English (COHA):

Oh wait, never mind, the era of Brian wasn’t in the 1920s; it was in the 1860s (and presaged in the 1830s). Wait, what? Let me go back to Google N-grams:

Oh dear, it’s spreading! What is happening? What is the meaning of Brian?!

The fact is, as you surely already knew, that there was no era of Brian. The variability of the length of the era in the first and third graphs is due to me changing the smoothing factor on the graph. The source of the spike is that in one year the proportion of “Brian” in the corpus shot up to around 10 to 20 times its base level. (This becomes clear if you look at the unsmoothed numbers.) And if we look at the composition of the corpus at that point (1929), it turns out that the Google Books corpus contains a 262-page book titled “Brian, a story”, which seems like it would account for this surge. The COHA corpus has a similar thing going on; two books in 1832 and 1834 have prominent characters with the name Brian, and 1860 has a book titled “Brian O’Linn”.

And that’s one of the problems of corpora. Sure, they’re full of far more linguistic information than the little sampling we used to use, but they’re still incomplete and composed as a not statistically independent sample of the full range of language. If these corpora contained the whole of all writing published in these years, the Brian spike would be negligible, but because of the inherently incomplete nature of corpora, a single book can have an inordinate effect on the apparent proportions of different words.

Corpora are great, but they’re also noisy, and they do require interpretation. I didn’t get that at first, and thought that interpreting corpus data was invariably infecting it with one’s own prejudices. And yeah, that’s a danger, writing off real phenomena that you don’t believe in because you don’t believe in them. But the answer isn’t to accept the corpus data as absolute truth. You have to be as skeptical of your corpora as you are of your constructed examples. And that’s advice, I’m sure, that very few of you will ever need.

*: If you find (1b) to be perfectly grammatical, that’s fine. I think you’ll find other examples in the paper that you consider less than perfectly grammatical but have grammatical analogues. And even if you don’t, the data will hopefully assure you that other people do.

Post Categories

The Monthly Archives

About The Blog

A lot of people make claims about what "good English" is. Much of what they say is flim-flam, and this blog aims to set the record straight. Its goal is to explain the motivations behind the real grammar of English and to debunk ill-founded claims about what is grammatical and what isn't. Somehow, this was enough to garner a favorable mention in the Wall Street Journal.

About Me

I'm Gabe Doyle, currently an assistant professor at San Diego State University, in the Department of Linguistics and Asian/Middle Eastern Languages, and a member of the Digital Humanities. Prior to that, I was a postdoctoral scholar in the Language and Cognition Lab at Stanford University. And before that, I got a doctorate in linguistics from UC San Diego and a bachelor's in math from Princeton.

My research and teaching connects language, the mind, and society (in fact, I teach a 500-level class with that title!). I use probabilistic models to understand how people learn, represent, and comprehend language. These models have helped us understand the ways that parents tailor their speech to their child's needs, why sports fans say more or less informative things while watching a game, and why people who disagree politically fight over the meaning of "we".

@MGrammar on twitter

Recent Tweets

If you like email and you like grammar, feel free to subscribe to Motivated Grammar by email. Enter your address below.

Join 980 other followers

Top Rated

%d bloggers like this: