One of my labmates was practicing his talk for an upcoming conference, looking at the predictability of different continuations of a sentence. Showing a logarithmic graph of word frequencies, he remarked that at one end of the scale, the words were one-in-a-million continuations. None of us were surprised. That’s one of the neat things about being a computational psycholinguist; we deal with one-in-a-million occurrences so often that they’re mundane.

But as I thought about it, I realized that one-in-a-million events shouldn’t be so surprising to any language user. Consider the following sentences:

(1a) The eulogy started sweet and was joined by a chorus of sniffles […]

(1b) The dignified woman, who came to the shelter in February, taught a younger woman to tell time […]

(1c) The entrenched board is made up of individuals […]

(1d) The kitties hid inside the space under the dishwasher […]

Each of the above sentences starts out with a one-in-a-million event; if you look at a million sentences starting with *The*, on average there will be one sentence with *eulogy* as the second word, one with *dignified* as the second word, and so on. Those might not feel like one-in-a-million words, so let me go a little into how we get the odds of a certain word. (You can skip the next section if you already know about corpus probabilities.)

**The odds of a word.** I used the Google N-gram corpus, a collection of a trillion words from the Internet. It’s called the n-gram corpus because it gives counts of n-grams, which are just phrases of *n* words. A 1-gram (or unigram) is just an individual word, a 2-gram (or bigram) is something like *the house* or *standing after*, and so on for larger values of *n*. *N*-grams are really useful in natural language processing because they are an easy-to-use stand-in for more complicated linguistic knowledge.

The question we’re looking at here is how predictable a word is given all the context you have. Context can cover a wide range of information, including what’s already been said, the environment the words are said in, and personal knowledge about the speaker and the world. For instance, if you hear someone carrying an umbrella say “The weather forecast calls for”, you’re probably going to predict the next word is “rain”. If the speaker were carrying a shovel instead, you might guess “snow”.

If you want a quick estimate of the predictability of a word, you can use *n*-grams to give a sort of general probability for the next word. So, in (1a), the predictability of *eulogy* is estimated as the probability of seeing *eulogy* following *The* at the start of the sentence based on the counts in the corpus. Here’s how we get the one-in-a-million estimate:

Let me break this equation down. The left-hand side, *p(eulogy|The)* is the estimated probability of seeing *eulogy* given that we started the sentence with *The*. This estimate is gotten by counting the number of times we see *The eulogy* at the start of a sentence, and dividing by the number of times we see *The* at the start of a sentence. (I’ve written these as *C(The eulogy)* and *C(The)*.) The reason we’re dividing is that we know, when we reach the second word in the sentence, that the first word was *The*, and we want to know what proportion of those sentences continue with *eulogy*. (If we wanted the probability that a randomly-chosen sentence starts with *The eulogy*, we’d divide by the total number of sentences in the corpus instead.) In the corpus, there are 2.3 billion sentences starting with *The*, and 2288 starting with *The eulogy*. So, given that we’ve seen *The*, there’s a one-in-a-million chance we’ll see *eulogy* next.

**From 1/1,000,000 to 1/20.** Okay, thanks for sticking with me through the math. Now let’s talk about what this really means. In conversation, saying that the odds are a million to one against something means that it’s not going to happen, and yet we often see these linguistic one-in-a-million events. In fact, to finally get around the point I mentioned in the post title, it turns out that if the first word of a sentence is *The*, there’s a one-in-twenty chance of a one-in-a-million event. How’s that?

Well, let’s start by thinking of a scenario where there is a 100% chance of a one-in-a-million occurrence: a sweepstakes with one million tickets. If the sweepstakes is above-board, someone has to win it. The probability of some specific ticket winning is one in a million, but at the same time, the probability that some ticket wins is one. The sweepstakes is guaranteed to generate a one-in-a-million event based on the way it is set up. That’s why it’s no surprise to find out someone won the lottery, but it’s a shock when it turns out to be you.

Now suppose you want to boost your chances by buying 1,000 tickets. Each individual ticket still has the one in a million probability, but the probability of the winning ticket being one of your purchased ones is now one in a thousand. This is sort of what’s going on in the linguistic world. In language, there are so many low-probability words that even though they individually have less than a one-in-a-million chance of following *The*, the aggregate likelihood of seeing one of these words is relatively high. *The* starts 2.3 billion sentences, and of those sentences, .15 billion of them continue with “rare” words like *eulogy* or *kitties* in the second position. Each of these words is individually rare, but there are a whole lot of them, so they carry a lot of probability mass.

**Outside of language, too.** This is a more general point: rare events aren’t so rare if you don’t care which rare event occurs. As a big sports watcher, I’m always amazed at a good sports statistician’s ability to find a rare event in the most mundane of games. For instance, consider today’s report from the Elias Sports Bureau, where they note that yesterday was the first time that there were seven or more baseball games in which the winning team scored less than four runs and won by a single run since May 21, 1978. It’s a rare event, sure, but if it hadn’t been this specific rare event, it would have been another.

The webcomic XKCD shows how science (and science journalism, especially) can suffer from this same problem. A statistically significant result is generally one where there is only a one-in-twenty chance of its occurring as a coincidence. But if you test, as the scientists in the comic do, twenty colors of jellybeans independently to see if they cause acne, at least one probably will appear to. (There’s a 64% chance of that, in fact.) Again, a rare event becomes expected simply because there are so many ways it can occur. This is why it’s easy to find coincidences.

This is getting long, even for me, so let me wrap up with two basic points. If a lot of events occur, some of them will almost certainly be rare ones. Similarly, if all of the possible outcomes are individually rare, then the observed outcome will almost certainly be rare. It’s true in language, and it’s true in life. I’m sure you can find other morals to these stories, and I’d love to hear them.

*[By the way, if you want to read more math posts written for a lay audience, go check out Math Goes Pop!, written by my college roommate who, unlike me, didn’t defect out of mathematics.]*

## 3 comments

Comments feed for this article

June 30, 2011 at 6:28 pm

johnwcowanThe trouble is that the one-in-a-million number is only good for a particular corpus. If you add more sentences to the corpus, the probability is likely to go down, because the new sentences probably won’t have the pattern of interest. But if you randomly remove sentences, the probability won’t necessarily go up: indeed, some of your samples may disappear, causing the probability to go down! So such numbers are not statistically stable.

July 2, 2011 at 6:11 am

The RidgerAs they say: if there are 7 billion people in the world, you can expect a 1-in-a-million shot to happen roughly 7,000 times…

July 5, 2011 at 4:12 pm

Gabejohn: I agree with you that the specific number is fairly meaningless, and depends crucially on the composition of the corpus itself. The point I’m making remains the same even if the numbers change a bit; it might really be a 4 or 6 percent chance of getting a word with less than a 1/1,000,000 bigram probability, but it will definitely be substantially more than the individual word’s 1/1,000,000 probability.

I have to disagree, though, with your claim that the probability of these rare events is likely to diminish as sentences are added to the corpus. I’d agree if the rare events appeared only once or twice in the corpus, but once they’ve appeared a few thousand times, that’s enough instances to suggest that we’re fairly close to the true distribution.