One of my labmates was practicing his talk for an upcoming conference, looking at the predictability of different continuations of a sentence. Showing a logarithmic graph of word frequencies, he remarked that at one end of the scale, the words were one-in-a-million continuations. None of us were surprised. That’s one of the neat things about being a computational psycholinguist; we deal with one-in-a-million occurrences so often that they’re mundane.

But as I thought about it, I realized that one-in-a-million events shouldn’t be so surprising to any language user. Consider the following sentences:

(1a) The eulogy started sweet and was joined by a chorus of sniffles […]
(1b) The dignified woman, who came to the shelter in February, taught a younger woman to tell time […]
(1c) The entrenched board is made up of individuals […]
(1d) The kitties hid inside the space under the dishwasher […]

Each of the above sentences starts out with a one-in-a-million event; if you look at a million sentences starting with The, on average there will be one sentence with eulogy as the second word, one with dignified as the second word, and so on. Those might not feel like one-in-a-million words, so let me go a little into how we get the odds of a certain word. (You can skip the next section if you already know about corpus probabilities.)

The odds of a word. I used the Google N-gram corpus, a collection of a trillion words from the Internet. It’s called the n-gram corpus because it gives counts of n-grams, which are just phrases of n words. A 1-gram (or unigram) is just an individual word, a 2-gram (or bigram) is something like the house or standing after, and so on for larger values of n. N-grams are really useful in natural language processing because they are an easy-to-use stand-in for more complicated linguistic knowledge.

The question we’re looking at here is how predictable a word is given all the context you have. Context can cover a wide range of information, including what’s already been said, the environment the words are said in, and personal knowledge about the speaker and the world. For instance, if you hear someone carrying an umbrella say “The weather forecast calls for”, you’re probably going to predict the next word is “rain”. If the speaker were carrying a shovel instead, you might guess “snow”.

If you want a quick estimate of the predictability of a word, you can use n-grams to give a sort of general probability for the next word. So, in (1a), the predictability of eulogy is estimated as the probability of seeing eulogy following The at the start of the sentence based on the counts in the corpus. Here’s how we get the one-in-a-million estimate:

Let me break this equation down. The left-hand side, p(eulogy|The) is the estimated probability of seeing eulogy given that we started the sentence with The. This estimate is gotten by counting the number of times we see The eulogy at the start of a sentence, and dividing by the number of times we see The at the start of a sentence. (I’ve written these as C(The eulogy) and C(The).) The reason we’re dividing is that we know, when we reach the second word in the sentence, that the first word was The, and we want to know what proportion of those sentences continue with eulogy. (If we wanted the probability that a randomly-chosen sentence starts with The eulogy, we’d divide by the total number of sentences in the corpus instead.) In the corpus, there are 2.3 billion sentences starting with The, and 2288 starting with The eulogy. So, given that we’ve seen The, there’s a one-in-a-million chance we’ll see eulogy next.

From 1/1,000,000 to 1/20. Okay, thanks for sticking with me through the math. Now let’s talk about what this really means. In conversation, saying that the odds are a million to one against something means that it’s not going to happen, and yet we often see these linguistic one-in-a-million events. In fact, to finally get around the point I mentioned in the post title, it turns out that if the first word of a sentence is The, there’s a one-in-twenty chance of a one-in-a-million event. How’s that?

Well, let’s start by thinking of a scenario where there is a 100% chance of a one-in-a-million occurrence: a sweepstakes with one million tickets. If the sweepstakes is above-board, someone has to win it. The probability of some specific ticket winning is one in a million, but at the same time, the probability that some ticket wins is one. The sweepstakes is guaranteed to generate a one-in-a-million event based on the way it is set up. That’s why it’s no surprise to find out someone won the lottery, but it’s a shock when it turns out to be you.

Now suppose you want to boost your chances by buying 1,000 tickets. Each individual ticket still has the one in a million probability, but the probability of the winning ticket being one of your purchased ones is now one in a thousand. This is sort of what’s going on in the linguistic world. In language, there are so many low-probability words that even though they individually have less than a one-in-a-million chance of following The, the aggregate likelihood of seeing one of these words is relatively high. The starts 2.3 billion sentences, and of those sentences, .15 billion of them continue with “rare” words like eulogy or kitties in the second position. Each of these words is individually rare, but there are a whole lot of them, so they carry a lot of probability mass.

Outside of language, too. This is a more general point: rare events aren’t so rare if you don’t care which rare event occurs. As a big sports watcher, I’m always amazed at a good sports statistician’s ability to find a rare event in the most mundane of games. For instance, consider today’s report from the Elias Sports Bureau, where they note that yesterday was the first time that there were seven or more baseball games in which the winning team scored less than four runs and won by a single run since May 21, 1978. It’s a rare event, sure, but if it hadn’t been this specific rare event, it would have been another.

The webcomic XKCD shows how science (and science journalism, especially) can suffer from this same problem. A statistically significant result is generally one where there is only a one-in-twenty chance of its occurring as a coincidence. But if you test, as the scientists in the comic do, twenty colors of jellybeans independently to see if they cause acne, at least one probably will appear to. (There’s a 64% chance of that, in fact.) Again, a rare event becomes expected simply because there are so many ways it can occur. This is why it’s easy to find coincidences.

This is getting long, even for me, so let me wrap up with two basic points. If a lot of events occur, some of them will almost certainly be rare ones. Similarly, if all of the possible outcomes are individually rare, then the observed outcome will almost certainly be rare. It’s true in language, and it’s true in life. I’m sure you can find other morals to these stories, and I’d love to hear them.

[By the way, if you want to read more math posts written for a lay audience, go check out Math Goes Pop!, written by my college roommate who, unlike me, didn’t defect out of mathematics.]