You are currently browsing the category archive for the ‘sports’ category.

One of my labmates was practicing his talk for an upcoming conference, looking at the predictability of different continuations of a sentence. Showing a logarithmic graph of word frequencies, he remarked that at one end of the scale, the words were one-in-a-million continuations. None of us were surprised. That’s one of the neat things about being a computational psycholinguist; we deal with one-in-a-million occurrences so often that they’re mundane.

But as I thought about it, I realized that one-in-a-million events shouldn’t be so surprising to any language user. Consider the following sentences:

(1a) The eulogy started sweet and was joined by a chorus of sniffles […]
(1b) The dignified woman, who came to the shelter in February, taught a younger woman to tell time […]
(1c) The entrenched board is made up of individuals […]
(1d) The kitties hid inside the space under the dishwasher […]

Each of the above sentences starts out with a one-in-a-million event; if you look at a million sentences starting with The, on average there will be one sentence with eulogy as the second word, one with dignified as the second word, and so on. Those might not feel like one-in-a-million words, so let me go a little into how we get the odds of a certain word. (You can skip the next section if you already know about corpus probabilities.)

The odds of a word. I used the Google N-gram corpus, a collection of a trillion words from the Internet. It’s called the n-gram corpus because it gives counts of n-grams, which are just phrases of n words. A 1-gram (or unigram) is just an individual word, a 2-gram (or bigram) is something like the house or standing after, and so on for larger values of n. N-grams are really useful in natural language processing because they are an easy-to-use stand-in for more complicated linguistic knowledge.

The question we’re looking at here is how predictable a word is given all the context you have. Context can cover a wide range of information, including what’s already been said, the environment the words are said in, and personal knowledge about the speaker and the world. For instance, if you hear someone carrying an umbrella say “The weather forecast calls for”, you’re probably going to predict the next word is “rain”. If the speaker were carrying a shovel instead, you might guess “snow”.

If you want a quick estimate of the predictability of a word, you can use n-grams to give a sort of general probability for the next word. So, in (1a), the predictability of eulogy is estimated as the probability of seeing eulogy following The at the start of the sentence based on the counts in the corpus. Here’s how we get the one-in-a-million estimate:

Let me break this equation down. The left-hand side, p(eulogy|The) is the estimated probability of seeing eulogy given that we started the sentence with The. This estimate is gotten by counting the number of times we see The eulogy at the start of a sentence, and dividing by the number of times we see The at the start of a sentence. (I’ve written these as C(The eulogy) and C(The).) The reason we’re dividing is that we know, when we reach the second word in the sentence, that the first word was The, and we want to know what proportion of those sentences continue with eulogy. (If we wanted the probability that a randomly-chosen sentence starts with The eulogy, we’d divide by the total number of sentences in the corpus instead.) In the corpus, there are 2.3 billion sentences starting with The, and 2288 starting with The eulogy. So, given that we’ve seen The, there’s a one-in-a-million chance we’ll see eulogy next.

From 1/1,000,000 to 1/20. Okay, thanks for sticking with me through the math. Now let’s talk about what this really means. In conversation, saying that the odds are a million to one against something means that it’s not going to happen, and yet we often see these linguistic one-in-a-million events. In fact, to finally get around the point I mentioned in the post title, it turns out that if the first word of a sentence is The, there’s a one-in-twenty chance of a one-in-a-million event. How’s that?

Well, let’s start by thinking of a scenario where there is a 100% chance of a one-in-a-million occurrence: a sweepstakes with one million tickets. If the sweepstakes is above-board, someone has to win it. The probability of some specific ticket winning is one in a million, but at the same time, the probability that some ticket wins is one. The sweepstakes is guaranteed to generate a one-in-a-million event based on the way it is set up. That’s why it’s no surprise to find out someone won the lottery, but it’s a shock when it turns out to be you.

Now suppose you want to boost your chances by buying 1,000 tickets. Each individual ticket still has the one in a million probability, but the probability of the winning ticket being one of your purchased ones is now one in a thousand. This is sort of what’s going on in the linguistic world. In language, there are so many low-probability words that even though they individually have less than a one-in-a-million chance of following The, the aggregate likelihood of seeing one of these words is relatively high. The starts 2.3 billion sentences, and of those sentences, .15 billion of them continue with “rare” words like eulogy or kitties in the second position. Each of these words is individually rare, but there are a whole lot of them, so they carry a lot of probability mass.

Outside of language, too. This is a more general point: rare events aren’t so rare if you don’t care which rare event occurs. As a big sports watcher, I’m always amazed at a good sports statistician’s ability to find a rare event in the most mundane of games. For instance, consider today’s report from the Elias Sports Bureau, where they note that yesterday was the first time that there were seven or more baseball games in which the winning team scored less than four runs and won by a single run since May 21, 1978. It’s a rare event, sure, but if it hadn’t been this specific rare event, it would have been another.

The webcomic XKCD shows how science (and science journalism, especially) can suffer from this same problem. A statistically significant result is generally one where there is only a one-in-twenty chance of its occurring as a coincidence. But if you test, as the scientists in the comic do, twenty colors of jellybeans independently to see if they cause acne, at least one probably will appear to. (There’s a 64% chance of that, in fact.) Again, a rare event becomes expected simply because there are so many ways it can occur. This is why it’s easy to find coincidences.

This is getting long, even for me, so let me wrap up with two basic points. If a lot of events occur, some of them will almost certainly be rare ones. Similarly, if all of the possible outcomes are individually rare, then the observed outcome will almost certainly be rare. It’s true in language, and it’s true in life. I’m sure you can find other morals to these stories, and I’d love to hear them.

[By the way, if you want to read more math posts written for a lay audience, go check out Math Goes Pop!, written by my college roommate who, unlike me, didn’t defect out of mathematics.]

There’s an unfortunate tendency to believe that we are the inheritors of a Golden Age of Punctuation, and that people today are ruining it with their errant apostrophes, unnecessary quotation marks, and overabundant ellipses. I consider it unfortunate for two reasons. The first is that it exposes a vanity within us, a belief that we were decent enough in our day, but that the younger folks are ruining the brilliant language we built and maintained. The second is that it suggests that new teaching methods or new technology are primarily to blame for modern linguistic shortcomings, when the fact is that these errors existed back in our day as well. The problem isn’t (primarily) that kids aren’t being taught what we were, but rather that the new ideas failed to solve our problems.

So I really enjoy collecting examples of incorrect usage from the past, such as an apostrophe to mark a plural in a famous 1856 editorial cartoon or its with an apostrophe in a 1984 John Mellencamp music video, as a reminder that errors in English are not solely the province of the current age. At least some sources of these errors are timeless, and it’s just as important to fix the timeless ones as any uniquely modern sources.

The Pittsburgh Post-Gazette has put together a beautiful multimedia presentation of one of the great moments of Pittsburgh sports history, the 1960 World Series. The ’60 Series, which concluded 50 years ago today, was your standard David-Goliath series. The relatively-unknown Pittsburgh Pirates (David) were up against the nearly-universally-hated New York Yankees (Goliath), and through the first six games the Yankees had outscored the Pirates 46-17. Despite the lopsided scoring, the Pirates and Yankees had split the six games 3-3, setting up the deciding Game Seven in Pittsburgh. The final game was a back-and-forth affair that was capped with a walk-off home run by “Maz” (Bill Mazeroski), a popular second baseman known for his glove, not his bat. The home run moved Maz into the pantheon of Pittsburgh sports legends, and in the minds of a few ambitious Pittsburghers, into politics:

“President”. Maybe these fellows were just being temperate in their revelry, knowing that Maz wasn’t really in the running for the Presidency. But I think it’s more likely that they’re just your average guys, making the same average misuses as we do 50 years later. In fact, I’m reminded of a picture I found a month ago of some Steeler fans who’d made an error of their own:

[Photo by Jared Wickerham/Getty Images]

So it goes.

The World Cup’s over now, but there’s a little point that’s keeps gnawing at me. I followed the World Cup primarily through Yahoo!’s sports site (previously mentioned for its poor choices in headline truncation), and I have to admit that despite my general disdain for comments on sports sites, I found myself actually following theirs. Not, of course, because the comments offered any insights, but rather out of a worrisome inability to stop looking at them. They were mini-Medusas, turning my brain to stone each time I looked upon their inane blabberings and tried to figure out why the commenter thought I needed to hear their thoughts. And worse, they had a siren’s song, a cer—n undeniable beauty in their weird blend of nationalism, chauvinism, mockery, pop culture references, and insanity that kept me unable to turn away.

Curious about the dashes in cer—n? Well, so am I. Yahoo!’s commenting software has an apparently very strange censorship module in it. Like a standard censorship module, it replaces words it finds offensive with dashes. In order to deter the more clever vulgarians, it also replaces dirty words hidden within other words. This is why glasses is censored into gl— in the following comment:

The Refs need gl---.

That’s a little out of the ordinary; in my experience, most automatic-censoring software checks against a dictionary, and lets words whose only fault is containing an obscene word go through untouched. This isn’t a hard feature to program in, so I am led to believe that Yahoo! consciously decided to omit it. Maybe they were having trouble with commenters using minced oaths like “We’re going to kick your glasses!” and they decided to remove even within-word obscenities to foil them. That would also explain this comment:

kudos in major quan---ies

I’m going to go out on a limb and suppose that the commenter wished to offer major quantities of kudos, which would of course be censored by a censor that seeks out vulgarities lurking within words. Nothing too weird there. But then I found these comments:


Apparently FIFA president Sepp Blatter isn’t the only one against technology; Yahoo!’s censor is adamant that the word not be reproduced in full. For some reason, the string gy is marked as obscene. The only explanation I can come up with for that is that the censor wanted to prevent brainiacs slipping gay by the censor by omitting its vowel. That’s an implausible explanation, though, especially since I’ve seen gay come through uncensored in other comments.

Now what about the censorship I engaged in in the opening paragraph, cer—n? Why would I do something so silly? Well, check out these comments:

Based on context, surely the censored words in the comments above are meant to be Captain, certain, and entertaining, which suggests that the Yahoo! censor believes tai to be a vulgarity.*

I was worried that my lexicon of vulgarities had fallen out of date, which would ruin the street cred that I have so precisely cultivated, so I rushed onto Urban Dictionary to find out what made tai censorable. Strangely, there was only one obscene definition for tai on Urban Dictionary. But I don’t think that it has anywhere near the general appeal to need censoring; it was the eighth definition listed on Urban Dictionary, buried under references to the band The Academy Is… and a claims that folks with the name Tai are “unusually fly”, “elite, perfect, cool guy in planet”, and “a total badass”. I tried looking on Google, but struck out there as well, with searches for “tai obscene” and “tai vulgarity” not returning anything useful.**

Does anyone have any idea what’s going on here? Have I offended you by saying gy and tai all willy-nilly? If so, please accept my heartfelt apolo—.

*: Perhaps, you’re thinking, it’s not tai that’s obscene but rather ta or tain, which are also in all three words. Judging from the gl— and quan—ies examples, though, it appears that all and only the obscene letters are dashed out.

**: I was shocked to find out you could search for any phrase with “obscene” in it and not get a single porn site. I found that especially surprising with “Tai” given that Kobe Tai was a famous pornographic actress in the late 90s.

One of the fun things about dialectal differences in English is how the poetry turns out. There are some rhymes that just wouldn’t work in your own dialect of English, but work fine in another. For instance, the way I learned that Canadian English has a different pronunciation of sorry from mine was by hearing a Nickelback song on the radio five hundred million times in 2002:

“It’s not like you to say sorry
I was waiting on a different story.”

My hometown of Pittsburgh has this too, as I found out reading a poem about the game in which the Terrible Towel (the original rally towel, which Pittsburghers wave at Steeler gamesOlympic award ceremonies, weddings, births, presidential inaugurations, etc.) debuted:

“‘It was easy,’ said Andy
And he flashed a crooked smile,
‘I was snapped on the fanny
By the Terrible Towel!'”

That probably seems like terrible poetry to you, not only because it is, but also because the bolded end-rhymes of the second and fourth lines aren’t remotely similar. But to native Pittsburghers, they are. That’s because we have two vowel shifts that move us away from the “standard” American English pronunciations. Both of them are “monophthongizations”, which is a really fun word to say once you figure how to. Monophthongization is the process of converting a diphthong to a monophthong (I’ll explain those terms in a minute.)

The first vowel shift is the conversion of /aɪ/ to /ɑ/ before an l or r. /aɪ/ is the phonetician’s way of writing what you learn in school as “long i”; it’s the vowel in sight, rhyme, or the pronoun I. It is a diphthong, which means that it’s really two vowels jammed together. If you say sight really slowly, you’ll notice that your lower jaw comes down as you start the vowel (the /a/ part), and then it starts back up, moving into sort of an “ee” sound (the /ɪ/ part) before you stop. If you don’t have anyone looking at you right now, try it yourself, and you’ll actually feel your mouth move from /a/ to /ɪ/. That’s a diphthong; it’s a sound where you start at one vowel and keep going until you finish at a new vowel. A monophthong, on the other hand, is a vowel sound that has the same sound throughout, like the /æ/ sound in American English hat. (Or the /a/ in British hat.) If you say hat slowly, you’ll notice that you start the vowel with your lower jaw down, and you only raise it back up when you start to make the t sound at the end, maintaining more or less the same vowel sound throughout.  If you’re having trouble seeing the difference between mono- & diphthongs, don’t worry.  The only crucial point is that the vowels in question are different in some way.

Returning to the Pittsburgh monophthongizations, we convert /aɪ/ to /ɑ/ before an l or r, so smile has a vowel that’s more like an “ah” sound than the standard “long i”.*  The other monophthongization is the conversion of /aʊ/, the sound in Standard American English town, to /a/, another “ah”-type sound. This is why Pittsburghers sometimes write “dahntahn” for downtown.   The first of these monophthongizations isn’t particularly rare in American English, occurring (if I remember correctly) in Appalachia, and parts of the Eastern Midwest as well.  The second monophthongization is pretty much unique to Pittsburgh, at least among American English speakers.

And that’s how the rhyme works.  /aɪ/ in smile turns into one “ah”-like vowel, and /aʊ/ in towel turns into another “ah”, and tah-dah! We get poetry that seems like free verse to anyone from another city!  And at the low cost making the word pairs dowel-dial, foul-file, towel-tile, and vowel-vile more or less indistinguishable.

*: Since this shift only applies to vowels before an l or r, Pittsburghers pronounce the vowels in smile and smite differently.

Post Categories

The Monthly Archives

About The Blog

A lot of people make claims about what "good English" is. Much of what they say is flim-flam, and this blog aims to set the record straight. Its goal is to explain the motivations behind the real grammar of English and to debunk ill-founded claims about what is grammatical and what isn't. Somehow, this was enough to garner a favorable mention in the Wall Street Journal.

About Me

I'm Gabe Doyle, currently a postdoctoral scholar in the Language and Cognition Lab at Stanford University. Before that, I got a doctorate in linguistics from UC San Diego and a bachelor's in math from Princeton.

In my research, I look at how humans manage one of their greatest learning achievements: the acquisition of language. I build computational models of how people can learn language with cognitively-general processes and as few presuppositions as possible. Currently, I'm working on models for acquiring phonology and other constraint-based aspects of cognition.

I also examine how we can use large electronic resources, such as Twitter, to learn about how we speak to each other. Some of my recent work uses Twitter to map dialect regions in the United States.

@MGrammar on twitter

Recent Tweets

If you like email and you like grammar, feel free to subscribe to Motivated Grammar by email. Enter your address below.

Join 975 other followers

Top Rated

%d bloggers like this: