Language Log (in the guise of Arnold Zwicky) alerted me this morning that Wordnik is now up and running. Succinctly, and stealing from their front page, Wordnik is an “ongoing project devoted to discovering all the words and everything about them”. Seems a laudable aim to me.

I went and checked out the site, and not feeling particularly imaginative today, looked up some rather quotidian words: whether, capital, and the. (I had principled reasons for this, I assure you, having to do with worries of data sparsity in a new search engine.  Those reasons oughtn’t to detract from the lamentable fact that when presented with a choice to stick in the everyday or to press on into terra infirma, I slipped off my boots and sidled back to the known. I intend the use of “quotidian” here instead of “mundane” or “commonplace” to function as some minor penance.) All in all, it’s a pleasant experience they have over at Wordnik, and I suspect I’ll end up wasting more than a few moments of my life finding out what their algorithm considers the most representative Flickr pictures for various abstract words — everyday, for instance.

But, as I am the sort who has never managed to fully slip the surly bonds of mathematics, it was the statistics that really snagged me. There’s a graph for each word showing its frequency change over the last 200 years. I don’t get what exactly it means for a word to be “unusual” in a year, but check out the statistics for the:


For some reason, from the 1920s to 1950, the was much more unusual than in the rest of the past 200 years (ignoring that outlier around 1980). This is almost certainly an artifact of the data, rather than an actual increase in the unusualness of the. I’m betting it has to do with different corpora being used at different points in the historical statistics. My guess would be that up to the 1920s, the data is dominated by prose writing that has entered into public domain (books in Project Gutenberg, etc.) and that when the public domain prose dried up, the corpora were dominated by work from some other genre that uses fewer articles. This hypothesis is backed up by a similar pattern on the indefinite articles a and an. I don’t know what genre would lead to a decrease in the number of articles — an increased proportion of dictionaries or telegrams? — or why there is the return to normalcy at 1950 — perhaps digitized news archives take over the corpus at that point? (If you have any ideas, please post a comment.) I assume it’s not really the case that articles became unusual at those points in time, but this points out an important failing of relying too heavily on automatically mined historical data; you can get really funky results due to changing corpus demographics, data sparsity, and the like.  If you think you’ve found an interesting result, you should always check it against some sort of baseline.

Let this be a lesson to all you word explorers out there. For all that can be found in the exotic parts of the dictionary (the most recently searched Wordnik words at the moment include palimpsest, irrendentist, epaulette, and interrobang), the greatest mysteries sometimes lurk in the everyday.

[P.S.: I’ve been gone awhile, but I should be back to posting and commenting and responding to email semi-regularly next week.  The school year has finally ended, I’m back in the U.S. from a machine learning conference, and now I have nothing to do but blog. Oh, and all the work I pushed off to the summer and the two summer jobs I’ve ended up with. Crumbs.]

Saturday morning in our apartment was marked, as all of them ought to be, by Saturday morning cartoons. It may be more accurate to say “cartoon”, singular — only one cartoon was shown, on repeat, because my roommate’s visiting friends had fallen asleep watching it the night before. It was “The Old Man and the Lisa“, the episode of The Simpsons where Mr. Burns loses all his money and is forced to make a living by recycling. Sent to a retirement home, Mr. Burns looks for something to do, such a newspaper to read, only to be met with Grampa Simpson’s explanation of why none are available: “We’re not allowed to read newspapers. They angry up the blood.”

The same restriction ought to be placed on me as well, except I shouldn’t be allowed to read grammar blogs. For you see, as I was busy working on my big yearly paper, I needed to read something to clear my head from all the Dirichlet distributions dancing in my head. Having already hit all of the sites I normally hit for distracting stories and finding nothing new, I foolishly sought out what other grammar bloggers had to say for themselves. Three minutes later, my blood had been so angried that I actually left a corrective comment on one blog — something that I virtually never do. I felt soothed and returned to my paper with a renewed vigor.

The next day I noticed that there was no comment on that post. Odd, I thought, but then again, I’d been up late writing the night before. It was entirely possible that I’d thought better of posting the comment. So I tried another comment, shorter and less confrontational. It too disappeared.  And so I have to go to all the bother of debunking this grammar gremlin here instead of settling it there.

The post in question is just the same junk everyone says on the internet to show their linguistic superiority — complaining that the so-called “educated” amongst us are actually uneducated, blaming the ills of modern language usage on “the drone of mass media”, all that jazz. The whole point of the post is that the rabble is destroying the language by replacing adverbs with adjectives.  The post drips with disdain for those dips whose slovenly usage is slowly leaching our precious adverbs from our precious language.

Look, I don’t have a lot of patience for this garbage. I’m not going to assert that adverbs definitely aren’t disappearing, but let me point out that the first three examples given to support the claim that our language is falling apart are completely specious.  This is the opening paragraph of the post:

My theory—though I cannot call it my own, original theory—is that within the next hundred years or so, all adverbs will cease to exist. I see them slowly disappearing throughout the various levels of education: the un-tenured freshman recalling that her O-Chem professor “talks too fast” (forgetting, for a moment, the equivocation of the verbs talk and speak); the corporate guru pitching his product as “built tough;” all the way up to the double-doctorate responding “I’m good, thanks” when confronted by the everyday salutation “how are you?”

So we have three examples of adverbs being displaced: talks too fast, built tough, and I’m good.  There’s just one problem.  Adverbs aren’t being displaced in any of these.

Let’s start with “talks too fast”. I’m supposing that the author presumes it’s an error because fast is an adjective and not an adverb.  Since fast is modifying the verb talks, an adjective would indeed be inappropriate.  But here’s the thing: fast is both an adjective and an adverb. It’s been an adverb since around 1200, according to the Oxford English Dictionary. In fact, the OED notes that the adjectival form of fast came from the adverbial form!  I don’t even know what the intended correction of talks too fast would be supposed to be.  Talks too fastly?  Nope.

Now on to that damnable Ford advertising slogan: “built tough”. Okay, that complaint at least gets the part of speech right; tough is indeed an adjective, and there is no adverbial usage of tough that would be consistent with the intended meaning. But as it turns out, the adjectival form is totally fine there. It’s called a predicative adjective. Compare it to

(1a) I painted the door white.
(1b) The door was painted white.

(2a) The company built the truck tough.
(2b) The truck was built tough.

And note that an adverb doesn’t actually work here.  You can’t say the door was painted whitely, and while I think you could say the truck was built toughly, it doesn’t have the right meaning.  Toughly in that phrase describes the manner by which the truck was built, while tough in (2) is modifying the truck itself.  And since the truck is a noun phrase, it gets modified by an adjective, not an adverb.

I’ll admit that the predicative adjective sounds a little odd — I don’t often use it myself — but it’s been standard English for quite some time. While you may have many objections to the Ford Motor Company, this one just isn’t justified.

The last complaint is saying “I’m good.”  On occasion back at college, I caught some guff for this.  In my family, we just don’t say well. We’re not well, we’re good. There is a substantial difference to me — well implies mere healthiness, while good implies an overall contentedness.  One can be well without being good, and vice versa.  But I digress. What’s more important than a brief overview of my family’s social interactions is that well in this situation isn’t an adverb, either. It’s an adjective.

You have to use an adjective in this sentence because there’s only a linking verb.  You couldn’t say I’m indignantly; you’d say I’m indignant.  The modifier is modifying the subject of the sentence, so it’s got to be an adjective.  When you say I’m well, you’re not using adverbial well, because there wouldn’t be anything for the adverb to modify. You’re using adjectival well, which just means “healthy”.  It’s a separate question whether you think well is a better adjective than good in this sentence, but the choice has to be between adjectives.  Adverbs are strictly ruled out.  Strike three.

Okay, so someone on the internet is wrong.  Why was I so riled up? Honestly, I wouldn’t have cared about this junk if it weren’t for the last paragraph of the post:

I blame the drone of the mass media, producing poorly thought-out mind-tranquilizers without regard for elevating the comprehension of the masses. But then, I generally hate the entertainment industry and am always quick to point out its culpability in the denigration of our society whenever possible. Meanwhile, if at some point you catch me twitching while listening to you, there’s a good chance you’ve forgotten two very important things: first and foremost, you’ve forgotten your third grade grammar lessons; and second, you’ve forgotten that you’re talking to a grammar snob.

See, that’s why people don’t like self-appointed “grammar snobs”. Not only are they often completely wrong, but they’re insufferably condescending about it. If you’re going to go around telling everyone that they’re idiots, you should probably do a little research to make sure they really are.

Ben Zimmer has once again written a cutting post about Global Language Monitor, its absurd claim that the English language is about to get its millionth word, and the news sources that blindly regurgitate GLM’s warmed-over press releases about that.   I know it’s become cliche, upon reading an article that one disagrees with, to ask “So this is what passes for journalism these days?”   But articles like the BBC’s really demand that question. Here’s another, from the Telegraph, touting an obviously false claim: “One millionth English word could be ‘defriend’ or ‘noob’.”

First off, to the reporter’s credit, he manages to answer one question about GLM’s methodology; a word is a word by their count once it has been attested 25,000 times “by media outlets, on social networking websites and in other sources.” This information is not available on GLM’s website — I searched for 25,000, 25000, “twenty-five thousand”, “twenty five thousand”, “twentyfive thousand”, and “25 thousand” on the GLM website and didn’t get a single hit.  So kudos to the reporter for getting this nugget out!

But then the whole enterprise falls apart. The article notes that among the words GLM is “currently monitoring which could take English to the one million threshold” is noob. If that’s the case, then GLM’s monitors are incompetent.  I popped over to MySpace, which surely would be included in any reasonable list of social networking sites, and lo! 145,000 hits. It’s already a word by GLM’s arbitrary standard!  Who is GLM using to monitor the social sites? Clearly they ought to be fired. If noob, which has been in wide use by computer folks since the turn of the millennium, managed to slip under their nose, think of how many other unnoticed words there are! For all we know, English might have already passed this made-up milestone a month ago!  To call this possibility a tragedy is an unacceptable understatement.  And the claim that noob hadn’t been yet used 25,000 times on the Internet — where it was born all those years ago! — didn’t set off any alarms at the Telegraph?

How credulous can one be? Here’s the lead paragraph of the Telegraph article:

“The milestone will be passed at 10.22am on June 10 according to the Global Language Monitor, an association of academics that tracks the use of new words.”

And the last paragraph:

“The organisation first predicted that the millionth English word was imminent in 2006, and has repeatedly pushed back the expected date. Other linguist[s] have expressed scepticism about its methods, claiming that there is no agreement about how to classify a word.”

Of course if the first guess was only off by three years, it’s totally reasonable to assume the current guess is off by less than a minute.

Also, “other linguists” implies that Paul Payack is a linguist. He is not. I’m not even convinced he or his merry monitors can be called academics. They are entrepreneurs at best, and they are peddling nothing worth acknowledging.

A lot of people make claims about what "good English" is. Much of what they say is flim-flam, and this blog aims to set the record straight. Its goal is to explain the motivations behind the real grammar of English and to debunk ill-founded claims about what is grammatical and what isn't. Somehow, this was enough to garner a favorable mention in the Wall Street Journal.

About Me

I'm Gabe Doyle, currently a postdoctoral scholar in the Language and Cognition Lab at Stanford University. Before that, I got a doctorate in linguistics from UC San Diego and a bachelor's in math from Princeton.

I also examine how we can use large electronic resources, such as Twitter, to learn about how we speak to each other. Some of my recent work uses Twitter to map dialect regions in the United States.

I also examine how we can use large electronic resources, such as Twitter, to learn about how we speak to each other. Some of my recent work uses Twitter to map dialect regions in the United States.

