You are currently browsing the tag archive for the ‘dictionaries’ tag.

Language Log (in the guise of Arnold Zwicky) alerted me this morning that Wordnik is now up and running. Succinctly, and stealing from their front page, Wordnik is an “ongoing project devoted to discovering all the words and everything about them”. Seems a laudable aim to me.

I went and checked out the site, and not feeling particularly imaginative today, looked up some rather quotidian words: whether, capital, and the. (I had principled reasons for this, I assure you, having to do with worries of data sparsity in a new search engine.  Those reasons oughtn’t to detract from the lamentable fact that when presented with a choice to stick in the everyday or to press on into terra infirma, I slipped off my boots and sidled back to the known. I intend the use of “quotidian” here instead of “mundane” or “commonplace” to function as some minor penance.) All in all, it’s a pleasant experience they have over at Wordnik, and I suspect I’ll end up wasting more than a few moments of my life finding out what their algorithm considers the most representative Flickr pictures for various abstract words — everyday, for instance.

But, as I am the sort who has never managed to fully slip the surly bonds of mathematics, it was the statistics that really snagged me. There’s a graph for each word showing its frequency change over the last 200 years. I don’t get what exactly it means for a word to be “unusual” in a year, but check out the statistics for the:


For some reason, from the 1920s to 1950, the was much more unusual than in the rest of the past 200 years (ignoring that outlier around 1980). This is almost certainly an artifact of the data, rather than an actual increase in the unusualness of the. I’m betting it has to do with different corpora being used at different points in the historical statistics. My guess would be that up to the 1920s, the data is dominated by prose writing that has entered into public domain (books in Project Gutenberg, etc.) and that when the public domain prose dried up, the corpora were dominated by work from some other genre that uses fewer articles. This hypothesis is backed up by a similar pattern on the indefinite articles a and an. I don’t know what genre would lead to a decrease in the number of articles — an increased proportion of dictionaries or telegrams? — or why there is the return to normalcy at 1950 — perhaps digitized news archives take over the corpus at that point? (If you have any ideas, please post a comment.) I assume it’s not really the case that articles became unusual at those points in time, but this points out an important failing of relying too heavily on automatically mined historical data; you can get really funky results due to changing corpus demographics, data sparsity, and the like.  If you think you’ve found an interesting result, you should always check it against some sort of baseline.

Let this be a lesson to all you word explorers out there. For all that can be found in the exotic parts of the dictionary (the most recently searched Wordnik words at the moment include palimpsest, irrendentist, epaulette, and interrobang), the greatest mysteries sometimes lurk in the everyday.

[P.S.: I’ve been gone awhile, but I should be back to posting and commenting and responding to email semi-regularly next week.  The school year has finally ended, I’m back in the U.S. from a machine learning conference, and now I have nothing to do but blog. Oh, and all the work I pushed off to the summer and the two summer jobs I’ve ended up with. Crumbs.]


Post Categories

The Monthly Archives

About The Blog

A lot of people make claims about what "good English" is. Much of what they say is flim-flam, and this blog aims to set the record straight. Its goal is to explain the motivations behind the real grammar of English and to debunk ill-founded claims about what is grammatical and what isn't. Somehow, this was enough to garner a favorable mention in the Wall Street Journal.

About Me

I'm Gabe Doyle, currently an assistant professor at San Diego State University, in the Department of Linguistics and Asian/Middle Eastern Languages, and a member of the Digital Humanities. Prior to that, I was a postdoctoral scholar in the Language and Cognition Lab at Stanford University. And before that, I got a doctorate in linguistics from UC San Diego and a bachelor's in math from Princeton.

My research and teaching connects language, the mind, and society (in fact, I teach a 500-level class with that title!). I use probabilistic models to understand how people learn, represent, and comprehend language. These models have helped us understand the ways that parents tailor their speech to their child's needs, why sports fans say more or less informative things while watching a game, and why people who disagree politically fight over the meaning of "we".

@MGrammar on twitter

Recent Tweets

If you like email and you like grammar, feel free to subscribe to Motivated Grammar by email. Enter your address below.

Join 981 other subscribers

Top Rated

%d bloggers like this: