Language Log (in the guise of Arnold Zwicky) alerted me this morning that Wordnik is now up and running. Succinctly, and stealing from their front page, Wordnik is an “ongoing project devoted to discovering all the words and everything about them”. Seems a laudable aim to me.

I went and checked out the site, and not feeling particularly imaginative today, looked up some rather quotidian words: whether, capital, and the. (I had principled reasons for this, I assure you, having to do with worries of data sparsity in a new search engine.  Those reasons oughtn’t to detract from the lamentable fact that when presented with a choice to stick in the everyday or to press on into terra infirma, I slipped off my boots and sidled back to the known. I intend the use of “quotidian” here instead of “mundane” or “commonplace” to function as some minor penance.) All in all, it’s a pleasant experience they have over at Wordnik, and I suspect I’ll end up wasting more than a few moments of my life finding out what their algorithm considers the most representative Flickr pictures for various abstract words — everyday, for instance.

But, as I am the sort who has never managed to fully slip the surly bonds of mathematics, it was the statistics that really snagged me. There’s a graph for each word showing its frequency change over the last 200 years. I don’t get what exactly it means for a word to be “unusual” in a year, but check out the statistics for the:

wordnik-the

For some reason, from the 1920s to 1950, the was much more unusual than in the rest of the past 200 years (ignoring that outlier around 1980). This is almost certainly an artifact of the data, rather than an actual increase in the unusualness of the. I’m betting it has to do with different corpora being used at different points in the historical statistics. My guess would be that up to the 1920s, the data is dominated by prose writing that has entered into public domain (books in Project Gutenberg, etc.) and that when the public domain prose dried up, the corpora were dominated by work from some other genre that uses fewer articles. This hypothesis is backed up by a similar pattern on the indefinite articles a and an. I don’t know what genre would lead to a decrease in the number of articles — an increased proportion of dictionaries or telegrams? — or why there is the return to normalcy at 1950 — perhaps digitized news archives take over the corpus at that point? (If you have any ideas, please post a comment.) I assume it’s not really the case that articles became unusual at those points in time, but this points out an important failing of relying too heavily on automatically mined historical data; you can get really funky results due to changing corpus demographics, data sparsity, and the like.  If you think you’ve found an interesting result, you should always check it against some sort of baseline.

Let this be a lesson to all you word explorers out there. For all that can be found in the exotic parts of the dictionary (the most recently searched Wordnik words at the moment include palimpsest, irrendentist, epaulette, and interrobang), the greatest mysteries sometimes lurk in the everyday.

[P.S.: I’ve been gone awhile, but I should be back to posting and commenting and responding to email semi-regularly next week.  The school year has finally ended, I’m back in the U.S. from a machine learning conference, and now I have nothing to do but blog. Oh, and all the work I pushed off to the summer and the two summer jobs I’ve ended up with. Crumbs.]

About these ads