You are currently browsing the tag archive for the ‘wordnik’ tag.

A few days ago, John McGrath, Wordnik’s Director of Product Development, sent me a link to the preview version of Wordnik’s new thesaurus feature.  Wordnik, if you’re not familiar with it, is an online dictionary that integrates information from traditional dictionaries and online usage to give a more complete picture of a word’s meaning.  Merging these supervised and unsupervised data sources is of course a brilliant idea, and I think within a few years it will become a necessary part of any online dictionary.

I decided to test the Wordnik thesaurus with two types of words that often aren’t adequately represented in traditional thesauruses: colloquial phrasal verbs and insults.  The particular colloquial verb I tested was flesh out, which tends to pop into my head when I’m writing academically, as I want to first give an overview of the point I’m arguing, and then flesh it out.  Sadly, I’ve never found a synonym for flesh out that befits the tone of academic writing. Many thesauruses, even online ones, don’t list flesh out, and those that do haven’t given me enough alternatives to find a good one.  So I tried looking up flesh out on Wordnik, and I have to say it performed better than I expected.  It offered a few words that were pretty good equivalents (detail, fill in, round out, exposit), and, as would be expected from a semisupervised method, a few that were somewhat off (instance, set forth).  Still nothing that really fits my needs, but I’m not sure the word I’d be looking for even exists. (If you have any suggestions for a flesh out equivalent, let me know.)

The second test word was a common insult I employ in writing: imbecile.  The problem is that it’s so general; I often have situations where I want to make a quite specific insult, not merely to point out that someone is an imbecile, but also to specify the type of their imbecility (conscious ignorance, malicious misinformation, insufficient expertise, etc.).  Ever since I realized that “The Big Book of Being Rude” that I purchased on clearance at Half Price Books was woefully lacking in specific insults, I’ve been looking for a new source. I was hoping the thesaurus would suggest some more specific insults that I could record for later use in particular situations.

It seemed like this was a task that a thesaurus that monitored online usage would be preternaturally good at; after all, what does one do on the internet other than call people idiots?  Alas, this search didn’t go as well as flesh out, although the thesaurus still made a good effort.  Strangely, most of the responses were for imbecile as an adjective (which strikes me as comparatively rare) rather than a noun.  My main source of sadness was that it didn’t generate anywhere near the range of possibilities I’d expect in insults, offering mostly run-of-the-mill words like buffoon, dullard, or fool.  But it did offer two interesting ones with which I was unfamiliar. One was nidget, a now-forgotten word that lacked a single usage example.  The other was anile, which led me to uncover what I like to call the Great Anile Conspiracy — a strange and almost exciting phenomenon that I hope to detail in an upcoming post.  While the Wordnik thesaurus didn’t really give me a more specific insult, at least it tipped me off to two interesting words, so that’s something.

I realized, though, that expecting more specific insults from imbecile may have been an unfair query. I decided to try again with a more specific insult: blowhard.  The results were hit-and-miss.  The synonyms were spot-on: big mouth, blusterer, boaster, braggart, line-shooter, loudmouth, and — my personal favorite — vaunter.  The “words used in the same context” results weren’t, offering such words as Parker, valetudinarian, and book-review. How those occur in similar contexts to blowhard is opaque to me. However, I found rather hilarious and surprisingly accurate its choice of ex-governor as a contextual neighbor of blowhard — are there better examples of blowhards than Sarah Palin and Rod Blagojevich?

So all in all, the Wordnik thesaurus was worth checking out. It takes advantage of the capabilities of the Internet to offer both solid synonyms and noisy possibly related words. Its algorithms aren’t perfect, of course, but the mistakes are mostly pretty reasonable and/or enjoyable. It hasn’t replaced thesaurus.com as my primary online thesaurus*, but it’s already interesting, and I’m looking forward to future developments that could make it supplant Roget’s in my heart.

*: I certainly hope that Wordnik hurries up and replaces thesaurus.com as my thesaurus of choice, now that I’ve read the Wall Street Journal’s blog post noting that it (well, its parent site, reference.com) has the highest number of trackers on its site of any of the top 50 most popular domains.

Language Log (in the guise of Arnold Zwicky) alerted me this morning that Wordnik is now up and running. Succinctly, and stealing from their front page, Wordnik is an “ongoing project devoted to discovering all the words and everything about them”. Seems a laudable aim to me.

I went and checked out the site, and not feeling particularly imaginative today, looked up some rather quotidian words: whether, capital, and the. (I had principled reasons for this, I assure you, having to do with worries of data sparsity in a new search engine.  Those reasons oughtn’t to detract from the lamentable fact that when presented with a choice to stick in the everyday or to press on into terra infirma, I slipped off my boots and sidled back to the known. I intend the use of “quotidian” here instead of “mundane” or “commonplace” to function as some minor penance.) All in all, it’s a pleasant experience they have over at Wordnik, and I suspect I’ll end up wasting more than a few moments of my life finding out what their algorithm considers the most representative Flickr pictures for various abstract words — everyday, for instance.

But, as I am the sort who has never managed to fully slip the surly bonds of mathematics, it was the statistics that really snagged me. There’s a graph for each word showing its frequency change over the last 200 years. I don’t get what exactly it means for a word to be “unusual” in a year, but check out the statistics for the:

wordnik-the

For some reason, from the 1920s to 1950, the was much more unusual than in the rest of the past 200 years (ignoring that outlier around 1980). This is almost certainly an artifact of the data, rather than an actual increase in the unusualness of the. I’m betting it has to do with different corpora being used at different points in the historical statistics. My guess would be that up to the 1920s, the data is dominated by prose writing that has entered into public domain (books in Project Gutenberg, etc.) and that when the public domain prose dried up, the corpora were dominated by work from some other genre that uses fewer articles. This hypothesis is backed up by a similar pattern on the indefinite articles a and an. I don’t know what genre would lead to a decrease in the number of articles — an increased proportion of dictionaries or telegrams? — or why there is the return to normalcy at 1950 — perhaps digitized news archives take over the corpus at that point? (If you have any ideas, please post a comment.) I assume it’s not really the case that articles became unusual at those points in time, but this points out an important failing of relying too heavily on automatically mined historical data; you can get really funky results due to changing corpus demographics, data sparsity, and the like.  If you think you’ve found an interesting result, you should always check it against some sort of baseline.

Let this be a lesson to all you word explorers out there. For all that can be found in the exotic parts of the dictionary (the most recently searched Wordnik words at the moment include palimpsest, irrendentist, epaulette, and interrobang), the greatest mysteries sometimes lurk in the everyday.

[P.S.: I’ve been gone awhile, but I should be back to posting and commenting and responding to email semi-regularly next week.  The school year has finally ended, I’m back in the U.S. from a machine learning conference, and now I have nothing to do but blog. Oh, and all the work I pushed off to the summer and the two summer jobs I’ve ended up with. Crumbs.]

Post Categories

The Monthly Archives

About The Blog

A lot of people make claims about what "good English" is. Much of what they say is flim-flam, and this blog aims to set the record straight. Its goal is to explain the motivations behind the real grammar of English and to debunk ill-founded claims about what is grammatical and what isn't. Somehow, this was enough to garner a favorable mention in the Wall Street Journal.

About Me

I'm Gabe Doyle, currently a postdoctoral scholar in the Language and Cognition Lab at Stanford University. Before that, I got a doctorate in linguistics from UC San Diego and a bachelor's in math from Princeton.

In my research, I look at how humans manage one of their greatest learning achievements: the acquisition of language. I build computational models of how people can learn language with cognitively-general processes and as few presuppositions as possible. Currently, I'm working on models for acquiring phonology and other constraint-based aspects of cognition.

I also examine how we can use large electronic resources, such as Twitter, to learn about how we speak to each other. Some of my recent work uses Twitter to map dialect regions in the United States.



@MGrammar on twitter

Recent Tweets

If you like email and you like grammar, feel free to subscribe to Motivated Grammar by email. Enter your address below.

Join 975 other followers

Top Rated

%d bloggers like this: