Tuesday, February 11, 2014

English word frequency list

I recently found myself in need of a English word list ordered by frequency, but could not find a free (in both freedom-of-use and free-of-charge senses) one that satisfied me.  So, I have compiled one using word counts in the Google Ngrams database, doing just a little processing to extract counts since 2005 (to avoid archaic words) and to strip out parts of speech identifiers from the word stems.

It seems adequate for my purposes, but have not done any extensive checking on it.  It should be adequate for common use ("Hello, how is your dog?"), but also for more formal writing.  For example, it contains the words "phylogenetic", "immunoblotting" and "histochemical" -- all fairly specialized molecular biology terms.

Be aware that there is no filtration on the terms included (i.e. if you want to strip out, e.g. profanity, you will need to do some further processing).  The file contains a header; these comments can be filtered out by excluding lines beginning with "#".  All entries are in lower case.

If this would be useful to you, the word list of the top 100,000 most common terms can be downloaded at: http://www.biophysengr.net/files/blog/wordlist/top100k_words_ngrams_djp.txt

The Ngrams data seems to be under Creative Commons Attribution 3.0 Unported license (CC-BY-3.0),  so I will follow suit for my processing of this list as well.

Happy word frequency-ing

1 comment: