Friday, February 28, 2014

PubMed BibTex QuickLink: A GreaseMonkey / TamperMonkey script adding a link to TexMed from PubMed

I have been writing my doctoral dissertation in LaTeX and have found TexMed -- a webserver that can provide the BibTex-formatted citation for any article in PubMed -- to be an invaluable resource.  My typical workflow has been:

  1. Find the article I want to cite on PubMed
  2. Copy the PubMed ID
  3. Search TexMed for the Pubmed ID
  4. Download the citation from TexMed
This works, but I'd rather cut out the middle steps.  So, I've written a script for Google Chrome* and Mozilla Firefox* that adds a new "[Get BibTex]" link to the PubMed pages that will directly take you to a TexMed page with the citation (screenshot below):



The script is available at UserScripts.org: http://userscripts.org/scripts/show/400556

UPDATE: As it looks like UserScripts.org is persistently down, the script is now available on GitHub: https://github.com/djparente/pubmed-texmed-quicklink as pubmed-bibtex-quicklink.js

*These work via the TamperMonkey and GreaseMonkey add-ins, respectively.

Tuesday, February 11, 2014

English word frequency list

I recently found myself in need of a English word list ordered by frequency, but could not find a free (in both freedom-of-use and free-of-charge senses) one that satisfied me.  So, I have compiled one using word counts in the Google Ngrams database, doing just a little processing to extract counts since 2005 (to avoid archaic words) and to strip out parts of speech identifiers from the word stems.

It seems adequate for my purposes, but have not done any extensive checking on it.  It should be adequate for common use ("Hello, how is your dog?"), but also for more formal writing.  For example, it contains the words "phylogenetic", "immunoblotting" and "histochemical" -- all fairly specialized molecular biology terms.

Be aware that there is no filtration on the terms included (i.e. if you want to strip out, e.g. profanity, you will need to do some further processing).  The file contains a header; these comments can be filtered out by excluding lines beginning with "#".  All entries are in lower case.

If this would be useful to you, the word list of the top 100,000 most common terms can be downloaded at: http://www.biophysengr.net/files/blog/wordlist/top100k_words_ngrams_djp.txt

The Ngrams data seems to be under Creative Commons Attribution 3.0 Unported license (CC-BY-3.0),  so I will follow suit for my processing of this list as well.

Happy word frequency-ing