Converting smart quotes to dumb quotes
DumbQuotes was born out of a frustration shared by many webloggers: most publishing software can automatically convert straight ASCII quotes to “smart quotes”, but the same software is stupid about handling “smart quotes” when an author copies and pastes them into an article. A common example is a blogger who wants to quote a passage from another site. They select a few sentences in their web browser, paste it into a web form to post to their own site, and the passage they pasted ends up looking nothing like the original site because their publishing software doesn't track character encoding properly.
Well, I can't fix every publishing system in the world, but I can fix the problem for myself by writing a user script that automatically converts smart quotes and other problematic high-bit characters to their 7-bit ASCII equivalent.
Example: dumbquotes.user.js
// ==UserScript== // @name DumbQuotes // @namespace http://diveintogreasemonkey.org/download/ // @description straighten curly quotes and apostrophes, simplify fancy dashes, etc. // @include * // ==/UserScript== var replacements, regex, key, textnodes, node, s; replacements = { "\xa0": " ", "\xa9": "(c)", "\xae": "(r)", "\xb7": "*", "\u2018": "'", "\u2019": "'", "\u201c": '"', "\u201d": '"', "\u2026": "...", "\u2002": " ", "\u2003": " ", "\u2009": " ", "\u2013": "-", "\u2014": "--", "\u2122": "(tm)"}; regex = {}; for (key in replacements) { regex[key] = new RegExp(key, 'g'); } textnodes = document.evaluate( "//text()", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null); for (var i = 0; i < textnodes.snapshotLength; i++) { node = textnodes.snapshotItem(i); s = node.data; for (key in replacements) { s = s.replace(regex[key], replacements[key]); } node.data = s; }
The code breaks down into four steps:
- Define a list of string replacements, mapping certain 8-bit characters to their 7-bit equivalents.
- Get all the text nodes of the current page.
- Loop through the list of text nodes.
- In each text, node, for each 8-bit character, replace it with its 7-bit equivalent.
The first step is really two steps. Javascript's string replacement is based on regular expressions. So in order to replace 8-bit characters with 7-bit equivalents, I really need to create a set of regular expression objects.
replacements = { "\xa0": " ", "\xa9": "(c)", "\xae": "(r)", "\xb7": "*", "\u2018": "'", "\u2019": "'", "\u201c": '"', "\u201d": '"', "\u2026": "...", "\u2002": " ", "\u2003": " ", "\u2009": " ", "\u2013": "-", "\u2014": "--", "\u2122": "(tm)"}; regex = {}; for (key in replacements) { regex[key] = new RegExp(key, 'g'); }
I use the curly braces syntax to quickly create an associative array. This is equivalent to (but less typing than) assigning each key-value pair individually:
replacements["\xa0"] = " "; replacements["\xa9"] = "(c)"; replacements["\xae"] = "(r)"; // and so forth
The individual 8-bit characters are represented by their hexadecimal values, using the escaping syntax "\xa0"
or "\u2018"
. Once we have an associative array of characters to strings, I loop through the array and create a list of regular expression
objects. Each regular expression object will search globally for one 8-bit character. (The 'g'
in the second argument means search globally; otherwise each regular expression would only search and replace the first occurrence
of a particular 8-bit character, and I might end up missing some.)
The next step is to get a list of all the text nodes in the current document. You might be tempted to say to yourself, “Hey, I could just use document.body.innerHTML
and get the whole page as a single string, and search and replace in that.”
var tmp = document.body.innerHTML; // do a bunch of search/replace on tmp document.body.innerHTML = tmp;
But this is a bad habit to get into, because the innerHTML
property will return the entire source of the page: all the markup, all the script, all the attributes, everything. In this
case it probably wouldn't cause a problem (HTML tags themselves do not contain 8-bit characters), but in other cases it could be disastrous, and very difficult to debug.
You need to ask yourself what exactly you are trying to search-and-replace. If the answer is “the raw page source,” then go ahead and use innerHTML
. However, in this case, the answer is “all the text on the page,” so the correct solution is to use an XPath query to get all the text nodes.
textnodes = document.evaluate( "//text()", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
This works by using an XPath function, text()
, which matches any text node. You're probably more familiar with querying for element nodes: a list of all the <a>
elements, or all the <img>
elements that have an alt
attribute. But the DOM also contains nodes for the actual text content within elements. (There are other types of nodes
too, like comments and processing instructions.) And it's the text nodes that we're interested in at the moment.
Step 3 is to loop through all the text nodes. This is exactly the same as looping through all the elements returned from
an XPath query like //a[@href]
; the only difference is that each item in the loop is a text node, not an element node.
for (var i = 0; i < textnodes.snapshotLength; i++) { node = textnodes.snapshotItem(i); s = node.data; // do replacements node.data = s;
node
is the current text node in the loop, and s
is a string that is the actual text of node
. I'm going to use s
to do the replacements, then copy the result back to the original node.
So now that I have the text of a single node, I need to actually do the replacements. Since I pre-created a list of regular expressions and a list of replacement string, this is relatively straightforward.
for (key in replacements) { s = s.replace(regex[key], replacements[key]); }