Archive for November, 2007

geek stuff - handling Microsoft Word text in web development

Tuesday, November 13th, 2007

Note: Please contribute to this code in comments, if you find mistakes or know of improvements.

It's surprising to most people - especially computer programmers, it seems - that the English language actually has more than one kind of single quote and one kind of double quote. In fact, if you go back to single-quote characters in old-school typography, you find: (a) the opening single quote, (b) the closing single quote, (c) the possessive apostrophe, and (d) the contraction apostrophe. "Straight" quotes and apostrophes are a later invention, to reduce the number of keys on a keyboard.

Many modern word processing programs such as Microsoft Word automatically insert these special characters into text while the user is typing. Word calls them "smart quotes," and refers to fancy em dashes and the like as "symbols." These characters are very pretty in printed documents; in web development, however, they're a bit of a nightmare. It's something of a de facto workflow for users to copy Word documents and paste them into forms for blogs, content management systems and the like; smart quotes end up rendering as "question marks," because web browsers don't know what to make of them.

If you're experiencing this problem, you may hope for a nice "character set" solution in a meta tag (charset=xxx). Stop. Seriously. There's no character set at the moment that will both solve this problem and render well in most browsers.

The ideal solution is to convert Word documents into plain text before copying and pasting; unfortunately, this has little to do with how most people work. Therefore, if you're a PHP developer, you might need a function to replace these special characters. There are many proposed functions out there for exactly this task; the one I'm posting here is just the one that works for me.

<?php
function fixSmartQuotes($string){ 
    
$pre chr(226).chr(128);
    
    
$search = array( $pre chr(152),
                    
$pre chr(153),
                    
$pre chr(156),
                    
$pre chr(157),
                    
$pre chr(147),
                    
chr(145), 
                    
chr(146), 
                    
chr(147), 
                    
chr(148), 
                    
chr(150),
                    
chr(151),
                    
chr(130),
                    
chr(133),
                    
chr(152),
                    
chr(154),
                    
chr(160)
                    ); 
 
    
$replace = array( "'",
                             
"'",
                             
'"',
                             
'"',
                             
'-',
                             
"'"
                             
"'"
                             
'"'
                             
'"'
                             
'–',
                             
'-',
                             
"&#8218;",
                             
"&#8230;",
                             
'-',
                             
",
                             
' ' ); 
                    
    return 
str_replace($search$replace$string); 

?>


I'll explain this in detail a little later.