I’m confused about the correct way to insert special typographic characters (like quotation marks and em dashes) when producing content that will be displayed on a web page. Is it better or preferred to type the actual unicode characters like this:
or to use entities for special characters:
Are there any risks associated with using either (specifically the risk of the user seeing some crazy character substitution), or are they truly equivalent?
Use the actual character.
The disadvantage to using entities is readability. Pop quiz: what does the following output?
†‹ some text ›
Without looking it up, I would have had no idea. Even if you did, you should consider that others reading your markup might not.
For the most part, there’s no reason you shouldn’t just use the actual character. To avoid any issues, be sure you’re using UTF-8 everywhere. You want to be sure of the following:
- The page is saved with UTF-8 encoding
- The ‘Content-Type’ HTTP header specifies UTF-8 encoding
- Data pulled from databases is saved with UTF-8 encoding
- Database connections use UTF-8 encoding
There are some exceptions.
Syntax characters. There are three characters that should always appear in content as escapes, so that they do not interact with the syntax of the markup. These are part of the language for all documents based on XML and for HTML.