Monday, September 13, 2010

Document defaults

As mentioned in Chapter 1, we’ll be working with XHTML markup in this book rather than
HTML. Although XHTML markup differs slightly from HTML, the file suffix for XHTML web
pages remains .html (or .htm if you swear by old-fashioned 8.3 DOS naming techniques).
Although XHTML’s stricter rules make it easier to work with than HTML, you need to be
aware of the differences in the basic document structure. In HTML, many designers are
used to starting out with something like the following code:
<html>
<head>
<title></title>
</head><body>
</body>
</html>
But in XHTML, a basic, blank document awaiting content may well look like this (although
there are variations):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="content-type" content="text/html;
å charset=utf-8" />
<title></title>
</head>
<body>
</body>
</html>
Although this is similar to the minimal HTML document, there are important differences.
The most obvious is found at the beginning of the document: a DOCTYPE declaration that
states what document type definition (DTD) you are following (and no, I’m not shouting—
DOCTYPE is spelled in all caps according to the W3C).
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
The DTD indicates to a web browser what markup you’re using, thereby enabling the
browser to accurately display the document in question (or at least as accurately as it
can—as shown in Chapter 9, browsers have various quirks, even when you’re using 100%
validated markup).
Next is the html start tag, which contains both a namespace and a language declaration.
The first of those is intended to reduce the ambiguity of defined elements within the web
page. (In XML, elements can mean different things, depending on what technology is being
used.) The language declaration indicates the (default) language used for the document’s
contents. This can assist various devices, for example enabling a screen reader in correctly
pronouncing words on a page, rather than assuming what the language is. (Also, internal
content can have language declarations applied to override the default, for example when
embedding some French within an English page.) The xml:lang attribute is a reserved
attribute of XML, while the lang attribute is a fallback, used for browsers that lack XML
support. Should the values of the two attributes differ, xml:lang outranks lang.
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang ="en" lang="en">
You’ll also notice that a meta tag appears in the head section of the document:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
To pass validation tests, you must declare your content type, which can be done using this
meta element. Here, the defined character set is UTF-8 (Unicode), the recommendeddefault encoding, and one that supports many languages and characters (so many characters
needn’t be converted to HTML entities).
There are other sets in use, too, for the likes of Hebrew, Nordic, and Eastern European languages,
and if you’re using them, the charset value would be changed accordingly.
Although www.iana.org/assignments/character-sets provides a thorough character set
listing, and www.czyborra.com/charsets/iso8859.html contains useful character set diagrams,
it’s tricky to wade through it all, so listed here are some common values and their
associated languages:
ISO-8859-1 (Latin1): Western European and American, including Afrikaans, Albanian,
Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German,
Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish.
ISO-8859-2 (Latin2): Central and Eastern European, including Croatian, Czech,
Hungarian, Polish, Romanian, Serbian, Slovak, and Slovene.
ISO-8859-3 (Latin3): Southern European, including Esperanto, Galician, Maltese,
and Turkish. (See also ISO-8859-9.)
ISO-8859-4 (Latin4): Northern European, including Estonian, Greenlandic, Lappish,
Latvian, and Lithuanian. (See also ISO-8859-6.)
ISO-8859-5: Cyrillic, including Bulgarian, Byelorussian, Macedonian, Russian, Serbian,
and Ukrainian.
ISO-8859-6: Arabic.
ISO-8859-7: Modern Greek.
ISO-8859-8: Hebrew.
ISO-8859-9 (Latin5): European. Replaces Icelandic-specific characters with Turkish
ones.
ISO-8859-10 (Latin6): Nordic, including Icelandic, Inuit, and Lappish.
For an overview of the ISO-8859 standard, see http://en.wikipedia.org/wiki/ISO_8859.

No comments:

Post a Comment