To Serve Man (Web I18n Tips)

I18nGuy

Here are some tips for working with web pages, especially those facts that I found very difficult to discover myself. As for the choice of title for this page, "To Serve Man" the explanation is at the end of the page.

Table of Contents
Top of page
Tip #01 Configuring Apache Web Servers with correct HTTP CHARSET
Tip #02 Setting the encoding of a CSS style sheet
Tip #03 Margins and Centering with CSS style sheets
Tip #04 Debugging HTTP Protocol
Tip #05 Netscape 4 blank lines before TABLE- padding bug
Tip #06 Preventing Line Breaking At Hyphens
Tip #07 Avoiding SPAM To E-Mail Addresses In Your Web Pages
Tip #08 Creating Customized 404 Error Pages
To Serve Man

All of my web pages declare their code page in the <head> section of the document, with the <meta> Content-type statement. This page for example declares:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

Some other pages, use Unicode and declare the charset to be UTF-8, UTF-16, or UTF-32 as appropriate. (See the Introduction to the Business Example for Unicode to see examples of each.) Most of the encodings I use are ASCII-based, and so reading the first couple lines in the html file are not a problem, as only ASCII characters are used in these statements. Declaring the encoding helps the browser interpret the page. For charsets, such as UTF-16 and UTF-32, the encoding information comes too late. By the time the browser reads the <meta> statement, it has had to negotiate several characters that are not ASCII-compatible in these encodings. (Instead of 8-bit units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit units.) Well, actually the browser may not have been able to negotiate them at all...

To address this I wanted to specify the charset in the HTTP protocol (RFC 2616, RFC 2068). Declaring the encoding in the protocol informs the browser about the encoding before it attempts to read the web page. If the encoding is not declared, the browser should presume ISO-8859-1. (Although many will instead attempt to detect the encoding.)

Sounds easy. It proved to be very difficult to discover how to do this. My ISP uses an Apache server. They suggested I could use a .htaccess file to set the configuration in each directory. They couldn't help me with the commands and syntax though. I read the Apache doc, but the obvious approaches didn't work. Since the server is at the ISP, I couldn't debug the file contents. I would just get errors when accessing the site with my browser. I was trying the AddCharset and similarly named commands. Based on the errors I was getting, I am not privileged to use these commands. My ISP didn't respond to suggestions to let me have these privileges... James H. Cloos Jr. provided a number of suggestions in answer to my mail to the Unicode list, but they didn't work for me. If you own and configure the server yourself, they will probably work assuming you give yourself the appropriate privileges.

Jungshik Shin told me how to do it. Instead of using the commands with Charset in their names, you need to use the AddType command. You apply the command by including them in the <Files> section, where the files are named. The commands are placed in a plain text file named .htaccess that is placed in the directory containing the named files. The "# character is used to begin lines that contain comments. The last line in the example below contains a comment. Make sure the last line in the file ends in a carriage return. Apparently, the ISP allows me to execute this command.

This example sets the files unicode-example-intro.html, and unicode-example-ruby.html to UTF-8. The file char-count.html is set to Windows-1252 encoding. The file plane1-utf-16.html is set to UTF-16 encoding. The file plane1-utf-32.html is set to UTF-32 encoding.

<Files unicode-example-intro.html>
AddType "text/html; charset=UTF-8" html
</Files>
<Files unicode-example-ruby.html>
AddType "text/html; charset=UTF-8" html
</Files>
<Files char-count.html>
AddType "text/html; charset=WINDOWS-1252" html
</Files>
<Files plane1-utf-16.html>
AddType "text/html; charset=UTF-16" html
</Files>
<Files plane1-utf-32.html>
AddType "text/html; charset=UTF-32" html
</Files>
# end file with c/r

If all the files in the directory with the same suffix (e.g. "html") have the same encoding, then the "<Files ***>" and "</Files>" statements are not needed, and a single AddType statement will declare the encoding of all the files matching the suffix used in the statement. In the following example of the contents of a .htaccess file, the files with a suffix of html or txt in the same directory as the .htaccess file, will have a charset encoding of "UTF-8".


AddType "text/html; charset=UTF-8" html
AddType "text/plain; charset=UTF-8" txt


Tips Table of Contents

You can use the @charset rule in a external CSS style sheet, to declare its encoding. However, an important point is the rule must be the very first characters in the file. If you are inclined to put comments at the top of a file, they must come after the @charset rule. This is according to CSS2 4.4 CSS document representation.
"At most one @charset rule may appear in an external style sheet -- it must not appear in an embedded style sheet -- and it must appear at the very start of the document, not preceded by any characters." For example, here is an excerpt of a UTF-8 based style sheet:


@charset "utf-8";
/* unicode-example.css */

body {
color : black;
background-color : #FFFFED;
}
.
.
.


Tips Table of Contents

If you review the web pages you develop with Microsoft Internet Explorer, you come to expect that using the CSS property text-align with a value of "center", will center just about anything, tables for example. If you view the same pages with Netscape, Opera or other browsers you will find text is centered, but tables and other block-level entities will not be. The key is that text-align is only supposed to affect in-line content.

The key for browsers other than IE, is to set margin-left and margin-right to auto. When both are set to auto, their values are made equal which effectively accomplishes centering. This is stated in the CSS2 section on Computing widths and margins with respect to several block-level scenarios.


Of special note for Opera 6 users, I find that setting margins with CSS rules like the following, do not actually set the top margins:
margin : 1em auto;   or
margin : 1em auto 2em auto;

Instead, use the individual margin-top, margin-left, margin-right, margin-bottom rules.

Tips Table of Contents

Working with multilingual or multinational web sites, it is sometimes necessary to review and/or debug browser-webserver HTTP exchanges. For example, you may want to examine the CHARSET encoding, or to see the values of ACCEPT-LANGUAGE, ACCEPT-CHARSET, etc. The tool HttpInspector from FreakySoft is very useful. It records and displays both the browser and server HTTP headers and content. FreakySoft's web site is no longer around, but the tool is available for download at many sites.

Rex Swain has an HTTP viewer that can be useful to display the contents of an HTTP request returned by a server. For example, if you enter the URL http://www.i18nguy.com/unicode-example.html into the viewer, you can see the results of the .htaccess file setting the file to a charset of UTF-8 (as described in Tip #01. The page is returned with the CHARSET=UTF-8 in the HTTP Header.

Here are some other sites for viewing HTTP headers:

Tips Table of Contents

One of the more frustrating problems with Netscape 4 was that tables would have a large number of blank lines in front of them. Netscape 4.7 users would have to scroll past the blank lines to discover the table. I finally determined that using CSS padding or padding-top in the table cells' style (e.g. td {padding:1em}) would cause the blank lines to appear above the table. I suspect, but didn't care to prove, that the more rows in the table the more blank lines accumulated above the table. Removing padding and padding-top fixes this. Padding-right, padding-left, and padding-bottom did not cause a problem.

Tips Table of Contents

Today, many words are hyphenated that should not have line breaks immediately after the hyphen. For example, the words "e-mail" or "e-business". You would not like to see the line broken after the hyphen. However, this is the default behavior for many browsers.

Please e-mail me your e-
business proposal.

I found it very difficult to learn how to prevent this. (Thanks go to Michel Suignard, who gave me the final solution.)

At first, I used the Unicode character non-breaking hyphen (Unicode Character U+2011), which I can easily enter in HTML with "&#x2011;". I would write "e-mail" as "e&#x2011;mail". However, then searches for hyphenated words like e-mail and e-business would not find my text. Apparently, most search routines do not make the two hyphens equivalent for comparison or search purposes.

I had searched the web for "line-breaking" and related terms and did not find an HTML or CSS solution, although one exists. It turns out you can use the property white-space and set it to nowrap. The solution existed as early as CSS1 and is also mentioned in CSS2. My searches didn't find it because it is described in terms of white space and wrapping, and I was looking for hyphenation, line breaking and justification. (Hopefully, web searches will now find those terms here!).

So now my style sheets all define a style that prevents wrapping. For example, I define the class "nobrk":

.nobrk {
white-space : nowrap;
}

and I span all the hyphenated terms such as e-mail, e-business as follows:

Please <span class="nobrk">e-mail</span> me your <span class="nobrk">e-business</span> proposal.

This will satisfy searches for hyphenated words, will not break lines at the hyphen and will display correctly:

Please e-mail me your
e-business proposal.

Web Tips Table of Contents

You want potential customers to contact you. So you want to include your e-mail address in your web page. But you don't want to get SPAM (unsolicited and innumerable junk mails). It's Catch-22. Customers need your address. Display your address and SPAMmers will send you SPAM. SPAMmers have web-bots (web programs) that continually scan the web solely to extract e-mail addresses from web pages and add them to their lists of addresses to SPAM.

The web-bots only recognize certain kinds of addresses however. One way to fool them (at least for now) is to use an HTML Numeric Character Reference (NCR). An NCR is simply a way to represent characters using their Unicode Scalar Value (Code Point). There is more about NCRs at Representing Characters in HTML.

For example, an "@" can be written in HTML with "&#x40;". The "&#x" indicates a hexadecimal number is being used. The "40" is the Unicode hexadecimal scalar value for the at-sign. The semi-colon terminates the NCR. (Decimal values can also be used. Simply drop the "x". So, in decimal, the character "@" is written as an NCR: "&#64;". The value "64" is the decimal equivalent of hex "40".)

You could write every character in your e-mail address using NCRs, but it is only necessary to write the at-sign this way to fool SPAMmers today. You can use NCRs within the href attribute of the anchor element as well as in the content of the element. For example:

<a href="mailto:tex&#x40;i18nguy.com">E-mail me at tex&#x40;i18nguy.com</a>.

Note that any character written as an NCR will display on the screen as the character not as an NCR and if copied to the clipboard will be copied as the character. So your customers viewing the address or copy/pasting it into a mail program will be using the actual at-sign not the NCR. In this example, they would see the following (not an actual mail link though):

E-mail me at tex@i18nguy.com.

Web Tips Table of Contents

If you have information that will help users who come to your site looking for a page that has been removed or perhaps that they misspelled, you can create your own error page and present it to the user instead of the default file not found, "404 error" page.

Create a page with the information you want to provide. Make sure that any link or image references in the html page are absolute (fully specified from the root of the domain).

Apache Web Server users should create a .htaccess file, if you don't already have one, and add a line to the .htaccess file:

ErrorDocument 404 /Path To The/Customized404page.html

Of course, replace the directory path and the error page's filename in the above command with the correct values for your page. This command instructs the Apache web server to use your custom 404 error page when a "file not found" error occurs.

You can have different custom error pages for different sections of your website. Simply put the "ErrorDocument 404" statement naming the custom page you want to associate with a particular section in a .htaccess file in the server directory that houses that section. There is a very nice and more detailed description of how to do configure custom error pages at the The Site Wizard.

Web Tips Table of Contents

The title is apt for a page on serving web pages, but is also the title of a short story by Damon Knight, which was made into a screenplay by Rod Serling and was an episode of "The Twilight Zone" that aired March 2, 1962. In short, aliens (the Kanamit) come to earth and persuade earth's denizens that they are benevolent. One piece of evidence is a book inadvertently misplaced by an alien and its cover page translated by an earth scientist. The book is titled "To Serve Man", convincing people they are truly here to help mankind. The aliens teach the earthlings how to improve crop production, cure disease, produce cheap power and other efficiencies, improving the overall quality of life. Of course, given the improvements in Earth's life style people are curious to visit the alien's home planet. The aliens invite earthlings and begin loading them in vast numbers on space ships to take them home. However, continued translation of the book uncovers the aliens real motives. The book is a cookbook! The title "To Serve Man" takes on a whole new meaning, as does the actions of the aliens in fattening up Earth's population. For more details, see
http://members.cox.net/kaiotea/serveman.htm,
http://www.thetzsite.com/pages/scripts/serve.html, and the pictures at
http://www.thetzsite.com/pages/pictures/serve.html.


Tips Table of Contents