Article | October 17th, 2020
Note This is a reprint of an article on Perl text normalization. It refers to Perl code not PHP code. Hope you find it useful.
Text is text, right? Not necessarily true when you are concerned with writing web-based applications. Because the net is composed of many different kinds of computers, a CGI program cannot just assume a one size fits all attitude. It must expect text will be coming from a variety of sources and formats. There are three main formats of text in use today, differing only in the way the end of line or newline is marked. The PC uses a carriage-return linefeed pair (CRLF) to mark the newline. Mac newlines are just a single carriage-return (CR). And Unix/Linux systems employ a single linefeed. If you ever wonder why your files appear to “shrink” when uploaded to a web server, it’s because of those unnecessary characters the PC text format adds (typically conversion takes place when files are uploaded in ASCII mode by FTP). But enough history, lets look at what a CGI-forms interface sees when it receives input from the user.
Attaining a State of Normalcy
I think it’s good practice (as a friendly Perl expert once taught me) to normalize all text before attempting to process it. If you try to operate on text expecting each line to end with a linefeed but the lines actually end with a carriage return linefeed pair, the results may be unexpected. By normalizing text, you won’t have to deal with several different kinds of linefeed all mixed together. The main parts of your CGI program will only have to understand one format, not three. Since you can’t predict what browser or operating system and hence format of text, it’s best to normalize all text input to a common format. One user may be on a Unix system, then another may come along on a Mac—if you just post their data to a text file it will end up scrambled by the mix of different newlines.
Note: For this reason, the W3C has reccommended in the HTML4.0 specification that all browsers normalize TEXTAREA (and I suppose TEXT input content) to CRLF format). But it will be a long time before you can rely on it. I see Netscape 1.1 occassionally in my logs to this day [at the time of writing, 1996-97].
Because Unix/Linux is the operating system widely running web servers, this example will normalize to the Unix single linefeed. But you can normalize to any format, such as to the PC format if you are running Perl on Windows NT or the Mac format if your web server runs under Mac OS.
#--------------# # Normalize textarea # input to Unix newline format. # Input: # $text: string to be normalized # Note: HTML4.0 # recommends browsers normalize # their textarea # content to CRLF format. This, I hope # will eventually make this step unnecessary. #---------------# # Convert PC newline (CRLF) # to Unix newline format (LF) $text =~ s/\r\n/\n/g; # Convert Mac newline (CR) # to Unix newline format (LF) $text =~ s/\r/\n/g;
Other Text Tricks
Now that you have the text input from the form normalized, what are other considerations? A line that contains only spaces or a few tab characters is not really useful–the extra characters are unnecessary baggage. So you can remove them.
# remove spaces and tabs from blank lines $text =~ s/\n[ \t]+/\n/g;
If you have no need to retain format of the original plain text, then you can remove leading and trailing line breaks. This is fine if the text is only intended for inclusion in an HTML document (except for PRE, which retains formatting), say in a paragraph element.
# remove leading and trailing line breaks $text =~ s/^\n+|\n+$//g;
Returning From Normal
You can also convert text normalized to Unix format back into PC or Mac formats.
# Translate to target line format. #----------------------------------# #Change to Mac format # LF to CR $text =~ s/\n/\r/g;
# Translate to target line format. #-----------------------------------# # Put Unix into PC format # LF to CRLF $text =~ s/\n/\r\n/g;
Does Normalizing Matter?
Yes. Because as long as Mac browsers submit Mac format text and PC browsers submit DOS format text, you will need to normalize TEXTAREA input. Of course, you do not have to normalize INPUT text because it is not a multi-line input. Perl always works with text in its native format, so must normalize. It’s best to normalize text to the format native to the server hosting the website. Unless you have a special need, such as writing in PC format for download and editing on PC’s.
PHP and Normalization
I’m uncertain whether other languages accept and process text in native format. I’ve not really had any trouble nor have I had to normalize TEXTAREA input in PHP. Maybe it normalizes input or most browsers today are submitting normalized text [in 2002].