[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TextRider bug on last token




Stewart Smith <ssmith@murdoch.edu.au> wrote:
> If this is so then we need to fix Files.C and OakFiles.C. The
> underlying assumption in all C compilers that I have met is that
> files are opened as "text" unless otherwise specified. Under Unix it
> makes no difference, since text and binary files have the same
> representation. Under DOS and Windows (and other operating systems?)
> if you don't explicity request "binary" mode, the native end-of-line
> representation is automatically translated to / from the "C"
> end-of-line representation (LF terminator). For Windows, this means
> that CR-LF is translated to LF on all file reads, and LF is
> translated to CR-LF on all file writes. This is done at the level of
> the C run-time library. 

One must note that the Unix EOL convention is *the* standard simply
because the C library was standardized under Unix long before MS-DOS
or MacOS were conceived. (The ANSI standard came later but for the
most part mimics the Unix convention.) Since the Unix convention of
EOL == LF didn't match well with the MS-DOS / Windows convention of
EOL == CR-LF (nor the MacOS convention of EOL == CR), non-Unix C
compiler vendors automatically assume you are reading a text file and
translate CR-LF (or CR) to LF in the standard C library. If you are
really reading a binary file, you have to say so. It is a fairly
elegant solution to a thorny problem, but having different conventions
haunts us still.

My favorite solution, based on experience gained by porting many
programs between Unix, Macintosh, and MS-DOS / Windows, is to always
read files in binary mode (using #ifdef's as appropriate). I then
convert CR-LF or CR sequences to LF before I use the data as text.
(Most usually, I have a compiler-like scanner anyway so accepting CR,
LF, or CR-LF is not a problem in my software.)

Michael van Acken <KK120y2@mail.lvr.de> wrote:
> The question still remains, how different end of line conventions
> can be selected for the TextRider module. The current implementation
> hardwires a single end of line character (ASCII.lf, I believe) into
> the module. I think we must make the eol convention an attribute of
> a rider instance instead. 

Mike Griebling <grieblm@trt.allied.com> wrote: 
> Maybe the end-of-line "character" should be defined as a string
> which is treated as an aggregate character. Both CR/LF are read but
> only a CR is actually returned. When ungetting, maybe it's enough
> just to unget just the CR. Would that cause problems for anyone?
> Obviously, the position would also have to be updated by 2 whenever
> an eol is reached.

I generally write all output according to Unix convention (and so
pretend that the file was written on Unix). However, I like Michael's
approach of making the EOL convention an attribute of TextRider so
that the write-convention can be changed on a file by file basis. I
would suggest that TextRider allow any of the three (or more) standard
read-conventions to be recognized automatically and converted to a
single LF for subsequent use. This would allow seemless reading of
text files from any platform, but allow text files to be written with
any of the conventions. The only problem to Mike's suggestion that we
only unget one character is that a Pos-ReadChar-UngetChar-Pos sequence
will give different results for the first and second Pos calls. Even
this could be eliminated with an internal attribute which keeps track
of how may characters were read for the last EOL.

I suggest that we adopt the same EOL convention as the networking
world for our default, i.e., EOL == LF. It maintains POSIX
compatibility, makes Michael's linux machine happy, and I am already
accustomed to doing it that way. :)

Mark

-- 
Mark K. Gardner (mkgardne@cs.uiuc.edu)
University of Illinois at Urbana-Champaign
Real-Time Systems Laboratory
--