[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New TextRider & oo2c patch



> From: "Eric W. Nikitin" <enikitin@apk.net>
> Date: Fri, 10 Jul 1998 09:14:04 -0400 (EDT)
> 
> Michael van Acken wrote:
> > While hacking away at TextRider, I noticed a small number of minor
> > inconsistencies of the reference manual.
> > 
> > Quote OOC RM:
> > >   When attempting to read, and if the value is not properly formatted
> > >for its type, `r.Res()' returns `invalidFormat'.  The reader remains
> > >positioned at the character which caused the `invalidFormat' error, but
> > >further reading can not take place until the error is cleared.
> > 
> > The bit about the reader position after `invalidFormat' errors does
> > not after a failed ReadBool.  In this case, the position is after the
> > invalid identifier.
> 
> The way I read it, the RM is correct.  Assume a `ReadBool' is done on each
> of the following (^ indicates rider position after read attempt):
> 
> TRUE
>     ^   (* r.Res() => done *)
> 
> FALSE
>      ^  (* r.Res() => done *)
> 
> TRUUE
>    ^    (* r.Res() => invalidFormat, expects an `E' here not `U' *)

I should have been more precise when I wrote this.  The bit about the
reader position is no longer correct, because I changed it with my
current patch.  There are two reasons for this: 

The old behaviour accepted things like
TRUEBAR
    ^  (* r.Res() => done *)
as a valid boolean value.  One can argue for and against this, but
personally I think this should not happen.

Then, the scanner treats booleans like identifiers.  It first reads
and identifier, and then checks for the special cases TRUE/FALSE.  In
other words, the scanner will _not_ accept "TRUEBAR".  For consistency
reasons, the reader should do the same IMO.

> When reading numbers, invalidFormat can only occur at the beginning (first
> or second character).  After that, "invalid" characters signal end-of-input
> for a number.  ReadInt on each of the following:
> 
> -123A
>     ^  (* r.Res() => done *)
> 
> -A
>  ^     (* r.Res() => invalidFormat *)
> 
> A123
> ^      (* r.Res() => invalidFormat *)

This is true for integers.  With reals there are additional "invalid"
variants.

> Should all of this be explained in the RM?  Or is there a better short
> explanation for this than what is already there?

I never believed in specifying this error behaviour in detail.  Some
cases signal a error if the lookahead character is invalid (like
ReadInt), others if the next character does not fit into the variable
(like ReadLine), and ReadBool if the identifier does not match.

You covered the first case, and the second case behaves accordingly by
leaving the offending character in the inoput, but the ReadBool case
does something slightly different.

> > `valueOutOfRange' is signaled after the whole integer number is read,
> > but for strings the reading procedure may return with this error "in
> > between", without scanning to the end of the string first.
> 
> I'll add a note about this to the RM.

I guess I could change ReadString/ReadLine to scan to the end, but I
am not sure if this is worthwhile.  It would make overflow handling
more consistent with ReadInt & friends, though.  

> > Also: The RM does not explain the difference between the types `error'
> > and `invalid'.  Lacking any further information, I am using `invalid'
> > as an equivalent to `undefined', which can only happen after
> > `InitScanner' or `ClearError'.
> 
> The difference should be that `invalid' (for Scanners) is equivalent to
> `invalidFormat' OR `valueOutOfRange' (for Readers) -- i.e., problems
> interpreting tokens.  On the other hand, an `error' is when an error occurs
> on the underlying Reader (i.e., `s.r.byteReader.res#done'); `error' is
> therefore used to determine when you've reached end-of-text.  So normally,
> you would write something like
> 
> 
>   s.Scan;
>   WHILE s.type # TextRider.error DO
>      IF s.type = TextRider.string THEN
>   ...
>      ELSIF s.type = TextRider.invalid THEN
>         (* Do something special with a bad token *)
>   ...

This makes sense.  Of course, the reference manual should not mention
`s.r.byteReader.res', because the `byteReader' field should be hidden
from the user -- it is only exported to be able to create extensions
of the module within the OOC library.  Also, `invalidFormat' and
`valueOutOfRange' are written to `byteReader.res' like low-level
errors (which is another thing the user does not need to know).

With `invalid' and `error' specified this way, I need another value to
initialize `Scanner.type' when doing `ConnectScanner' or
`ClearError'.  I suggest to use `undefined' for this.

> > Btw: I did _not_ fix the potential string buffer overruns in ReadLReal
> > and Scanner.ReadNum.
> 
> What should we do about this?  Make it POINTER TO ARRAY OF CHAR?  Or ADT
> Lib's dynamic String type?

IMO, the correct way to do this is to have the procedures accept
strings of arbitrary length, and return the nearest LONGREAL value in
any case.  The trick is to discard stuff like leading zeroes, cutoff
digits beyond the maximum number of significant digits, and to detect
overflows just by counting digits.  With these techniques and
MAX(LONGREAL) < 2E308, one could store any valid read number in a
buffer of 330 characters, and detect overflows and underflows for
longer real strings "by hand".  I implemented something similar for
integers, but it is a little bit more complex for reals.  Right now I
only signal "valueOutOfRange" if the real string is longer than 1023
characters, although I do scan to the end of the number before
reporting this.

-- mva