[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: LONGCHAR proposal
"Eric W. Nikitin" <enikitin@apk.net> writes:
> Michael van Acken wrote:
> > >
> > > CAP(x) LONGCHAR LONGCHAR if x is a letter,
> > > corresponding capital
> > > letter;
> > > otherwise, identity
> >
> > Do we restrict CAP to the ASCII range, i.e., implement it as a mapping
> > of [a-z] to [A-Z], and identity otherwise, or do we deal with the
> > whole set of characters that have a corresponding upper case
> > equivalent?
>
> After thinking this over and looking at the Component Pascal language
> report, which defines CAP as follows:
>
> CAP(x) character type type of x x is a Latin-1 letter:
> corresponding capital
> letter
Currently OOC implements CAP only for ASCII, and is the identity
operation for chars in the range 80X..FFX. Is anyone in favor of
extending capitalization to the full ISO latin-1 range?
> I think we should *not* define CAP for LONGCHAR. Unicode
> capitalization is locale dependant, and I wouldn't want that sort of
> dependency lurking about in a predeclared procedure.
>
> > >
> > > SHORT(x) LONGCHAR CHAR projection
> >
> > This will lose the most significant part of the 2-byte Unicode
> > character.
>
> True. But are you saying that this simply needs to be noted in the
> proposal? Or that SHORT shouldn't be defined to operate on LONGCHAR
> and LongStrings?
There are two ways to deal with this loss of information.
1) Truncation. Disadvantage: The character mapping to latin-1 is
quite arbitrary.
2) Mapping of 0100X..FFFFX onto a single character, e.g. "?".
Advantage: More deterministic, and the shorted string can be
readable if the original Unicode text uses mostly latin-1
characters, e.g. if it is an English text with a few special
characters like quotes, hyphens, and the like. The effect would be
pretty much like viewing a HTML document produced by a MS product
with a non-Windows browser.
> > > Thus, BinaryRider needs to know only about a single Unicode encoding
> > > because its function is to mirror OOC's internal data format.
> >
> > Actually, it needs two. Writing fixed width 16-bit Unicode can put
> > the least significant or the most significant byte first. Both
> > variants are explicitly permitted. To distinguish the two, it is
> > possible to prefix the binary file with the byte order mark U+FEFF.
>
> Both Reader and Writer in BinaryRider already have a `SetByteOrder'
> method, should ReadLChar, etc., use this setting as well?
This might be a good idea. BinaryRider could be told through an
option if it should interpret (and discard) a leading U+FEFF byte
order marker.
> Even though I implied code reuse in the above, it wasn't my primary
> concern. I just want to be able to drop any TextRider variant into
> my application with as few changes as possible. Hence, a common
> abstract class at the base defining the interface with multiple
> implementations such as UnicodeRider and KSCRider. Whether KSCRider
> inherits from UnicodeRider doesn't matter, but what does matter is
> that they are (for the most part) interchangeable.
If KSCRider maps to the Unicode character encoding, these two would be
share most, if not all, of their interface. Fixing the interface in
an abstract class is a good idea to formalize this.
-- mva