[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LONGCHAR proposal



Michael van Acken wrote:
> > 
> >      CAP(x)      LONGCHAR             LONGCHAR     if x is a letter,
> >                                                      corresponding capital
> >                                                      letter;
> > 						   otherwise, identity
> 
> Do we restrict CAP to the ASCII range, i.e., implement it as a mapping
> of [a-z] to [A-Z], and identity otherwise, or do we deal with the
> whole set of characters that have a corresponding upper case
> equivalent?

After thinking this over and looking at the Component Pascal language
report, which defines CAP as follows:

       CAP(x)      character type      type of x    x is a Latin-1 letter:
                                                    corresponding capital 
                                                    letter 


I think we should *not* define CAP for LONGCHAR.  Unicode
capitalization is locale dependant, and I wouldn't want that sort of
dependency lurking about in a predeclared procedure.

> > 
> >      SHORT(x)    LONGCHAR             CHAR         projection 
> 
> This will lose the most significant part of the 2-byte Unicode
> character.

True.  But are you saying that this simply needs to be noted in the
proposal?  Or that SHORT shouldn't be defined to operate on LONGCHAR
and LongStrings?  

> >      Name            Argument type                 Function
> >      COPY(x, v)      x: character array, string    v := x
> >                      v: character array
> 
> Invalid: 
>   COPY(ARRAY OF CHAR, ARRAY OF LONGCHAR) or
>   COPY (ARRAY OF CHAR, LongString)
> 
> All other variations are legal.

I'll add this in.

> > Thus, BinaryRider needs to know only about a single Unicode encoding
> > because its function is to mirror OOC's internal data format.
> 
> Actually, it needs two.  Writing fixed width 16-bit Unicode can put
> the least significant or the most significant byte first.  Both
> variants are explicitly permitted.  To distinguish the two, it is
> possible to prefix the binary file with the byte order mark U+FEFF.

Both Reader and Writer in BinaryRider already have a `SetByteOrder'
method, should ReadLChar, etc., use this setting as well?  


> > > I strongly suggest we create a `UnicodeRider'
> > > module, akin to `TextRider', that does conversion from and to Unicode
> > > texts.  It should deal with the various encodings like UTF-7, UTF-8,
> > > and little-endian/big-endian 16 bit.  (Are there more variants?)
> > 
> > If you want a separate module for each encoding (and there would be a
> > lot of them, see the following for those supported by Java:
> > http://www.javasoft.com/products/jdk/1.1/docs/guide/intl/encoding.doc.html
> > )
> 
> Actually I meant that UnicodeRider would understand the most common
> Unicode encoding (transformation format) variants.  Mapping between
> different character sets, like JIS0208 to Unicode, should be done by
> other modules.

Ok.  It makes sense to enable UnicodeRider to understand the common
transformation formats.  


> > Then I suggest we try to build this as a class hierarchy, with
> > possibly using an abstract class at the base.  Then, once the two
> > basic mapper classes, say TextRider and UnicodeRider, are built, any
> > extension class like KSCRider would simply have to supply conversions
> > and the rest of the behavior would be the same.
> 
> IMO code reuse is a non-topic here.  We can write a rider class that
> siphons every single character through a state machine to do the
> decoding and conversion, but the performance penalty would be severe.

Even though I implied code reuse in the above, it wasn't my primary
concern.  I just want to be able to drop any TextRider variant into
my application with as few changes as possible.  Hence, a common
abstract class at the base defining the interface with multiple
implementations such as UnicodeRider and KSCRider.  Whether KSCRider
inherits from UnicodeRider doesn't matter, but what does matter is
that they are (for the most part) interchangeable.    


Eric