[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LONGCHAR proposal



> Date: Tue, 23 Feb 1999 11:02:25
> From: Stewart Greenhill <greenhil@murdoch.edu.au>
> 
> >The following is a proposal to add the type LONGCHAR to OOC, which has been
> >discussed here very briefly before.  Please feel free to critique it.
> 
> I think this would be a useful addition to OOC. Currently, there is no good
> mapping for Unicode data types under Microsoft Windows. LONGCHAR would
> solve this problem.

I'm quite fond of Unicode myself.  XML uses this character encoding.

> [...]
> 
> >Constant strings which consist solely of characters in the range
> >`0X..0FFX' and strings stored in an array of CHAR are of type String,
> >all others are of type LongString.
> >
> >(LongString constants need to have a means of representing Unicode
> > character values.  Java does it using escape sequences like,
> > "\uc0ac\uc6a9".  What would be an equivalent Oberon-like way to do
> > this?  How does Component Pascal handle this?  I'd guess that, since
> > CP has a string concatenation operator `+', you could write
> > 0C0ACX+0C6A9X to represent the above Java string.)
> 
> Yes, this is correct. While concatenation is generally used for operating
> on variables, it can also be used to build strings containing special
> characters (eg. control characters, Unicode characters). It would be useful
> to introduce a limited form of string concatenation that works for
> constants. This could also be useful in many Windows functions. For
> example, MessageBoxA is used to display informational or error messages in
> a window. Without string concatenation, one needs to build strings at
> run-time:
> [...]

Concatenation of string constants would be useful indeed, even if
completely against standard O2.  For such minor transgressions OOC has
its "non-conformant mode".  I would be willing to make it a little
less conformant by adding a string concatenation operator.  How could
we otherwise express strings like "a"+0X+"b"?  ;-)

> PROCEDURE Usage;
> CONST
>   nl = 0DX + 0AX;

With string concat, we could turn CharClass.systemEol into a constant.

> >The following predeclared function procedures support these
> >additional operations:
> >
> >       Name      Argument type        Result type  Function
> >       LONG(x)   CHAR                 LONGCHAR     identity 
> >                 String               LongString   identity 
> >
> >       SHORT(x)  LONGCHAR             CHAR         projection 
> >                 LongString           String       projection 
> 
> I would suggest in addition:
> 
>         ORD(x)     LONGCHAR            INTEGER      ordinal value of x
> 
>         LONGCHR(x) integer type        LONGCHAR     long character with
> ordinal
>                                                     value x

I would prefer ORD(LONGCHAR) to map to LONGINT.  Otherwise half of the
LONGCHAR values would be mapped to negative values.

We also have CAP and COPY.  CAP is a problem, because mapping to upper
case is only simple if it is restricted to the ASCII set.  Currently,
the implementation of CAP cannot even to ISO-latin-1 capitalization.

> >The following modules would have to be modified:
> >
> >	Channel     (add Channel: LErrorDescr; 
> >		         Reader:  LErrorDescr;
> >			 Writer:  LErrorDescr)
> >	Files       (add LErrorDescr, LNew, LOld, LTmp, LSetModTime,
> 
> I don't think the underlying run-time system will allow LONGCHARs in file
> names. This means that longstrings will need to be converted to strings,
> possibly signalling an error if non-representable characters are included
> in file names.

I agree.  And as long as the error messages in the modules are not
localizable, we can safely stick to single byte ASCII for them, too.  

What encoding should the additions to BinaryRider (ReadLChar &
friends) use?  There are a _lot_ of Unicode encodings, and I really
don't want to burden `BinaryRider' with an intimate knowledge of them
all.

> >			 LGetModTime, LExists;
> >		         File:    LErrorDescr, 
> >		         Reader:  LErrorDescr;
> >			 Writer:  LErrorDescr)
> >
> >	BinaryRider (add Reader: ReadLChar, LErrorDescr, ReadLString;
> >		         Writer: WriteLChar, LErrorDescr, WriteLString)
> >
> >	TextRider   (This will require the most significant changes,
> >		     and will need at least some discussion as to
> >		     what should be done.  Some things to think
> >		     about:
> >
> >		     1) We could just add methods like methods like
> >		        reader.ReadLChar and reader.ReadLString, but
> >		        what about things like reader.ReadInt, how
> >		        would they be set to read from Unicode
> >		        streams?  Should there be methods like
> >		        reader.LongReadInt?
> >
> >		     2) Or should there be a separate module
> >		        LongTextRider?  And if so, how would that
> >		        affect modules In, Out, Err, and Log?  Could
> >		        we make it so that TextRider.reader is
> >		        interchangeable with LongTextRider.reader?)
> 
> I think approach (1) is probably the simpler and would be easier to
> maintain since it would not require duplication of the majority of
> the code.
> [... TextRider outline for both CHAR and LONGCHAR deleted]

Why bother to create a single TextRider module for all encodings?  We
have a clear separation of concerns in the I/O section of the library.
Channels deal with byte streams, mappers convert from bytes to
meaningful data.  I strongly suggest we create a `UnicodeRider'
module, akin to `TextRider', that does conversion from and to Unicode
texts.  It should deal with the various encodings like UTF-7, UTF-8,
and little-endian/big-endian 16 bit.  (Are there more variants?)

> PS. CSTRING would also need to be changed to deal with LongStrings.

Yes.  Eric must not forget this in his draft ;-)


Quoting Eric's original post:

> Unicode is not the only international standard for character
> encoding; for example, a standard Korean character encoding is
> "KSC5601".  When manipulating non-Unicode text data, a program
> would convert the data into Unicode, perform its processing, and
> then convert the result back to an external character encoding.
>
> OOC could provide conversion I/O routines that convert between
> various external encodings and internal (Unicode).  This could
> be done as using a "layered channel" approach:

While feasible, I don't like to do this on the channel level.
Instead, I propose a module `KSCRider', implementing the same
interface as `UnicodeRider', that does the necessary conversion.


One point hasn't been mentioned yet: With Unicode support in the
compiler, the compiler should also be able to read source code written
directly in Unicode.  Then it would be possible to define Unicode
string constants without reverting to *X escapes.

-- mva