[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LONGCHAR proposal




In response to various, the section of the proposal on predeclared
function procedures now looks like:

---
The following predeclared function procedures support these
additional operations:

     Name        Argument type        Result type  Function
     LONG(x)     CHAR                 LONGCHAR     identity 
                 String               LongString   identity 

     CAP(x)      LONGCHAR             LONGCHAR     if x is a letter,
                                                     corresponding capital
                                                     letter;
						   otherwise, identity

     LONGCHR(x)  integer type         LONGCHAR     long character with
                                                   ordinal value x

     ORD(x)      LONGCHAR             LONGINT      ordinal value of x

     SHORT(x)    LONGCHAR             CHAR         projection 
                 LongString           String       projection 
---

Does anything special need to be said about COPY?  Does the current
definition cover everything?  That is, does the following need to
change at all?  Or does a note need to be added afterwards about
LongStrings?

     Name            Argument type                 Function
     COPY(x, v)      x: character array, string    v := x
                     v: character array


Michael van Acken wrote:
> Concatenation of string constants would be useful indeed, even if
> completely against standard O2.  For such minor transgressions OOC has
> its "non-conformant mode".  I would be willing to make it a little
> less conformant by adding a string concatenation operator.  How could
> we otherwise express strings like "a"+0X+"b"?  ;-)

I'd vote for adding a string concatenation operator.


> > >The following modules would have to be modified:
> > >
> > >     Channel     (add Channel: LErrorDescr;
> > >                      Reader:  LErrorDescr;
> > >                      Writer:  LErrorDescr)
> > >     Files       (add LErrorDescr, LNew, LOld, LTmp, LSetModTime,
> >
> > I don't think the underlying run-time system will allow LONGCHARs in file
> > names. This means that longstrings will need to be converted to strings,
> > possibly signalling an error if non-representable characters are included
> > in file names.
> 
> I agree.  And as long as the error messages in the modules are not
> localizable, we can safely stick to single byte ASCII for them, too.

I'll conceed this for now.  Although for full internationalization
support, you'd somehow have to be able to access files with long
character names.  Can URLs have Unicode in them?  If so, maybe this
could be handled in some sort of URL resolver channel class (later
on).  I also reserve the right to bring this up again when we get
around to discussing Locales.


> What encoding should the additions to BinaryRider (ReadLChar &
> friends) use?  There are a _lot_ of Unicode encodings, and I really
> don't want to burden `BinaryRider' with an intimate knowledge of them
> all.

Let me try to clear something up here.

You have to be careful when using the word ``encoding'' because it is
comprised of two related concepts.  One is the ``character set''
itself; that is, a set of values that map to particular characters.
It also refers to the ``data format'', which is the way each
character value is actually represented.

So, for ``ASCII encoding'' a minimum of 7 bits is required to
represent its 128 values.  But it could be formatted as 7-bit ASCII,
or 8-bit ASCII, or 9-bit ASCII, or...

Unicode works the same way.  There is a single Unicode character set,
which requires a minimum of 16 bits to represent all characters, but
it can be formatted in a number of different representations.  The
two I am most familiar with are UCS-2 and UTF-8.  UCS-2 is a 16-bit
fixed width format, whereas UTF-8 is an 8-bit variable width format
requiring from one to four bytes to represent each character.  

The LONGCHAR proposal suggests that OOC adopt exactly two character
encodings for internal use: 8-bit fixed width ISO-Latin-1 for CHAR
and 16-bit fixed width Unicode for LONGCHAR.  This simplifies all
locale sensitive operations, and when dealing with other encodings,
the only thing you need to worry about is conversion when reading
from or writing to external data sources.  

Thus, BinaryRider needs to know only about a single Unicode encoding
because its function is to mirror OOC's internal data format.


With respect to TextRider and Misc:

Stewart Greenhill wrote:
>I think approach (1) is probably the simpler and would be easier to
>maintain since it would not require dupication of the majority of the code.
>We could regard the "character width" as simply an attribute of the text.
>Thus, there would be only one TextRider class, but it would deal with
>either 1-byte and 2-byte characters. Character types should be able to be
>used interchangably, provided that they are representable.

Michael van Acken:
> Why bother to create a single TextRider module for all encodings?  We
> have a clear separation of concerns in the I/O section of the library.
> Channels deal with byte streams, mappers convert from bytes to
> meaningful data.  I strongly suggest we create a `UnicodeRider'
> module, akin to `TextRider', that does conversion from and to Unicode
> texts.  It should deal with the various encodings like UTF-7, UTF-8,
> and little-endian/big-endian 16 bit.  (Are there more variants?)

If you want a separate module for each encoding (and there would be a
lot of them, see the following for those supported by Java:
http://www.javasoft.com/products/jdk/1.1/docs/guide/intl/encoding.doc.html
)

Then I suggest we try to build this as a class hierarchy, with
possibly using an abstract class at the base.  Then, once the two
basic mapper classes, say TextRider and UnicodeRider, are built, any
extension class like KSCRider would simply have to supply conversions
and the rest of the behavior would be the same.


Eric