[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Revised LONGCHAR proposal



Here is the revised proposal for LONGCHAR.  

If this is acceptable, I believe Michael van Acken will implement the
compiler changes and probably module LongStrings as well, and I'll do
the documentation.  Can I get some volunteers to work on the rest of
the library changes?  


Thanks,
Eric
---

In order to support the Unicode character set, OOC adds the type
LONGCHAR and introduces the concept of long strings.  The `character
types' are now CHAR and LONGCHAR, and the `string types' are String
and LongString.


I.  Language

The basic character types are as follows:

    * CHAR      the characters of the ISO-Latin-1 (i.e., ISO-8859-1)
	        character set (0X..0FFX)

    * LONGCHAR  the characters of the Unicode character set
		(0X..0FFFFX)

The character type LONGCHAR includes the values of type CHAR
according to the following hierarchy:

    LONGCHAR >= CHAR

Character constants are denoted by the ordinal number of the
character in hexadecimal notation followed by the letter X.  The type
of a character constant is the minimal type to which the constant
value belongs.  (i.e., If the constant value is in the range
`0X..0FFX', its type is CHAR; otherwise, it is LONGCHAR).


Constant strings which consist solely of characters in the range
`0X..0FFX' and strings stored in an ARRAY OF CHAR are of type String,
all others are of type LongString.

Constants strings can be represented using the string concatenation
operator `+' and a combination of characters or string constants.
For example, the following is of type LongString:

  CONST
     aLongString = 0C0ACX + 0C6A9X + " " + 0C2E4X + 0D328X; 


The following predeclared function procedures support these
additional operations:

     Name        Argument type        Result type  Function
     LONG(x)     CHAR                 LONGCHAR     identity 
                 String               LongString   identity 

     LONGCHR(x)  integer type         LONGCHAR     long character with
                                                   ordinal value x

     ORD(x)      LONGCHAR             LONGINT      ordinal value of x

     SHORT(x)    LONGCHAR             CHAR         projection 
                 LongString           String       projection 


Please Note: 

SHORT(x), where x is of type LONGCHAR, can result in overflow, which
triggers a compilation or run-time error.  The result of an operation
that causes an overflow, but is not detected as such, is undefined.


The predeclared procedure COPY(x, v) also supports LongStrings.

     Name            Argument type                 Function
     COPY(x, v)      x: character array, string    v := x
                     v: character array

Note that, COPY(x, v) is invalid if x is of type ARRAY OF CHAR, and v
is of type LongString or ARRAY OF LONGCHAR.


String types are assignment compatible as follows:

  An expression e of type Te is assignment compatible with a variable
  v of type Tv if one of the following conditions hold:

   1. Tv is an array of LONGCHAR, Te is LongString or String, and 
      LEN(e) < LEN(v); 
   2. Tv is an array of CHAR, Te is String, and LEN(e) < LEN(v);


String types are array compatible as follows:

  An actual parameter a of type Ta is array compatible with a formal
  parameter f of type Tf if

   1. Tf is an open array of LONGCHAR and Ta is LongString, or 
   2. Tf is an open array of CHAR and Ta is String. 


Character and string types are expression compatible as follows:

       Operator        First operand   Second operand  Result type
       = # < <= > >=   character type  character type  BOOLEAN
                       string type     string type     BOOLEAN


II.  Library


The following modules would be added to support LONGCHAR and
LongStrings:

	LongStrings  

	LongRider [ABSTRACT]
	UnicodeRider  (or is UnicodeMapper a better name?)

(I removed the addition of `LongCharClass' from this proposal because
 character classification of LONGCHARs is always locale sensitive.  I
 will bring this up again in later discussions on Locales.)


The following modules would have to be modified:

	BinaryRider (add Reader: ReadLChar,  ReadLString;
		         Writer: WriteLChar, WriteLString;

		     Note that the above would be affected by calls
		     to SetByteOrder.  Also, an option needs to be
		     added to set how a BinaryRider interprets (and
		     possibly discards) a leading U+FEFF byte order
		     marker.)

	Calendar    (procedures TimeToLStr and LStrToTime)

	Exception   (LRaise and GetLMessage)

	CSTRING would also need to be changed to deal with
	LongStrings.  (Or is it that we need to add CWIDESTRING?)

     (The following could also be changed/added to support LONGCHAR,
      but these aren't absolutely necessary:

	Integers    (procedures ConvertFromLString and
	             ConvertToLString)
	Integer/LongString Conversions
	Real/LongString Conversions)
	


III.  MISC

      Other mapper classes can be added (as time permits) to handle
      additional 8- and 16-bit encodings.  These classes map from
      another encoding (e.g., "KSC5601", a standard Korean character
      encoding) to Unicode or Latin-1 (as appropriate), and vice
      versa.  Here "encoding" means both the encoding of n bit values
      in byte streams, and translation of character codes between the
      two standards.

      A possible class hierarchy is as follows:

  
                      Rider [ABSTRACT]
                      /    \
                    /        \
                  /            \
             TextRider        LongRider [ABSTRACT]
                /\                 |
              /    \               |
            /        \             |
        Cp037Rider (other       UnicodeRider
                    8-bit           /\
                    encodings)    /    \
                                /        \
                            KSCRider    (other 16-bit encodings)



      `Rider' is the abstract class that defines the Reader, Writer,
      and Scanner interfaces as currently implemented in `TextRider'.
      Modules In, Out, and Err would be defined relative to this
      class.

      `LongRider' adds LONGCHAR and LongString support.

      Input/output of TextRider is ISO-Latin-1, and likewise the I/O
      of UnicodeRider is Unicode.