[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

LONGCHAR proposal




The following is a proposal to add the type LONGCHAR to OOC, which has been
discussed here very briefly before.  Please feel free to critique it.


Thanks,
Eric

---

In order to support the Unicode character set, OOC adds the type
LONGCHAR and introduces the concept of long strings.  The `character
types' are now CHAR and LONGCHAR, and the `string types' are String
and LongString.


I.  Language

The basic character types are as follows:

    * CHAR      the characters of the ISO-Latin-1 (i.e., ISO-8859-1)
	        character set (0X..0FFX)

    * LONGCHAR  the characters of the Unicode character set
		(0X..0FFFFX)

The character type LONGCHAR includes the values of type CHAR
according to the following hierarchy:

    LONGCHAR >= CHAR

Character constants are denoted by the ordinal number of the
character in hexadecimal notation followed by the letter X.  The type
of a character constant is the minimal type to which the constant
value belongs.  (i.e., If the constant value is in the range
`0X..0FFX', its type is CHAR; otherwise, it is LONGCHAR).


Constant strings which consist solely of characters in the range
`0X..0FFX' and strings stored in an array of CHAR are of type String,
all others are of type LongString.

(LongString constants need to have a means of representing Unicode
 character values.  Java does it using escape sequences like,
 "\uc0ac\uc6a9".  What would be an equivalent Oberon-like way to do
 this?  How does Component Pascal handle this?  I'd guess that, since
 CP has a string concatenation operator `+', you could write
 0C0ACX+0C6A9X to represent the above Java string.)


The following predeclared function procedures support these
additional operations:

       Name      Argument type        Result type  Function
       LONG(x)   CHAR                 LONGCHAR     identity 
                 String               LongString   identity 

       SHORT(x)  LONGCHAR             CHAR         projection 
                 LongString           String       projection 


String types are assignment compatible as follows:

  An expression e of type Te is assignment compatible with a variable
  v of type Tv if one of the following conditions hold:

   1. Tv is an array of LONGCHAR, Te is LongString or String, and 
      LEN(e) < LEN(v); 
   2. Tv is an array of CHAR, Te is String, and LEN(e) < LEN(v);


String types are array compatible as follows:

  An actual parameter a of type Ta is array compatible with a formal
  parameter f of type Tf if

   1. Tf is an open array of LONGCHAR and Ta is LongString, or 
   2. Tf is an open array of CHAR and Ta is String. 


Character and string types are expression compatible as follows:

       Operator        First operand   Second operand  Result type
       = # < <= > >=   character type  character type  BOOLEAN
                       string type     string type     BOOLEAN


II.  Library


The following modules would be added to support LONGCHAR and
LongStrings:

	LongCharClass
	LongStrings  

(I have an issue with `LongCharClass' because OOC already provides
 character classification routines in two places: CharClass and
 LocStrings.  The difference being that CharClass is not locale
 sensitive.  Are we going to need two additional separate modules for
 LongCharClass and LocLongStrings?)


The following modules would have to be modified:

	Channel     (add Channel: LErrorDescr; 
		         Reader:  LErrorDescr;
			 Writer:  LErrorDescr)
	Files       (add LErrorDescr, LNew, LOld, LTmp, LSetModTime,
			 LGetModTime, LExists;
		         File:    LErrorDescr, 
		         Reader:  LErrorDescr;
			 Writer:  LErrorDescr)

	BinaryRider (add Reader: ReadLChar, LErrorDescr, ReadLString;
		         Writer: WriteLChar, LErrorDescr, WriteLString)

	TextRider   (This will require the most significant changes,
		     and will need at least some discussion as to
		     what should be done.  Some things to think
		     about:

		     1) We could just add methods like methods like
		        reader.ReadLChar and reader.ReadLString, but
		        what about things like reader.ReadInt, how
		        would they be set to read from Unicode
		        streams?  Should there be methods like
		        reader.LongReadInt?

		     2) Or should there be a separate module
		        LongTextRider?  And if so, how would that
		        affect modules In, Out, Err, and Log?  Could
		        we make it so that TextRider.reader is
		        interchangeable with LongTextRider.reader?)

	In, Out, Err, Log (dependant on what happens to TextRider)

	Calendar    (procedures TimeToLStr and LStrToTime)

	Exception   (LRaise and GetLMessage)


     (The following could also be changed/added to support LONGCHAR,
      but these aren't absolutely necessary:

	Integers    (procedures ConvertFromLString and
	             ConvertToLString)
	Integer/LongString Conversions
	Real/LongString Conversions)
	


III.  MISC

     Unicode is not the only international standard for character
     encoding; for example, a standard Korean character encoding is
     "KSC5601".  When manipulating non-Unicode text data, a program
     would convert the data into Unicode, perform its processing, and
     then convert the result back to an external character encoding.

     OOC could provide conversion I/O routines that convert between
     various external encodings and internal (Unicode).  This could
     be done as using a "layered channel" approach:

     Ex:

	VAR
	  file: Files.File;
	  convertChannel: Converters.Channel;
	...

	file := Files.Old(fname, {Files.read, Files.write}, res);
	...

	convertChannel := Converters.Attach(file, "KSC5601");
	(* when reading, converts from "KSC5601" to Unicode,
	   when writing, converts from Unicode to "KSC5601".
	 *)