[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
LONGCHAR proposal
The following is a proposal to add the type LONGCHAR to OOC, which has been
discussed here very briefly before. Please feel free to critique it.
Thanks,
Eric
---
In order to support the Unicode character set, OOC adds the type
LONGCHAR and introduces the concept of long strings. The `character
types' are now CHAR and LONGCHAR, and the `string types' are String
and LongString.
I. Language
The basic character types are as follows:
* CHAR the characters of the ISO-Latin-1 (i.e., ISO-8859-1)
character set (0X..0FFX)
* LONGCHAR the characters of the Unicode character set
(0X..0FFFFX)
The character type LONGCHAR includes the values of type CHAR
according to the following hierarchy:
LONGCHAR >= CHAR
Character constants are denoted by the ordinal number of the
character in hexadecimal notation followed by the letter X. The type
of a character constant is the minimal type to which the constant
value belongs. (i.e., If the constant value is in the range
`0X..0FFX', its type is CHAR; otherwise, it is LONGCHAR).
Constant strings which consist solely of characters in the range
`0X..0FFX' and strings stored in an array of CHAR are of type String,
all others are of type LongString.
(LongString constants need to have a means of representing Unicode
character values. Java does it using escape sequences like,
"\uc0ac\uc6a9". What would be an equivalent Oberon-like way to do
this? How does Component Pascal handle this? I'd guess that, since
CP has a string concatenation operator `+', you could write
0C0ACX+0C6A9X to represent the above Java string.)
The following predeclared function procedures support these
additional operations:
Name Argument type Result type Function
LONG(x) CHAR LONGCHAR identity
String LongString identity
SHORT(x) LONGCHAR CHAR projection
LongString String projection
String types are assignment compatible as follows:
An expression e of type Te is assignment compatible with a variable
v of type Tv if one of the following conditions hold:
1. Tv is an array of LONGCHAR, Te is LongString or String, and
LEN(e) < LEN(v);
2. Tv is an array of CHAR, Te is String, and LEN(e) < LEN(v);
String types are array compatible as follows:
An actual parameter a of type Ta is array compatible with a formal
parameter f of type Tf if
1. Tf is an open array of LONGCHAR and Ta is LongString, or
2. Tf is an open array of CHAR and Ta is String.
Character and string types are expression compatible as follows:
Operator First operand Second operand Result type
= # < <= > >= character type character type BOOLEAN
string type string type BOOLEAN
II. Library
The following modules would be added to support LONGCHAR and
LongStrings:
LongCharClass
LongStrings
(I have an issue with `LongCharClass' because OOC already provides
character classification routines in two places: CharClass and
LocStrings. The difference being that CharClass is not locale
sensitive. Are we going to need two additional separate modules for
LongCharClass and LocLongStrings?)
The following modules would have to be modified:
Channel (add Channel: LErrorDescr;
Reader: LErrorDescr;
Writer: LErrorDescr)
Files (add LErrorDescr, LNew, LOld, LTmp, LSetModTime,
LGetModTime, LExists;
File: LErrorDescr,
Reader: LErrorDescr;
Writer: LErrorDescr)
BinaryRider (add Reader: ReadLChar, LErrorDescr, ReadLString;
Writer: WriteLChar, LErrorDescr, WriteLString)
TextRider (This will require the most significant changes,
and will need at least some discussion as to
what should be done. Some things to think
about:
1) We could just add methods like methods like
reader.ReadLChar and reader.ReadLString, but
what about things like reader.ReadInt, how
would they be set to read from Unicode
streams? Should there be methods like
reader.LongReadInt?
2) Or should there be a separate module
LongTextRider? And if so, how would that
affect modules In, Out, Err, and Log? Could
we make it so that TextRider.reader is
interchangeable with LongTextRider.reader?)
In, Out, Err, Log (dependant on what happens to TextRider)
Calendar (procedures TimeToLStr and LStrToTime)
Exception (LRaise and GetLMessage)
(The following could also be changed/added to support LONGCHAR,
but these aren't absolutely necessary:
Integers (procedures ConvertFromLString and
ConvertToLString)
Integer/LongString Conversions
Real/LongString Conversions)
III. MISC
Unicode is not the only international standard for character
encoding; for example, a standard Korean character encoding is
"KSC5601". When manipulating non-Unicode text data, a program
would convert the data into Unicode, perform its processing, and
then convert the result back to an external character encoding.
OOC could provide conversion I/O routines that convert between
various external encodings and internal (Unicode). This could
be done as using a "layered channel" approach:
Ex:
VAR
file: Files.File;
convertChannel: Converters.Channel;
...
file := Files.Old(fname, {Files.read, Files.write}, res);
...
convertChannel := Converters.Attach(file, "KSC5601");
(* when reading, converts from "KSC5601" to Unicode,
when writing, converts from Unicode to "KSC5601".
*)