[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: LONGCHAR proposal
"Eric W. Nikitin" <enikitin@apk.net> writes:
> In response to various, the section of the proposal on predeclared
> function procedures now looks like:
>
> ---
> The following predeclared function procedures support these
> additional operations:
>
> Name Argument type Result type Function
> LONG(x) CHAR LONGCHAR identity
> String LongString identity
>
> CAP(x) LONGCHAR LONGCHAR if x is a letter,
> corresponding capital
> letter;
> otherwise, identity
Do we restrict CAP to the ASCII range, i.e., implement it as a mapping
of [a-z] to [A-Z], and identity otherwise, or do we deal with the
whole set of characters that have a corresponding upper case
equivalent?
> LONGCHR(x) integer type LONGCHAR long character with
> ordinal value x
>
> ORD(x) LONGCHAR LONGINT ordinal value of x
>
> SHORT(x) LONGCHAR CHAR projection
This will lose the most significant part of the 2-byte Unicode
character.
> LongString String projection
> ---
>
> Does anything special need to be said about COPY? Does the current
> definition cover everything? That is, does the following need to
> change at all? Or does a note need to be added afterwards about
> LongStrings?
>
> Name Argument type Function
> COPY(x, v) x: character array, string v := x
> v: character array
Invalid:
COPY(ARRAY OF CHAR, ARRAY OF LONGCHAR) or
COPY (ARRAY OF CHAR, LongString)
All other variations are legal.
> I'll conceed this for now. Although for full internationalization
> support, you'd somehow have to be able to access files with long
> character names. Can URLs have Unicode in them? If so, maybe this
> could be handled in some sort of URL resolver channel class (later
> on). I also reserve the right to bring this up again when we get
> around to discussing Locales.
Agreed.
> > What encoding should the additions to BinaryRider (ReadLChar &
> > friends) use? There are a _lot_ of Unicode encodings, and I really
> > don't want to burden `BinaryRider' with an intimate knowledge of them
> > all.
>
> Let me try to clear something up here.
>
> You have to be careful when using the word ``encoding'' because it is
> comprised of two related concepts. One is the ``character set''
> itself; that is, a set of values that map to particular characters.
> It also refers to the ``data format'', which is the way each
> character value is actually represented.
>
> So, for ``ASCII encoding'' a minimum of 7 bits is required to
> represent its 128 values. But it could be formatted as 7-bit ASCII,
> or 8-bit ASCII, or 9-bit ASCII, or...
>
> Unicode works the same way. There is a single Unicode character set,
> which requires a minimum of 16 bits to represent all characters, but
> it can be formatted in a number of different representations. The
> two I am most familiar with are UCS-2 and UTF-8. UCS-2 is a 16-bit
> fixed width format, whereas UTF-8 is an 8-bit variable width format
> requiring from one to four bytes to represent each character.
>
> The LONGCHAR proposal suggests that OOC adopt exactly two character
> encodings for internal use: 8-bit fixed width ISO-Latin-1 for CHAR
> and 16-bit fixed width Unicode for LONGCHAR. This simplifies all
> locale sensitive operations, and when dealing with other encodings,
> the only thing you need to worry about is conversion when reading
> from or writing to external data sources.
>
> Thus, BinaryRider needs to know only about a single Unicode encoding
> because its function is to mirror OOC's internal data format.
Actually, it needs two. Writing fixed width 16-bit Unicode can put
the least significant or the most significant byte first. Both
variants are explicitly permitted. To distinguish the two, it is
possible to prefix the binary file with the byte order mark U+FEFF.
> With respect to TextRider and Misc:
>
> Stewart Greenhill wrote:
> >I think approach (1) is probably the simpler and would be easier to
> >maintain since it would not require dupication of the majority of the code.
> >We could regard the "character width" as simply an attribute of the text.
> >Thus, there would be only one TextRider class, but it would deal with
> >either 1-byte and 2-byte characters. Character types should be able to be
> >used interchangably, provided that they are representable.
>
> Michael van Acken:
> > Why bother to create a single TextRider module for all encodings? We
> > have a clear separation of concerns in the I/O section of the library.
> > Channels deal with byte streams, mappers convert from bytes to
> > meaningful data. I strongly suggest we create a `UnicodeRider'
> > module, akin to `TextRider', that does conversion from and to Unicode
> > texts. It should deal with the various encodings like UTF-7, UTF-8,
> > and little-endian/big-endian 16 bit. (Are there more variants?)
>
> If you want a separate module for each encoding (and there would be a
> lot of them, see the following for those supported by Java:
> http://www.javasoft.com/products/jdk/1.1/docs/guide/intl/encoding.doc.html
> )
Actually I meant that UnicodeRider would understand the most common
Unicode encoding (transformation format) variants. Mapping between
different character sets, like JIS0208 to Unicode, should be done by
other modules.
> Then I suggest we try to build this as a class hierarchy, with
> possibly using an abstract class at the base. Then, once the two
> basic mapper classes, say TextRider and UnicodeRider, are built, any
> extension class like KSCRider would simply have to supply conversions
> and the rest of the behavior would be the same.
IMO code reuse is a non-topic here. We can write a rider class that
siphons every single character through a state machine to do the
decoding and conversion, but the performance penalty would be severe.
-- mva