Thanks, Erik. I agree with all of that, except for the term "glyph",
which isn't quite right -- the correct term is "character". There's a
good glossary, as well as a discussion of the changes made to Java to
support Unicode supplementary characters, and the issues involved, at
It's highly recommended reading for anyone interested in this discussion.
At the moment, my concern is mostly about neither UTF-8 nor UTF-32,
but rather with UTF-16 because that is the encoding for
java.lang.String and, as so many people have pointed out, Java-Scala
interoperability is essential. I think perhaps the lowest-impact
approach would be to provide a Unicode view of Strings, so for
instance one could do
All that would take is an implicit conversion from a String to a
UnicodeView, and I have written code to do that. It would be nice to
be able to do
and I have written that code too, but it makes isLetter an implicit
method for any Int, which may not be desirable.
> On Thu, Dec 23, 2010 at 11:50:00PM -0500, Arya Irani wrote:
>> What are some performance considerations that should be kept in mind when
>> implementing a UTF string library?
>> Must a UTF8 string be stored as an Array[Byte]? What about Seq[Seq[Byte]]
>> or Array[Seq[Byte]] or Array[Array[Byte]], where each element represents a
>> code point?
> Just to be clear: I think Jim wants a Unicode string library rather
> than just a UTF-8 string library. Unicode is a specification which
> assigns numbers to glyphs, whereas UTF-8 is a particular method of
> storing strings of Unicode glyphs as bytes.
> You can use Seq[Int] (which corresponds to the UTF-32 encoding) to
> correctly represent all existing Unicode glyphs at the cost of
> increased memory usage. For instance, "cat" takes 12 bytes when
> represented as Array[Int] (UTF-32) but 3 bytes when represented as
> Array[Byte] (UTF-8).
> UTF-8 uses a variable number of bits to reduce memory usage (in essence
> a simple form of compression), but this complicates code which wants to
> handle the string in terms of glyphs (for instance the number of bytes
> in a UTF-8 string will often differ from the number of Unicode glyphs).
> With the Seq[Int] representation the length of the sequence is the same
> as the number of glyphs.
> I think the ideal would be to have a library which can deal with (at
> least) two representations of a Unicode string: UTF-32 (Seq[Int]) and
> UTF-8 (Seq[Byte]). One could use the former for simplicity and speed,
> and the latter when trying to conserve memory and for I/O.
> None of this is profound, but I thought it would be useful to make the
> distinction between Unicode and UTF-8 (the two are often conflated).
> -- Erik