Unicode issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode issues

Jim Balter-2
[Moved to scala-debate per Tony's suggestion]

On Thu, Dec 23, 2010 at 8:50 PM, Arya Irani <[hidden email]> wrote:
> Ok, so...
>
> What are some performance considerations that should be kept in mind when
> implementing a UTF string library?
> Must a UTF8 string be stored as an Array[Byte]?  What about Seq[Seq[Byte]]
> or Array[Seq[Byte]] or Array[Array[Byte]], where each element represents a
> code point?
> -Arya

The point of UTF-8 is space efficiency; the latter three
representations would defeat that and are inferior in all respects to
an Array[Int] (UTF-32) representation. Each of the three standard
representations have some benefit that the others don't:

UTF-8: space
UTF-16: interoperability with Java and existing Scala code/libraries
UTF-32: speed

There's no point in any other representation.

-- Jim
Reply | Threaded
Open this post in threaded view
|

Re: Unicode issues

Jim Balter-2
On Thu, Dec 23, 2010 at 10:01 PM, Jim Balter <[hidden email]> wrote:

> [Moved to scala-debate per Tony's suggestion]
>
> On Thu, Dec 23, 2010 at 8:50 PM, Arya Irani <[hidden email]> wrote:
>> Ok, so...
>>
>> What are some performance considerations that should be kept in mind when
>> implementing a UTF string library?
>> Must a UTF8 string be stored as an Array[Byte]?  What about Seq[Seq[Byte]]
>> or Array[Seq[Byte]] or Array[Array[Byte]], where each element represents a
>> code point?
>> -Arya
>
> The point of UTF-8 is space efficiency; the latter three
> representations would defeat that and are inferior in all respects to
> an Array[Int] (UTF-32) representation. Each of the three standard
> representations have some benefit that the others don't:
>
> UTF-8: space
> UTF-16: interoperability with Java and existing Scala code/libraries
> UTF-32: speed
>
> There's no point in any other representation.

I somewhat overstated that. There are, of course, ropes, which is
where we started. But ropes, which provide inexpensive concatenation
and slicing, don't address any of the Unicode issues -- the internal
substrings in ropes can use any of the above encodings. And there is
also List[SomeCharacterType].