Unicode issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode issues

Jim Balter-2
[moved to scala-debate]

On Thu, Dec 23, 2010 at 10:20 PM, Rex Kerr <[hidden email]> wrote:

> On Thu, Dec 23, 2010 at 7:22 PM, Jim Balter <[hidden email]> wrote:
>>
>> On Thu, Dec 23, 2010 at 4:09 PM, Rex Kerr <[hidden email]> wrote:
>> > If Scala changes, Java-Scala iterop breaks.
>>
>> There are changes, and then there are changes. For instance, changing
>> StringOps to implement the correct abstraction would not break interop
>> with Java.
>
> Sure it would, because StringOps look like methods on String, and if you
> changed StringOps to implement a different abstraction, String wouldn't even
> look self-consistent.

I can't make sense of that. I'm working on such a version of
StringOps, and it looks consistent to me. The two abstractions we're
talking about are: an indexed sequence of 16-bit characters, a
sequence of Unicode code points -- both are implemented as an array of
Java chars, the latter via a UTF-16 encoding. The first abstraction,
which is what the current StringOps provides, is wrong -- in Java 1.5
and above, Java Strings are mandated to have the latter abstraction.

>  So you're basically forced to use something other
> than String.  Maybe a UTF-8 rope?
>
>>
>> >  But you can always write a
>> > unicode-aware layer on top of Java chars and Strings, and use that
>> > instead.
>> > If you're careful, you can make it look to Java-land like whatever the
>> > consensus solution is in Java-land, and make it extra-pretty in Scala.
>> > Implicit conversions are your friend.
>>
>> However, Jon Pretty pointed out that implicit conversions may not
>> always be invoked when you want them to be -- they only are when the
>> code doesn't type check. This may put serious restrictions on a
>> solution.
>
> You _might_ have to do something as drastic as define a .u method on String
> to translate to a Unicode-safe class (n.b. Regex and .r).  Horrors!

Yes, I mentioned such a thing and that I am currently working on an
implementation  of that as well in my last post. It does not, however,
solve all problems.

[personal commentary along the lines of your previous instruction to
shut up and just code snipped]
Reply | Threaded
Open this post in threaded view
|

Re: Unicode issues

Rex Kerr-2
On Fri, Dec 24, 2010 at 1:49 AM, Jim Balter <[hidden email]> wrote:
[moved to scala-debate]

On Thu, Dec 23, 2010 at 10:20 PM, Rex Kerr <[hidden email]> wrote:
> On Thu, Dec 23, 2010 at 7:22 PM, Jim Balter <[hidden email]> wrote:
>>
>> On Thu, Dec 23, 2010 at 4:09 PM, Rex Kerr <[hidden email]> wrote:
>> > If Scala changes, Java-Scala iterop breaks.
>>
>> There are changes, and then there are changes. For instance, changing
>> StringOps to implement the correct abstraction would not break interop
>> with Java.
>
> Sure it would, because StringOps look like methods on String, and if you
> changed StringOps to implement a different abstraction, String wouldn't even
> look self-consistent.

I can't make sense of that. I'm working on such a version of
StringOps, and it looks consistent to me. The two abstractions we're
talking about are: an indexed sequence of 16-bit characters, a
sequence of Unicode code points -- both are implemented as an array of
Java chars, the latter via a UTF-16 encoding. The first abstraction,
which is what the current StringOps provides, is wrong -- in Java 1.5
and above, Java Strings are mandated to have the latter abstraction.

No they aren't.  Try "\ud834\udd1e"--you get a single treble clef glyph printed, but the .length method returns 2, and substring(1) produces the unicode-nonsensical string "\udd1e".  You do get correct results with higher-level methods using regular expressions, but the low-level methods still exist.

So you really need separate collections for treating the low-level char array and the sequence of code points.  Thus, why not leave StringOps as it is dealing with characters, and add UnicodeOps (or just Unicode) that handles unicode (with a .u method or somesuch in StringOps to return a UnicodeOps)?

What problem would a .u method not solve, aside from laziness or lack of awareness on the part of the programmer?

  --Rex