[squeak-dev] Recent change in byte array at:put:

Jakob Reschke jakob.reschke at student.hpi.de
Sun Jul 30 14:39:26 UTC 2017


I did not want to advertise for UTF-16, but I thought maybe something
could be learned from the Java implementation of string building. ;-)
But now that I looked up about UTF-16 and its surrogates again, and
found this [1], I doubt it.

[1] https://stackoverflow.com/questions/26170180/complexity-of-insert0-c-operation-on-stringbuffer-is-it-o1


2017-07-30 15:34 GMT+02:00 Tobias Pape <Das.Linux at gmx.de>:
>
>> On 30.07.2017, at 11:02, Jakob Reschke <jakob.reschke at student.hpi.de> wrote:
>>
>> Simple "solution" to editing: treat (encoded) Strings as immutable.
>>
>> For editing/stringbuilding, use a WideString or a special kind of stream (MultiByteBinaryOrTextStream or how is it called) plus additional support for inserting in the middle if desired. I remember somebody proposing Ropes when discussing a reformation of strings previously. Could be interesting at "edit-time".
>>
>> For (in-memory) storage, encoded Strings should maybe just be ByteArrays paired with some TextConverter-like thing or at least a spec of the encoding so you can fetch Characters or configure streams from it on demand.
>>
>> If licensing permits it one could also have a look at how the OpenJDK deals with UTF-16 in StringBuilder.
>>
>
> While I agree in principle, don't come near me with utf16 ;)
>
>>
>> Am 30.07.2017 03:20 schrieb "tim Rowledge" <tim at rowledge.org>:
>>
>> > On 29-07-2017, at 12:48 PM, Nicolas Cellier <nicolas.cellier.aka.nice at gmail.com> wrote:
>> > Absolutely,
>> > to me a String is a sequence of characters.
>> > squeakToUtf8 is a hack that makes us consider a String as a sequence of codePoints whose encoding is in the eye of the beholder (or implicitly in the Context - the Smalltalk one).
>> > I's not very object oriented and quite fragile.
>> > We started to clean Multilingual but never finished the job…
>>
>> Yes, that’s pretty much how I see it. Currently the utf8 ‘string’ is just kept as a byte string and the user is expected to understand that it is in a rather dangerous state.
>>
>> >
>> > It's difficult to finish it, because we value backward compatibility.
>> > So maybe the ByteArray change was a bit radical with this respect.
>>
>> Backward compatibility can sometimes drive you to loud swearing!
>>
>> Maybe a new message to return the bytearray of the uft8 data could be added, leaving the old one alone. We should probably consider making an actual UTF8String class, though I did try to work out the best thing to do for that several years ago for NuScratch and got lost in the tangles. Editing the damn things is a pain, to say the leat, so you get to thinking about having the canonical string as an instvar and a byte array and edits work on the String which gets converted at the end of the edit to update bytearray. Or the other way around… or… aaargh!
>>
>>
>> tim
>> --
>> tim Rowledge; tim at rowledge.org; http://www.rowledge.org/tim
>> Useful random insult:- Immune from any serious head injury.
>>
>>
>>
>>
>>
>
>


More information about the Squeak-dev mailing list