Simple "solution" to editing: treat (encoded) Strings as immutable.
For editing/stringbuilding, use a WideString or a special kind of stream (MultiByteBinaryOrTextStream or how is it called) plus additional support for inserting in the middle if desired. I remember somebody proposing Ropes when discussing a reformation of strings previously. Could be interesting at "edit-time".
For (in-memory) storage, encoded Strings should maybe just be ByteArrays paired with some TextConverter-like thing or at least a spec of the encoding so you can fetch Characters or configure streams from it on demand.
If licensing permits it one could also have a look at how the OpenJDK deals with UTF-16 in StringBuilder.
Am 30.07.2017 03:20 schrieb "tim Rowledge" tim@rowledge.org:
On 29-07-2017, at 12:48 PM, Nicolas Cellier <nicolas.cellier.aka.nice@gmai
l.com> wrote:
Absolutely, to me a String is a sequence of characters. squeakToUtf8 is a hack that makes us consider a String as a sequence of
codePoints whose encoding is in the eye of the beholder (or implicitly in the Context - the Smalltalk one).
I's not very object oriented and quite fragile. We started to clean Multilingual but never finished the job…
Yes, that’s pretty much how I see it. Currently the utf8 ‘string’ is just kept as a byte string and the user is expected to understand that it is in a rather dangerous state.
It's difficult to finish it, because we value backward compatibility. So maybe the ByteArray change was a bit radical with this respect.
Backward compatibility can sometimes drive you to loud swearing!
Maybe a new message to return the bytearray of the uft8 data could be added, leaving the old one alone. We should probably consider making an actual UTF8String class, though I did try to work out the best thing to do for that several years ago for NuScratch and got lost in the tangles. Editing the damn things is a pain, to say the leat, so you get to thinking about having the canonical string as an instvar and a byte array and edits work on the String which gets converted at the end of the edit to update bytearray. Or the other way around… or… aaargh!
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful random insult:- Immune from any serious head injury.
On 30.07.2017, at 11:02, Jakob Reschke jakob.reschke@student.hpi.de wrote:
Simple "solution" to editing: treat (encoded) Strings as immutable.
For editing/stringbuilding, use a WideString or a special kind of stream (MultiByteBinaryOrTextStream or how is it called) plus additional support for inserting in the middle if desired. I remember somebody proposing Ropes when discussing a reformation of strings previously. Could be interesting at "edit-time".
For (in-memory) storage, encoded Strings should maybe just be ByteArrays paired with some TextConverter-like thing or at least a spec of the encoding so you can fetch Characters or configure streams from it on demand.
If licensing permits it one could also have a look at how the OpenJDK deals with UTF-16 in StringBuilder.
While I agree in principle, don't come near me with utf16 ;)
Am 30.07.2017 03:20 schrieb "tim Rowledge" tim@rowledge.org:
On 29-07-2017, at 12:48 PM, Nicolas Cellier nicolas.cellier.aka.nice@gmail.com wrote: Absolutely, to me a String is a sequence of characters. squeakToUtf8 is a hack that makes us consider a String as a sequence of codePoints whose encoding is in the eye of the beholder (or implicitly in the Context - the Smalltalk one). I's not very object oriented and quite fragile. We started to clean Multilingual but never finished the job…
Yes, that’s pretty much how I see it. Currently the utf8 ‘string’ is just kept as a byte string and the user is expected to understand that it is in a rather dangerous state.
It's difficult to finish it, because we value backward compatibility. So maybe the ByteArray change was a bit radical with this respect.
Backward compatibility can sometimes drive you to loud swearing!
Maybe a new message to return the bytearray of the uft8 data could be added, leaving the old one alone. We should probably consider making an actual UTF8String class, though I did try to work out the best thing to do for that several years ago for NuScratch and got lost in the tangles. Editing the damn things is a pain, to say the leat, so you get to thinking about having the canonical string as an instvar and a byte array and edits work on the String which gets converted at the end of the edit to update bytearray. Or the other way around… or… aaargh!
tim
tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful random insult:- Immune from any serious head injury.
On 30-07-2017, at 6:34 AM, Tobias Pape Das.Linux@gmx.de wrote:
On 30.07.2017, at 11:02, Jakob Reschke jakob.reschke@student.hpi.de wrote:
If licensing permits it one could also have a look at how the OpenJDK deals with UTF-16 in StringBuilder.
While I agree in principle, don't come near me with utf16 ;)
I was about to say something similar :-)
I think it’s reasonably clear that nobody wants to have UFT-X as the main representation of text within their system if any sort of editing might be involved. It’s just too painful. However, there seem to be quite a lot of places where UTF-8 has been chosen as a sort of interface coding, I imagine for some sort of space-saving reasons in general. It does seem like a bit of an early -90’s “oh my gosh, all the furrin letters take up so much space what can we do, we can’t ask people to install an entire megabyte of memory on their PCs!” thing.
For the NuScratch stuff I used Cairo/Pango to render text nicely and thus had to convert everything to UTF-8 in order to pass it to the renderer. No editing was done to any of that, so no backward conversions or complex parsing required. To my surprise the general performance on the Pi’s was not noticeably impacted; when I did my first experiments I though I would have to render the full fonts out to make my own glyph bitmaps and so on but in fact it worked nicely. Which meant that the languages with complex layout and kerning rules could be dealt with by somebody else’s code, which I like.
Jakob mentioned pairing encoded bytes with convertors of some kind and that made me think of Text, where we pretty much do that already. I wonder if using a runarray paired with the bytearray of UTF-8 (or even, dog help us, UTF-16) to call out where non-byte characters lurk would work? Think about behaving as if the text attribute were ‘this one needs 3 bytes’ rather than ‘this one is in flashing red sparkles with rotating underlines and winking quotes”. Given that we are able to handle editing Text pretty well, maybe, just maybe, that would make editing UFT-X work decently? Sounds like a good student project to me ;-)
tim -- tim Rowledge; tim@rowledge.org; http://www.rowledge.org/tim Useful random insult:- Calls people to ask them their phone number.
squeak-dev@lists.squeakfoundation.org