On Mon, Jun 29, 2009 at 3:09 PM, Paolo Bonzini paolo.bonzini@gmail.comwrote:
On 06/29/2009 11:08 PM, Eliot Miranda wrote:
On reading this my first question us "what should at: do?". Have you thought this through? Does at: have to search for TAG marks and skip over them, or is the problem punted up to the client?
Tags are zero-width Unicode characters just like the byte-order mark U+FEFF. Note that the tag uses a completely different set of characters than the normal Latin alphabet. Similar to how in UTF-8/UTF-16 it is possible to find in O(1) time the beginning of a character, in this RFC it is always clear if a character is part of a tag or not.
But being able to find the start of a character in O(1) doesn't tell you how many characters there are between a given address within the string and its start address, and it doesn't tell you what the address of a character at a given index in the string is. So if the TAG representation is the internal representation (which I think is implied by using this as a means of carrying language information around with the character data) then this representation implies O(N) at:, which means that it'll only be suitable as an exchange representation (and expensive to encode/decode to/from) or it needs an additional index structure, or...?
Paolo