<br><br><div class="gmail_quote">On Mon, Jun 29, 2009 at 3:09 PM, Paolo Bonzini <span dir="ltr">&lt;<a href="mailto:paolo.bonzini@gmail.com">paolo.bonzini@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im">On 06/29/2009 11:08 PM, Eliot Miranda wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

On reading this my first question us &quot;what should at: do?&quot;.  Have you<br>

thought this through?  Does at: have to search for TAG marks and skip<br>

over them, or is the problem punted up to the client?<br>

</blockquote>

<br></div>

Tags are zero-width Unicode characters just like the byte-order mark U+FEFF.  Note that the tag uses a completely different set of characters than the normal Latin alphabet.  Similar to how in UTF-8/UTF-16 it is possible to find in O(1) time the beginning of a character, in this RFC it is always clear if a character is part of a tag or not.</blockquote>

<div><br></div><div>But being able to find the start of a character in O(1) doesn&#39;t tell you how many characters there are between a given address within the string and its start address, and it doesn&#39;t tell you what the address of a character at a given index in the string is.  So if the TAG representation is the internal representation (which I think is implied by using this as a means of carrying language information around with the character data) then this representation implies O(N) at:, which means that it&#39;ll only be suitable as an exchange representation (and expensive to encode/decode to/from) or it needs an additional index structure, or...?</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><br><font color="#888888">

<br>

Paolo<br>

</font></blockquote></div><br>