[squeak-dev] Why doesn't ECompletion / OCompletion use a Trie?

Mon Jun 20 21:31:17 UTC 2011

On Wed, 15 Jun 2011, Juan Vuletich wrote:

> Levente Uzonyi wrote:
>> ...
>>> 
>>> 
>>> Showing the most popular selectors first is quite nice. But having a hard, 
>>> hidden boundary on what will be suggested and what not is not that nice.
>> 
>> It's only true when the OController is not #expanded (see #menuMorph), 
>> which is the case when you didn't ask for completion. When you ask for 
>> completion (by pressing tab), then you'll get the full list.
>
> Thanks. I wasn't aware of that. In that case, it just scans the symbol table 
> on each keystroke. It doesn't get too slow, though. Only problem is that it 
> misses the 'sort by date' feature.

That's not entirely true. OCompletion uses it's completion table in the 
first place, so it adds the 7 recently used selectors matching the prefix 
to the entries list. See OModel >> #entries for details.

>
>> ...
>>> 
>>> The trie will get faster as the prefixes are longer, beating easily the 
>>> table, that will have to filter a much larger set of words. Besides, the
>> 
>> It doesn't matter if it's faster or not, because as long as the prefix 
>> lenght is >= 2, the cost is a dictionary lookup + filtering a constant (40) 
>> length list which has O(1) cost. If the prefix length is less than 2, then 
>> OCompletion won't give you suggestions. Start typing: "0 tinyBenchmarks" 
>> and you'll see how it works. When you get to "0 ti" you'll see 
>> tinyBenchmarks being suggested. If you press tab after typing "0 t", then 
>> you'll get all (at most 500) selectors beginning with t, if you keep 
>> typing, then you'll see another selector in the system (tinyDisplay) which 
>> is not present in the suggestion list if you don't ask for completion, 
>> because it's not part of Kernel* and Collections* which are preprocessed 
>> during the installation.
>
> Right. I was thinking of 'filter from all symbols' + 'sort by date', a combo 
> that ECompletion / OCompletion doesn't offer. This sounds like the ideal 
> behavior, maybe the only one to have. Otherwise, I need to think if the 
> selector I want should be looked by ECompletion or by OCompletion to know if 
> I should press <tab> after the first character, but before I type the second 
> character...

Suggestions are only done by OCompletion, ECompletion doesn't have this 
feature.

>
>>> sorting of the matching results of the trie will not be too bad, if I do 
>>> filtering and sorting at the same time, keeping (for example) just the 40 
>>> latests words. (Note to Juan: do it!).
>> 
>> Note that 40 here is different from the 40 in OCompletion. Btw this is a 
>> great use case for Heap.
>
> I was thinking of just an array and binary search on each addition to the 
> list, but will take a close look at Heap. Thanks.

With Heap you can have O(log(n)) insertion cost instead of O(n), but 
since n = 40, it may be less efficient. Also binary search is probably 
less efficient than a linear search for such a small list.

>
>>> And as I said before, holding all possible selectors is much better than 
>>> holding (maybe) 20% of them.
>> 
>> Adding all selectors to a trie would take much more space than the symbol 
>> table itself. What I planned to do to speed up selector lookup by prefix 
>> (while keeping the space overhead low) is to create a WeakArray from 
>> SymbolTable and use binary search on it. But implementing an efficient 
>> binary search on weak arrays is not easy. :)
>
> I'll do space analysis when done. I'm thinking on a kind of 
> CharacterKeyDictionary that could be pretty compact, making the object 
> headers for the trie nodes the biggest 'waste'. Besides, I plan to avoid 
> nodes with just one child altogether. I hope it won't be too bad.

I tried with our trie and got pretty bad results, though it has no path
compression yet. It had 27942 selectors and 238065 internal nodes. So 
even if it were optimized for space, it'd still take 5498440 bytes to 
store it. Path compression could decrease the number of internal nodes 
anywhere between 3-10x, but it'd introduce new objects (strings/arrays) 
which are not present currently, so I doubt it would be possible to create 
a trie which takes less than 2MB for this 27942 selectors.

>
> In any case, I find both being able to suggest any selector + sorting by date 
> as really valuable. So, some kind of data structure is needed. It could be an 
> Array as you say instead of a Trie, but why make it weak? You can check for 
> implementors when deleting a method instead. It is also tempting to change 
> the symbol table itself...

The current symbol table implementation uses two WeakSets: SymbolTable and 
NewSymbols. All new symbols are added to NewSymbols and it's expected to 
be a small data structure. So almost all symbols are stored in 
SymbolTable. The idea is to create a WeakArray which stores the same 
symbols as SymbolTable, but it'd be sorted, so prefix and range searches 
can be done with it. It must be weak, because it contains all symbols, not 
just selectors. So searching for a given prefix would be done on this new 
WeakArray with binary search and on NewSymbols with a linear search. 
Assuming that the size of NewSymbols is small enough, this method would be 
fast enough for syntax highlighting and code completion.

The size of SymbolTable in my image is 36511, so a new WeakArray would 
occupy only 146056 bytes. Storing an integer timestamp per symbol would 
make it double, but it's still 1/8th of what I expect an optimized trie 
would occupy.

Levente

>
>>> 
>>> Besides, the trie is better for natural language dictionaries, where the 
>>> sorting by date feature is most likely not good and the set of words can 
>>> get much larger. My current implementation can also match words regardless 
>>> of case and accents, and suggest a preferred spelling (for example, with 
>>> correct accents).
>> 
>> Cool. We also faced some NLP problems, so we created a few tries so far. 
>> The one we use the most knows pretty much what yours does + it can search 
>> for words with a given maximum levenshtein distance. It doesn't use the 
>> most efficient solution (levenshtein automata), but from 400k words it 
>> finds the results in less than 100 milliseconds for distance <= 2.
>
> Interesting!
>
> Cheers,
> Juan Vuletich
>
>> ...
>> 
>> Levente 
>
>