[squeak-dev] Why doesn't ECompletion / OCompletion use a Trie?

Wed Jun 15 12:54:06 UTC 2011

On Wed, 15 Jun 2011, Juan Vuletich wrote:

> Hi Levente,
>
> Thanks for your answer. I thought I had missed something, as I think a trie 
> would do a much better job. Your answer tells me I had already got it, and I 
> still believe a trie is better.
> (comments inline)
>
> Levente Uzonyi wrote:
>> On Tue, 14 Jun 2011, Juan Vuletich wrote:
>> 
>>> Hi Folks,
>>> 
>>> I've been playing a bit with ECompletion / OCompletion on Cuis (thanks, 
>>> Levente!). What surprises me is the use of a single level Dictionary with 
>>> 2 character prefixes as the data structure to hold the string dictionary. 
>>> Does anybody know why a Trie ( http://en.wikipedia.org/wiki/Trie ) wasn't 
>>> used?
>> 
>> OCompletion suggests the recently used methods, therefore the length of the 
>> list for each prefix is limited to 40 by default. So looking up selectors 
>> beginning with a two letter prefix is fast enough (I increased it to 400 in 
>> my image and it's still fast enough). 40 is not always sufficient, 
>> sometimes the method you're looking for is not in the list of the three 
>> recently used methods, but it works for most cases.
>
> Having the list limited to 40 might be ok when you have a 2 or 3 letter 
> prefix, but the majority of the methods will never be suggested, even if you 
> keep typing a longer prefix. At some moment, the list will be empty, and 
> maybe you still need to finish typing a rather long (seldom used, hard to 
> remember) selector. When trying OCompletion, I found this very annoying. For 
> instance, in Pharo 1.2, the selectors table holds less than 6,000 selectors, 
> while the system has over 32,000 distinct selectors. This means that over 80% 
> of the selectors will never be suggested.
>
> Showing the most popular selectors first is quite nice. But having a hard, 
> hidden boundary on what will be suggested and what not is not that nice.

It's only true when the OController is not #expanded (see #menuMorph), 
which is the case when you didn't ask for completion. When you ask for 
completion (by pressing tab), then you'll get the full list.

When you install OCompletion, you tell it which packages should it 
preprocess for you. From then it learns from the code as you write/load 
it. It's designed to be familiar with those packages which you're using, 
but I guess Romain Robbes could tell you more about this.

>
>> With a trie you'd have to sort the words matching the prefix by access time 
>> for every lookup which may be less efficient, especially for small 
>> prefixes. A trie would also have to hold all possible selectors, while the 
>> lists limited to 40 avoid this.
>
> The trie will get faster as the prefixes are longer, beating easily the 
> table, that will have to filter a much larger set of words. Besides, the

It doesn't matter if it's faster or not, because as long as the prefix 
lenght is >= 2, the cost is a dictionary lookup + filtering a constant 
(40) length list which has O(1) cost. If the prefix length is less than 2, 
then OCompletion won't give you suggestions. Start typing: "0 
tinyBenchmarks" and you'll see how it works. When you get to "0 ti" you'll 
see tinyBenchmarks being suggested. If you press tab after typing "0 t", 
then you'll get all (at most 500) selectors beginning with t, if you keep 
typing, then you'll see another selector in the system (tinyDisplay) which 
is not present in the suggestion list if you don't ask for completion, 
because it's not part of Kernel* and Collections* which are preprocessed 
during the installation.

> sorting of the matching results of the trie will not be too bad, if I do 
> filtering and sorting at the same time, keeping (for example) just the 40 
> latests words. (Note to Juan: do it!).

Note that 40 here is different from the 40 in OCompletion. Btw this is a 
great use case for Heap.

>
> And as I said before, holding all possible selectors is much better than 
> holding (maybe) 20% of them.

Adding all selectors to a trie would take much more space than the symbol 
table itself. What I planned to do to speed up selector lookup by prefix 
(while keeping the space overhead low) is to create a WeakArray from 
SymbolTable and use binary search on it. But implementing an efficient 
binary search on weak arrays is not easy. :)

>
> Besides, the trie is better for natural language dictionaries, where the 
> sorting by date feature is most likely not good and the set of words can get 
> much larger. My current implementation can also match words regardless of 
> case and accents, and suggest a preferred spelling (for example, with correct 
> accents).

Cool. We also faced some NLP problems, so we created a few tries so far. 
The one we use the most knows pretty much what yours does + it can search 
for words with a given maximum levenshtein distance. It doesn't use the 
most efficient solution (levenshtein automata), but from 400k words it 
finds the results in less than 100 milliseconds for distance <= 2.

>
> Not to mention that code is much simpler, smaller and general...

I'm not sure we agree here. An efficient trie is not that small and 
simple. I'd guess that the size of the implementation would be on par.

>
>> Btw, I'm thinking about integrating code completion into Squeak itself. The 
>> idea is to add hooks to the system for code completion and provide an 
>> implementation for common things, like menus, key bindings, etc. There 
>> would also be a simple symbol table based completion tool integrated by 
>> default, like Cmd+Q, but with menus, like in Pharo. Then stripped down 
>> versions of E/OCompletion could be loaded as a replacement. I'm also 
>> thinking about implementing a mix of ECompletion and OCompletion which 
>> would show the recently used selectors first like OCompletion, but 
>> filters/sorts the suggestions based on type information like ECompletion.
>
> Sounds great. Maybe my implementation can be one of the alternatives.

Sure.

Levente

>
> Cheers,
> Juan Vuletich
>
>> Levente
>
>
>