[squeak-dev] Why doesn't ECompletion / OCompletion use a Trie?

Wed Jun 15 14:22:54 UTC 2011

Levente Uzonyi wrote:
> ...
>>
>>
>> Showing the most popular selectors first is quite nice. But having a 
>> hard, hidden boundary on what will be suggested and what not is not 
>> that nice.
>
> It's only true when the OController is not #expanded (see #menuMorph), 
> which is the case when you didn't ask for completion. When you ask for 
> completion (by pressing tab), then you'll get the full list.

Thanks. I wasn't aware of that. In that case, it just scans the symbol 
table on each keystroke. It doesn't get too slow, though. Only problem 
is that it misses the 'sort by date' feature.

> ...
>>
>> The trie will get faster as the prefixes are longer, beating easily 
>> the table, that will have to filter a much larger set of words. 
>> Besides, the
>
> It doesn't matter if it's faster or not, because as long as the prefix 
> lenght is >= 2, the cost is a dictionary lookup + filtering a constant 
> (40) length list which has O(1) cost. If the prefix length is less 
> than 2, then OCompletion won't give you suggestions. Start typing: "0 
> tinyBenchmarks" and you'll see how it works. When you get to "0 ti" 
> you'll see tinyBenchmarks being suggested. If you press tab after 
> typing "0 t", then you'll get all (at most 500) selectors beginning 
> with t, if you keep typing, then you'll see another selector in the 
> system (tinyDisplay) which is not present in the suggestion list if 
> you don't ask for completion, because it's not part of Kernel* and 
> Collections* which are preprocessed during the installation.

Right. I was thinking of 'filter from all symbols' + 'sort by date', a 
combo that ECompletion / OCompletion doesn't offer. This sounds like the 
ideal behavior, maybe the only one to have. Otherwise, I need to think 
if the selector I want should be looked by ECompletion or by OCompletion 
to know if I should press <tab> after the first character, but before I 
type the second character...

>> sorting of the matching results of the trie will not be too bad, if I 
>> do filtering and sorting at the same time, keeping (for example) just 
>> the 40 latests words. (Note to Juan: do it!).
>
> Note that 40 here is different from the 40 in OCompletion. Btw this is 
> a great use case for Heap.

I was thinking of just an array and binary search on each addition to 
the list, but will take a close look at Heap. Thanks.

>> And as I said before, holding all possible selectors is much better 
>> than holding (maybe) 20% of them.
>
> Adding all selectors to a trie would take much more space than the 
> symbol table itself. What I planned to do to speed up selector lookup 
> by prefix (while keeping the space overhead low) is to create a 
> WeakArray from SymbolTable and use binary search on it. But 
> implementing an efficient binary search on weak arrays is not easy. :)

I'll do space analysis when done. I'm thinking on a kind of 
CharacterKeyDictionary that could be pretty compact, making the object 
headers for the trie nodes the biggest 'waste'. Besides, I plan to avoid 
nodes with just one child altogether. I hope it won't be too bad.

In any case, I find both being able to suggest any selector + sorting by 
date as really valuable. So, some kind of data structure is needed. It 
could be an Array as you say instead of a Trie, but why make it weak? 
You can check for implementors when deleting a method instead. It is 
also tempting to change the symbol table itself...

>>
>> Besides, the trie is better for natural language dictionaries, where 
>> the sorting by date feature is most likely not good and the set of 
>> words can get much larger. My current implementation can also match 
>> words regardless of case and accents, and suggest a preferred 
>> spelling (for example, with correct accents).
>
> Cool. We also faced some NLP problems, so we created a few tries so 
> far. The one we use the most knows pretty much what yours does + it 
> can search for words with a given maximum levenshtein distance. It 
> doesn't use the most efficient solution (levenshtein automata), but 
> from 400k words it finds the results in less than 100 milliseconds for 
> distance <= 2.

Interesting!

Cheers,
Juan Vuletich

> ...
>
> Levente