[squeak-dev] [ANN] [Cuis] Cuis 2.1 is available. Lots of cool stuff inside!

Sat Feb 13 16:31:35 UTC 2010

On Sat, 13 Feb 2010, Leandro wrote:

> Levente,
>
> Thank you for remarking this. Could you please suggest a reference where I 
> can learn more about the O(n+m) algorithm? (Note however that our "trick" 
> could also be useful with the fast algorithm because it avoids comparing 
> individual words of lines that are common).

TextDiffBuilder uses a modified version of the algorithm described here:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.6927
You can find our implementation in TextDiffBuilder >> #lcsFor:and:.
O(n+m) is only true for typical source code/text diffs, where the 
number of differences between the two sequences is small. TextDiffBuilder 
also uses an extra filtering which reduces n, m and the number of 
differences by removing elements which occur in only one of the sequences 
(this is common if the elements are limes).

Note that there are plenty of algorithms with various time/space 
complexity for the lcs problem, even hierachical ones which may be better 
for line+word diffs.

We chose this one, because it's widely used (current unix diff 
implementations use the recursive version of this algorithm which has 
guaranteed O(n+m) space complexity), relatively easy to implement and 
works well with source code and text.

Levente

>
> /Leandro
>
> Levente Uzonyi wrote:
>> On Fri, 12 Feb 2010, Leandro wrote:
>> 
>>> 
>>> Levente Uzonyi wrote:
>>>
>>>       You can also get the word based diff using TextDiffBuilder by 
>>> subclassing it and overriding #split:. Actually ClassDiffBuilder does 
>>> exactly
>>>       this, so using ClassDiffBuilder instead of TextDiffBuilder will give 
>>> you the expected result.
>>> 
>>> 
>>> I think that the difference with the implementation Valeria and I did is 
>>> that ours switches from lines to words automatically. It also defines 
>>> "words"
>>> as Smalltalk words, meaning that it understands keyword limits, which 
>>> gives better results. The differentiator starts comparing lines, and if 
>>> the
>>> proportion of line differences isn't too high, then it reutilizes the work 
>>> already done (with lines) and refines it to get differences in between
>>> keywords. The refining strategy was crucial for achieving good 
>>> performance.
>> 
>> If you subclass ClassDiffBuilder (let's call the subclass 
>> KeywordDiffBuilder)
>> and implement KeywordDiffBuilder >> #split: just like DifferenceFinder >> 
>> #keywordsAndBlanksFrom: you get even better results.
>> 
>> DifferenceFinder has to use such tricks/hacks to have good performance, 
>> because it uses the simplest lcs algorithm which requires n*m time and 
>> space in all cases, while for practical cases TextDiffBuilder requires 
>> O(n+m) time and space.
>> 
>> 
>> Levente
>> 
>>> 
>>> Not sure if the full functionality I'm describing here is in the last Cuis 
>>> version though.
>>> 
>>> /Leandro
>>> 
>>> 
>> 
>
>