Full-text search, performance results.

Wed Jan 23 02:44:50 UTC 2002

On 23 Jan 2002, Cees de Groot wrote:

> Scott A Crosby <crosby at qwes.math.cmu.edu> said:
> >The complexity is in dealing with the disk. If you can assume
> >everything is in RAM, things get a *lot* easier. I'm assuming this.
> >
> The Squeak-dev mailing list archive for January so far is 2.5Mb worth of text,
> the collected archives since squeak-dev moved to squeakfoundation.org last
> July is around 35Mb.

... So don't index *everything*. Very frequent words may not need to be
indexed, short words also can be ignored. :)

Anyways, I've already got an implementation.

>
> Now, there is something to say for squeak -memory 512Mb and let Linux
> sort of swapping, but it'd probably be better to be able to write the
> thing on disk right away.
>

Then you trade memory for HD latency. In essence, you're doing the
swapping to the drive manually. There isn't really any way to win.  I'm
going for simplicity. 512mb isn't that much RAM. If you can index a large
swiki, all the source code and the squeak mailing list going back 3 years
in 300mb of RAM, is it worth the slowdown and complexity of diskbasing it?

Now, for quick performance numbers:

--------------------------------------
# Index all method source in class Morph:

- Time to build index: 13 seconds (1/4 of which was getting the source)
- Space for index: 670kb.
- Space for method text: 285kb.
- Searches:  10-20ms (matching a few hundred and resulting in 20)

I do:
((di anyOf: #(super topleft this that) asSet) intersection:
 (di anyOf: #(size submorphs) asSet)) size
which has 400 intermediate results and 50 final results

too fast to profile. :)

---------------------------------

#Index of all methods in system (??)
Smalltalk allClassesDo: [ :aClass | aClass selectorsDo:
    [ :method | di add: ((aClass sourceCodeAt: method) asString)]]].

- Time to build index: 8 minutes
- Space for index: 17mb
- Size of all method text: 10mb
- Number distinct strings: 40,000
- Searches: fast.

--
((di anyOf: #(super topleft this that) asSet) intersection:
 (di anyOf: #(size submorphs) asSet)) size

uses 11,000 intermediate values, and creates 1200 results in 2 seconds.
 (95% of the time is spent in IdentitySet>>scansFor: [*])

--
(di anyOf: #(color) asSet) size + (di anyOf: #(size) asSet) size
uses 5000 intermediate values, and creates 183 results in 244 ms.
 (64% of the time spent in identitySet>>scansFor:),

-------------------- Performance analysis -------------

[*] If method text's had a 28-bit hash value associated with them, then
the this test would likely be about 10x faster; 400ms.  Thats because I'm
dealing with sets > 4000 in size. I cannot use normal Sets because
Text>>hash, is extremely expensive. (20x slowdown).

Furthermore, I'm not very intelligent about doing work in an order that
minimizes the work needed. This sort of optimization is mostly
heuristic-based, and a function of the types of queries, so I'm not
focussing on it.

-------------------------------------------

In these, I remove any strings <4 characters long, and a few
common method names (ifTrue, ifFalse, self)

> Can't wait for it, query language looks cool :-)

Query language is trivial. You just query each index with the list of
disjunctive phrases, which it union's together, then
Collection>>intersect: those together for each disjunctive clause. There's
no requirement that the seperate disjunctive clauses be in the same index,
just that they index the same document set.

And, if you want to be smarter, intersect starting with the smallest set
first.

--

Anyways, the work here depends on an updated version of my string refactor
patch, my MiscEnhancements patch (for new stuff in CharacterSet), and a
signifigant rewrite of the Skiplists currently in the image.

The interface is you construct the index around an adaptor that, given a
document, creates a list of keywords in that document.. Documents may be
indexed in more than one index. For example, methods may have several
adaptors. One for the method text, one for all the stuff they invoke, one
for their comments, etc. Each forms a seperate index. Furthermore, I can
index stuff other than strings. I also allow removing documents from the
index.

--

Give me a few days to clean it up, and I'll post them away, unless you
want the messy patches to play with?

Scott