Process, harvesting, getting your favorite things in the image
Daniel Vainsencher
danielv at netvision.net.il
Tue Mar 11 15:28:25 UTC 2003
Sure, but using hashes might introduce noise as it reduces cost, by
mixing spam and non-spam tokens. I'd thought of simply not saving any
tokens that are very un-indicative. Probably all those tokens with
ratios between 30-70 are never that critical. But this would require
testing.
Hmm I should stop emptying my trash to create a spam repository to test
theories on.
We could have some fun here. After we make Celeste and SpamFilter into
packages.
Daniel
Tim Olson <tim at io.com> wrote:
> Daniel Vainsencher <danielv at netvision.net.il> wrote:
> | Yes I applied his fix, but - DataStream saves a Bag naively (not as
> | item-count pairs). Saving 600K tokens this way is very wasteful.
> | Specializing the save could remove a 10 factor. Making the token
> saving
> | incremental (as the saving the index is) would make it near
> | instantanous.
>
> This would be great. I tend to keep my Celeste window open in my saved
> image, so I didn't run into the performance problem of the
> DataStream/Bag serializer until fairly recently.
>
> Another possibility I'm experimenting with: instead of saving the actual
> tokens, convert each token to an MD5 hash (or some other
> well-distributing hash) and simply increment/decrement counts in
> fixed-size hash tables for the spam tokens and the mail tokens. This
> would prevent uncontrolled growth of the token database, while still
> providing continuous "training". It would take some experimentation to
> determine what size hash tables are effective.
>
> -- tim
More information about the Squeak-dev
mailing list
|