Process, harvesting, getting your favorite things in the image

Daniel Vainsencher danielv at netvision.net.il
Tue Mar 11 15:28:25 UTC 2003


Sure, but using hashes might introduce noise as it reduces cost, by
mixing spam and non-spam tokens. I'd thought of simply not saving any
tokens that are very un-indicative. Probably all those tokens with
ratios between 30-70 are never that critical. But this would require
testing. 

Hmm I should stop emptying my trash to create a spam repository to test
theories on.

We could have some fun here. After we make Celeste and SpamFilter into
packages.

Daniel

Tim Olson <tim at io.com> wrote:
> Daniel Vainsencher <danielv at netvision.net.il> wrote:
> | Yes I applied his fix, but - DataStream saves a Bag naively (not as
> | item-count pairs). Saving 600K tokens this way is very wasteful.
> | Specializing the save could remove a 10 factor. Making the token
> saving
> | incremental (as the saving the index is) would make it near
> | instantanous.
> 
> This would be great.  I tend to keep my Celeste window open in my saved
> image, so I didn't run into the performance problem of the
> DataStream/Bag serializer until fairly recently.
> 
> Another possibility I'm experimenting with: instead of saving the actual
> tokens, convert each token to an MD5 hash (or some other
> well-distributing hash) and simply increment/decrement counts in 
> fixed-size hash tables for the spam tokens and the mail tokens.  This
> would prevent uncontrolled growth of the token database, while still
> providing continuous "training".  It would take some experimentation to
> determine what size hash tables are effective.
> 
> 	-- tim



More information about the Squeak-dev mailing list