Process, harvesting, getting your favorite things in the image

Tim Olson tim at io.com
Tue Mar 11 14:47:49 UTC 2003


Daniel Vainsencher <danielv at netvision.net.il> wrote:
| Yes I applied his fix, but - DataStream saves a Bag naively (not as
| item-count pairs). Saving 600K tokens this way is very wasteful.
| Specializing the save could remove a 10 factor. Making the token
saving
| incremental (as the saving the index is) would make it near
| instantanous.

This would be great.  I tend to keep my Celeste window open in my saved
image, so I didn't run into the performance problem of the
DataStream/Bag serializer until fairly recently.

Another possibility I'm experimenting with: instead of saving the actual
tokens, convert each token to an MD5 hash (or some other
well-distributing hash) and simply increment/decrement counts in 
fixed-size hash tables for the spam tokens and the mail tokens.  This
would prevent uncontrolled growth of the token database, while still
providing continuous "training".  It would take some experimentation to
determine what size hash tables are effective.

	-- tim



More information about the Squeak-dev mailing list