[squeak-dev] [Documentation] Spell checking within Squeak - suggestion of a method

Hannes Hirzel hannes.hirzel at gmail.com
Fri Apr 30 23:16:47 UTC 2010


On 4/30/10, Ian Trudel <ian.trudel at gmail.com> wrote:
> 2010/4/30 Casey Ransberger <casey.obrien.r at gmail.com>:
>> Ian, thanks. Comments inline.
>
>> I don't know of a spell checker implementation for Squeak. Is there one
>> out
>> there? If not, can you implement one and then get back to me right away?
>> :P
>
> Neither do I know.
>

Contrariwise to general word-processing our spell checking needs are
simpler. We are writing technical documentation and the number of word
forms is more limited (my estimate - something between 3000...5000).
This means a simple dictionary of allowed word forms could do the job?

How do we get that dictionary?

Method 1)
We create a Bag of all word forms in the existing comments. The words
which are infrequent are candidates for being misspelt. They can be
flagged and put into a another collection.

Method 2)
We paste the list of words obtained from the comments into a regular
word processor and run a  spell check there.

A diff to the original word list gives then the misspelt words.

We paste the result back into the Squeak image,

Within the Squeak image resides a collection of acceptable word forms
(as part of the HelpSystem-Tools).

Before accepting a comment this list is consulted.

This process has to be repeated for a few versions maybe until we have
a comprehensive wordlist.

Additional benefit. We limit the number of words used thus making the
texts easier to understand and more consistent.

The idea here is 'controlled language'  - i.e. "Squeak Technical English"  :-)

--Hannes

------------------------------------------------------------------------------------
http://en.wikipedia.org/wiki/Controlled_language

Controlled natural languages (CNLs) are subsets of natural languages,
obtained by restricting the grammar and vocabulary in order to reduce
or eliminate ambiguity and complexity. Traditionally, controlled
languages fall into two major types: those that improve readability
for human readers (e.g. non-native speakers), and those that enable
reliable automatic semantic analysis of the language.

The first type of languages (often called "simplified" or "technical"
languages), for example ASD Simplified Technical English, Caterpillar
Technical English, IBM's Easy English, are used in the industry to
increase the quality of technical documentation, and possibly simplify
the (semi-)automatic translation of the documentation. These languages
restrict the writer by general rules such as "write short and
grammatically simple sentences", "use nouns instead of pronouns", "use
determiners", and "use active instead of passive".[1]



More information about the Squeak-dev mailing list