[squeak-dev] [Documentation] Spell checking within Squeak - suggestion of a method

Ian Trudel ian.trudel at gmail.com
Sat May 1 10:25:09 UTC 2010


Hannes,

I am not particularly fond of this solution. There are two major
problems: 1) we will have to build our own dictionary and risk to
insert errors, misspells, etc. and 2) existing spell checkers have
more features (e.g. suggesting close matches when a word is
misspelled).

Hunspell is a good choice because it has been peer reviewed over and
over. It is used in projects with a large audience, such as OpenOffice
and Firefox. We may have to define a terminology dictionary but we
won't have to deal with plain English as it is already available.

Ian.

2010/4/30 Hannes Hirzel <hannes.hirzel at gmail.com>:
> On 4/30/10, Ian Trudel <ian.trudel at gmail.com> wrote:
>> 2010/4/30 Casey Ransberger <casey.obrien.r at gmail.com>:
>>> Ian, thanks. Comments inline.
>>
>>> I don't know of a spell checker implementation for Squeak. Is there one
>>> out
>>> there? If not, can you implement one and then get back to me right away?
>>> :P
>>
>> Neither do I know.
>>
>
> Contrariwise to general word-processing our spell checking needs are
> simpler. We are writing technical documentation and the number of word
> forms is more limited (my estimate - something between 3000...5000).
> This means a simple dictionary of allowed word forms could do the job?
>
> How do we get that dictionary?
>
> Method 1)
> We create a Bag of all word forms in the existing comments. The words
> which are infrequent are candidates for being misspelt. They can be
> flagged and put into a another collection.
>
> Method 2)
> We paste the list of words obtained from the comments into a regular
> word processor and run a  spell check there.
>
> A diff to the original word list gives then the misspelt words.
>
> We paste the result back into the Squeak image,
>
> Within the Squeak image resides a collection of acceptable word forms
> (as part of the HelpSystem-Tools).
>
> Before accepting a comment this list is consulted.
>
> This process has to be repeated for a few versions maybe until we have
> a comprehensive wordlist.
>
> Additional benefit. We limit the number of words used thus making the
> texts easier to understand and more consistent.
>
> The idea here is 'controlled language'  - i.e. "Squeak Technical English"  :-)
>
> --Hannes
>
> ------------------------------------------------------------------------------------
> http://en.wikipedia.org/wiki/Controlled_language
>
> Controlled natural languages (CNLs) are subsets of natural languages,
> obtained by restricting the grammar and vocabulary in order to reduce
> or eliminate ambiguity and complexity. Traditionally, controlled
> languages fall into two major types: those that improve readability
> for human readers (e.g. non-native speakers), and those that enable
> reliable automatic semantic analysis of the language.
>
> The first type of languages (often called "simplified" or "technical"
> languages), for example ASD Simplified Technical English, Caterpillar
> Technical English, IBM's Easy English, are used in the industry to
> increase the quality of technical documentation, and possibly simplify
> the (semi-)automatic translation of the documentation. These languages
> restrict the writer by general rules such as "write short and
> grammatically simple sentences", "use nouns instead of pronouns", "use
> determiners", and "use active instead of passive".[1]
>
>



-- 
http://mecenia.blogspot.com/



More information about the Squeak-dev mailing list