[squeak-dev] The Trunk: Multilingual-ul.208.mcz

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Sun May 3 01:29:14 UTC 2015


Ouch, yes, extracting simple case mapping from full CaseFolding data was
probably a mistake...
Thanks for reviewing, and as we say, vieux motard que jamais (better late
than never) - it's almost 5 years old

Next job will be to comment Unicode class, and explain which unicode
operation is supported...

--------------------------

Multilingual-nice.123
Author: nice
Time: 14 July 2010, 1:17:02.219 pm
UUID: ec8f05b8-78a6-4496-aca9-8f9b2e54823d
Ancestors: Multilingual-ul.122

1) simplify a case of at:ifAbsentPut: pattern in SparseXTable
2) provide a simple mapping of unicode upper/lower case characters as
described at http://unicode.org/reports/tr21/tr21-5.html

Note 1: Unicode class now provides two utilities to transform case of a
String rather than of a Character. This is for enabling future enhancements
like handling special casings when case folding does change the number of
characters.

Note 2: there is no automatic initialization performed yet. You'll have to
execute this before using above utilities:
Unicode initializeCaseMappings.

This is only an unoptimized, first attempt proposal. Comments and changes
are of course welcome.

2015-05-03 2:15 GMT+02:00 <commits at source.squeak.org>:

> Levente Uzonyi uploaded a new version of Multilingual to project The Trunk:
> http://source.squeak.org/trunk/Multilingual-ul.208.mcz
>
> ==================== Summary ====================
>
> Name: Multilingual-ul.208
> Author: ul
> Time: 1 May 2015, 3:25:18.828 pm
> UUID: 82d19dac-c602-4c0d-bc9a-7858e3a3c283
> Ancestors: Multilingual-ul.206
>
> Improved Unicode caseMappings:
> - Don't overwrite an existing mapping, because that leads to problems
> (like (Unicode toUppercaseCode: $k asciiValue) = 8490)
> - Use PluggableDictionary class >> #integerDictionary for better lookup
> performance (~+16%), and compaction resistance (done at every release).
> - Compact the dictionaries before saving.
> - Save the new dictionaries atomically.
>
> =============== Diff against Multilingual-ul.206 ===============
>
> Item was changed:
>   ----- Method: Unicode class>>initializeCaseMappings (in category
> 'casing') -----
>   initializeCaseMappings
>         "Unicode initializeCaseMappings"
> +
> +       UIManager default informUserDuring: [ :bar |
> -       ToCasefold := IdentityDictionary new.
> -       ToUpper := IdentityDictionary new.
> -       ToLower := IdentityDictionary new.
> -       UIManager default informUserDuring: [:bar|
>                 | stream |
>                 bar value: 'Downloading Unicode data'.
>                 stream := HTTPClient httpGet: '
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt'.
>                 (stream isKindOf: RWBinaryOrTextStream) ifFalse:[^self
> error: 'Download failed'].
>                 stream reset.
>                 bar value: 'Updating Case Mappings'.
> +               self parseCaseMappingFrom: stream ].!
> -               self parseCaseMappingFrom: stream.
> -       ].!
>
> Item was changed:
>   ----- Method: Unicode class>>parseCaseMappingFrom: (in category
> 'casing') -----
>   parseCaseMappingFrom: stream
>         "Parse the Unicode casing mappings from the given stream.
>         Handle only the simple mappings"
>         "
>                 Unicode initializeCaseMappings.
>         "
>
> +       | newToCasefold newToUpper newToLower casefoldKeys |
> +       newToCasefold := PluggableDictionary integerDictionary.
> +       newToUpper := PluggableDictionary integerDictionary.
> +       newToLower := PluggableDictionary integerDictionary.
> -       ToCasefold := IdentityDictionary new: 2048.
> -       ToUpper := IdentityDictionary new: 2048.
> -       ToLower := IdentityDictionary new: 2048.
>
> +       "Filter the mappings (Simple and Common) to newToCasefold."
> +       stream contents linesDo: [ :line |
> +               | data fields sourceCode destinationCode |
> +               data := line copyUpTo: $#.
> +               fields := data findTokens: '; '.
> +               (fields size > 2 and: [ #('C' 'S') includes: (fields at:
> 2) ]) ifTrue:[
> +                       sourceCode := Integer readFrom: (fields at: 1)
> base: 16.
> +                       destinationCode := Integer readFrom: (fields at:
> 3) base: 16.
> +                       newToCasefold at: sourceCode put: destinationCode
> ] ].
> -       [stream atEnd] whileFalse:[
> -               | fields line srcCode dstCode |
> -               line := stream nextLine copyUpTo: $#.
> -               fields := line withBlanksTrimmed findTokens: $;.
> -               (fields size > 2 and: [#('C' 'S') includes: (fields at: 2)
> withBlanksTrimmed]) ifTrue:[
> -                       srcCode := Integer readFrom: (fields at: 1)
> withBlanksTrimmed base: 16.
> -                       dstCode := Integer readFrom: (fields at: 3)
> withBlanksTrimmed base: 16.
> -                       ToCasefold at: srcCode put: dstCode.
> -               ].
> -       ].
>
> +       casefoldKeys := newToCasefold keys.
> +       newToCasefold keysAndValuesDo: [ :sourceCode :destinationCode |
> +               (self isUppercaseCode: sourceCode) ifTrue: [
> +                       "In most cases, uppercase letter are folded to
> lower case"
> +                       newToUpper at: destinationCode put: sourceCode.
> +                       newToLower at: sourceCode ifAbsentPut:
> destinationCode "Don't overwrite existing pairs. To avoid $k asUppercase to
> return the Kelvin character (8490)." ].
> +               (self isLowercaseCode: sourceCode) ifTrue: [
> +                       "In a few cases, two upper case letters are folded
> to the same lower case.
> +                       We must find an upper case letter folded to the
> same letter"
> +                       casefoldKeys
> +                               detect: [ :each |
> +                                       (self isUppercaseCode: each) and: [
> +                                               (newToCasefold at: each) =
> destinationCode ] ]
> +                               ifFound: [ :uppercaseCode |
> +                                       newToUpper at: sourceCode put:
> uppercaseCode ]
> +                               ifNone: [ ] ] ].
> +
> +       "Compact the dictionaries."
> +       newToCasefold compact.
> +       newToUpper compact.
> +       newToLower compact.
> +       "Save in an atomic operation."
> +       ToCasefold := newToCasefold.
> +       ToUpper := newToUpper.
> +       ToLower := newToLower
> +       !
> -       ToCasefold keysAndValuesDo:
> -               [:k :v |
> -               (self isUppercaseCode: k)
> -                       ifTrue:
> -                               ["In most cases, uppercase letter are
> folded to lower case"
> -                               ToUpper at: v put: k.
> -                               ToLower at: k put: v].
> -               (self isLowercaseCode: k)
> -                       ifTrue:
> -                               ["In a few cases, two upper case letters
> are folded to the same lower case.
> -                               We must find an upper case letter folded
> to the same letter"
> -                               | up |
> -                               up := ToCasefold keys detect: [:e | (self
> isUppercaseCode: e) and: [(ToCasefold at: e) = v]] ifNone: [nil].
> -                               up ifNotNil: [ToUpper at: k put: up]]].!
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20150503/91226eab/attachment.htm


More information about the Squeak-dev mailing list