[squeak-dev] The Inbox: Collections-mt.838.mcz

Marcel Taeumel marcel.taeumel at hpi.de
Fri Jul 5 14:57:19 UTC 2019


Hi Levente,

thanks for the tips! I would appreciate another review of Collections-mt.839 and CollectionsTests-mt.313. :-)

http://forum.world.st/The-Inbox-Collections-mt-839-mcz-tp5100939.html [http://forum.world.st/The-Inbox-Collections-mt-839-mcz-tp5100939.html]
http://forum.world.st/The-Inbox-CollectionsTests-mt-313-mcz-tp5100940.html [http://forum.world.st/The-Inbox-CollectionsTests-mt-313-mcz-tp5100940.html]

Best,
Marcel

Am 05.07.2019 00:44:39 schrieb Levente Uzonyi <leves at caesar.elte.hu>:
On Thu, 4 Jul 2019, commits at source.squeak.org wrote:

> A new version of Collections was added to project The Inbox:
> http://source.squeak.org/inbox/Collections-mt.838.mcz
>
> ==================== Summary ====================
>
> Name: Collections-mt.838
> Author: mt
> Time: 4 July 2019, 4:32:40.854026 pm
> UUID: ac4ab442-79c0-d246-8dec-914be7ee5356
> Ancestors: Collections-pre.837
>
> To String, adds simple analysis of natural language in source code. No word stemming.
>
> 1) Refactor #findTokens: to look like #lines (i.e. #linesDo: and #lineIndicesDo:).
> 2) Add #findFeaturesDo: like #findTokens:do: and #linesDo:.
>
> Try this:
>
> HTTPDownloadRequest name findFeatures.
> (Morph >> #drawOn:) getSource asString findFeatures.
>
> Where can that be useful?
>
> - Automatic insertion of "*" for search terms like "WeakDictionary" to also find WeakIdentityDictionary etc.
> - Prefix emphasis for names lists of classes in code browsers: MCAddition, MCAncestry, etc.

Given the new methods' completixy, I think they deserve tests.

>
> =============== Diff against Collections-pre.837 ===============
>
> Item was added:
> + ----- Method: String>>findFeatureIndicesDo: (in category 'accessing - features') -----
> + findFeatureIndicesDo: aBlock
> + "State machine that separates camelCase, UPPERCase, number/operator combinations and skips colons"

I think an example would help make it is easier to understand what this
method does. (The same applies to #findTokens:, but I'm already familiar
with that.)

> + | last state char "0 = start, 1 = a, 2 = A, 3 = AA, 4 = num, 5 = op" |
> +
> + state := 0.
> + last := 1.
> +
> + 1 to: self size do: [ :index |
> + char := self at: index.
> + "a"
> + char isLowercase ifTrue: [
> + (state a"
> + (state == 3) ifTrue: [

#= is optimized just as good as #== when the argument is a constant. Using
#= and dropping the unnecessary parentheses would make the code look a bit
less "C-style".

> + "AAa -> A + Aa (camel case follows uppercase)"
> + aBlock value: last value: index - 2.
> + last := index - 1.
> + state := 2].
> + (state > 3) ifTrue: [
> + "+a -> + | a (letter follows non-letter)"
> + aBlock value: last value: index - 1.
> + last := index.
> + state := 1]]
> + ifFalse: [
> + char isUppercase ifTrue: [
> + (state == 0)
> + ifTrue: [state := 2] "start -> A"
> + ifFalse: [
> + (state 3]) ifTrue: [
> + "*A -> * | A (uppercase begins, flush before)"
> + aBlock value: last value: index - 1.
> + last := index.
> + state := 2] ifFalse: [
> + "AA -> AA (uppercase continues)"
> + state := 3]]]
> + ifFalse: [
> + ("char == $: or:" char isSeparator) ifTrue: [
> + "skip colon/whitespace"
> + (state > 0) ifTrue: [
> + aBlock value: last value: index - 1.
> + state := 0].
> + last := index + 1]
> + ifFalse: [
> + char isDigit ifTrue: [
> + (state == 0)
> + ifTrue: [state := 4]
> + ifFalse: [
> + (state ~= 4) ifTrue: [
> + aBlock value: last value: index - 1.
> + last := index.
> + state := 4]]]
> + ifFalse: [
> + (state == 0)
> + ifTrue: [state := 5]
> + ifFalse: [
> + (state
> + aBlock value: last value: index - 1.
> + last := index.
> + state := 5]]]]]]].
> + last
> + aBlock value: last value: self size]!
>
> Item was added:
> + ----- Method: String>>findFeatures (in category 'accessing - features') -----
> + findFeatures
> +
> + ^ Array streamContents: [:features |
> + self findFeaturesDo: [:feature | features nextPut: feature]]!
>
> Item was added:
> + ----- Method: String>>findFeaturesDo: (in category 'accessing - features') -----
> + findFeaturesDo: aBlock
> + "Simple analysis for natural language in source code. No support for word stemming."
> +
> + self findFeatureIndicesDo: [:start :end |
> + (self at: start) isLetter ifTrue: [
> + aBlock value: (self copyFrom: start to: end) asLowercase]].!
>
> Item was changed:
> ----- Method: String>>findTokens: (in category 'accessing') -----
> findTokens: delimiters
> + "Answer the collection of tokens that result from parsing self."
> +
> + ^ OrderedCollection streamContents: [:tokens |

#streamContents: should never be used with OrderedCollection.
OrderedCollection has its own streaming API (I would use #addLast: here)
which is way more efficient.

> + self
> + findTokens: delimiters
> + do: [:token | tokens nextPut: token]]!
> - "Answer the collection of tokens that result from parsing self. Return strings between the delimiters. Any character in the Collection delimiters marks a border. Several delimiters in a row are considered as just one separation. Also, allow delimiters to be a single character."
> -
> - | tokens keyStart keyStop separators |
> -
> - tokens := OrderedCollection new.
> - separators := delimiters isCharacter
> - ifTrue: [Array with: delimiters]
> - ifFalse: [delimiters].
> - keyStop := 1.
> - [keyStop
> - [keyStart := self skipDelimiters: separators startingAt: keyStop.
> - keyStop := self findDelimiters: separators startingAt: keyStart.
> - keyStart
> - ifTrue: [tokens add: (self copyFrom: keyStart to: (keyStop - 1))]].
> - ^tokens!
>
> Item was added:
> + ----- Method: String>>findTokens:do: (in category 'accessing') -----
> + findTokens: delimiters do: aBlock
> +
> + self
> + findTokens: delimiters
> + indicesDo: [:start :end | aBlock value: (self copyFrom: start to: end)].!
>
> Item was added:
> + ----- Method: String>>findTokens:indicesDo: (in category 'accessing') -----
> + findTokens: delimiters indicesDo: aBlock
> + "Parse self to find tokens between delimiters. Any character in the Collection delimiters marks a border. Several delimiters in a row are considered as just one separation. Also, allow delimiters to be a single character. Similar to #lineIndicesDo:."
> +
> + | tokens keyStart keyStop separators |

There are a few opportunities to regain the performance lost with the
introduction of blocks and sends:
- the tokens temporary is unused
- self size should be cached in a temporary (size)
- instead of Array >> #with:, a brace array should be used
- | keyStop keyStart separators size | would probably yield the best
performance

Levente

> + separators := delimiters isCharacter
> + ifTrue: [Array with: delimiters]
> + ifFalse: [delimiters].
> + keyStop := 1.
> + [keyStop
> + keyStart := self skipDelimiters: separators startingAt: keyStop.
> + keyStop := self findDelimiters: separators startingAt: keyStart.
> + keyStart
> + ifTrue: [aBlock value: keyStart value: keyStop - 1]].!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20190705/a44e224a/attachment.html>


More information about the Squeak-dev mailing list