[squeak-dev] The Inbox: Collections-mt.838.mcz

Levente Uzonyi leves at caesar.elte.hu
Thu Jul 4 22:44:29 UTC 2019


On Thu, 4 Jul 2019, commits at source.squeak.org wrote:

> A new version of Collections was added to project The Inbox:
> http://source.squeak.org/inbox/Collections-mt.838.mcz
>
> ==================== Summary ====================
>
> Name: Collections-mt.838
> Author: mt
> Time: 4 July 2019, 4:32:40.854026 pm
> UUID: ac4ab442-79c0-d246-8dec-914be7ee5356
> Ancestors: Collections-pre.837
>
> To String, adds simple analysis of natural language in source code. No word stemming.
>
> 1) Refactor #findTokens: to look like #lines (i.e. #linesDo: and #lineIndicesDo:).
> 2) Add #findFeaturesDo: like #findTokens:do: and #linesDo:.
>
> Try this:
>
> HTTPDownloadRequest name findFeatures.
> (Morph >> #drawOn:) getSource asString findFeatures.
>
> Where can that be useful?
>
> - Automatic insertion of "*" for search terms like "WeakDictionary" to also find WeakIdentityDictionary etc.
> - Prefix emphasis for names lists of classes in code browsers: MCAddition, MCAncestry, etc.

Given the new methods' completixy, I think they deserve tests.

>
> =============== Diff against Collections-pre.837 ===============
>
> Item was added:
> + ----- Method: String>>findFeatureIndicesDo: (in category 'accessing - features') -----
> + findFeatureIndicesDo: aBlock
> + 	"State machine that separates camelCase, UPPERCase, number/operator combinations and skips colons"

I think an example would help make it is easier to understand what this 
method does. (The same applies to #findTokens:, but I'm already familiar 
with that.)

> + 	| last state char "0 = start, 1 = a, 2 = A, 3 = AA, 4 = num, 5 = op"  |
> + 
> + 	state := 0.
> + 	last := 1.
> + 
> + 	1 to: self size do: [ :index |
> + 		char := self at: index.
> + 		"a"
> + 		char isLowercase ifTrue: [
> + 			(state < 3) ifTrue: [state := 1]. "*a -> a"
> + 			(state == 3) ifTrue: [

#= is optimized just as good as #== when the argument is a constant. Using 
#= and dropping the unnecessary parentheses would make the code look a bit 
less "C-style".

> + 				"AAa -> A + Aa (camel case follows uppercase)"
> + 				aBlock value: last value: index - 2.
> + 				last := index - 1.
> + 				state := 2].
> + 			(state > 3) ifTrue: [
> + 				"+a -> + | a (letter follows non-letter)" 
> + 				aBlock value: last value: index - 1.
> + 				last := index.
> + 				state := 1]] 
> + 		ifFalse: [
> + 			char isUppercase ifTrue: [
> + 				(state == 0)
> + 					ifTrue: [state := 2] "start -> A"
> + 					ifFalse: [
> + 						(state < 2 or: [state > 3]) ifTrue: [
> + 							"*A -> * | A (uppercase begins, flush before)"
> + 							aBlock value: last value: index - 1.
> + 							last := index.
> + 							state := 2] ifFalse: [
> + 								"AA -> AA (uppercase continues)"
> + 								state := 3]]]
> + 			ifFalse: [
> + 				("char == $: or:" char isSeparator) ifTrue: [
> + 					"skip colon/whitespace"
> + 					(state > 0) ifTrue: [
> + 						aBlock value: last value: index - 1.
> + 						state := 0].
> + 					last := index + 1]
> + 				ifFalse: [
> + 					char isDigit ifTrue: [
> + 						(state == 0)
> + 							ifTrue: [state := 4]
> + 							ifFalse: [
> + 							(state ~= 4) ifTrue: [
> + 								aBlock value: last value: index - 1.
> + 								last := index.
> + 								state := 4]]]
> + 						ifFalse: [
> + 							(state == 0)
> + 								ifTrue: [state := 5]
> + 								ifFalse: [
> + 									(state < 5) ifTrue: [
> + 										aBlock value: last value: index - 1.
> + 										last := index.
> + 										state := 5]]]]]]].
> + 	last <= self size ifTrue: [
> + 		aBlock value: last value: self size]!
>
> Item was added:
> + ----- Method: String>>findFeatures (in category 'accessing - features') -----
> + findFeatures
> + 
> + 	^ Array streamContents: [:features |
> + 		self findFeaturesDo: [:feature | features nextPut: feature]]!
>
> Item was added:
> + ----- Method: String>>findFeaturesDo: (in category 'accessing - features') -----
> + findFeaturesDo: aBlock
> + 	"Simple analysis for natural language in source code. No support for word stemming."
> + 
> + 	self findFeatureIndicesDo: [:start :end |
> + 		(self at: start) isLetter ifTrue: [
> + 			aBlock value: (self copyFrom: start to: end) asLowercase]].!
>
> Item was changed:
>  ----- Method: String>>findTokens: (in category 'accessing') -----
>  findTokens: delimiters
> + 	"Answer the collection of tokens that result from parsing self."
> + 
> + 	^ OrderedCollection streamContents: [:tokens |

#streamContents: should never be used with OrderedCollection.
OrderedCollection has its own streaming API (I would use #addLast: here) 
which is way more efficient.

> + 		self
> + 			findTokens: delimiters
> + 			do: [:token | tokens nextPut: token]]!
> - 	"Answer the collection of tokens that result from parsing self.  Return strings between the delimiters.  Any character in the Collection delimiters marks a border.  Several delimiters in a row are considered as just one separation.  Also, allow delimiters to be a single character."
> - 
> - 	| tokens keyStart keyStop separators |
> - 
> - 	tokens := OrderedCollection new.
> - 	separators := delimiters isCharacter 
> - 		ifTrue: [Array with: delimiters]
> - 		ifFalse: [delimiters].
> - 	keyStop := 1.
> - 	[keyStop <= self size] whileTrue:
> - 		[keyStart := self skipDelimiters: separators startingAt: keyStop.
> - 		keyStop := self findDelimiters: separators startingAt: keyStart.
> - 		keyStart < keyStop
> - 			ifTrue: [tokens add: (self copyFrom: keyStart to: (keyStop - 1))]].
> - 	^tokens!
>
> Item was added:
> + ----- Method: String>>findTokens:do: (in category 'accessing') -----
> + findTokens: delimiters do: aBlock
> + 
> + 	self
> + 		findTokens: delimiters
> + 		indicesDo: [:start :end | aBlock value: (self copyFrom: start to: end)].!
>
> Item was added:
> + ----- Method: String>>findTokens:indicesDo: (in category 'accessing') -----
> + findTokens: delimiters indicesDo: aBlock
> + 	"Parse self to find tokens between delimiters. Any character in the Collection delimiters marks a border.  Several delimiters in a row are considered as just one separation.  Also, allow delimiters to be a single character. Similar to #lineIndicesDo:."
> + 
> + 	| tokens keyStart keyStop separators |

There are a few opportunities to regain the performance lost with the 
introduction of blocks and sends:
- the tokens temporary is unused
- self size should be cached in a temporary (size)
- instead of Array >> #with:, a brace array should be used
- | keyStop keyStart separators size | would probably yield the best 
performance

Levente

> + 	separators := delimiters isCharacter 
> + 		ifTrue: [Array with: delimiters]
> + 		ifFalse: [delimiters].
> + 	keyStop := 1.
> + 	[keyStop <= self size] whileTrue: [
> + 		keyStart := self skipDelimiters: separators startingAt: keyStop.
> + 		keyStop := self findDelimiters: separators startingAt: keyStart.
> + 		keyStart < keyStop
> + 			ifTrue: [aBlock value: keyStart value: keyStop - 1]].!


More information about the Squeak-dev mailing list