[squeak-dev] The Trunk: Collections-mt.839.mcz

commits at source.squeak.org commits at source.squeak.org
Thu Jul 11 05:21:01 UTC 2019


Marcel Taeumel uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-mt.839.mcz

==================== Summary ====================

Name: Collections-mt.839
Author: mt
Time: 5 July 2019, 4:54:03.194089 pm
UUID: 76695fa3-7ed2-cb4e-8c6e-b33c83dd526f
Ancestors: Collections-mt.838

Improves code commentary, structure, and performance. Thanks Levente for the tips!

(Adds tests for #findFeatures in CollectionsTests-mt.313).

=============== Diff against Collections-mt.838 ===============

Item was changed:
  ----- Method: String>>findFeatureIndicesDo: (in category 'accessing - features') -----
  findFeatureIndicesDo: aBlock
+ 	"Support for simple analysis of natural language in source code.
- 	"State machine that separates camelCase, UPPERCase, number/operator combinations and skips colons"
- 	| last state char "0 = start, 1 = a, 2 = A, 3 = AA, 4 = num, 5 = op"  |
  	
+ 	In addition to whitespace separation like #findTokens:, also separate features using higher-level rules:
+ 		(1) 'camelCase' -> #('camel' 'Case'),
+ 		(2) 'UPPERCase' -> #('UPPER' 'Case'),
+ 		(3) integer numbers such as 'MyModel55' -> #('My' 'Model' '55'), and 
+ 		(4) operators such as '5 <= 4' -> #('5' '<=' '4').
+ 	Other kinds of characters are tokenized as operators: '[state := 2]' -> #('[' 'state' ':=' '2' ']').
- 	state := 0.
- 	last := 1.
  	
+ 	This method works like #lineIndicesDo: and provides start/stop indices of tokens to the given aBlock to, for example, extract and normalize features (or tokens)."
+ 		
+ 	| last state char |
+ 	state := 0. "0 = start, 1 = a, 2 = A, 3 = AA, 4 = num, 5 = op"
+ 	last := 1. "last character index"
+ 	
  	1 to: self size do: [ :index |
  		char := self at: index.
+ 		
+ 		char isLowercase
+ 			ifTrue: [ "a"
+ 				state < 3 ifTrue: [state := 1]. "*a -> a"
+ 				state = 3 ifTrue: [
+ 					"AAa -> A + Aa (camel case follows uppercase)"
+ 					aBlock value: last value: index - 2.
+ 					last := index - 1.
+ 					state := 2].
+ 				state > 3 ifTrue: [
+ 					"+a -> + | a (letter follows non-letter)" 
+ 					aBlock value: last value: index - 1.
+ 					last := index.
+ 					state := 1]] 
- 		"a"
- 		char isLowercase ifTrue: [
- 			(state < 3) ifTrue: [state := 1]. "*a -> a"
- 			(state == 3) ifTrue: [
- 				"AAa -> A + Aa (camel case follows uppercase)"
- 				aBlock value: last value: index - 2.
- 				last := index - 1.
- 				state := 2].
- 			(state > 3) ifTrue: [
- 				"+a -> + | a (letter follows non-letter)" 
- 				aBlock value: last value: index - 1.
- 				last := index.
- 				state := 1]] 
- 		ifFalse: [
- 			char isUppercase ifTrue: [
- 				(state == 0)
- 					ifTrue: [state := 2] "start -> A"
- 					ifFalse: [
- 						(state < 2 or: [state > 3]) ifTrue: [
- 							"*A -> * | A (uppercase begins, flush before)"
- 							aBlock value: last value: index - 1.
- 							last := index.
- 							state := 2] ifFalse: [
- 								"AA -> AA (uppercase continues)"
- 								state := 3]]]
  			ifFalse: [
+ 				char isUppercase
+ 					ifTrue: [ "A"
+ 						state = 0
+ 							ifTrue: [state := 2] "start -> A"
- 				("char == $: or:" char isSeparator) ifTrue: [
- 					"skip colon/whitespace"
- 					(state > 0) ifTrue: [
- 						aBlock value: last value: index - 1.
- 						state := 0].
- 					last := index + 1]
- 				ifFalse: [
- 					char isDigit ifTrue: [
- 						(state == 0)
- 							ifTrue: [state := 4]
  							ifFalse: [
+ 								(state < 2) | (state > 3) ifTrue: [
+ 									"*A -> * | A (uppercase begins, flush before)"
+ 									aBlock value: last value: index - 1.
+ 									last := index.
+ 									state := 2] ifFalse: [
+ 										"AA -> AA (uppercase continues)"
+ 										state := 3]]]
+ 					ifFalse: [	
+ 						char isSeparator
+ 							ifTrue: [ " "
+ 								"skip whitespace"
+ 								state > 0 ifTrue: [
+ 									aBlock value: last value: index - 1.
+ 									state := 0].
+ 								last := index + 1]
+ 							ifFalse: [
+ 								
+ 								char isDigit
+ 									ifTrue: [ "num"
+ 										state = 0
+ 											ifTrue: [state := 4]
+ 											ifFalse: [
+ 											state ~= 4 ifTrue: [
+ 												aBlock value: last value: index - 1.
+ 												last := index.
+ 												state := 4]]]
+ 									ifFalse: [ "op"
+ 										state = 0
+ 											ifTrue: [state := 5]
+ 											ifFalse: [
+ 												state < 5 ifTrue: [
+ 													aBlock value: last value: index - 1.
+ 													last := index.
+ 													state := 5]] ] ] ] ] ].
- 							(state ~= 4) ifTrue: [
- 								aBlock value: last value: index - 1.
- 								last := index.
- 								state := 4]]]
- 						ifFalse: [
- 							(state == 0)
- 								ifTrue: [state := 5]
- 								ifFalse: [
- 									(state < 5) ifTrue: [
- 										aBlock value: last value: index - 1.
- 										last := index.
- 										state := 5]]]]]]].
  	last <= self size ifTrue: [
  		aBlock value: last value: self size]!

Item was changed:
  ----- Method: String>>findFeaturesDo: (in category 'accessing - features') -----
  findFeaturesDo: aBlock
+ 	"Simple analysis of natural language in source code. Select all features that are letters only, normalize them as lowercase. No support for word stemming.
+ 	
+ 	Example:
+ 		'Transcript show: 123 asString; cr; show: #HelloWorld.'
+ 			-> #('transcript' 'show' 'as' 'string' 'cr' 'show' 'hello' 'world')
+ 	"
- 	"Simple analysis for natural language in source code. No support for word stemming."
  
  	self findFeatureIndicesDo: [:start :end |
  		(self at: start) isLetter ifTrue: [
  			aBlock value: (self copyFrom: start to: end) asLowercase]].!

Item was added:
+ ----- Method: String>>findTokens (in category 'accessing - tokens') -----
+ findTokens
+ 
+ 	^ self findTokens: Character separators!

Item was changed:
  ----- Method: String>>findTokens: (in category 'accessing') -----
  findTokens: delimiters
+ 	"Answer the collection of tokens between delimiters, which results from parsing self."
- 	"Answer the collection of tokens that result from parsing self."
  	
+ 	| tokens |
+ 	tokens := OrderedCollection new.
+ 	self
+ 		findTokens: delimiters
+ 		do: [:token | tokens addLast: token].
+ 	^ tokens!
- 	^ OrderedCollection streamContents: [:tokens |
- 		self
- 			findTokens: delimiters
- 			do: [:token | tokens nextPut: token]]!

Item was changed:
+ ----- Method: String>>findTokens:do: (in category 'accessing - tokens') -----
- ----- Method: String>>findTokens:do: (in category 'accessing') -----
  findTokens: delimiters do: aBlock
  	
  	self
  		findTokens: delimiters
  		indicesDo: [:start :end | aBlock value: (self copyFrom: start to: end)].!

Item was changed:
+ ----- Method: String>>findTokens:indicesDo: (in category 'accessing - tokens') -----
+ findTokens: oneOrMoreCharacters indicesDo: aBlock
+ 	"Parse self to find tokens between delimiters. Any character in the Collection delimiters marks a border.  Several delimiters in a row are considered as just one separation. The interface is similar to #lineIndicesDo:."
- ----- Method: String>>findTokens:indicesDo: (in category 'accessing') -----
- findTokens: delimiters indicesDo: aBlock
- 	"Parse self to find tokens between delimiters. Any character in the Collection delimiters marks a border.  Several delimiters in a row are considered as just one separation.  Also, allow delimiters to be a single character. Similar to #lineIndicesDo:."
  	
+ 	| keyStart keyStop separators size |
+ 	size := self size.
+ 	separators := oneOrMoreCharacters isCharacter 
+ 		ifTrue: [{oneOrMoreCharacters}]
+ 		ifFalse: [oneOrMoreCharacters].
- 	| tokens keyStart keyStop separators |
- 	separators := delimiters isCharacter 
- 		ifTrue: [Array with: delimiters]
- 		ifFalse: [delimiters].
  	keyStop := 1.
+ 	[keyStop <= size] whileTrue: [
- 	[keyStop <= self size] whileTrue: [
  		keyStart := self skipDelimiters: separators startingAt: keyStop.
  		keyStop := self findDelimiters: separators startingAt: keyStart.
  		keyStart < keyStop
  			ifTrue: [aBlock value: keyStart value: keyStop - 1]].!



More information about the Squeak-dev mailing list