Just print this two scripts:
|s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun.
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun.
I've seen that "ByteString>>stringHash:initialHash" use a primitive. Could we have such optimisation for WideString ?
Christophe.
On Fri, 14 May 2010, christophe.jalady@free.fr wrote:
Just print this two scripts:
|s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun.
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun.
I've seen that "ByteString>>stringHash:initialHash" use a primitive. Could we have such optimisation for WideString ?
Sure. There are (at least) four ways to do it: 1) Update the implementation of the primitive to support wordarrays. Then build a vm with it and finally change the image side code. This would be the best IMHO. Is there a reason why the primitive is not implemented this way? 2) Use NativeBoost to write a primitive for this method. This is much easier than building a vm/plugin, though you have to write the code in X86 assembly. This would be a cool demo for NativeBoost's primitive writing capabilities. (Much better than the Fibonacci-number calculator. :)) 3) Use Exupery, I think it can optimize methods like this. 4) Use the Cog VM. ;) (I wonder how fast this method is with Cog.)
Levente
Christophe.
On 5/14/10, Levente Uzonyi leves@elte.hu wrote:
On Fri, 14 May 2010, christophe.jalady@free.fr wrote:
Just print this two scripts:
|s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun.
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun.
I've seen that "ByteString>>stringHash:initialHash" use a primitive. Could we have such optimisation for WideString ?
Sure. There are (at least) four ways to do it:
- Update the implementation of the primitive to support wordarrays. Then
build a vm with it and finally change the image side code. This would be the best IMHO. Is there a reason why the primitive is not implemented this way?
If this is not a work you feel comfortable with then post this on the VM developer list and/or create a Mantis entry.
- Use NativeBoost to write a primitive for this method. This is much
easier than building a vm/plugin, though you have to write the code in X86 assembly. This would be a cool demo for NativeBoost's primitive writing capabilities. (Much better than the Fibonacci-number calculator. :))
Yes, this would be a very good demonstration.
- Use Exupery, I think it can optimize methods like this.
- Use the Cog VM. ;) (I wonder how fast this method is with Cog.)
What kind of VM are we using in 4.1? It supports closures, but seeminly is not the Cog VM(yet)?
Hannes
On Fri, May 14, 2010 at 10:28 AM, Levente Uzonyi leves@elte.hu wrote:
On Fri, 14 May 2010, christophe.jalady@free.fr wrote:
Just print this two scripts:
|s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun.
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun.
I've seen that "ByteString>>stringHash:initialHash" use a primitive. Could we have such optimisation for WideString ?
Sure. There are (at least) four ways to do it:
- Update the implementation of the primitive to support wordarrays. Then
build a vm with it and finally change the image side code. This would be the best IMHO. Is there a reason why the primitive is not implemented this way? 2) Use NativeBoost to write a primitive for this method. This is much easier than building a vm/plugin, though you have to write the code in X86 assembly. This would be a cool demo for NativeBoost's primitive writing capabilities. (Much better than the Fibonacci-number calculator. :)) 3) Use Exupery, I think it can optimize methods like this. 4) Use the Cog VM. ;) (I wonder how fast this method is with Cog.)
2.66GHz Intel Core i7 MacBook Pro (my new tool!!!!). Current Cog, my 3.8 derived work image:
|s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun. 16
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun. 10
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 15725
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000000 timesRepeat: [s hash]] timeToRun. 11078
Current 4.1/Squeak 4.1.1beta2U.app: |s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun. 98
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun. 31
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 100660
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000000 timesRepeat: [s hash]] timeToRun. 29476
Current 4.1/current Cog |s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun. 14
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun. 10
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 14946
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000000 timesRepeat: [s hash]] timeToRun. 10171
So quite a bit faster :)
best Eliot
Levente
Christophe.
On Fri, 14 May 2010, Eliot Miranda wrote:
On Fri, May 14, 2010 at 10:28 AM, Levente Uzonyi leves@elte.hu wrote:
On Fri, 14 May 2010, christophe.jalady@free.fr wrote:
Just print this two scripts:
|s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun.
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun.
I've seen that "ByteString>>stringHash:initialHash" use a primitive. Could we have such optimisation for WideString ?
Sure. There are (at least) four ways to do it:
- Update the implementation of the primitive to support wordarrays. Then
build a vm with it and finally change the image side code. This would be the best IMHO. Is there a reason why the primitive is not implemented this way? 2) Use NativeBoost to write a primitive for this method. This is much easier than building a vm/plugin, though you have to write the code in X86 assembly. This would be a cool demo for NativeBoost's primitive writing capabilities. (Much better than the Fibonacci-number calculator. :)) 3) Use Exupery, I think it can optimize methods like this. 4) Use the Cog VM. ;) (I wonder how fast this method is with Cog.)
2.66GHz Intel Core i7 MacBook Pro (my new tool!!!!). Current Cog, my 3.8 derived work image:
|s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun. 16
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun. 10
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 15725
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000000 timesRepeat: [s hash]] timeToRun. 11078
Current 4.1/Squeak 4.1.1beta2U.app: |s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun. 98
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun. 31
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 100660
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000000 timesRepeat: [s hash]] timeToRun. 29476
Current 4.1/current Cog |s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun. 14
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun. 10
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 14946
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000000 timesRepeat: [s hash]] timeToRun. 10171
So quite a bit faster :)
Thanks, it looks really nice. I wonder why are the results with 4.1 better than with your 3.8 derived image.
Levente
best Eliot
Levente
Christophe.
On Fri, May 14, 2010 at 1:03 PM, Levente Uzonyi leves@elte.hu wrote:
On Fri, 14 May 2010, Eliot Miranda wrote:
On Fri, May 14, 2010 at 10:28 AM, Levente Uzonyi leves@elte.hu wrote:
On Fri, 14 May 2010, christophe.jalady@free.fr wrote:
Just print this two scripts:
|s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun.
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun.
I've seen that "ByteString>>stringHash:initialHash" use a primitive. Could we have such optimisation for WideString ?
Sure. There are (at least) four ways to do it:
- Update the implementation of the primitive to support wordarrays. Then
build a vm with it and finally change the image side code. This would be the best IMHO. Is there a reason why the primitive is not implemented this way? 2) Use NativeBoost to write a primitive for this method. This is much easier than building a vm/plugin, though you have to write the code in X86 assembly. This would be a cool demo for NativeBoost's primitive writing capabilities. (Much better than the Fibonacci-number calculator. :)) 3) Use Exupery, I think it can optimize methods like this. 4) Use the Cog VM. ;) (I wonder how fast this method is with Cog.)
2.66GHz Intel Core i7 MacBook Pro (my new tool!!!!). Current Cog, my 3.8 derived work image:
|s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun. 16
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun. 10
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 15725
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000000 timesRepeat: [s hash]] timeToRun. 11078
Current 4.1/Squeak 4.1.1beta2U.app: |s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun. 98
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun. 31
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 100660
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000000 timesRepeat: [s hash]] timeToRun. 29476
Current 4.1/current Cog |s| s := WideString with: (Character value: 16r55E4). [100000 timesRepeat: [s hash]] timeToRun. 14
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000 timesRepeat: [s hash]] timeToRun. 10
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 14946
|s| s := ByteString with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4) with: (Character value: 16rE4). [100000000 timesRepeat: [s hash]] timeToRun. 10171
So quite a bit faster :)
Thanks, it looks really nice. I wonder why are the results with 4.1 better than with your 3.8 derived image.
The code looks the same and so I expect the times are close enough to be in the noise. I didn't shut my machine down to a consistent state so other tasks could have affected things, etc.
Levente
best Eliot
Levente
Christophe.
I implemented WideString hash in NB.
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 155718
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s nbHash]] timeToRun. 81231
Which makes me think, that we're benching the outer loop, rather then hash function itself :)
So, here is more appropriate , i think:
|s t1 t2 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. 10 timesRepeat: [ s := s , s ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun.
{ t1. t2 } #(11479 134)
~ 85x faster :)
added implementation into examples package: http://www.squeaksource.com/NativeBoost/NativeBoost-Examples-Igor.Stasenko.1...
On 15 May 2010 01:14, Igor Stasenko siguctua@gmail.com wrote:
I implemented WideString hash in NB.
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 155718
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s nbHash]] timeToRun. 81231
Which makes me think, that we're benching the outer loop, rather then hash function itself :)
So, here is more appropriate , i think:
|s t1 t2 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. 10 timesRepeat: [ s := s , s ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun.
{ t1. t2 } #(11479 134)
~ 85x faster :)
-- Best regards, Igor Stasenko AKA sig.
Oh, yeah.. i forgot to hash the strings.. for comparison:
|ss s t1 t2 t3 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. ss := (String with: $z) , 'abcdefghijklmno'.
10 timesRepeat: [ s := s , s. ss := ss , ss ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun. t3 := [1000 timesRepeat: [ss hash]] timeToRun. { t1. t2. t3 } #(11477 134 190)
so, WideString>>nbHash outperforms ByteString>>hash a little, for same string sizes. ;)
Hi Igor,
is the NB implementation mapping the character codes in the wide strings into Character objects and taking the character hashes? If so, very cool. The NB code is very fast. If on the other hand you're just short-circuiting the character code lookup then you're cheating :)
On Fri, May 14, 2010 at 3:31 PM, Igor Stasenko siguctua@gmail.com wrote:
Oh, yeah.. i forgot to hash the strings.. for comparison:
|ss s t1 t2 t3 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. ss := (String with: $z) , 'abcdefghijklmno'.
10 timesRepeat: [ s := s , s. ss := ss , ss ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun. t3 := [1000 timesRepeat: [ss hash]] timeToRun. { t1. t2. t3 } #(11477 134 190)
so, WideString>>nbHash outperforms ByteString>>hash a little, for same string sizes. ;)
-- Best regards, Igor Stasenko AKA sig.
On 15 May 2010 01:38, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Igor, is the NB implementation mapping the character codes in the wide strings into Character objects and taking the character hashes? If so, very cool. The NB code is very fast. If on the other hand you're just short-circuiting the character code lookup then you're cheating :)
What mapping you have in mind?
WideString>>at: index "Answer the Character stored in the field of the receiver indexed by the argument." ^ Character value: (self wordAt: index).
Character class>>value: anInteger "Answer the Character whose value is anInteger."
anInteger > 255 ifTrue: [^self basicNew setValue: anInteger]. ^ CharacterTable at: anInteger + 1.
(Character classPool at: #CharacterTable) withIndexDo: [:ch :i | self assert: (ch asInteger = (i-1))]
So, it is 1:1 correspondence between word, stored in wide string (self wordAt: index), and Character value, used for hashing. So, no mapping required.
And besides, if someone would bother implementing a primitive for hasing the WideString, i doubt that he will go and create an instances of Character for each indice, and then read its value and only then use it for hashing. So, i don't think this is cheating. Its just an optimization :) Of course, Cog , even if it optimize things cleverly, still has to follow a code, and should create a real instances of Character, simply because it is written so, and it should honor the language semantics.
On 15 May 2010 01:53, Igor Stasenko siguctua@gmail.com wrote:
On 15 May 2010 01:38, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Igor, is the NB implementation mapping the character codes in the wide strings into Character objects and taking the character hashes? If so, very cool. The NB code is very fast. If on the other hand you're just short-circuiting the character code lookup then you're cheating :)
What mapping you have in mind?
WideString>>at: index "Answer the Character stored in the field of the receiver indexed by the argument." ^ Character value: (self wordAt: index).
Character class>>value: anInteger "Answer the Character whose value is anInteger."
anInteger > 255 ifTrue: [^self basicNew setValue: anInteger]. ^ CharacterTable at: anInteger + 1.
(Character classPool at: #CharacterTable) withIndexDo: [:ch :i | self assert: (ch asInteger = (i-1))]
So, it is 1:1 correspondence between word, stored in wide string (self wordAt: index), and Character value, used for hashing. So, no mapping required.
-- Best regards, Igor Stasenko AKA sig.
The cardinal rule of running benchmarks is to compare apples to apples. You've compared apples to oranges, i.e. an optimized reimplementation of WideString>>hash that eliminates the mapping of codes to characters, against the vanilla Squeak implementation. You need to at least compare the NB implementation against
WideString methods for comparison fastHash | stringSize hash low | stringSize := self size. hash := ByteString identityHash bitAnd: 16rFFFFFFF. 1 to: stringSize do: [:pos | hash := hash + (self wordAt: pos). "Begin hashMultiply" low := hash bitAnd: 16383. hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF. ]. ^ hash
| s n | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. n := 100000. { [1 to: n do: [:i| s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash]] timeToRun. [1 to: n do: [:i| s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash]] timeToRun. }
#(829 1254)
ASo your measurements tell us nothing about a general comparison of NB against the Squeak VM or Cog. They only demonstrate (unsurprisingly) that a loop summing integers in an array goes PDQ. On the other hand my numbers showed Cog 10x faster than the Squeak interpreter when executing exactly the same bytecode.
best Eliot
On Fri, May 14, 2010 at 4:13 PM, Igor Stasenko siguctua@gmail.com wrote:
And besides, if someone would bother implementing a primitive for hasing the WideString, i doubt that he will go and create an instances of Character for each indice, and then read its value and only then use it for hashing. So, i don't think this is cheating. Its just an optimization :) Of course, Cog , even if it optimize things cleverly, still has to follow a code, and should create a real instances of Character, simply because it is written so, and it should honor the language semantics.
On 15 May 2010 01:53, Igor Stasenko siguctua@gmail.com wrote:
On 15 May 2010 01:38, Eliot Miranda eliot.miranda@gmail.com wrote:
Hi Igor, is the NB implementation mapping the character codes in the wide
strings
into Character objects and taking the character hashes? If so, very
cool.
The NB code is very fast. If on the other hand you're just short-circuiting the character code lookup then you're cheating :)
What mapping you have in mind?
WideString>>at: index "Answer the Character stored in the field of the receiver indexed
by
the argument." ^ Character value: (self wordAt: index).
Character class>>value: anInteger "Answer the Character whose value is anInteger."
anInteger > 255 ifTrue: [^self basicNew setValue: anInteger]. ^ CharacterTable at: anInteger + 1.
(Character classPool at: #CharacterTable) withIndexDo: [:ch :i | self assert: (ch asInteger = (i-1))]
So, it is 1:1 correspondence between word, stored in wide string (self wordAt: index), and Character value, used for hashing. So, no mapping required.
-- Best regards, Igor Stasenko AKA sig.
-- Best regards, Igor Stasenko AKA sig.
On 15 May 2010 02:35, Eliot Miranda eliot.miranda@gmail.com wrote:
The cardinal rule of running benchmarks is to compare apples to apples. You've compared apples to oranges, i.e. an optimized reimplementation of WideString>>hash that eliminates the mapping of codes to characters, against the vanilla Squeak implementation. You need to at least compare the NB implementation against WideString methods for comparison fastHash | stringSize hash low | stringSize := self size. hash := ByteString identityHash bitAnd: 16rFFFFFFF. 1 to: stringSize do: [:pos | hash := hash + (self wordAt: pos). "Begin hashMultiply" low := hash bitAnd: 16383. hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF. ]. ^ hash | s n | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. n := 100000. { [1 to: n do: [:i| s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash]] timeToRun. [1 to: n do: [:i| s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash]] timeToRun. } #(829 1254) ASo your measurements tell us nothing about a general comparison of NB against the Squeak VM or Cog. They only demonstrate (unsurprisingly) that a loop summing integers in an array goes PDQ. On the other hand my numbers showed Cog 10x faster than the Squeak interpreter when executing exactly the same bytecode.
Yes, of course you're right. But i didn't compared it with Cog, because i can't. And NB is not for translating bytecodes into native code, it is for authoring a primitives using native code. So, my comparison is how faster it could be , if implemented primitively.
I can only imagine, how faster the whole thing would be if we cross-hatch NB and Cog, where Cog will serve as a smalltalk optimizer , and NB will serve for manual optimizations for heavy numerical crunching :)
best Eliot
On Fri, May 14, 2010 at 4:43 PM, Igor Stasenko siguctua@gmail.com wrote:
On 15 May 2010 02:35, Eliot Miranda eliot.miranda@gmail.com wrote:
The cardinal rule of running benchmarks is to compare apples to apples. You've compared apples to oranges, i.e. an optimized reimplementation of WideString>>hash that eliminates the mapping of codes to characters,
against
the vanilla Squeak implementation. You need to at least compare the NB implementation against WideString methods for comparison fastHash | stringSize hash low | stringSize := self size. hash := ByteString identityHash bitAnd: 16rFFFFFFF. 1 to: stringSize do: [:pos | hash := hash + (self wordAt: pos). "Begin hashMultiply" low := hash bitAnd: 16383. hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 *
low)
bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF. ]. ^ hash | s n | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. n := 100000. { [1 to: n do: [:i| s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash]] timeToRun. [1 to: n do: [:i| s hash. s hash. s hash. s hash. s hash. s hash. s
hash. s
hash. s hash. s hash]] timeToRun. } #(829 1254) ASo your measurements tell us nothing about a general comparison of NB against the Squeak VM or Cog. They only demonstrate (unsurprisingly)
that a
loop summing integers in an array goes PDQ. On the other hand my numbers showed Cog 10x faster than the Squeak interpreter when executing exactly
the
same bytecode.
Yes, of course you're right. But i didn't compared it with Cog, because i can't. And NB is not for translating bytecodes into native code, it is for authoring a primitives using native code. So, my comparison is how faster it could be , if implemented primitively.
I can only imagine, how faster the whole thing would be if we cross-hatch NB and Cog, where Cog will serve as a smalltalk optimizer , and NB will serve for manual optimizations for heavy numerical crunching :)
Yes, this is a very cool direction. One effort should be AOStA/SIStA with Marcus, which will do adaptive optimization/speculative inlining a la Self & Hotspot. Being able to directly generate machine code with NB and not have to rely on the VM's naive code generator would allow machine-specific optimization ad excellent register usage. This could beat C. The difficulty here is in portability. Another more researchy direction would be a Klein approach where the Cog code generator is eliminated and replaced by NB, perfectly possible since Cog retains the interpreter which can be fallen back upon at any time. Again portability is an issue.
best, Eliot
best Eliot
-- Best regards, Igor Stasenko AKA sig.
On 15 May 2010 02:43, Igor Stasenko siguctua@gmail.com wrote:
On 15 May 2010 02:35, Eliot Miranda eliot.miranda@gmail.com wrote:
The cardinal rule of running benchmarks is to compare apples to apples. You've compared apples to oranges, i.e. an optimized reimplementation of WideString>>hash that eliminates the mapping of codes to characters, against the vanilla Squeak implementation. You need to at least compare the NB implementation against WideString methods for comparison fastHash | stringSize hash low | stringSize := self size. hash := ByteString identityHash bitAnd: 16rFFFFFFF. 1 to: stringSize do: [:pos | hash := hash + (self wordAt: pos). "Begin hashMultiply" low := hash bitAnd: 16383. hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF. ]. ^ hash | s n | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. n := 100000. { [1 to: n do: [:i| s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash]] timeToRun. [1 to: n do: [:i| s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash]] timeToRun. } #(829 1254) ASo your measurements tell us nothing about a general comparison of NB against the Squeak VM or Cog. They only demonstrate (unsurprisingly) that a loop summing integers in an array goes PDQ. On the other hand my numbers showed Cog 10x faster than the Squeak interpreter when executing exactly the same bytecode.
Yes, of course you're right. But i didn't compared it with Cog, because i can't. And NB is not for translating bytecodes into native code, it is for authoring a primitives using native code. So, my comparison is how faster it could be , if implemented primitively.
I can only imagine, how faster the whole thing would be if we cross-hatch NB and Cog, where Cog will serve as a smalltalk optimizer , and NB will serve for manual optimizations for heavy numerical crunching :)
Its now time to verify things, since now i can run native code under Cog:
|ss s t1 t2 t3 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. ss := (String with: $z) , 'abcdefghijklmno'.
10 timesRepeat: [ s := s , s. ss := ss , ss ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun. t3 := [1000 timesRepeat: [ss hash]] timeToRun. { t1. t2. t3 }
Squeak VM:
#(7929 89 125)
Cog VM:
#(1256 88 99)
However, in tight loops, where loop's payload comparable to function's payload itself, Cog is a winner:
|s| s := WideString with: (Character value: 16r55E4). [10000000 timesRepeat: [s hash]] timeToRun. 1307
|s| s := WideString with: (Character value: 16r55E4). [ s hash ] bench '2,840,000 per second.'
|s| s := WideString with: (Character value: 16r55E4). [10000000 timesRepeat: [s nbHash]] timeToRun.
1438
|s| s := WideString with: (Character value: 16r55E4). [ s nbHash ] bench '2,710,000 per second.'
obviously, because it can't inline a native code generated by NB :)
On Thu, 30 Sep 2010, Igor Stasenko wrote:
On 15 May 2010 02:43, Igor Stasenko siguctua@gmail.com wrote:
On 15 May 2010 02:35, Eliot Miranda eliot.miranda@gmail.com wrote:
The cardinal rule of running benchmarks is to compare apples to apples. You've compared apples to oranges, i.e. an optimized reimplementation of WideString>>hash that eliminates the mapping of codes to characters, against the vanilla Squeak implementation. You need to at least compare the NB implementation against WideString methods for comparison fastHash | stringSize hash low | stringSize := self size. hash := ByteString identityHash bitAnd: 16rFFFFFFF. 1 to: stringSize do: [:pos | hash := hash + (self wordAt: pos). "Begin hashMultiply" low := hash bitAnd: 16383. hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF. ]. ^ hash | s n | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. n := 100000. { [1 to: n do: [:i| s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash]] timeToRun. [1 to: n do: [:i| s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash]] timeToRun. } #(829 1254) ASo your measurements tell us nothing about a general comparison of NB against the Squeak VM or Cog. They only demonstrate (unsurprisingly) that a loop summing integers in an array goes PDQ. On the other hand my numbers showed Cog 10x faster than the Squeak interpreter when executing exactly the same bytecode.
Yes, of course you're right. But i didn't compared it with Cog, because i can't. And NB is not for translating bytecodes into native code, it is for authoring a primitives using native code. So, my comparison is how faster it could be , if implemented primitively.
I can only imagine, how faster the whole thing would be if we cross-hatch NB and Cog, where Cog will serve as a smalltalk optimizer , and NB will serve for manual optimizations for heavy numerical crunching :)
Its now time to verify things, since now i can run native code under Cog:
|ss s t1 t2 t3 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. ss := (String with: $z) , 'abcdefghijklmno'.
10 timesRepeat: [ s := s , s. ss := ss , ss ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun. t3 := [1000 timesRepeat: [ss hash]] timeToRun. { t1. t2. t3 }
Squeak VM:
#(7929 89 125)
You should update your image, because WideString's hash is now ~2x faster than before with SqueakVM. I get #(3822 119) for t1 and t3.
Cog VM:
#(1256 88 99)
However, in tight loops, where loop's payload comparable to function's payload itself, Cog is a winner:
|s| s := WideString with: (Character value: 16r55E4). [10000000 timesRepeat: [s hash]] timeToRun. 1307
|s| s := WideString with: (Character value: 16r55E4). [ s hash ] bench '2,840,000 per second.'
|s| s := WideString with: (Character value: 16r55E4). [10000000 timesRepeat: [s nbHash]] timeToRun.
1438
|s| s := WideString with: (Character value: 16r55E4). [ s nbHash ] bench '2,710,000 per second.'
obviously, because it can't inline a native code generated by NB :)
Can Cog inline message sends? I thought it can't ATM. Btw, can you send a message from NB code?
Levente
-- Best regards, Igor Stasenko AKA sig.
2010/9/29 Levente Uzonyi leves@elte.hu
On Thu, 30 Sep 2010, Igor Stasenko wrote:
On 15 May 2010 02:43, Igor Stasenko siguctua@gmail.com wrote:
On 15 May 2010 02:35, Eliot Miranda eliot.miranda@gmail.com wrote:
The cardinal rule of running benchmarks is to compare apples to apples. You've compared apples to oranges, i.e. an optimized reimplementation of WideString>>hash that eliminates the mapping of codes to characters, against the vanilla Squeak implementation. You need to at least compare the NB implementation against WideString methods for comparison fastHash | stringSize hash low | stringSize := self size. hash := ByteString identityHash bitAnd: 16rFFFFFFF. 1 to: stringSize do: [:pos | hash := hash + (self wordAt: pos). "Begin hashMultiply" low := hash bitAnd: 16383. hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF. ]. ^ hash | s n | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. n := 100000. { [1 to: n do: [:i| s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash. s fastHash]] timeToRun. [1 to: n do: [:i| s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash. s hash]] timeToRun. } #(829 1254) ASo your measurements tell us nothing about a general comparison of NB against the Squeak VM or Cog. They only demonstrate (unsurprisingly) that a loop summing integers in an array goes PDQ. On the other hand my numbers showed Cog 10x faster than the Squeak interpreter when executing exactly the same bytecode.
Yes, of course you're right. But i didn't compared it with Cog, because i can't. And NB is not for translating bytecodes into native code, it is for authoring a primitives using native code. So, my comparison is how faster it could be , if implemented primitively.
I can only imagine, how faster the whole thing would be if we cross-hatch NB and Cog, where Cog will serve as a smalltalk optimizer , and NB will serve for manual optimizations for heavy numerical crunching :)
Its now time to verify things, since now i can run native code under Cog:
|ss s t1 t2 t3 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. ss := (String with: $z) , 'abcdefghijklmno'.
10 timesRepeat: [ s := s , s. ss := ss , ss ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun. t3 := [1000 timesRepeat: [ss hash]] timeToRun. { t1. t2. t3 }
Squeak VM:
#(7929 89 125)
You should update your image, because WideString's hash is now ~2x faster than before with SqueakVM. I get #(3822 119) for t1 and t3.
Cog VM:
#(1256 88 99)
However, in tight loops, where loop's payload comparable to function's payload itself, Cog is a winner:
|s| s := WideString with: (Character value: 16r55E4). [10000000 timesRepeat: [s hash]] timeToRun. 1307
|s| s := WideString with: (Character value: 16r55E4). [ s hash ] bench '2,840,000 per second.'
|s| s := WideString with: (Character value: 16r55E4). [10000000 timesRepeat: [s nbHash]] timeToRun.
1438
|s| s := WideString with: (Character value: 16r55E4). [ s nbHash ] bench '2,710,000 per second.'
obviously, because it can't inline a native code generated by NB :)
Can Cog inline message sends? I thought it can't ATM.
Right. But the thought is that because Cog does inline machine-code for certain primitives (actually it generates machine-code versions of core primitives) that with a little work NB would be able to generate the machine code for primitives that Cog's code generator doesn't handle.
Igor will correct me if I'm wrong but I think the idea is thus. NB-generated code needs to be stored somewhere in a CompiledMethod where Cog can find it, along with some link-edit metadata to allow Cog to fix-up data references in the code etc. If Cog finds a method containing NB code it copies this code into the start of the method (where primitive code is compiled) and does some simple link-editing on the code. Now the Cog method starts with NB code which operates ;like a primitive, jumping to the start of the Cog method proper if it fails.
If we're talking about NB FFI calls then there's additional compilations. A stack switch is required from Cog's stack pages to the real C stack, and back on return, etc. But I'm encouraged that Igor has already got NB going on Cog.
Btw, can you send a message from NB code?
As I understand it, no.
best Eliot
Levente
-- Best regards, Igor Stasenko AKA sig.
On 30 September 2010 03:39, Eliot Miranda eliot.miranda@gmail.com wrote:
Can Cog inline message sends? I thought it can't ATM.
Right. But the thought is that because Cog does inline machine-code for certain primitives (actually it generates machine-code versions of core primitives) that with a little work NB would be able to generate the machine code for primitives that Cog's code generator doesn't handle. Igor will correct me if I'm wrong but I think the idea is thus.
You are right here.
NB-generated code needs to be stored somewhere in a CompiledMethod where Cog can find it,
This is trivial. Its stored right after last bytecode in compiled method :)
along with some link-edit metadata to allow Cog to fix-up data references in the code etc.
No need. The generated code is position agnostic, and its using interpreterProxy api for speaking with VM. No position/object memory dependent data is stored there. You can copy it anywhere you want and it will still work correctly. Native code during its interaction with VM fully relies on interpreterProxy interface (as any other primitive).
A native code is a function of kind:
sqInt nativeFunction();
if its not fails, it should return a valid oop. So, after call you simply check if prim not failed:
result := self fnPtr. "fnPtr is native code" self failed ifFalse: [ interpreterProxy pop: argCount + 1 thenPush: result. ] ifTrue: [ lastError := interpreterProxy primitiveFailureCode].
but of course, for inlined code we can change that , in order to generate even more optimal code :)
If Cog finds a method containing NB code it copies this code into the start of the method (where primitive code is compiled) and does some simple link-editing on the code. Now the Cog method starts with NB code which operates ;like a primitive, jumping to the start of the Cog method proper if it fails.
A sketch idea is following: - reserve a special primitive number for 'a compiled method with native code, which can be inlined'. Then you can easily detect in Cog inliner, if method can be inlined after some extra checks: Method could have such primitive, but may not have a native code, since its not yet generated. Then you simply 'fail prim' - i.e. simply go and run method interpreted. This is what i am actually checking/doing each time before making a native call.
Once everything right (it finds a correct native code within a method), its then free to either make a call to it or inline it.
And as i understood, from image side, we have to guarantee that such special inlined code won't overflow the stack , so Cog won't need to switch stack pointer to C stack before running it.
Or, if you extend interpreterProxy with something like pleaseMakeSureIAmUsingCStack() and pleaseMakeSureIAmUsingCogStackAgain() then NB could generate a code which will call these things before/after external calls, and so, Cog are free to inline even code which may contain external calls.
If we're talking about NB FFI calls then there's additional compilations. A stack switch is required from Cog's stack pages to the real C stack, and back on return, etc. But I'm encouraged that Igor has already got NB going on Cog.
It wasn't hard. I have to match interpreterProxy interface with Cog's one. (Btw i am really miss the getStackPointer(), which currently not implemented in Cog VM, i would be happy to have this function available without Alien plugin :) Second, i changed two methods which were responsible for marchalling Float instances, since Cog storing them in native format, while Squeak in big endian.
Btw, can you send a message from NB code?
As I understand it, no.
Of course it can't, because native code runs as a primitive. Primitive can't do sends, because then obviously, it has to enter interpreter loop ( via call to interpret()), and once send complete, somehow, it should again continue to run native code. So, its implementation will be much more like a callback mechanism, which already there.
With NB you can create a native function with callback, then you making an FFI callout to that function and when it calls back, you can do 'message send from NB code' or whatever you want. So, roughly speaking it could be seen as a way to do a message sends from native code :)
best Eliot
Levente
-- Best regards, Igor Stasenko AKA sig.
On Wed, Sep 29, 2010 at 6:38 PM, Igor Stasenko siguctua@gmail.com wrote:
On 30 September 2010 03:39, Eliot Miranda eliot.miranda@gmail.com wrote:
Can Cog inline message sends? I thought it can't ATM.
Right. But the thought is that because Cog does inline machine-code for certain primitives (actually it generates machine-code versions of core primitives) that with a little work NB would be able to generate the
machine
code for primitives that Cog's code generator doesn't handle. Igor will correct me if I'm wrong but I think the idea is thus.
You are right here.
NB-generated code needs to be stored somewhere in a CompiledMethod where Cog can find it,
This is trivial. Its stored right after last bytecode in compiled method :)
No, it's not trivial. Cog doesn't know how to distinguish an NB trailer form any old trailer. personally I'd prefer to see the NB code in a special location in the literal frame (e.g. the first literal) using an instance of a class that can't be a literal. Either that, or a bit in the CompiledMethod header should indicate that the trailer contains the code. (Oh, but I see below you're thinking of a specific primitive number; that's good). Also, if the NB code /is/ in the trailer, how does Cog know where it ends, sicne presumably the trailer includes other stuff after the trailer too.
along with some link-edit metadata to allow Cog to fix-up data references in the code etc.
No need. The generated code is position agnostic, and its using interpreterProxy api for speaking with VM. No position/object memory dependent data is stored there. You can copy it anywhere you want and it will still work correctly. Native code during its interaction with VM fully relies on interpreterProxy interface (as any other primitive).
So the NB code uses absolute addressing to get hold of the interpreterProxy, right? Depending on what the primitive is doing using interpreterProxy may be OK or may have poor performance. A little bit of metadata woudl allow NB direct access to certain variables and allow it to short-circuit the slow double-indirect calls through interpreterProxy.
A native code is a function of kind:
sqInt nativeFunction();
if its not fails, it should return a valid oop. So, after call you simply check if prim not failed:
result := self fnPtr. "fnPtr is native code" self failed ifFalse: [ interpreterProxy pop: argCount + 1 thenPush: result. ] ifTrue: [ lastError := interpreterProxy
primitiveFailureCode].
but of course, for inlined code we can change that , in order to generate even more optimal code :)
If Cog finds a method containing NB code it copies this code into the start of the method (where primitive code is compiled) and does some simple link-editing on the code. Now the Cog
method
starts with NB code which operates ;like a primitive, jumping to the
start
of the Cog method proper if it fails.
A sketch idea is following:
- reserve a special primitive number for 'a compiled method with
native code, which can be inlined'.
Yes. That's nice.
Then you can easily detect in Cog inliner, if method can be inlined after some extra checks: Method could have such primitive, but may not have a native code, since its not yet generated. Then you simply 'fail prim'
- i.e. simply go and run method interpreted.
This is what i am actually checking/doing each time before making a native call.
Once everything right (it finds a correct native code within a method), its then free to either make a call to it or inline it.
And as i understood, from image side, we have to guarantee that such special inlined code won't overflow the stack , so Cog won't need to switch stack pointer to C stack before running it.
Right. But Cog stack pages have a generous headroom so you can count on 256 bytes of headroom. But we'd have to specify this. Hopefully 256 bytes is enough for something that runs on the Smalltalk stack since one could argue that if it uses lots of stack its probably going to run for a whole and so the overhead of switching to/from teh C stack is affordable. i.e. run only small performance-critical things on the Smalltalk stack.
Or, if you extend interpreterProxy with something like pleaseMakeSureIAmUsingCStack() and pleaseMakeSureIAmUsingCogStackAgain() then NB could generate a code which will call these things before/after external calls, and so, Cog are free to inline even code which may contain external calls.
How about a primitive that says "run this code on the C stack" and one that says "run this on the native stack"?
If we're talking about NB FFI calls then there's additional compilations.
A
stack switch is required from Cog's stack pages to the real C stack, and back on return, etc. But I'm encouraged that Igor has already got NB
going
on Cog.
It wasn't hard. I have to match interpreterProxy interface with Cog's one. (Btw i am really miss the getStackPointer(), which currently not implemented in Cog VM, i would be happy to have this function available without Alien plugin :)
I think thats very easy to add :)
Second, i changed two methods which were responsible for marchalling Float instances, since Cog storing them in native format, while Squeak in big endian.
Btw, can you send a message from NB code?
As I understand it, no.
Of course it can't, because native code runs as a primitive. Primitive can't do sends, because then obviously, it has to enter interpreter loop ( via call to interpret()), and once send complete, somehow, it should again continue to run native code. So, its implementation will be much more like a callback mechanism, which already there.
With NB you can create a native function with callback, then you making an FFI callout to that function and when it calls back, you can do 'message send from NB code' or whatever you want. So, roughly speaking it could be seen as a way to do a message sends from native code :)
cheers Eliot
best Eliot
Levente
-- Best regards, Igor Stasenko AKA sig.
-- Best regards, Igor Stasenko AKA sig.
On 30 September 2010 04:53, Eliot Miranda eliot.miranda@gmail.com wrote:
On Wed, Sep 29, 2010 at 6:38 PM, Igor Stasenko siguctua@gmail.com wrote:
On 30 September 2010 03:39, Eliot Miranda eliot.miranda@gmail.com wrote:
Can Cog inline message sends? I thought it can't ATM.
Right. But the thought is that because Cog does inline machine-code for certain primitives (actually it generates machine-code versions of core primitives) that with a little work NB would be able to generate the machine code for primitives that Cog's code generator doesn't handle. Igor will correct me if I'm wrong but I think the idea is thus.
You are right here.
NB-generated code needs to be stored somewhere in a CompiledMethod where Cog can find it,
This is trivial. Its stored right after last bytecode in compiled method :)
No, it's not trivial. Cog doesn't know how to distinguish an NB trailer form any old trailer.
Obviously NB plugin knows how to do that, so does Cog since C plugin is part of VM :)
Personally I'd prefer to see the NB code in a special location in the literal frame (e.g. the first literal) using an instance of a class that can't be a literal. Either that, or a bit in the CompiledMethod header should indicate that the trailer contains the code. (Oh, but I see below you're thinking of a specific primitive number; that's good). Also, if the NB code /is/ in the trailer, how does Cog know where it ends, sicne presumably the trailer includes other stuff after the trailer too.
I did the simplest possible thing to make it work, without changing parser/compiler to handle some additional pragma and insert special literal into method.
If you wondering how i detecting the bounds of a native code located inside a compiled method, you don't even need to go on VM side to look how i did it in my primitive. Just install this changeset into your image: http://code.google.com/p/nativeboost/downloads/detail?name=000-NativeCodeTra...
where you can see, how its encoding raw bytes into trailer, and how decoding it.
See, i've been there, did that. And all i understood while sitting on squeak-dev and trying to sell my ideas is, that if you want your stuff being included/used then you need to do very small changes, and don't dare ever touching VM!
So, a NativeBoost project is my answer to this problem. I can do my stuff without bothering touching VM.
Sure thing, if you want to integrate it in more nicer/generic way, i won't object.
along with some link-edit metadata to allow Cog to fix-up data references in the code etc.
No need. The generated code is position agnostic, and its using interpreterProxy api for speaking with VM. No position/object memory dependent data is stored there. You can copy it anywhere you want and it will still work correctly. Native code during its interaction with VM fully relies on interpreterProxy interface (as any other primitive).
So the NB code uses absolute addressing to get hold of the interpreterProxy, right? Depending on what the primitive is doing using interpreterProxy may be OK or may have poor performance. A little bit of metadata woudl allow NB direct access to certain variables and allow it to short-circuit the slow double-indirect calls through interpreterProxy.
Yes, i know very well about drawbacks of such approach. Exposing some variables, like newMethod, etc could make code little more faster. Couple years ago i did that for Exupery VM-side code. So Exupery were able to get pointer to any VM's global state and use them in generated code. But i'm not really miss that :) If you provide a primitive which can give me a pointer(s) to any/some VM's variable by its name, then i will simply alter the code generator to use it, if its available. And if not, then it will keep doing the hard way.
About metadata: again, metadata for native code means more handling at VM side. This is what i was trying to avoid all the time. And this is the reason why i integrated NB with Cog so fast (except callbacks stuff) :)
And btw, with some effort , i think its possible to make NativeBoost to handle code relocation(s) by own. The code generator would simply emit call to special checkForRelocationAndRelocateIfNeeded() routine, which will be the first thing at entry point of native code which is not position agnostic.
Additionally, in NB, i started implementing a NBObjectFormat thingy, so in some cases, a code generator already using it and can completely avoid doing calls to interpeter proxy, and generate proper code for poking inside an objects.
So, i wouldn't bother with relocation(s) and optimization(s) at current stage. Only after we could see clearly, if there something, which belongs to VM side, and can't be handled by native code itself :)
A native code is a function of kind:
sqInt nativeFunction();
if its not fails, it should return a valid oop. So, after call you simply check if prim not failed:
result := self fnPtr. "fnPtr is native code" self failed ifFalse: [ interpreterProxy pop: argCount + 1 thenPush: result. ] ifTrue: [ lastError := interpreterProxy primitiveFailureCode].
but of course, for inlined code we can change that , in order to generate even more optimal code :)
If Cog finds a method containing NB code it copies this code into the start of the method (where primitive code is compiled) and does some simple link-editing on the code. Now the Cog method starts with NB code which operates ;like a primitive, jumping to the start of the Cog method proper if it fails.
A sketch idea is following: - reserve a special primitive number for 'a compiled method with native code, which can be inlined'.
Yes. That's nice.
Then you can easily detect in Cog inliner, if method can be inlined after some extra checks: Method could have such primitive, but may not have a native code, since its not yet generated. Then you simply 'fail prim'
- i.e. simply go and run method interpreted.
This is what i am actually checking/doing each time before making a native call.
Once everything right (it finds a correct native code within a method), its then free to either make a call to it or inline it.
And as i understood, from image side, we have to guarantee that such special inlined code won't overflow the stack , so Cog won't need to switch stack pointer to C stack before running it.
Right. But Cog stack pages have a generous headroom so you can count on 256 bytes of headroom. But we'd have to specify this. Hopefully 256 bytes is enough for something that runs on the Smalltalk stack since one could argue that if it uses lots of stack its probably going to run for a whole and so the overhead of switching to/from teh C stack is affordable. i.e. run only small performance-critical things on the Smalltalk stack.
Completely reasonable.
Or, if you extend interpreterProxy with something like pleaseMakeSureIAmUsingCStack() and pleaseMakeSureIAmUsingCogStackAgain() then NB could generate a code which will call these things before/after external calls, and so, Cog are free to inline even code which may contain external calls.
How about a primitive that says "run this code on the C stack" and one that says "run this on the native stack"?
Maybe. Then developer needs to decide (figure out) if his code using less than 256 bytes of stack or not, which could be controversial. But i can't find the way how to detect that automatically, since i am allowing to put arbitrary code into method, no any lint rules, no stack checking whatever. :) Freedom comes with responsibility.
If we're talking about NB FFI calls then there's additional compilations. A stack switch is required from Cog's stack pages to the real C stack, and back on return, etc. But I'm encouraged that Igor has already got NB going on Cog.
It wasn't hard. I have to match interpreterProxy interface with Cog's one. (Btw i am really miss the getStackPointer(), which currently not implemented in Cog VM, i would be happy to have this function available without Alien plugin :)
I think thats very easy to add :)
Second, i changed two methods which were responsible for marchalling Float instances, since Cog storing them in native format, while Squeak in big endian.
Btw, can you send a message from NB code?
As I understand it, no.
Of course it can't, because native code runs as a primitive. Primitive can't do sends, because then obviously, it has to enter interpreter loop ( via call to interpret()), and once send complete, somehow, it should again continue to run native code. So, its implementation will be much more like a callback mechanism, which already there.
With NB you can create a native function with callback, then you making an FFI callout to that function and when it calls back, you can do 'message send from NB code' or whatever you want. So, roughly speaking it could be seen as a way to do a message sends from native code :)
cheers Eliot
best Eliot
Levente
-- Best regards, Igor Stasenko AKA sig.
-- Best regards, Igor Stasenko AKA sig.
Should the hash function be the same between ByteString and WideString ? I mean: if a WideString contains the same characters than a ByteString, I guess that they should have both the same hashcode no ?
Christophe
----- Mail Original ----- De: "Igor Stasenko" siguctua@gmail.com À: "The general-purpose Squeak developers list" squeak-dev@lists.squeakfoundation.org Envoyé: Samedi 15 Mai 2010 00h26:18 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne Objet: Re: [squeak-dev] WideString hash is way slower than ByteString hash.
added implementation into examples package: http://www.squeaksource.com/NativeBoost/NativeBoost-Examples-Igor.Stasenko.1...
On 15 May 2010 01:14, Igor Stasenko siguctua@gmail.com wrote:
I implemented WideString hash in NB.
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 155718
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s nbHash]] timeToRun. 81231
Which makes me think, that we're benching the outer loop, rather then hash function itself :)
So, here is more appropriate , i think:
|s t1 t2 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. 10 timesRepeat: [ s := s , s ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun.
{ t1. t2 } #(11479 134)
~ 85x faster :)
-- Best regards, Igor Stasenko AKA sig.
On Sat, 15 May 2010, christophe.jalady@free.fr wrote:
Should the hash function be the same between ByteString and WideString ? I mean: if a WideString contains the same characters than a ByteString, I guess that they should have both the same hashcode no ?
Yes, they should be the same. And they are the same, aren't they?
Levente
Christophe
----- Mail Original ----- De: "Igor Stasenko" siguctua@gmail.com ˙˙: "The general-purpose Squeak developers list" squeak-dev@lists.squeakfoundation.org Envoyé: Samedi 15 Mai 2010 00h26:18 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne Objet: Re: [squeak-dev] WideString hash is way slower than ByteString hash.
added implementation into examples package: http://www.squeaksource.com/NativeBoost/NativeBoost-Examples-Igor.Stasenko.1...
On 15 May 2010 01:14, Igor Stasenko siguctua@gmail.com wrote:
I implemented WideString hash in NB.
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 155718
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s nbHash]] timeToRun. 81231
Which makes me think, that we're benching the outer loop, rather then hash function itself :)
So, here is more appropriate , i think:
|s t1 t2 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. 10 timesRepeat: [ s := s , s ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun.
{ t1. t2 } #(11479 134)
~ 85x faster :)
-- Best regards, Igor Stasenko AKA sig.
-- Best regards, Igor Stasenko AKA sig.
On 15 May 2010 01:53, christophe.jalady@free.fr wrote:
Should the hash function be the same between ByteString and WideString ? I mean: if a WideString contains the same characters than a ByteString, I guess that they should have both the same hashcode no ?
Seems so.
|s | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. s at: 1 put: $a. s class. WideString s nbHash. 196130722 s nbHash 196130722 'aabcdefghijklmno' hash 196130722
Hi again,
On Sat, 15 May 2010, Igor Stasenko wrote:
added implementation into examples package: http://www.squeaksource.com/NativeBoost/NativeBoost-Examples-Igor.Stasenko.1...
just tried to run the following code:
'foo' asWideString nbHash
but I get an assertion failure at this line: imul: ESI with: 16r260D; " 16r260D * low " The code generator is unhappy about ESI. In AJInstructionDescription >> #emitimul:operand1:operand2:operand3: There's an assertion: self assert: op1 isRegTypeGPW. If I just press proceed, it will be unhappy with the other two registers passed to #imul:with:, but the code compiles and works, so probably the assertion is wrong.
Levente
On 15 May 2010 01:14, Igor Stasenko siguctua@gmail.com wrote:
I implemented WideString hash in NB.
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 155718
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s nbHash]] timeToRun. 81231
Which makes me think, that we're benching the outer loop, rather then hash function itself :)
So, here is more appropriate , i think:
|s t1 t2 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. 10 timesRepeat: [ s := s , s ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun.
{ t1. t2 } #(11479 134)
~ 85x faster :)
-- Best regards, Igor Stasenko AKA sig.
-- Best regards, Igor Stasenko AKA sig.
2010/5/15 Levente Uzonyi leves@elte.hu:
Hi again,
On Sat, 15 May 2010, Igor Stasenko wrote:
added implementation into examples package:
http://www.squeaksource.com/NativeBoost/NativeBoost-Examples-Igor.Stasenko.1...
just tried to run the following code:
'foo' asWideString nbHash
but I get an assertion failure at this line: imul: ESI with: 16r260D; " 16r260D * low " The code generator is unhappy about ESI. In AJInstructionDescription >> #emitimul:operand1:operand2:operand3: There's an assertion: self assert: op1 isRegTypeGPW. If I just press proceed, it will be unhappy with the other two registers passed to #imul:with:, but the code compiles and works, so probably the assertion is wrong.
ah yes, i just commented this assert out. I don't sure, but it seems either placed in a wrong place or wrong.
Levente
On 15 May 2010 01:14, Igor Stasenko siguctua@gmail.com wrote:
I implemented WideString hash in NB.
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s hash]] timeToRun. 155718
|s| s := WideString with: (Character value: 16r55E4). [100000000 timesRepeat: [s nbHash]] timeToRun. 81231
Which makes me think, that we're benching the outer loop, rather then hash function itself :)
So, here is more appropriate , i think:
|s t1 t2 | s := (WideString with: (Character value: 16r55E4)) , 'abcdefghijklmno'. 10 timesRepeat: [ s := s , s ]. self assert: (s hash = s nbHash). t1 := [1000 timesRepeat: [s hash]] timeToRun. t2 := [1000 timesRepeat: [s nbHash]] timeToRun.
{ t1. t2 } #(11479 134)
~ 85x faster :)
-- Best regards, Igor Stasenko AKA sig.
-- Best regards, Igor Stasenko AKA sig.
squeak-dev@lists.squeakfoundation.org