Hi Ben,

On Thu, Jun 27, 2019 at 10:31 PM Ben Coman <btc@openinworld.com> wrote:
Hi Eliot,

On Tue, 7 Aug 2018 at 03:32, Eliot Miranda <eliot.miranda@gmail.com> wrote:
Hi Ben,

> On Aug 5, 2018, at 6:34 PM, Ben Coman <btc@openinworld.com> wrote:
>> On 5 August 2018 at 23:10, Eliot Miranda <eliot.miranda@gmail.com> wrote:
>> Hi Ben,
>>> On Aug 4, 2018, at 8:40 AM, Ben Coman <btc@openinworld.com> wrote:
>>> A problem with FFI is that if a callout segfaults, all of memory
>>> including that of the Image is suspect, and execution of the Image terminates.
>>> Occasionally I hunt around hoping to find technology to mitigate that problem.
>>> Maybe this time in I found something... Memory Protection Keys [1]
>>> Perhaps these could ensure Image memory safe when an FFI callout segfaults.
>>> IIUC the main problem with protecting Image memory on every FFI callout
>>> is the time it would take update the flags on every page of Image memory.
>>> Would being able to change the protection of a massive number of pages
>>> with one syscall make it feasible to wrap them around FFI callouts?
>>> This may be useful at least where the FFI use is more about reuse of
>>> existing functionality than about performance.
>>> Or at least useful while someone is learning/experimenting with FFI for
>>> the first time or while becoming familiar with some external library.
>>> Further info at [2] & [3].
>> I think there’s a much simpler improvement that doesn’t go this far.  I implemented it in VisualWorks and it’s been in production for more than a decade.  It should be easy to add to Cog.
>> The idea is simply to add a flag that tracks if the VM is in an FFI call or not and to test this flag in the VM’s exception handlers for SIGBUS, SIGILL, SIGSEGV and their equivalents on Windows.  The exception handlers then respond when in an FFI call by failing the FFI call primitive, answering a primitive fail code that includes the exception information.  Recently we extended Cog’s failure codes to allow a structured object (I font have the details handy; I’ll check soon).  In this case we need a pc and/or address and an exception code.
>> Would this approach satisfy you?
> That sounds good.  Although the argument I've seen is that a memory
> access error means you "cant recover because you don't know what may have been corrupted"
> I think its worthwhile to be optimistic that the Image may last a bit
> longer to get more information about what call from the Image invoked the FFI failure.
> And if you've been notified (e.g. via Growl message) you can still
> take steps to move to a new Image if the current one is suspect.

>> While it’s possible that an FFI call could damage the Smalltalk heap and VM state it’s often not the case do one wants to be able to at least reap the error code and hence identify where the error occurred, inspect the arguments to the call, etc.  It’s hence about having enough functionality to gather what information is available from outside the call, not about being able to continue for a long time after.

> Also, the approach you suggest would be a pre-requisite for what I
> suggested anyway, and make it easier to later experiment with MPKs.


> Let me know what I can do to help (probably more capable on the testing side).

Will do.  If you’re happy with C programming and the simulator part of it is adding a global flag variable, setting and u setting it in FFI calls and callbacks, and then responding to the flag in the exception handler.  The tricky bit is arranging failure.  That I can work on when time allows.

I believe you did some work on this catching of segfaults in FFI callouts to return a primitive failure.  
Where did it get up to?  Can you point me at the code that sets/tests this flag and sets up the primitive failure?

Indeed I did.  Here's the status.  It works on Unix and MacOS but fails on Windows due to a failure in structur4edd exception handling (stack walking) with the MinGW toolchain.  I may revisit the code in the context of MSVC Community Edition 2017, which I'm using for the Terf VM.  Thanks for the reminder.  here's the structure of the code:

In SmalltalkImage>>#recreateSpecialObjectsArray the error table is extended to include a prototype instance of ExceptionInFFICallError.  When wanting to deliver such an error the VM creates a shallow copy of this object, fills it in. and supplies it as the errorCode in an FFI primitive method.  This change was introduced in System-eem.1041.

newArray at: 52 put: #(nil "nil => generic error" #'bad receiver'
#'bad argument' #'bad index'
#'bad number of arguments'
#'inappropriate operation'  #'unsupported operation'
#'no modification' #'insufficient object memory'
#'insufficient C memory' #'not found' #'bad method'
#'internal error in named primitive machinery'
#'object may move' #'resource limit exceeded'
#'object is pinned' #'primitive write beyond end of object'
#'object moved' #'object not pinned' #'callback error'),
{PrimitiveError new errorName: #'operating system error'; yourself.
ExceptionInFFICallError new errorName: #'exception in FFI call'; yourself}.

ExceptionInFFICallError allInstVarNames #('errorName' 'errorCode' 'pc')
So errorName is #'exception in FFI call', errorCode will be either the second argument to the signal handler on Unix, or the Win32 exception code on Win32.  The pc is the pc at which the exception took place.  The error code will only be delivered if the method contains a primitive error code.  There is a flag in the VM to provide overriding of this behavior, but as yet there is no primitive to access this flag.  See the two implementors of primitiveFailForFFIException:at: in the VMMaker source code.

Within the VM the fatal exception handlers (sigsegv in the Unix & MacOS VMs; squeakExceptionHandler within the win32 VM) always call primitiveFailForFFIExceptionat.  primitiveFailForFFIExceptionat checks to see if the VM is in an FFI call and if not, simply returns.  If so, it does the relevant stack switching actions to discard the C stack and fail the primitive with the supplied error code & pc.

For reasons unknown, the exception handler squeakExceptionHandler seems not to be reached if the VM is compiled with clang and/or gcc on win32.  [I have to confirm this; it's been ten months].

N.B. I had to add an error code variable to ExternalFunction>>#invokeWithArguments:. Pharo should ensure it also has one.

best, Eliot