While I've done a lot of C programming that is useful for FFI interfacing, I've not done much C++. So just sharing something new I learnt today to help with FFI interfacing to combined C/C++ libraries. I thought maybe others in the same boat could be interested in this. [Original question asked in squeak-dev, cross-posting to pharo-dev]
On Fri, 2 Nov 2018 at 21:06, Ben Coman btc@openinworld.com wrote:
On Fri, 2 Nov 2018 at 18:44, Edwin Ancaer eancaer@gmail.com wrote:
As I'm looking at a way to automate the search of documents in my humble administration, I read some articles about OCR. I came along an article about using Python with Tesseract, to transform an scan of a document into text, that is searchable.
My question now is if I can do something similar with Squeak. To my inexperienced eye, it seems like I should use FFI to call the functions in the Tesseract API, but this API is in C++, and I don't know if it is possible to use FFI to call C++ functions?
You are right C++ is difficult because of the name mangling of function symbols, but good fortune I notice Tesseract has C bindings... https://github.com/tesseract-ocr/tesseract#for-developers https://github.com/tesseract-ocr/tesseract/blob/master/src/api/capi.h so it looks like you are in the clear.
Browsing a deeper I got quite confused for a while. I could see a typedef definition for TessResultRenderer here... https://github.com/tesseract-ocr/tesseract/blob/master/src/api/capi.h#L83 "typedef struct TessResultRenderer TessResultRenderer" which I understood to must refer to *existing* struct, but I couldn't find the definition of that struct anywhere. In particular... $ git clone git@github.com:tesseract-ocr/tesseract.git $ cd tesseract $ find . -type f -name "*h" -exec grep -Hn TessResultRenderer {} ; but didn't find any struct definitions.
I could only find TessResultRenderer as a class definition... https://github.com/tesseract-ocr/tesseract/blob/master/src/api/renderer.h#L4... and the only thing that I guessed could possibly make sense was that C++ classes and structs could be used interchangeably. My google-fu failed to find anything useful, so an experiment... $ vi test.cpp #include <stdio.h> class SomeClass { public: int a; int b; }; typedef struct SomeClass SomeTypeDef; int main() { SomeTypeDef x; x.a = 5; x.b = 7; printf("Answer is %d\n", x.a + x.b); } $ gcc test.cpp $ ./a.out Answer is 12
Now I noticed that the TessResultRenderer member variables were private... https://github.com/tesseract-ocr/tesseract/blob/master/src/api/renderer.h#L1... and curious about that I changed my test example from public to private which somewhat expectedly produced compile errors.
So those TessResultRenderer member variables must only be accessed from a member function, but how is that C++ member function called from C to operate on a particular object? An example is TessResultRendererInsert... C Declaration: https://github.com/tesseract-ocr/tesseract/blob/c375f4fbf73b8f761b2e65e0e3ad... C Definition: https://github.com/tesseract-ocr/tesseract/blob/c375f4fbf73b8f761b2e65e0e3ad... C++ Declaration: https://github.com/tesseract-ocr/tesseract/blob/master/src/api/renderer.h#L5... C++ Definition: https://github.com/tesseract-ocr/tesseract/blob/master/src/api/renderer.cpp#...
So in the C Defintion "the C++ member-function insert() as being called via a function pointer in the struct." (is that a reasonable way to describe it?)
In this case, because of the private member variables, our FFI would treat TessResultRenderer as an opaque object, which simplifies things. I would guess in-Image direct access to the member variables from would need to account for the offset due to variables holding the function pointer to the member functions.
cheers -ben
P.S. for Tesseract FFI it might be good to start with reproducing this example... https://github.com/tesseract-ocr/tesseract/wiki/APIExample#example-using-the...