[squeak-dev] [ANN] BioSqueak 0.4

Hernán Morales Durand hernan.morales at gmail.com
Fri Feb 1 16:31:24 UTC 2013


Hello Hannes,
Thanks for the feedback! Some answers then between the lines:

El 01/02/2013 11:35, H. Hirzel escribió:
> Hello Hernán
>
> This is interesting.
> http://biosmalltalk.blogspot.com/
>
> I understand that you have constructed an internal domain specific
> language (a DSL, a query language) for dealing with genetic data in
> Smalltalk
>
> search := BioNCBIWWWBlastClient new nucleotide query: 'CCCTCAAACAT...TTTGAGGAG';
>     hitListSize: 150;
>     filterLowComplexity;
>     expectValue: 10;
>     wordSize: 11;
>     blastn;
>     blastPlainService;
>     alignmentViewFlatQueryAnchored;
>     formatTypeXML;
>     fetch.
> search outputToFile: 'blast-query-result.xml' contents: search result.
>
> Is there a description of this DSL?

Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc, 
but a "DSL" which is embedded thus inheriting the syntax and execution 
semantics of Smalltalk.
To clarify: I've not built a DSL specification for the QBlast API, 
although I'm willing to develop DSLs for bioinformatics APIs in a 
Smalltalk language workbench (anyone?).

Currently the messages for performing alignments at the NCBI are based 
in the API specification, 
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html . The unary 
sends are the result of a plan to reduce parametrization and to 
replicate or customize Blast settings through a UI. This is because 
geneticists experiment changing Blast parameters over time and I want my 
system not to be tied to textual parameters.

 > The data is kept in XML files and
 > all is read into the image to be queried. It seems that you don't have
 > a problem with the image size?

Yes I had problems with image size and performance, a lot indeed. 
Actually working with XML DOM with alignments of 5000 or more hits 
Squeak (and Pharo of course) started to show slowliness. So I cannot 
keep all XML nodes in memory. To overcome this problem I've tried the 
SAX (push) parser and the XMLPullParser (which is a StAX parser). Then 
my idea was to reduce the tree by specifying only the XML nodes which 
I'm interested for. After reducing the nodes, I wrote custom XML tree 
classes with a specific API to query blast XML results, taken form the 
DTD specification. AFAIK this is known as a XML digester, which is 
somewhat "evolved" in Java 
(http://commons.apache.org/digester/xmlrules.html). I have built a 
dynamic query builder in Morphic for querying the XML providing the 
possibility of persist and update the filters. Unfortunately for Squeak 
users I'm using the Polymorph API, which I think is not available in Squeak.

We worked using the XML push/pull parsers for reading genomes and they 
worked acceptably. But it is impossible to keep nodes for 3 GBytes of 
XML at least for now in Squeak/Pharo.

More and critical problems arise when trying to work with microarray 
data (big data) in Smalltalk which is not document-oriented. I had to 
switch to "solutions" like SQL, or HDF5 using Pytables with 
well-designed scheme for our input. The advantages are that supports 
indexing and reading data in blocks, besides tools like Vitables or 
HDFView to navigate the data. Until someone provides some bits in this 
field, there is little opportunity for using Smalltalk.

> I would welcome a short writeup with a general introduction to what
> you are doing in http://biosmalltalk.blogspot.com/.
>
> Or pointers to papers (Castilian is fine)
>

We have submitted a paper recently and we are waiting for the review 
results. On the other side we are preparing another paper for a 
phylogenetics decision support system which includes text-mining and a 
rule engine. I will try to write an entry in the next week with screenshots.

Best regards,

Hernán

> Kind regards
>
> Hannes Hirzel
>
> On 2/1/13, Hernán Morales Durand <hernan.morales at gmail.com> wrote:
>> Hi,
>>
>> Few days ago I created a port of BioSmalltalk for Squeak too.
>> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
>> port is labelled "BioSqueak" and I expect to release a version for
>> Windows sometime soon. You can find it in:
>>
>> http://code.google.com/p/biosmalltalk/downloads/list
>>
>> I'm very interested in feedback.
>> Thanks for reading.
>>
>> Hernán
>>
>> --
>> Hernán Morales
>> Institute of Veterinary Genetics (IGEVET)
>> http://igevet.fcv.unlp.edu.ar
>> National Scientific and Technical Research Council (CONICET).
>> La Plata (1900), Buenos Aires, Argentina.
>> Telephone: +54 (0221) 421-1799.
>> Internal: 422
>> Fax: 425-7980 or 421-1799.
>>
>>
>
>



More information about the Squeak-dev mailing list