[squeak-dev] [ANN] BioSqueak 0.4

H. Hirzel hannes.hirzel at gmail.com
Sat Feb 16 19:00:29 UTC 2013


Hello Hernán

Thank you for your elaboration on the topic of BioSqueak.

On 2/1/13, Hernán Morales Durand <hernan.morales at gmail.com> wrote:
>
> Hello Hannes,
> Thanks for the feedback! Some answers then between the lines:
>
> El 01/02/2013 11:35, H. Hirzel escribió:
>> Hello Hernán
>>
>> This is interesting.
>> http://biosmalltalk.blogspot.com/
>>
>> I understand that you have constructed an internal domain specific
>> language (a DSL, a query language) for dealing with genetic data in
>> Smalltalk
>>
>> search := BioNCBIWWWBlastClient new nucleotide query:
>> 'CCCTCAAACAT...TTTGAGGAG';
>>     hitListSize: 150;
>>     filterLowComplexity;
>>     expectValue: 10;
>>     wordSize: 11;
>>     blastn;
>>     blastPlainService;
>>     alignmentViewFlatQueryAnchored;
>>     formatTypeXML;
>>     fetch.
>> search outputToFile: 'blast-query-result.xml' contents: search result.
>>
>> Is there a description of this DSL?
>
> Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc,
> but a "DSL" which is embedded thus inheriting the syntax and execution
> semantics of Smalltalk.

Yes, I understand, the regular thing in Smalltalk as every Smalltalk
domain model could be considered a DSL to a certain extent/

Lukas Renggli has a useful classification on DSLs in his PhD dissertation on
   'Dynamic Language Embedding''
    http://scg.unibe.ch/archive/phd/renggli-phd.pdf
    Chapter 2

According to that you probably have an Internal DSL (chapter 2.1), right?


> To clarify: I've not built a DSL specification for the QBlast API,
> although I'm willing to develop DSLs for bioinformatics APIs in a
> Smalltalk language workbench (anyone?).

OK

> Currently the messages for performing alignments at the NCBI are based
> in the API specification,
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html .

The unary
> sends are the result of a plan to reduce parametrization and to
> replicate or customize Blast settings through a UI. This is because
> geneticists experiment changing Blast parameters over time and I want my
> system not to be tied to textual parameters.
>


>  > The data is kept in XML files and
>  > all is read into the image to be queried. It seems that you don't have
>  > a problem with the image size?
>
> Yes I had problems with image size and performance, a lot indeed.

> Actually working with XML DOM with alignments of 5000 or more hits
> Squeak (and Pharo of course) started to show slowliness. So I cannot
> keep all XML nodes in memory. To overcome this problem I've tried the
> SAX (push) parser and the XMLPullParser (which is a StAX parser). Then
> my idea was to reduce the tree by specifying only the XML nodes which
> I'm interested for. After reducing the nodes, I wrote custom XML tree
> classes with a specific API to query blast XML results, taken form the
> DTD specification. AFAIK this is known as a XML digester, which is
> somewhat "evolved" in Java
> (http://commons.apache.org/digester/xmlrules.html).

I understand that you took
      http://www.squeaksource.com/XMLSupport/
      (the XML support repo for Pharo, for Squeak XML support is in
the trunk image)
and modified it.

> I have built a
> dynamic query builder in Morphic for querying the XML providing the
> possibility of persist and update the filters. Unfortunately for Squeak
> users I'm using the Polymorph API, which I think is not available in
> Squeak.

A screen shot would be appreciated... :-)

> We worked using the XML push/pull parsers for reading genomes and they
> worked acceptably. But it is impossible to keep nodes for 3 GBytes of
> XML at least for now in Squeak/Pharo.

According to my experience keeping XML structures in the image is
inefficient in terms of memory usage. More efficient ways are needed
and XML is then only for reading/writing to external files.

> More and critical problems arise when trying to work with microarray
> data (big data) in Smalltalk which is not document-oriented. I had to
> switch to "solutions" like SQL, or HDF5 using Pytables with
> well-designed scheme for our input. The advantages are that supports
> indexing and reading data in blocks, besides tools like Vitables or
> HDFView to navigate the data. Until someone provides some bits in this
> field, there is little opportunity for using Smalltalk.

But what I understand is that people keep DNA data in memory for speed
reasons and use C++ or Perl programs to deal with it.

>> I would welcome a short writeup with a general introduction to what
>> you are doing in http://biosmalltalk.blogspot.com/.


>
> We have submitted a paper recently and we are waiting for the review
> results. On the other side we are preparing another paper for a
> phylogenetics decision support system which includes text-mining and a
> rule engine. I will try to write an entry in the next week with
> screenshots.

Any news on this?

Kind regards
Hannes


> Best regards,
>
> Hernán
>
>> Kind regards
>>
>> Hannes Hirzel
>>
>> On 2/1/13, Hernán Morales Durand <hernan.morales at gmail.com> wrote:
>>> Hi,
>>>
>>> Few days ago I created a port of BioSmalltalk for Squeak too.
>>> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
>>> port is labelled "BioSqueak" and I expect to release a version for
>>> Windows sometime soon. You can find it in:
>>>
>>> http://code.google.com/p/biosmalltalk/downloads/list
>>>
>>> I'm very interested in feedback.
>>> Thanks for reading.
>>>
>>> Hernán
>>>
>>> --
>>> Hernán Morales
>>> Institute of Veterinary Genetics (IGEVET)
>>> http://igevet.fcv.unlp.edu.ar
>>> National Scientific and Technical Research Council (CONICET).
>>> La Plata (1900), Buenos Aires, Argentina.
>>> Telephone: +54 (0221) 421-1799.
>>> Internal: 422
>>> Fax: 425-7980 or 421-1799.
>>>
>>>
>>
>>
>
>
>


More information about the Squeak-dev mailing list