[squeak-dev] [ANN] BioSqueak 0.4
Hernán Morales Durand
hernan.morales at gmail.com
Fri Feb 1 16:31:24 UTC 2013
Hello Hannes,
Thanks for the feedback! Some answers then between the lines:
El 01/02/2013 11:35, H. Hirzel escribió:
> Hello Hernán
>
> This is interesting.
> http://biosmalltalk.blogspot.com/
>
> I understand that you have constructed an internal domain specific
> language (a DSL, a query language) for dealing with genetic data in
> Smalltalk
>
> search := BioNCBIWWWBlastClient new nucleotide query: 'CCCTCAAACAT...TTTGAGGAG';
> hitListSize: 150;
> filterLowComplexity;
> expectValue: 10;
> wordSize: 11;
> blastn;
> blastPlainService;
> alignmentViewFlatQueryAnchored;
> formatTypeXML;
> fetch.
> search outputToFile: 'blast-query-result.xml' contents: search result.
>
> Is there a description of this DSL?
Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc,
but a "DSL" which is embedded thus inheriting the syntax and execution
semantics of Smalltalk.
To clarify: I've not built a DSL specification for the QBlast API,
although I'm willing to develop DSLs for bioinformatics APIs in a
Smalltalk language workbench (anyone?).
Currently the messages for performing alignments at the NCBI are based
in the API specification,
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html . The unary
sends are the result of a plan to reduce parametrization and to
replicate or customize Blast settings through a UI. This is because
geneticists experiment changing Blast parameters over time and I want my
system not to be tied to textual parameters.
> The data is kept in XML files and
> all is read into the image to be queried. It seems that you don't have
> a problem with the image size?
Yes I had problems with image size and performance, a lot indeed.
Actually working with XML DOM with alignments of 5000 or more hits
Squeak (and Pharo of course) started to show slowliness. So I cannot
keep all XML nodes in memory. To overcome this problem I've tried the
SAX (push) parser and the XMLPullParser (which is a StAX parser). Then
my idea was to reduce the tree by specifying only the XML nodes which
I'm interested for. After reducing the nodes, I wrote custom XML tree
classes with a specific API to query blast XML results, taken form the
DTD specification. AFAIK this is known as a XML digester, which is
somewhat "evolved" in Java
(http://commons.apache.org/digester/xmlrules.html). I have built a
dynamic query builder in Morphic for querying the XML providing the
possibility of persist and update the filters. Unfortunately for Squeak
users I'm using the Polymorph API, which I think is not available in Squeak.
We worked using the XML push/pull parsers for reading genomes and they
worked acceptably. But it is impossible to keep nodes for 3 GBytes of
XML at least for now in Squeak/Pharo.
More and critical problems arise when trying to work with microarray
data (big data) in Smalltalk which is not document-oriented. I had to
switch to "solutions" like SQL, or HDF5 using Pytables with
well-designed scheme for our input. The advantages are that supports
indexing and reading data in blocks, besides tools like Vitables or
HDFView to navigate the data. Until someone provides some bits in this
field, there is little opportunity for using Smalltalk.
> I would welcome a short writeup with a general introduction to what
> you are doing in http://biosmalltalk.blogspot.com/.
>
> Or pointers to papers (Castilian is fine)
>
We have submitted a paper recently and we are waiting for the review
results. On the other side we are preparing another paper for a
phylogenetics decision support system which includes text-mining and a
rule engine. I will try to write an entry in the next week with screenshots.
Best regards,
Hernán
> Kind regards
>
> Hannes Hirzel
>
> On 2/1/13, Hernán Morales Durand <hernan.morales at gmail.com> wrote:
>> Hi,
>>
>> Few days ago I created a port of BioSmalltalk for Squeak too.
>> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
>> port is labelled "BioSqueak" and I expect to release a version for
>> Windows sometime soon. You can find it in:
>>
>> http://code.google.com/p/biosmalltalk/downloads/list
>>
>> I'm very interested in feedback.
>> Thanks for reading.
>>
>> Hernán
>>
>> --
>> Hernán Morales
>> Institute of Veterinary Genetics (IGEVET)
>> http://igevet.fcv.unlp.edu.ar
>> National Scientific and Technical Research Council (CONICET).
>> La Plata (1900), Buenos Aires, Argentina.
>> Telephone: +54 (0221) 421-1799.
>> Internal: 422
>> Fax: 425-7980 or 421-1799.
>>
>>
>
>
More information about the Squeak-dev
mailing list
|