PhyloclassTalk (was: Re: [squeak-dev] [ANN] BioSqueak 0.4)

Hernán Morales Durand hernan.morales at gmail.com
Sat Feb 23 19:53:22 UTC 2013


Hello Hannes,

Sorry for the late response, I have been working intensively in an 
application using BioSmalltalk. Here is a post with some screenshots: 
http://biosmalltalk.blogspot.com.ar/2013/02/phyloclasstalk-preview.html

as I've said, it is developed in Pharo but most subsystems work in 
Squeak too. I cross-post to the Pharo users list in case someone is 
interested.

El 16/02/2013 16:00, H. Hirzel escribió:
> Hello Hernán
>
> Thank you for your elaboration on the topic of BioSqueak.
>
> On 2/1/13, Hernán Morales Durand <hernan.morales at gmail.com> wrote:
>>
>> Hello Hannes,
>> Thanks for the feedback! Some answers then between the lines:
>>
>> El 01/02/2013 11:35, H. Hirzel escribió:
>>> Hello Hernán
>>>
>>> This is interesting.
>>> http://biosmalltalk.blogspot.com/
>>>
>>> I understand that you have constructed an internal domain specific
>>> language (a DSL, a query language) for dealing with genetic data in
>>> Smalltalk
>>>
>>> search := BioNCBIWWWBlastClient new nucleotide query:
>>> 'CCCTCAAACAT...TTTGAGGAG';
>>>      hitListSize: 150;
>>>      filterLowComplexity;
>>>      expectValue: 10;
>>>      wordSize: 11;
>>>      blastn;
>>>      blastPlainService;
>>>      alignmentViewFlatQueryAnchored;
>>>      formatTypeXML;
>>>      fetch.
>>> search outputToFile: 'blast-query-result.xml' contents: search result.
>>>
>>> Is there a description of this DSL?
>>
>> Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc,
>> but a "DSL" which is embedded thus inheriting the syntax and execution
>> semantics of Smalltalk.
>
> Yes, I understand, the regular thing in Smalltalk as every Smalltalk
> domain model could be considered a DSL to a certain extent/
>
> Lukas Renggli has a useful classification on DSLs in his PhD dissertation on
>     'Dynamic Language Embedding''
>      http://scg.unibe.ch/archive/phd/renggli-phd.pdf
>      Chapter 2
>
> According to that you probably have an Internal DSL (chapter 2.1), right?
>

Yes, it would fit into the Internal DSL category. I didn't knew about 
that classification, thanks for sharing.

>
>> To clarify: I've not built a DSL specification for the QBlast API,
>> although I'm willing to develop DSLs for bioinformatics APIs in a
>> Smalltalk language workbench (anyone?).
>
> OK
>
>> Currently the messages for performing alignments at the NCBI are based
>> in the API specification,
>> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html .
>
> The unary
>> sends are the result of a plan to reduce parametrization and to
>> replicate or customize Blast settings through a UI. This is because
>> geneticists experiment changing Blast parameters over time and I want my
>> system not to be tied to textual parameters.
>>
>
>
>>   > The data is kept in XML files and
>>   > all is read into the image to be queried. It seems that you don't have
>>   > a problem with the image size?
>>
>> Yes I had problems with image size and performance, a lot indeed.
>
>> Actually working with XML DOM with alignments of 5000 or more hits
>> Squeak (and Pharo of course) started to show slowliness. So I cannot
>> keep all XML nodes in memory. To overcome this problem I've tried the
>> SAX (push) parser and the XMLPullParser (which is a StAX parser). Then
>> my idea was to reduce the tree by specifying only the XML nodes which
>> I'm interested for. After reducing the nodes, I wrote custom XML tree
>> classes with a specific API to query blast XML results, taken form the
>> DTD specification. AFAIK this is known as a XML digester, which is
>> somewhat "evolved" in Java
>> (http://commons.apache.org/digester/xmlrules.html).
>
> I understand that you took
>        http://www.squeaksource.com/XMLSupport/
>        (the XML support repo for Pharo, for Squeak XML support is in
> the trunk image)
> and modified it.
>
>> I have built a
>> dynamic query builder in Morphic for querying the XML providing the
>> possibility of persist and update the filters. Unfortunately for Squeak
>> users I'm using the Polymorph API, which I think is not available in
>> Squeak.
>
> A screen shot would be appreciated... :-)
>

Ok, the blog post includes some screenshots.

>> We worked using the XML push/pull parsers for reading genomes and they
>> worked acceptably. But it is impossible to keep nodes for 3 GBytes of
>> XML at least for now in Squeak/Pharo.
>
> According to my experience keeping XML structures in the image is
> inefficient in terms of memory usage. More efficient ways are needed
> and XML is then only for reading/writing to external files.
>

Exactly, XML is not good at all for big data.

>> More and critical problems arise when trying to work with microarray
>> data (big data) in Smalltalk which is not document-oriented. I had to
>> switch to "solutions" like SQL, or HDF5 using Pytables with
>> well-designed scheme for our input. The advantages are that supports
>> indexing and reading data in blocks, besides tools like Vitables or
>> HDFView to navigate the data. Until someone provides some bits in this
>> field, there is little opportunity for using Smalltalk.
>
> But what I understand is that people keep DNA data in memory for speed
> reasons and use C++ or Perl programs to deal with it.
>

It really depends of the type of analysis, I've seen most starter 
bioinformaticians prefer Python over Perl because of the nicer syntax 
and more complete library support.

I don't know big data projects using C++ with raw DNA data. Compression 
with indexing, and specialized file formats are used these days, 
splitting data in clusters where needed. I would love to see some 
Smalltalkers working on dataspaces too.

See these presentations: http://www.slideshare.net/mndoci/presentations

>>> I would welcome a short writeup with a general introduction to what
>>> you are doing in http://biosmalltalk.blogspot.com/.
>
>
>>
>> We have submitted a paper recently and we are waiting for the review
>> results. On the other side we are preparing another paper for a
>> phylogenetics decision support system which includes text-mining and a
>> rule engine. I will try to write an entry in the next week with
>> screenshots.
>
> Any news on this?
>

No news so far, still in the reviewing process.
Best regards,

Hernán

> Kind regards
> Hannes
>
>
>> Best regards,
>>
>> Hernán
>>
>>> Kind regards
>>>
>>> Hannes Hirzel
>>>
>>> On 2/1/13, Hernán Morales Durand <hernan.morales at gmail.com> wrote:
>>>> Hi,
>>>>
>>>> Few days ago I created a port of BioSmalltalk for Squeak too.
>>>> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
>>>> port is labelled "BioSqueak" and I expect to release a version for
>>>> Windows sometime soon. You can find it in:
>>>>
>>>> http://code.google.com/p/biosmalltalk/downloads/list
>>>>
>>>> I'm very interested in feedback.
>>>> Thanks for reading.
>>>>
>>>> Hernán
>>>>
>>>> --
>>>> Hernán Morales
>>>> Institute of Veterinary Genetics (IGEVET)
>>>> http://igevet.fcv.unlp.edu.ar
>>>> National Scientific and Technical Research Council (CONICET).
>>>> La Plata (1900), Buenos Aires, Argentina.
>>>> Telephone: +54 (0221) 421-1799.
>>>> Internal: 422
>>>> Fax: 425-7980 or 421-1799.
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>



More information about the Squeak-dev mailing list