PhyloclassTalk (was: Re: [squeak-dev] [ANN] BioSqueak 0.4)

Germán Arduino garduino at gmail.com
Sat Feb 23 23:09:02 UTC 2013


Really nice UI Hernán! Congrats!

2013/2/23 Hernán Morales Durand <hernan.morales at gmail.com>:
> Hello Hannes,
>
> Sorry for the late response, I have been working intensively in an
> application using BioSmalltalk. Here is a post with some screenshots:
> http://biosmalltalk.blogspot.com.ar/2013/02/phyloclasstalk-preview.html
>
> as I've said, it is developed in Pharo but most subsystems work in Squeak
> too. I cross-post to the Pharo users list in case someone is interested.
>
> El 16/02/2013 16:00, H. Hirzel escribió:
>>
>> Hello Hernán
>>
>> Thank you for your elaboration on the topic of BioSqueak.
>>
>> On 2/1/13, Hernán Morales Durand <hernan.morales at gmail.com> wrote:
>>>
>>>
>>> Hello Hannes,
>>> Thanks for the feedback! Some answers then between the lines:
>>>
>>> El 01/02/2013 11:35, H. Hirzel escribió:
>>>>
>>>> Hello Hernán
>>>>
>>>> This is interesting.
>>>> http://biosmalltalk.blogspot.com/
>>>>
>>>> I understand that you have constructed an internal domain specific
>>>> language (a DSL, a query language) for dealing with genetic data in
>>>> Smalltalk
>>>>
>>>> search := BioNCBIWWWBlastClient new nucleotide query:
>>>> 'CCCTCAAACAT...TTTGAGGAG';
>>>>      hitListSize: 150;
>>>>      filterLowComplexity;
>>>>      expectValue: 10;
>>>>      wordSize: 11;
>>>>      blastn;
>>>>      blastPlainService;
>>>>      alignmentViewFlatQueryAnchored;
>>>>      formatTypeXML;
>>>>      fetch.
>>>> search outputToFile: 'blast-query-result.xml' contents: search result.
>>>>
>>>> Is there a description of this DSL?
>>>
>>>
>>> Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc,
>>> but a "DSL" which is embedded thus inheriting the syntax and execution
>>> semantics of Smalltalk.
>>
>>
>> Yes, I understand, the regular thing in Smalltalk as every Smalltalk
>> domain model could be considered a DSL to a certain extent/
>>
>> Lukas Renggli has a useful classification on DSLs in his PhD dissertation
>> on
>>     'Dynamic Language Embedding''
>>      http://scg.unibe.ch/archive/phd/renggli-phd.pdf
>>      Chapter 2
>>
>> According to that you probably have an Internal DSL (chapter 2.1), right?
>>
>
> Yes, it would fit into the Internal DSL category. I didn't knew about that
> classification, thanks for sharing.
>
>>
>>> To clarify: I've not built a DSL specification for the QBlast API,
>>> although I'm willing to develop DSLs for bioinformatics APIs in a
>>> Smalltalk language workbench (anyone?).
>>
>>
>> OK
>>
>>> Currently the messages for performing alignments at the NCBI are based
>>> in the API specification,
>>> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html .
>>
>>
>> The unary
>>>
>>> sends are the result of a plan to reduce parametrization and to
>>> replicate or customize Blast settings through a UI. This is because
>>> geneticists experiment changing Blast parameters over time and I want my
>>> system not to be tied to textual parameters.
>>>
>>
>>
>>>   > The data is kept in XML files and
>>>   > all is read into the image to be queried. It seems that you don't
>>> have
>>>   > a problem with the image size?
>>>
>>> Yes I had problems with image size and performance, a lot indeed.
>>
>>
>>> Actually working with XML DOM with alignments of 5000 or more hits
>>> Squeak (and Pharo of course) started to show slowliness. So I cannot
>>> keep all XML nodes in memory. To overcome this problem I've tried the
>>> SAX (push) parser and the XMLPullParser (which is a StAX parser). Then
>>> my idea was to reduce the tree by specifying only the XML nodes which
>>> I'm interested for. After reducing the nodes, I wrote custom XML tree
>>> classes with a specific API to query blast XML results, taken form the
>>> DTD specification. AFAIK this is known as a XML digester, which is
>>> somewhat "evolved" in Java
>>> (http://commons.apache.org/digester/xmlrules.html).
>>
>>
>> I understand that you took
>>        http://www.squeaksource.com/XMLSupport/
>>        (the XML support repo for Pharo, for Squeak XML support is in
>> the trunk image)
>> and modified it.
>>
>>> I have built a
>>> dynamic query builder in Morphic for querying the XML providing the
>>> possibility of persist and update the filters. Unfortunately for Squeak
>>> users I'm using the Polymorph API, which I think is not available in
>>> Squeak.
>>
>>
>> A screen shot would be appreciated... :-)
>>
>
> Ok, the blog post includes some screenshots.
>
>>> We worked using the XML push/pull parsers for reading genomes and they
>>> worked acceptably. But it is impossible to keep nodes for 3 GBytes of
>>> XML at least for now in Squeak/Pharo.
>>
>>
>> According to my experience keeping XML structures in the image is
>> inefficient in terms of memory usage. More efficient ways are needed
>> and XML is then only for reading/writing to external files.
>>
>
> Exactly, XML is not good at all for big data.
>
>>> More and critical problems arise when trying to work with microarray
>>> data (big data) in Smalltalk which is not document-oriented. I had to
>>> switch to "solutions" like SQL, or HDF5 using Pytables with
>>> well-designed scheme for our input. The advantages are that supports
>>> indexing and reading data in blocks, besides tools like Vitables or
>>> HDFView to navigate the data. Until someone provides some bits in this
>>> field, there is little opportunity for using Smalltalk.
>>
>>
>> But what I understand is that people keep DNA data in memory for speed
>> reasons and use C++ or Perl programs to deal with it.
>>
>
> It really depends of the type of analysis, I've seen most starter
> bioinformaticians prefer Python over Perl because of the nicer syntax and
> more complete library support.
>
> I don't know big data projects using C++ with raw DNA data. Compression with
> indexing, and specialized file formats are used these days, splitting data
> in clusters where needed. I would love to see some Smalltalkers working on
> dataspaces too.
>
> See these presentations: http://www.slideshare.net/mndoci/presentations
>
>>>> I would welcome a short writeup with a general introduction to what
>>>> you are doing in http://biosmalltalk.blogspot.com/.
>>
>>
>>
>>>
>>> We have submitted a paper recently and we are waiting for the review
>>> results. On the other side we are preparing another paper for a
>>> phylogenetics decision support system which includes text-mining and a
>>> rule engine. I will try to write an entry in the next week with
>>> screenshots.
>>
>>
>> Any news on this?
>>
>
> No news so far, still in the reviewing process.
> Best regards,
>
> Hernán
>
>> Kind regards
>> Hannes
>>
>>
>>> Best regards,
>>>
>>> Hernán
>>>
>>>> Kind regards
>>>>
>>>> Hannes Hirzel
>>>>
>>>> On 2/1/13, Hernán Morales Durand <hernan.morales at gmail.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Few days ago I created a port of BioSmalltalk for Squeak too.
>>>>> BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
>>>>> port is labelled "BioSqueak" and I expect to release a version for
>>>>> Windows sometime soon. You can find it in:
>>>>>
>>>>> http://code.google.com/p/biosmalltalk/downloads/list
>>>>>
>>>>> I'm very interested in feedback.
>>>>> Thanks for reading.
>>>>>
>>>>> Hernán
>>>>>
>>>>> --
>>>>> Hernán Morales
>>>>> Institute of Veterinary Genetics (IGEVET)
>>>>> http://igevet.fcv.unlp.edu.ar
>>>>> National Scientific and Technical Research Council (CONICET).
>>>>> La Plata (1900), Buenos Aires, Argentina.
>>>>> Telephone: +54 (0221) 421-1799.
>>>>> Internal: 422
>>>>> Fax: 425-7980 or 421-1799.
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
>


More information about the Squeak-dev mailing list