Has anyone been working on speech synthesis for Squeak? I have roughly implemented -enterely in Squeak- a synthesizer based upon rsynth (Klatt cascade/parallel filter bank synthesizer and rule-based text to phonemes translator), but it's just an experiment... the quality is pour and it's very slow. I'd like to do something in the diphone concatenation direction (just like Festival does, for instance). I'd like to know of other people working in the same thing for Squeak, if there are any.
Squeak! Luciano.-
Luciano Esteban Notarfrancesco wrote:
I have roughly implemented -enterely in Squeak- a synthesizer based upon rsynth (Klatt cascade/parallel filter bank synthesizer and rule-based text to phonemes translator), but it's just an > experiment...
Wow!
Rsynth is public domain I believe. Do you have any plans to put your work with it under the Squeak license (or something similar)? If so, I'd be interested in helping out as a tester or with comments related to getting it to work efficiently or better under Squeak.
We have a Delphi product (an interactive story authoring system called StoryHarp) which I'd like to see in Squeak (we almost wrote it in Squeak in the first place). It uses speech recognition and text-to-speech (although the speech recognition is not as essential).
A native Squeak test-to-speech (TTS) system would be much easier to manage across multiple platforms than an API w/ primitives to call out to platform dependent TTS systems.
the quality is pour and it's very slow. I'd like to do something in the diphone concatenation direction (just like Festival does, for instance).
How slow is it? What system (processor/speed/memory) are you running it on? Is it as understandable as the original rsynth itself? Are there specific Squeak speed related problems (like clicking or drop outs or stuttering) that could be worked around?
Is it possible to use the Smalltalk->C translator to take your work and build it into the VM (like some current sound primitives)? It would have to be written with the translator restrictions in mind, or the translator would have to be expanded to support it.
Festival is only free for non-commercial use I believe. If you used that code directly, your work could not be part of the general Squeak distribution under the Squeak license (although I guess it could be distributed as an add on).
I'd like to know of other people working in the same thing for Squeak, if there are any.
I'd be very interested in finding this out too.
I've looked at linking Squeak to SAPI and Microsoft Agent under Windows via Squeak primitives -- but that would just provides access to external TTS and SR systems. I did not get anything working though because I decided to use Delphi instead. I don't know much about implementing TTS or SR engines themselves, but I've done some signal processing stuff for machine vision and acoustic positioning.
It may sound silly, but TTS is a real whiz-bang feature that would attract more attention to Squeak. People are very excited by Macintalk on the Newton for example. Also, I've heard one can also use a good TTS system to generate wave forms to use as the basis of a speech recognition algorithm. Your work could enable a lot of really neat stuff.
-Paul Fernhout Kurtz-Fernhout Software ========================================================= Developers of custom software and educational simulations Creators of the Garden with Insight(TM) garden simulator http://www.kurtz-fernhout.com
On Tue, 28 Apr 1998, Paul Fernhout wrote:
Rsynth is public domain I believe. Do you have any plans to put your work with it under the Squeak license (or something similar)? If so, I'd be interested in helping out as a tester or with comments related to getting it to work efficiently or better under Squeak.
It was just an experiment... I'd like to have the time to do it much better. Anyway, everything I do is always public domain, so if you or others want it I can post it. I will send it to you and to a Squeak ftp site the next Tuesday (I don't have it here, sorry) and by the way I'll try to improve it a little.
How slow is it? What system (processor/speed/memory) are you running it on? Is it as understandable as the original rsynth itself? Are there specific Squeak speed related problems (like clicking or drop outs or stuttering) that could be worked around?
Roughly it needed about 10 or 20 seconds to say 'how are you' at a sampling rate of 8 khz in a 486 DX2 66 with 12 MBs running Linux. It sounds like rsynth, with the exception of intonation (stress), that I have not yet implemented. It can probably be speed up adding some primitives.
Is it possible to use the Smalltalk->C translator to take your work and build it into the VM (like some current sound primitives)? It would have to be written with the translator restrictions in mind, or the translator would have to be expanded to support it.
I tryed to do it, but the translator does not support changing instance variables which are not integers (floats, for instance) and don't support returns which are not ^self.
Festival is only free for non-commercial use I believe. If you used that code directly, your work could not be part of the general Squeak distribution under the Squeak license (although I guess it could be distributed as an add on).
I didn't mean using Festival code directly, but using a diphone concatenation approach just in the way Festival and other synthesizers (most of them I think) do. The disadvantage of that method is that it needs a big (between 4 and 12 MBs) base consisting of sampled spoken units coded in such a way to make it easy to change duration, pitch and amplitude of voice; these units are concatenated to form words and phrases. The advantage of this approach is that the resulting voice is very natural sounding, in such a way that even can make the synthesizer sing as in the Lyricos project.
regards, Luciano.-
Luciano -
Thanks for the reply.
Luciano Esteban Notarfrancesco wrote:
if you or others want it I can post it. I will send it to you and to a Squeak ftp site the next Tuesday (I don't have it here, sorry) and by the way I'll try to improve it a little.
Yes, please! I'm still very interested in trying out what you have done when you are ready to release it, and look at what I could do to make it go faster. I don't mean to rush you though if you want to make improvements first.
Best not to post it to the list (since I assume it is big). Email to me directly is fine or let me know where you ftp it (or if you put it up somewhere as a URL). If you want, I can put the package on our web site somewhere for http downloading by others.
Is it possible to use the Smalltalk->C translator to take your work and build it into the VM (like some current sound primitives)? It would have to be written with the translator restrictions in mind, or the translator would have to be expanded to support it.
I tryed to do it, but the translator does not support changing instance variables which are not integers (floats, for instance) and don't support returns which are not ^self.
John Maloney suggested in another post on your project that maybe only a few key methods might need to be converted to C. That might get around the floating point / return problems.
Of course, one could also just plug parts of the rsynth code into the Squeak shell too (but that might be ugly). Also, maybe it would only take one or two modifications to the Smalltalk->C translator (as a subclass perhaps) to get it to handle what your code needs (or perhaps just to optimize the inner loops of your code).
Is there one or two cpu intensive operations -- like generating a wave form from a function, or superimposing two waves -- at the heart of rsynth that could be made into a primitive (like the sound primitives)? Can the sound primitives do that already? I've heard of music algorithms being used for TTS before. Would the Festival-like approach be easier to do than rsynth by using the existing Squeak sound primitives?
Roughly it needed about 10 or 20 seconds to say 'how are you' at a sampling rate of 8 khz in a 486 DX2 66 with 12 MBs running Linux. It sounds like rsynth, with the exception of intonation (stress), that I have not yet implemented. It can probably be speed up adding some primitives.
I run Squeak on a 486 DX2 66 24MB running Win95, a Pentium Pro 180Mhz 64MB running NT4.0, and a Quadra 630 68040 66Mhz 36MB. The PentiumPro is probably up to twelve times faster for some things that the other two (3 for clock X 2 for 486->Pentium X 2 for Pentium->Pro). So it is possible your code would run as is without translation on higher end Pentiums or Mac PowerPCs in a second or two. I can try it out on both the 486 and PPro and compare. Of course, since you run Linux, you're probably getting better performance than these Windows sloths on the same hardware (no offense intended to sloths of course, who play an essential role in their ecological niche).
Dell, for example, is now advertising a 400Mhz Pentium II for US$3699 w/ 128MB RAM and 21" monitor. They were selling stuff at half that speed for that price a year or two ago. That's about what I paid for a Gateway 386 25Mhz 4MB 14" monitor system about ten years ago (the system on which I first used ParcPlace's ObjectWorks for Smalltalk 2.5). So in a couple of years speed may not even be an issue for this sort of waveform manipulation work in pure Squeak Smalltalk (especially with that great Jitter system which I expect will just keep improving).
Smalltalk -- the solution that just keeps getting better as computers keep getting faster...
I didn't mean using Festival code directly, but using a diphone concatenation approach just in the way Festival and other synthesizers (most of them I think) do. The disadvantage of that method is that it needs a big (between 4 and 12 MBs) base consisting of sampled spoken units coded in such a way to make it easy to change duration, pitch and amplitude of voice; these units are concatenated to form words and phrases.
Macintalk for the Newton is only a 243K or so package but has very good voice quality (and several voices). I would be very pleased if we had a voice synthesizer for Squeak with a similar specifications. Does anyone know what approach Macintalk is based on? Maybe it has internal compression of waveforms?
The advantage of this approach is that the resulting voice is very natural sounding, in such a way that even can make the synthesizer sing as in the Lyricos project.
A singing Squeak. That would be really cool!
-Paul Fernhout Kurtz-Fernhout Software ========================================================= Developers of custom software and educational simulations Creators of the GPL'd Garden with Insight(TM) garden simulator http://www.kurtz-fernhout.com
squeak-dev@lists.squeakfoundation.org