speech synthesis

Paul Fernhout kfsoft at netins.net
Fri May 1 15:05:20 UTC 1998

Luciano -

Thanks for the reply. 

Luciano Esteban Notarfrancesco wrote:
> if you or others want it I can post it. 
> I will send it to you and to a Squeak ftp site the next Tuesday 
> (I don't have it here, sorry) and by the way I'll
> try to improve it a little.

Yes, please! I'm still very interested in trying out what you have done
when you are ready to release it, and look at what I could do to make it
go faster. I don't mean to rush you though if you want to make
improvements first. 

Best not to post it to the list (since I assume it is big). Email to me
directly is fine or let me know where you ftp it (or if you put it up
somewhere as a URL). If you want, I can put the package on our web site
somewhere for http downloading by others.

> > Is it possible to use the Smalltalk->C translator to take your work and
> > build it into the VM (like some current sound primitives)? It would have
> > to be written with the translator restrictions in mind, or the
> > translator would have to be expanded to support it.
> I tryed to do it, but the translator does not support changing instance
> variables which are not integers (floats, for instance) and don't support
> returns which are not ^self.

John Maloney suggested in another post on your project that maybe only a
few key methods might need to be converted to C. That might get around
the floating point / return problems. 

Of course, one could also just plug parts of the rsynth code into the
Squeak shell too (but that might be ugly). Also, maybe it would only
take one or two modifications to the Smalltalk->C translator (as a
subclass perhaps) to get it to handle what your code needs (or perhaps
just to optimize the inner loops of your code). 

Is there one or two cpu intensive operations -- like generating a wave
form from a function, or superimposing two waves -- at the heart of
rsynth that could be made into a primitive (like the sound primitives)?
Can the sound primitives do that already? I've heard of music algorithms
being used for TTS before. Would the Festival-like approach be easier to
do than rsynth by using the existing Squeak sound primitives?

> Roughly it needed about 10 or 20 seconds to say 'how are you' at a sampling
> rate of 8 khz in a 486 DX2 66 with 12 MBs running Linux. It sounds like
> rsynth, with the exception of intonation (stress), that I have not yet
> implemented. It can probably be speed up adding some primitives.

I run Squeak on a 486 DX2 66 24MB running Win95, a Pentium Pro 180Mhz
64MB running NT4.0, and a Quadra 630 68040 66Mhz 36MB. The PentiumPro is
probably up to twelve times faster for some things that the other two (3
for clock X 2 for 486->Pentium X 2 for Pentium->Pro). So it is possible
your code would run as is without translation on higher end Pentiums or
Mac PowerPCs in a second or two. I can try it out on both the 486 and
PPro and compare. Of course, since you run Linux, you're probably
getting better performance than these Windows sloths on the same
hardware (no offense intended to sloths of course, who play an essential
role in their ecological niche).

Dell, for example, is now advertising a 400Mhz Pentium II for US$3699 w/
128MB RAM and 21" monitor. They were selling stuff at half that speed
for that price a year or two ago. That's about what I paid for a Gateway
386 25Mhz 4MB 14" monitor system about ten years ago (the system on
which I first used ParcPlace's ObjectWorks for Smalltalk 2.5). So in a
couple of years speed may not even be an issue for this sort of waveform
manipulation work in pure Squeak Smalltalk (especially with that great
Jitter system which I expect will just keep improving).

Smalltalk -- the solution that just keeps getting better as computers
keep getting faster...

> I didn't mean using Festival code directly, but using a diphone
> concatenation approach just in the way Festival and other synthesizers
> (most of them I think) do. The disadvantage of that method is that
> it needs a big (between 4 and 12 MBs) base consisting of sampled spoken
> units coded in such a way to make it easy to change duration, pitch and
> amplitude of voice; these units are concatenated to form words and
> phrases. 

Macintalk for the Newton is only a 243K or so package but has very good
voice quality (and several voices). I would be very pleased if we had a
voice synthesizer for Squeak with a similar specifications. Does anyone
know what approach Macintalk is based on? Maybe it has internal
compression of waveforms?

> The advantage of this approach is that the resulting voice is very
> natural sounding, in such a way that even can make the synthesizer sing as
> in the Lyricos project.

A singing Squeak. That would be really cool!

-Paul Fernhout
Kurtz-Fernhout Software 
Developers of custom software and educational simulations
Creators of the GPL'd Garden with Insight(TM) garden simulator

More information about the Squeak-dev mailing list