Audio Engineering Society Singapore Section

>> Home
>> About Us
>> Contact Us
>> What's On
>> Section Reports
>> Job Postings
>> AES Inc.

Time-Scale Modification Algorithm for Audio & Speech Signal Applications*
Mr. Amerson Lin Hong Jie

Friday, 8 May 1998

reported by: Dr. Roland K C Tan
                  Chairman (Term 1997/98)

On Friday, 8 May 1998, about 12 members and 36 invited guests attended a seminar on "Time-Scale Modification Algorithm for Audio & Speech Signal Applications" organised by the Audio Engineering Society (AES) Singapore Section. The topic for the seminar was based on our paper that was also presented at the recent AES 104th Convention (preprint no: 4644)* on Sunday, 17 May 1998 in Amsterdam at the RAI Congress Centre. 

The speaker for the evening was my co-author, 15-year old student Amerson H J Lin. The young prodigy from Raffles Institution, Singapore's premier school, is sitting for his GCE 'O' level exam at the end of this year and has been involved in DSP research projects under my mentorship since January 1996. The event was held at the Ngee Ann Polytechnic's Electrical Engineering Department's Staff Conference Room with audiences that comprised of audio professionals and tertiary students from the local TV and radio broadcasting stations, audio industries, the local universities and polytechnics. Amerson's teachers and classmates from Raffles Institution were also present during his talk. 

A high quality time-scale modification algorithm applicable for both digital audio and speech is a useful feature for dedicated audio system. A novel approach proposed as SASOLA (which stands for sub-band analysis synchronous overlap-and-add) enables varying the tempo of music to as high as twice the rate of expansion or reduction without affecting the pitch of the musical instruments or the singers' voice characteristic. 
 

Amerson Lin, a 15-year old prodigy from Singapore's premier school Raffles Institution, presenting a talk on his research findings at Ngee Ann Polytechnic -
photograph by Mr. Michael Teh, Committee Member.

Time-scale modification algorithms originally developed for speech such as the pitch synchronous overlap-and-add (PSOLA) technique can produce excellent results. However, it may not perform as well with audio signals due to the fact that an accurate pitch prediction computation is difficult to achieve in audio waveform. The proposed SASOLA algorithm as offers an alternative to the existing time-scale modification algorithms due to its computational efficiency and higher audio sound quality output. SASOLA considers sub-band analysis and is based on the time-domain synchronous overlap-and-add (SOLA) technique originally developed for speech. 

The principle of overlap-and-add concatenates two frames of speech/audio samples (that is, the analysis frame and the synthesis frame) by finding the best alignment point in the region of overlap with the highest similarity. In SOLA, this is found by maximising the cross-correlation function between the analysis frame and the synthesis frame in the overlapping region. 
 

Amerson Lin (centre, in school blazer and tie) with AES members and guests after the talk - photograph by Christopher K C Yap, Treasurer.

Unlike speech signals, the presence of high-transients and non-stationary characteristics inherent in most broadband audio signals within the full audible frequency bandwidth between 20Hz to 20kHz means that the best alignment point in the region of overlap may not always be ideal. These contribute audible distortions which effect the pitch of the instruments and singer's voice. SASOLA algorithm overcomes the problem by first decomposing the broadband signal into smaller sub-bands before performing overlap-and-add on the individual bands. By partitioning the audio signal into sub-bands with narrower bandwidth, the signal becomes more predictable. A better alignment can thus be realised which results in overall improvement in the output sound quality. 

Although the computational complexity for the SOLA algorithm is relatively lower when compared to the frequency domain processing techniques such as the short-time Fourier transform (STFT) algorithm, a single chip hardware real-time implementation for both audio and speech applications at 44.1kHz/48kHz sampling frequency is not viable. This is due to the compute intensive time-domain cross-correlation computation found in the SOLA algorithm. In fact, the overall computational efficiency can be increased by simply switching from time-domain to the frequency-domain in the cross-correlation computation based on the simple convolution-multiplication relationship which can be mathematically proven.

Amerson H J Lin (right) receiving the plaque from Chairman AES Singapore Section, Dr. Roland K C Tan - photograph by C S Lim, SBA.

The difference in sound quality using a commercial technique on both speech and music signals were subjectively compared during a sound demo session, which followed after the presentation. With speech signal, the sound quality was clearly superior as opposed to the results obtained with music signals when performing both time-scale expansion/reduction modifications. This can be explained by the fact that in the time-domain waveform of music signal, it is generally more complex 

and non-stationary (high variations with time) as compared to speech. To have a good cross-correlation between frames for music signal would be difficult to achieve.

Therefore, decomposing the full audio bandwidth spectrum into smaller sub-bands reduces the complexity of the music signal in each band thus making it more "stationary". A better cross-correlation between frames can then be achieved. The processed sub-band signals after time-scale modifications can thus be synthesised (combined) again to obtain the resulting output music signal at full bandwidth.

Amerson Lin presenting during the AES 104th Convention at the RAI Congress Centre, Amsterdam - photograph by Dr. Roland K C Tan, Chairman. 

A subjective comparison was made again this time using SASOLA and the commercial technique. At twice the expansion (-50%) in particular, it was found that time-scale modification using the commercial technique generated audible "echo" and "stuttering" distortion effects in the background. On the other hand at twice the reduction (+100%), there were clearly missing information. However, these audible distortions were eliminated using SASOLA. 

These were more obvious with contemporary pop music signal consisting instruments that produce high-transient waveform such as kick-drum, castanets, or high-hats. Overall, the audience felt that the results obtained using SASOLA could retain the pitch and tone characteristics of the original music and speech signals better.

The technology developed are suitable for applications in the pro-audio, communication, broadcast and entertainment industries. As an example in lips synchronisation during voice-over work or special sound effects in cinematography, there is no need to re-record the actor's voice nor involve the orchestra again. This could save both time and money. With CD, DAB and digital mixer, DJ can vary the tempo of music with a "smooth mix" without affecting the original music signal characteristics - a technique currently not possible with an analogue mixer. In communication system applications, listening to long recorded or 'live' voice messages of a fast talker can be slowed down to improve intelligibility. Similarly, listening time can be shortened by speeding up music & speech recordings during playback. 
 

Dr. Roland K C Tan (left) with Amerson H J Lin (right) standing right in front of  the RAI Congress Centre, Amsterdam, The Netherlands dated Sunday, 17 May 1998 - photograph by C S Lim, SBA. 

REFERENCES

* Amerson H J Lin & Roland K C Tan, "Time-Scale Modification Algorithm for Audio and Speech Signal Applications" presented at the 104th Convention of the Audio Engineering Society, J. Audio Eng. Soc. (Abstracts), vol.46, p.574 (1998 June), preprint 4644.

* Roland K C Tan & Amerson H J Lin, "A Time-Scale Modification Algorithm Based on the Subband Time-Domain Technique for Broad-Band Signal Applications" J. Audio Eng. Soc., vol.48, No.5, pp 437-449 (2000 May).

 


Copyright 1998 AES Singapore Section