Speech production presents a very complex process that involves approximately 100 muscles. But even more significant in this process is a precise timing of a muscle contraction required for production of temporally complex speech sounds. Normally, speech production is subdivided into three sub-processes: respiration, phonation and articulation. Respiration is typically presented as a mechanical process of air flowing in and out of lungs. As a part of respiration, forced expiration provides the airflow coming from the lungs. It involves the cooperative contraction of muscles of the trunk and abdomen. The air then goes through vocal folds. The vibration of vocal folds caused by the air pressure and laryngeal muscle contraction leads to production of a periodic sound wave, called phonation. During the last stage of speech production process, the periodic sound wave passes the pharynx, mouth and nasal cavities. The varying shape of the oral cavity, pharynx and nasal cavities causes the transition of resonant frequencies. The process of the shape adjustment is called articulation and requires a precise work of mobile articulators (i.e. lips, tongue, lower jaw).
In the last decade, several articulatory synthesis systems have been proposed that can model physical and physiological processes taking place in speech production. The table below summarises the key aspects of the current models.
Modelling geometry of the pharynx, larynx, nasal cavities
Modelling vocal folds
Naturalness of synthesized speech
The geometrical articulatory model generating high-quality articulatory and acoustic speech signals
The geometrical model based on reconstruction of the vocal tract using cineradiography, labiofilm and
The parametrical 2D model of vocal tract that define the relation between the mid-sagittal dimension
of the VT and its cross-sectional area by using parameters ? and ?
The improved ??-model with eight parameters to control the sub-models representing mobile articulators
Table 1. Articulatory models of speech synthesis
The most commonly used method for defining the relation between the mid-sagittal dimension of the VT and its cross-sectional area was published for the first time in the study by Heinz and Stevens 1. It has become to be known as the ??-model after its parameters. It relates the sagittal cross-sectional distance d to the cross-sectional
area A simply as follows:
Here, ? and ? are parameters, which depend on the position in the VT. There are some complex implementations of this model with several improvements 2 3. Despite the attractive simplicity of the model it has a significant disadvantage. The relationship between the cross-sectional distance d and area A is rather complex than described in the model, thus the redefining of this relationship is needed for each speaker.
A more recent version of the ??-model is described by Stark 4 5. The goal of the system is to study apical sounds (and hence the name). Similar to the original model, it uses the model for cross-sectional distance to area conversion. The model has eight synthesis parameters controlling the sub-models. Lips are included only as an area model and epiglottis and larynx are modelled by translating and rotating static contours. Mandible (teeth and mouth floor) is modelled as a static contour, which is positioned according to the mandible parameter. In contrast, the tongue model includes two more detailed sub-models. Tongue apex is modelled with a parabolic curve between tongue body and tip and controlled with protrusion and curvature. As the apex-body fit is kept smooth by rotation the curvature parameter has the effect of lifting the tongue tip. The tongue body is controlled with two parameters called position and deviation. Position is the back-front position and deviation is the degree of the constriction. The actual tongue hump is produced with a modified (non-symmetric) Gaussian function.
Maeda used arbitrary factor analysis in his studies to reconstruct the vocal-tract shape based on data gathered from two speakers – one female and one male 6. As a result the model of the vocal tract controlled by seven parameters was created: jaw opening, tongue dorsum position, tongue dorsum shape, tongue apex position, lip opening, lip protrusion and larynx height. Besides the parameters described for control of the oral area function Maeda’s synthesizer includes parameters for control of glottis sub-model and nasal coupling. Glottal area is modelled as the sum of a slow and a fast-varying component 7. Overall this model can be used effectively to synthesize connected speech, but there is significant room for improvement 8.
One of the newest geometrical articulatory models was presented by Birkholz 9. Generally, the vocal-tract shape in this model is defined in terms of a number of geometric surfaces of the articulators and vocal tract walls. Their shape and position in 3D space are specified by a set of 23 control parameters, each corresponding to one degree of freedom. Also, this model includes results of the significant work in the areas of modelling losses due to turbulence in vocal systems 10 and modelling of self-oscillating vocal folds 11.
Conclusion on speech synthesis review
An extensive work in the field of speech production modelling over the last couple of decades has resulted in a number of accurate models of the human vocal tract that are able to synthesize high-quality and intelligible distinct speech sounds 12. However, despite the steps made towards the simulation of short syllables production, the main goal of these systems is to generate fluent continuous speech needs to be achieved 12. The most challenging aspects of real speech such as co-articulation and emotional influence are need to be considered and taken into account. In order to address the question of how to model these aspects, the vast array of possible solutions have been proposed including modelling context-sensitive consonant shape as a weighted average of three reference shapes for that consonant in the context of the corner vowels /a/, /i/, and /u/ 12. Also, coarticulation in the context of a task-dynamic model has been studied by Fowler and Saltzman 13. Even models based on superposition principle have been considered to simulate coarticulation phenomenon 14 15 16. However, the most successful, in terms of naturalness of synthesized speech, are cognitive models aimed to simulate the process of infant language acquisition 17 18 19. The detailed review of the most advanced models is presented below.
The DIVA model
Figure 1. The DIVA model of speech acquisition and production (adapted from Jason A. Tourville and Frank H. Guenther) 20
The DIVA model was proposed by Guenther 20 and comprises of neural networks components corresponding to regions of the cerebral cortex and cerebellum, including premotor, motor, auditory, and somatosensory cortical areas. Many studies concerning functional brain imaging provided experimental evidence for brain regions involved in speech production 21 22. The DIVA model takes into account results of these studies and explicitly, mathematically describes the relationship between different regions of brain and their role in speech production. In its current form 23, the DIVA model, shown in Figure 1, consists of the integrated feedforward and feedback control sub-systems. Together, they learn to control a simulated vocal tract, a modified version of the synthesizer described by Maeda 6. Once trained, the model takes a speech sound as an input, and generates a time varying sequence of articulator positions that command movements of the simulated vocal tract that produce the desired sound. As it can be seen from Figure 1, speech production starts from the activation of a speech sound map. Cells in the speech sound map project to cells in the feedforward articulator velocity maps. These projections represent a set of the feedforward motor commands or articulatory gestures for the speech sound 24 that determines the positions of eight articulators in the Maeda synthesizer. Feedback control subsystem at the same time begins with the speech sound map projecting to auditory and somatosensory target maps. These projections encode the time-varying sensory expectations, or targets, associated with the active speech sound map cell. Then the signal propagates to different regions including the auditory and somatosensory error maps. Activity in the error maps then represents the difference between the expected and actual sensory states associated with the production of the current speech sound production. This difference then is used for correction of articulatory movements for each particular sound. Overall this model is capable to describe many aspect of speech 25 26 19 27 28, and thus has been used as a tool for studying the mechanisms underlying normal 29 30 31 and disordered speech 32 33.
Figure 2. Organization of the neurocomputational model (adapted from Kroger et al.) 34
The aim of the neurocomputational model proposed by Kroger 34 is to implement a biologically motivated approach for speech recognition and synthesis, i.e. a computer-implemented neural model using artificial neural networks, capable of imitating human processes of speech production and speech perception. Functionally the model comprises of the production and perception parts. In its current state, the model excludes linguistic processing (mental grammar, mental lexicon, comprehension and conceptualization) but focuses on the sensorimotor processes of speech production and on sublexical speech perception, i.e. sound and syllable identification and discrimination. Even though this model is based on the DIVA model 23, there are three major differences between both approaches. Firstly, the DIVA approach does not separate motor planning and motor execution as is introduced in Kroger’s model. This separation is motivated by studies considering speech gestures and motor speech execution 35 36 37. Secondly, the DIVA model does not explicitly introduce a phonetic map or at least a map, reflecting the self-organization of speech items between sensory, motor, and phonemic representation; and the DIVA model does not explicitly claim bidirectional mappings between phonemic, sensory, and motor representations. Thirdly, the DIVA model is a production model not aiming for modelling speech perception. Modelling of speech production and speech perception as two closely related processes is of great importance. This is achieved in Kroger’s approach.
Conclusion on the neurocomputational models of speech production
Both models described in the section above are the result of an outstanding work leading to the development of the naturally plausible, complex and comprehensive mathematical and computational framework for speech production and perception simulations. These models reflect major known facts about speech production mechanisms occurring in the human brain such as mirror neurons 38 39, self-organization as a central principle of learning 40 41 42 etc. However, recent studies in neurobiology related to the speech processing, have revealed new significant patterns in the development of the auditory cortex 43 44 45. These studies investigated the topography of the auditory cortex and, in addition to a well-known tonotopic organization, discovered orthogonal periodotopic pattern in the auditory cortex. This sensitivity of the auditory filed maps (AFMs) to the periodicity allows processing of the temporal aspects of sound. Modelling of this parameter in the AFMs holds the potential to significantly improve the ability of the model to discriminate fast but at the same time important temporal envelopes of the speech and lead to better recognition accuracy. Also, another important aspect of brain development overlooked in current neurocomputational models of speech acquisition, is constant changes in a number of neurons caused by their birth and death and alterations in connectivity patterns which depend on the activity and homeostatic plasticity rules. These rules have were well studied and comprehensive models have been proposed recently 46. It is of great interest and relevance to this PhD research project to examine how the above models can influence the speech production and perception. Hence, an extensive research in this specific area is needed which will enable to address current limitations in the speech acquisition process.