Summary of the hierarchical model of speech learning and recognition.

The core of the model is equivalent to the core of the birdsong model [14]. The Equations 1 and 2 on the right side generate the dynamics shown on the left side, and are described in the Model section (see also Table 1 for the meaning of parameters). Speech sounds, i.e., sound waves, enter the model through the cochlear level. The output is a cochleagram (shown for the speech stimulus “zero”), which is a type of frequency-time diagram. There are 86 channels, which represent the firing rate (warm colors for high firing rate and cold colors for low firing rate) of the neuronal ensembles that encode lower frequencies as the channel number increases. We decrease the dimension of this input to six dimensions by averaging every 14 channels (see the color coding to the right of the cochleagram and also see Model). After this cochlear processing, activity is fed forward into the two-level hierarchical model. This input is encoded by the activity of the first level network (shown with the same color coding on the right), which is in turn encoded by activity at the second level (no color coding at this level, different colors represent different neuronal ensembles). From the generative model shown here (core model), we derived a recognition model (for mathematical details see Model).