Spectral feature development in birdsong

Simon Overduin
March 25, 2003

 

Background

Birdsong is an important model for animal imitation and learning. In many species of songbirds, a father's song is passed on to his male offspring by a process of imitation that is independent of genetic ties between the birds (Eales, 1985). In some species, other members of the bird's social environment, including females responding to the father (King & West, 1983), additional tutors in the environment (Williams, 1990), and the male's siblings (Tchernichovski & Nottebohm, 1998) may also have a minor role in shaping the male's song. While certain species, like the canary (Marler & Peters, 1987), are open-ended imitators able to learn new songs into adulthood, males of other species like the zebra finch (Taeniopygia guttata) are age-limited learners and have a relatively fixed period during their youth during which to acquire the song it will sing during its own adulthood. (In the case of the zebra finch, the critical periods spans a variable initial date until about 90 days after hatching.) While this adult song can be gradually altered by systematically altering the auditory feedback heard by the bird (Leonardo & Konishi, 1999), in natural scenarios the adult song is believed to be quite crystallized and stereotypical for the bird.

The song of the zebra finch as of other birds can be analyzed at three hierarchical levels: that of bouts, motifs, or syllables (Leonardo & Konishi, 1999). Bouts are the highest level of organization, and consist of a sequence of motifs that is not necessarily stereotyped even in the adult. Motifs, in turn, consist of a number of introductory notes followed by sequences of syllables that become very stereotyped as learning proceeds. Finally, syllables proceed over learning from an unstable to a crystallized spectral structure in adulthood, when they may be represented symbolically (e.g. syllable A, B, C etc.). Types of syllables include harmonic stacks, frequency sweeps, high-pitch notes, broadband sounds, vibratos, and male long calls (Tchernikovski et al., 2001). In adult birdsong, syllables may also be demarcated by intervals of relative silence or sharp frequency modulation changes (Morrison & Nottebohm, 1993).

 

Pitch estimation and harmonic stack evelopment

Fourier analysis is well suited to the structure of birdsong vocalizations, given that they often (particularly in adulthood) have stereotyped frequency relations. This is particularly true in the striking case of harmonic stacks, which are prolonged syllables consisting of simultaneous fundamental and harmonic production, without the amplitude dropoff at higher frequencies typical of human speech. Among the spectral features on which harmonic stacks and other syllables have been described (Tchernichovski et al., 2001) are pitch, Wiener entropy, spectral continuity, and frequency modulation. (See here for a diagrammatic explanation from Tchernichovsky et al., 2000.) Higher-order features may describe the frequency or periodicity of syllables within a motif (Tchernichovski et al., 2001).

In this project we consider pitch, a feature that has attracted particular attention likely due to its intuitive nature. However, multiple definitions of pitch are possible. The simplest defines pitch as the fundamental frequency in a stack of harmonics. A more complicated definition, implicit in several measures described below, is biased by amplification of higher harmonics relative to the fundamental. A traditional measure of pitch is based directly on the signal or a filtered version of the signal (e.g. Dologlou & Carayannis, 1989), and describes pitch as twice the distance between zero crossings (assuming a normalized signal). However, such a definition is clearly susceptible to signals with a weak or missing fundamental frequency, even if the signal is lowpass filtered to dampen higher frequencies.

Some alternative estimates of pitch are based on autocorrelation peaks (Rabiner, 1977). While even in a complex signal it may be possible to detect frequencies based on amplitude modulation, in a Fourier transformed signal the relative contribution of each frequency is made explicit. Thus an autocorrelogram of the Fourier transformed signal will have more pronounced peaks representing "common" frequencies than a signal autocorrelogram, in which the highest peaks represent the period of these common frequencies. A pitch algorithm may select either the largest peak in the correlogram (after "clipping" the central peak; Sondhi, 1968) as representing the fundamental frequency - as has been done in this project - or it may consider the harmonic relationships between neighboring peaks to reconstruct missing fundamentals. A related measure of pitch is based on the cepstrum method (Noll, 1967), which computes the inverse Fourier transform of the natural logarithm of the Fourier-transformed signal (e.g. a power spectrum).

While such frequency-based estimates of pitch may be less susceptible to noise or harmonic interference, they are still susceptible to a phenomenon called period doubling. This phenomenon is illustrated here (Tchernichovsky et al., 2001). It occurs as a juvenile bird imitates a model harmonic stack by increasing its periodicity to twice that of the model stack, and then filling in the intervening frequencies. Pitch estimates like those above would indicate a gradual doubling of pitch followed by a sudden halving, even though the actual modulation of different frequency components may be gradual throughout this process (Tchernichovsky et al., 2001). In an attempt to track the imitation process in a more linear fashion, I am developing a novel pitch estimate, also based on the power spectrum of the signal, that is explicitly biased towards lower frequencies. The algorithm essentially finds the single frequency at a point in time that maximizes the mean power in all integer multiples of that frequency.

At present I have applied the autocorrelation based pitch measure to a pseudorandom set of frequency downsweeps manually parsed from files recorded throughout one zebra finch's learning data. As depicted, the pitch shows an initial increase but little clear trend after that. The variability in the pitch within each downsweep sample also grows slightly over learning - contrary to the pitch stabilization that might be expected for pure harmonic stacks, but perhaps expected for downsweeps as they become more pronounced. Analysis on a larger sample of data is planned to make these observations more reliable.

 

Filtering the developing song with subsong primitives

Several outstanding questions in the field relate to the mechanics of this learning. Are there birdsong "primitives" that are used as the building blocks for adult songs, or can new syllables develop de novo? Do these subsong prototypes take the form of individual syllables, or sequences of syllables? In the absence of neuorphysiological measurements of the neural or muscular underpinnings of these prototypes, can the primitives be identified in the gradual progression from juvenile to adult song features? That is, does the bird amplify and modulate the amplitude of subsong primitives, or shorten and protract them, or reorder them, etc.? Or are learning-related changes evident within the spectral structure of the primitives as well, e.g. in the pitch or frequency modulation of given primitives? Following exposure to the tutor song, do individual birds follow circumscribed imitation trajectories through feature space, or an initial explosion of feature diversity followed by a gradual selection of features appropriate to the model song (Tchernichovski et al., 2001)?

Many of these questions are difficult to answer at the level of the syllable features described earlier. All of these, and not only pitch, are open to a variety of operationalizations. Further, each of these methods attempts to reduce a complex spectrogram to a single statistic. While such efforts are useful in applying intuitive auditory dimensions to vocalizations, they may obscure hidden dimensions within the song, or extract features which are highly interdependent like the negatively related Wiener entropy and spectracl continuity measures (Tchernichovski et al., 2001), or the covarying frequency modulation and pitch variance estimates. Compound statistics like the Kolmogorov-Smirnov value and a "feature diversity" score have been used to evaluate a bird's song repertoire more comprehensively.

But one could also imagine features extracted independently of any of these intuitive dimensions, e.g. using a modified version of principal or independent component analysis. Time-varying components could further be useful for capturing entire primitives like harmonic sweeps, rather than statically produced coproduction of frequencies (not unlike the approach used in d'Avella and Tresch, 2002). I am working on developing an algorithm to extract these independent components, which could then be used as filters applied to later songs in order to trace their gradual modification. In so doing, the identification of subsong primitives could be divorced not only from ease of syllable demarcation and feature identification, but also from the limited repertoire that defines the bird's model song.

 

Links

 

References

 

Acknowledgments