Thus Spoke Zarathustra in High Fidelity Sound Reproduction

Peter Wurmsdobler
9 min readFeb 24, 2024

--

Some time ago many people aspired to owning a HiFi system, a high fidelity means of sound reproduction, including a power amplifier and large speakers; it allowed playing loud music in small rooms. There was and still is some mystery around Hi-Hi systems and decent sound, lots of snake oil, but little measurement and science. Nowadays, it is clear that the limitation in sound reproduction are (in descending order of importance):

  • between the ears of the listener, i.e. perception abilities,
  • room acoustics, in particular for small rooms,
  • loud speaker design.

After decades of evolution, the electronic components should not be a challenge any more: frequency and phase response can be thought to be flat for the audible frequencies with the total harmonic distortion being several orders of magnitude below the signal level. Floyd E Toole confirms this statement in his book Sound Reproduction — The Acoustics and Psychoacoustics of Loudspeakers and Rooms:

Electronic devices, analog or digital, are also in the signal path, but it is not difficult to demonstrate that in competently designed products, any effects they may have are small if they are not driven into gross distortion or clipping.

Nobody builds electronics for an audio signal path any more that does not have a ruler flat frequency response over more than the audible bandwidth. They are becoming “invisible” parts of our systems.

Flat amplitude response of Philips Applied Technologies/Hypex class D amplifier at different loads.

The above plot would have been obtained by passing pure sinusoidal tones at constant frequency and amplitude through the audio component under test. But is such a flat frequency response really sufficient for decent sound reproduction? I hear people claiming: music is more than just a sequence of pure sinusoidal tones, music is about transients. This story tries to communicate my thoughts on transients and the requirements for audio equipment using simple thought experiments, some calculations and Richard Strauss’ “Also Sprach Zarathustra”.

The Hypothesis

Based on conversations with audio exports I got the following understanding: an audio system that produces a perfect sinusoidal tone across a wide frequency range at a constant gain and low distortion is necessary but not sufficient. A system needs to be able to respond to transients, i.e. to signals that change in amplitude (and perhaps in frequency contents) within a very short time. This demand is quite common in classical music, e.g. the opening of Richard Strauss’ “Also Sprach Zarathustra”, going from pianissimo to fortissimo within a few bars. An important contributing factor is the slew rate required for electrical quantities, i.e. the rate at which quantities have to change; in other words, the capability to energise the system within a short period of time. The hypothesis is that only a system with a provision for a high slew rate will be able to reproduce these transients, so important to music.

The Thought Experiment

The question is, how could a transient be modelled? As a starting point, let’s consider a sinusoidal signal at a constant amplitude; then let’s increase the amplitude of that signal according to a function that allows the power to swell up and then come back down again. The result is the modulation of a sinusoidal “carrier” signal with a time dependent amplitude. In more concrete terms, let’s assume:

  • a loudspeaker with a sensitivity of 90dB/W/m at a nominal impedance of 8 Ohms; the voltage amplitude required to produce that loudness is the well-known 2.83V for an electrical power of 1Watt.
  • a pure tone as a carrier c(t) at a constant frequency of f = 262Hz (C4 or middle C, with a period Tc = 3.82ms), but at two levels:
    - a quiet level of 60dB (pianissimo); the required power is -30dB down, p₀=0.001W, resulting in an amplitude a₀ = 0.089V.
    - a loud level of 100dB (fortissimo); the required power is +10dB up, p₁=10W, resulting in an amplitude a₁ = 8.949V.
  • to model a transient with a dynamic range of just 40dB an amplitude profile a(t) should increase the volume from the low level up to and back down again in a harmonic pattern within a certain pulse time Tₚ: an inverted and weighted raised cosine should be used:
Modulated sine wave, from 0.089V to 8.9V within 4 oscillations of a pure C4 tone at 262 Hz

The slew rate of a harmonic signal described by A·cos (2π·f·t) is simply its time derivative, -2π·f·A·sin (2π·f·t), with the maximum being 2π·f·A. For the example above, this means the maximum slew rate for the low level signal is ~147V/s, and for the high level signal 14,710V/s or 14.7V/ms, both not too demanding and in the range of many electronic devices.

The claim I refer to is that the modulation of a harmonic signal using a pulse pattern would generate even an higher slew rate; the proof of a good amplifier would be the ability to follow that transition. But is that so? Looking at the pulse model above, the slew rate is the time derivative:

Slew rate for the modulated sine wave, from 0.089V to 8.9V within 4 oscillations of a pure C4 tone at 262 Hz

The slew rate can now be calculated for various pulse lengths Tₚ at a constant carrier frequency fc (and equivalent carrier period Tc). From that the maximum slew rate can be obtained for a wide range of Tₚ to Tc ratios.

Maximum slew rate for a range of pulse length Tₚ to carrier period Tc ratios, left shaded area is Tₚ < Tc with pulse dominating slew rate, right shaded area is Tₚ >> Tc with carrier dominating slew rate.

It follows from the equation above, and could be confirmed by numerical calculations, what seems to be obvious with hind sight:

  • for long pulses, i.e. Tₚ >> Tc, the slew rate is dominated and determined by the slew rate of the carrier tone, in this case 14.7V/ms at high level amplitude. This is a likely scenario as most pulses would be much longer than the carrier period; they would last a second or so.
  • for very short pulses, Tₚ < Tc, the slew rate is dominated and determined by the slew rate of the pulse shape. However, for this transient slew rate to matter, these pulses would have to be very, very short, smaller than a millisecond; that would be inaudible.

Conclusion: as long as the system can accommodate the rated power at the rated frequency range (flat frequency response), it implicitly follows that all components in the signal chain can manage the necessary slew rate. But is that again proof enough?

Another Thought Experiment

Music is not about pure sinusoidal tones at constant power, it is about spectra of tones and timbrel richness. For instance, an orchestral piece contains may chords with many, many notes being played simultaneously by instruments with different timbre. What would the requirements be for analogue electronics in the signal chain in order to be able to play back the signals representing that music? What would be the maximum slew rate the electronics would have to deal with? This section is about getting an idea of the order magnitude in comparison to a pure tone.

As an example for a rich chord across an entire orchestra, the initial fanfare in Richard Strauss’ “Also Sprach Zarathustra” came to my mind, in particular, the last C-major chord in bars 12–14. A simple tally of all notes being played taken from the sheet music shown below tells us how many instruments play each note in the chord:

‘C1’: 1, ‘C2’: 6, ‘G2’: 1, ‘C3’: 6, ‘E3’: 1, ‘G3’: 4, ‘C4’: 3, ‘E4’: 3,
‘G4’: 2, ‘C5’: 5, ‘E5’: 4, ‘G5’: 4, ‘C6’: 5, ‘E6’: 1, ‘G6’: 1, ‘C7’: 3

The fanfare in “Also Sprach Zarathustra” with 16 different notes being played by the entire orchestra in the C-major chord of bars 12–14, ranging from C1 to C7.

How can the expected sound signal representing this chord be modelled? Well, this is what synthesisers do. Here, however, I do not want to simulate the entire sound, I would like to get an idea of the maximum slew rate needed in a worst case scenario. Therefore, I try to implement a very simple “synthesiser” that produces every note played as a pure tone with a certain number of harmonics, and every note at an amplitude commensurate with how often it occurred in the chord using the tally above. The total power should be equivalent to the pure tone above.

To keep it simple, some algorithm was used to sum up all the frequency components (fundamental and 9 harmonics) of all notes present, where each frequency contribution is multiplied by how often the corresponding note was present. This results in the contribution of amplitudes for all notes and their harmonics as shown in the following plot.

Contribution of frequencies as amplitudes when all 16 notes of the large C-major chord are played.

For a pure sinusoidal tone of amplitude a, the power is P=a²/ 8 with 8 being the nominal impedance of the speaker used. For a signal composed of many frequencies, the total power is the sum of all the contributions. In this case a scalar k is added to allow the signal to be scaled:

Simple algebra will yield the scalar k to make sure that composite tone would produce the same power as the pure tone with amplitude a₁, albeit with power distributed over all frequencies. Thus the complex sound signal can be synthesised and plotted over time as shown below.

Synthesised signal of 16 notes of a C-major chord from C1 to C7 with fundamentals and harmonics

Numerically differentiating the complex sound signal yields the slew rate as a function of time; the maximum was 503,554V/s or 503.6V/ms. Alternatively, a maximum can be estimated for a point in time when all harmonics are in phase, as a worst case assumption which may well happen. The total slew rate would be the sum of all contributions:

The worst case slew rate turns out to be 521,437V/s or 521.4V/ms which is significantly higher than the 14,710V/s or 14.7V/ms needed for a pure sinusoidal tone at the same power, by a factor of 35. In other words, this means that the bandwidth and/or the power reserve of the electronics would have to be higher to accommodate that slew rate. Of course, a real recording of that famous chord will probably demand smaller slew rates as not all frequencies will be in perfect phase all the time.

Conclusion: the richness of complex sounds makes it necessary that, despite small average power, larger slew rates are necessary than would be with a signal of a single tone. In addition, all circuits need to be energised at that rate, too. This is perhaps where the slew-rate reserves helps and this is what separates decent from excellent audio electronics.

Post Scriptum

To alleviate the demand on the electronics in terms of power and slew-rate, using speakers with higher sensitivity can help. For instance, a high sensitivity horn speaker such as Haigner’s Alphahorn sports a sensitivity of 103dB/W/m. In order to achieve the sound pressure level of 100dB mentioned above, this speaker would need 3dB less than the nominal 1W power, i.e. 0.5W; this is 1/20 of the 10W for the 90dB/W/m speaker used above. The resulting voltage amplitude would be 2V (instead of the 8.9V) or a maximum slew rate of 3,293V/s (instead of 14,709V/s).

Haigner’s Betahorn (97dB/W/m) and Alphahorn (103dB/W/m)

--

--

Peter Wurmsdobler

Works on the technological foundations of autonomous vehicles at Five, UK. Interested in sustainable mobility, renewable energy and regenerative agriculture.