The Engineering and Art of Headphone Design
A Brief Overview
by Noel Lee
The Art of Listening
Critical listening is an acquired art. It’s a skill that must be learned. Just like tasting fine
wines, beers or even fine cigars. It takes time and experience to know what to look for
when determining the highest levels of quality. Such is the challenge in determining
what is a good headphone or a good speaker. For novice listeners, what they hear in a
club or disco is good music reproduction. But advanced listeners will often find these
speakers unnatural and inaccurate, and actually fatiguing to listen to over a long period
With headphones, determining what superior sound is even more difficult than
speakers. Each ear is different, and even details that seem relatively insignificant, like
ear tip fit, can dramatically influence the results. However, there are some absolute
terms that we can use to describe the listening experience that go beyond just numbers,
like measuring frequency response, which is the common among many manufacturer’s
design headphones. If it were that simple, why wouldn’t two headphones that measured
similarly, sound the same? Why does a balanced armature design have a “signature
sound” that dynamics don’t have? Why do headphones come in so many types and
varieties and sound so different?
In search of the perfect sound.
High quality headphones will reproduce all music accurately and allow the listener to
enjoy the music as if they were transparent. They don’t sound like headphones, but like
real music. The sound feels alive and sounds lifelike. One can listen to them for hours
without fatigue. Unfortunately, really great headphones are extremely rare. But we are
all in search of the perfect sound, and learning how to achieve that in a headphone is
combination of engineering and art.
A headphone. A speaker. A microphone. Understanding the similarities and
differences as they relate to music reproduction.
These three devices are the same in that they’re all transducers. In other words, they
turn mechanical energy into electrical signals, and visa versa. Headphones are similar
to speakers in that they are both transducers on the reproduction end. Their job is to
recreate the music signal, without adding sounds of their own. But that’s almost
impossible since every mechanical device has sounds, resonances and distortions of
their own. For example, when reproducing a bass kick drum, the recorded sound may
stop, but because of the inertia of the speaker or headphone diagram, it keeps on
going. This is known as “decay” over a period of time and it is easily measured today in
the form of a “waterfall” graph. Kind of like a tuning fork that keeps on ringing. This is
The “speed” at which the music signal occurs is also important to create a sense of
realism. In real life, when a guitar pick hits the string, or when one hits a triangle, how
fast is the initial impact? It’s immediate. But, in the same way a speaker or headphone
has trouble stopping, it can also have trouble accelerating fast enough to accurately
capture the initial impact of the music.
The microphone is a speaker in reverse. It captures the music as the airwaves hits its
diaphragm, and this also has a stop and start factor, as well as frequency response.
That’s why you will see recording engineers being fanatic over their selection of
microphones for various instruments. Even singers like different choices of microphones
as it reproduces their voice in the way they want to hear it.
Various technologies have been invented over the years to optimize some of these
parameters. Dynamic speakers with huge magnets help bass speakers stop and start
accurately, along with different cone materials that stiffen the speaker. On the high end,
metalized mids and tweeters help rapid stop and starting of the signal, but may have
“ringing” distortions of their own. Electrostatic speakers with extremely light diaphragms
are the reference used by many headphone and speaker listeners because of their
ability to start and stop, but because they don’t move great distances, they may lack
power and dynamic range.
In headphones there are designs that help one parameter, but they’re often at the
expense of another. Electrostatic headphones are considered the best, but they can’t
move a lot of air so they lack bass response. Dynamic headphones are all over the map
in their ability to accurately reproduce music, but represent a good compromise if
designed properly. Balanced armatures are fast in reacting, but are bad in stopping and
producing resonances and sounds of their own as can be seen in their waterfall
There is one last difference between speakers and headphones. Everyone knows that a
speaker sounds best in a tuned room that is designed for the speaker. That’s how many
recording studios are designed. However in a headphone, everyone’s ear is slightly
different. Obviously there’s no room, but there is an ear cup on over-ear headphones,
and an ear tip on in-ears. Both can dramatically affect the sound. Both are an
“ecosystem” where a whole lot of parameters depend on one another to get the best
That’s why designing a great headphone is knowing how to balance all of the
parameters of the ecosystem to get the best reproduction in sound. That is where the
‘art” and the “ear” are part of the design process.
Measurements vs. the Listening.
Measurements are useful in design, but there is no one measurement that will tell you
what a headphone will sound like. Many novices are focused on frequency response
callouts that have no +/- db variation specification or distortion measurement to them, so
they don’t mean anything. Even if the frequency response were exact, without the
various distortion figures (IM, Harmonic, TIM), the number is meaningless.
Since the ear does not hear flat frequency response, a correct frequency response
should not be flat. The human ear does not hear all frequencies at the same level, and
is more sensitive to the middle ranges. Usually every other frequency is referenced to
1k, or 1000hz.
Frequency response curve of what
the ear wants to hear. An ideal and
quot;balancedquot; headphone would have a
frequency response similar to this
This test shows a relatively flat
frequency response of a popular
headphone and the harmonic
distortion below. Since this curve
does not match the reference curve,
this headphone will have difficulty
reproducing audible bass.
Yet another measurement that affects perceived frequency response is “waterfall,” or
“decay” of the original signal. Just as the tuning fork can’t stop, various headphone
diaphragms can’t stop, thereby adding coloration to the sound around that particular
frequency. So the “frequency response” might look good, but the “waterfall” or delay
may look bad, causing exaggerated high frequencies (such as in balanced armature
designs) or exaggerated bass, such as in dynamic designs.
The same frequency response with
the waterfall response plotted next to
it. One can see a slow decay across
the frequency spectrum where the
energy is stored. This will result in a
harsh sound with high and long decay
Lastly, impulse response helps to determine how fast a headphone can respond to a
signal, and how fast it can stop. An impulse allows us to see the rise and fall of the
transducer and its ability to reproduce musical instruments that have fast transients.
One can see why two headphones that measure the same in frequency response can
sound very different. It’s a combination of tests that will give us an indication of how a
headphone will sound.
This is a simplified explanation. Headphone housing, materials, driver design, and ear
tip design are only some of the other considerations in making a great headphone.
The final analysis is how it sounds to the critical human ear. How to quot;tunequot; all of
parameters is the quot;artquot; in the design. Knowing what measures good needs to be in
concert with what sounds good. Years of experience of knowing what to do and a
critical ear is a rare combination indeed.
Talking the Talk; Audio Terms Describing Headphone Listening
The intent of the following terms will help us establish a common language to talk about
headphone music reproduction. Just as wine aficionados have their terms of oaky, airy,
fruity, and others, we too need to have terms to describe the sound of headphones.
We have enhanced the description of these terms specifically around headphone
listening, and have also have introduced measurements where we can begin to
correlate to what we hear with what we can measure. It is impossible to have one
measurement, as it is a combination of measurements, along with the ‘ecosystem”
which includes your ear, the ear tips, and even the seal around the ear, that will
determine the final result.
Aggressive: Forward and overly bright sonic character, as opposed to being smooth
and balanced. It can be measured in high amounts of IM distortion and poor waterfall
response. This distortion can cause long term listening fatigue.
Air: Spacious and open with a sense of lightness and transparency. Achieved through
reproducing mid and high frequencies accurately with good phase response throughout
Airy: Pertaining to treble which sounds light, delicate, open, and seemingly unrestricted
in upper extension. From quality reproducing systems that have smooth and very
extended HF response.
Ambience: Psychoacoustic impression of a physical acoustic space, such as a concert
hall in which a recording is made.
Articulate: Imparting a sense of precise intelligibility and definition of vocals,
instrumentals and the interactions between them. Comes from good transient and
waterfall, especially at high frequencies.
Attack: The leading edge of a note, such as the “snap” of the drumstick as it hits the
snare so one hears the individual snares. Also pertains to the ability of a system to
reproduce the attack transients in music. Accomplished with extremely good transient
and impulse response with great waterfall with no overhang.
Awesome: The sound when the combination of all of the positive parameters of
headphone design come together to describe this listening experience.
Articulation: The ability to reproduce fine details, especially quick transients. Tiny
details reproduced accurately are a hallmark of a headphone with good articulation.
Balance: The smooth non emphasis of any part of the audible spectrum. A headphone
with good tonal balance can be played louder without fatigue as it does not over
emphasize any part of the frequency spectrum and therefore the overall level can be
louder. Proper reproduction of a thunderous orchestra or big band is a good
demonstration of tonal balance.
Another balance is channel balance, or the relative level of the left and right stereo
channels. Channel balance is critical to good soundstage and imaging in a headphone,
as the signals must arrive to both ears at the same time at the same level.
Bass: The audio frequencies between about 20Hz and 250Hz. New music with
synthesized effects can be produce very powerful low notes, so reproduction in the 30
to 50hz region becomes important. Well recorded bass guitar is a good test for a
combination of low end bass response, with higher end fundamentals as when a
performer plays quot;slapquot; bass.
For a headphone design, the proper response needs to follow the insensitivities of the
human ear. Flat response may not give very satisfying bass as the ear is less sensitive
as the volume and frequency go down. Good bass should also be “tight” as the
headphone diaphragm needs to start (speed) and stop (waterfall) with the signal and
not add sounds of their own, as with acoustic and electric bass.
Bass Extension: Realization of all low bass information from 250Hz down to 20hz.
Very few headphones can reproduce this well because of the tiny diaphragm that needs
to move a lot of air to create these frequencies. Also because the ear is less sensitive to
bass frequencies, the bass response needs to go up as the frequency and volume goes
Body: Fullness of sound, with particular emphasis on upper bass. Opposite of thin.
Bright: A sound that over-emphasizes the upper, midrange and lower treble. This can
be seen by exaggerated high frequency response as well as poor waterfall (long decay).
Clarity: Is the sound quot;clearquot; and quot;transparentquot; as opposed to muddy or fuzzy.
Accomplished with good impulse and waterfall, so the headphone diaphragm starts and
stops rapidly. A “slow” headphone may have all of the frequencies, but not good clarity.
Clear: Similar to clarity, but used to describe a lack of “speaker sound” or headphone
sound. See also, Transparent.
Coloration: An audible added characteristic with which a headphone produces that is
not a part of the original source material. Caused by poor waterfall and/or frequency
response resulting from resonances in the design of the diaphragm, as well as the
earphone housing. Heavy metals are usually preferable to plastics, which can resonate.
Coherent: Showing no audible evidence of a crossover or of different driver colorations
in any of the various frequency ranges. For example, a saxophone doesn’t sound like
it’s coming from a low frequency and high frequency diaphragm, but from one
diaphragm. In dual and triple diaphragm headphone designs, it is extremely difficult to
sound like all diaphragms operate as one. Yet another measurement called phase and
impulse response will show a lack of, or presence of coherency.
Decay: Fadeout of a note following the initial attack, easily seen in waterfall response.
Some frequencies may decay longer than others depending on the headphone design.
Decay negatively affects sound accuracy, since it adds “coloration” to the music that
wasn’t in the original recording.
Definition (or resolution): The ability of a component to reveal the subtle information
that is fundamental to high fidelity sound. Also “inner definition” such as the drawing of a
bow across YoYo Ma’s cello, which is extremely difficult to reproduce. Also revealed in
the “bite” of horns in a big band recorded with a great condenser (electrostatic in
Delicate: High frequencies extending from 8kHz to 20kHz without accentuating peaks.
This also describes a headphone’s ability to respond to extremely low-level signals,
where some headphones may not have enough sensitivity or response.
Depth: A sense of hearing quot;intoquot; the music, the 3rd dimension. Also referred to as front
to back, there is a great sense of space. Great phase response is required to reproduce
depth and “ambiance” of a recording or the quot;depthquot; of the soundstage.
Detail: The most delicate elements of the original recorded sound. These elements are
the first to disappear with lesser equipment and headphones, as it requires high
sensitivity and response.
Distortion: A sound that is not part of the original signal. Distortion can be a
modification of the original signal (Intermodulation distortion), or generating new signals
that result from the interference of a combination of signals (harmonic distortion). Can
be characterized as a roughness, fuzziness, harshness, or stridency in the music..
There are many distortions that can be measured: IM (intermodulation distortion),
harmonic distortion, and TIM (transient intermediation distortion). Also breakup, where
the headphone can’t handle the power or the low frequencies, which causes sound to
Dynamic: The ability to play very loud as well to very soft. Some headphones may
quot;compressquot; on loud passages, or simply quot;distort.quot; Some headphones will only reproduce
the mids with moderate dynamics and cannot capture the power of real music. Some
insensitive headphones are not able to resolve subtle low level signals which make
Dynamic is also used to describe a type of headphone speaker, that uses typical
magnet and voice coil design as opposed to quot;electrostaticquot; or quot;balanced armaturequot;
Dynamic Range: Pertaining to the ratio between the loudest and the quietest sounds.
Wide dynamic range could be an explosion out of silence. Small dynamic range could
be a loud rock band that doesn't ever play soft.
Efficient: The ratio of level of signal in to sound output. An efficient headphone will play
louder at the same identical volume sitting, while an inefficient one will require that the
volume level be turned up to get the same music level.
Therefore different headphones may play louder or quieter, despite being given the
same level of output from the same media player. This does not mean that they are
good or bad, unless you like to listen at loud levels and your headphone can not play as
loud as you want them to. In this case, high efficiency headphones are more satisfying.
Many headphones cannot be driven to high output levels, which will require the use of a
quality headphone amplifier.
When comparing headphones, it is helpful to set the volume levels so they play the
same loudness in the middle frequencies, adjusting the volume between the two if
Edgy: Excessive high frequency response. Occurs when there is high level of distortion
due to frequency response exaggeration with a raspiness or harshness to the sound. .
Excessive decay and waterfall which are inherent in the headphone design in itself can
cause this unpleasant sound which will create listening fatigue.
Sometimes referred in a positive sense when reproducing brass instruments such as
trumpets where one hears the quot;edge.quot;
Extreme Highs: Audio frequencies of 10 KHz and above. Examples of instruments are
triangles and cymbals. These frequencies are hard to hear because they do not occur
often in music and the ear is less sensitive to them.
Fast: Good reproduction of rapid transients which give a sense of realism. Often refers
to the ability of a headphone to reproduce sharp transients. Timely response and
acceleration of a speaker to an incoming signal. Good brass band will show this off.
Focus: A strong, precise sense of image and music instrument placement.
Forward: Usually referring to the midrange, vocals or projection of instruments as being
in the front of the soundstage.
Sub-bass – 16Hz to 30Hz
Bass – 20Hz to 250Hz
Mid-bass – 60Hz to 250Hz
Mid-range – 250Hz to 6KHz
Upper-mid-range – 2KHz to 6KHz
Highs – 6KHz and above
Extreme Highs – 10KHz and above
Full: Strong sense of balance in the music with all instruments being equally
reproduced.. Good low frequency response, not necessarily extended, but with
adequate level around 100 to 300 Hz. Male voices are full around 125 Hz; female
voices and violins are full around 250 Hz; sax is full around 250 to 400 Hz. Opposite of
Gritty: A harsh sound in the upper frequencies. Rough sandpapery sound caused by
exaggerated high end and long decay times in the high frequencies.
Harmonics: The richness of sound and production of instrument overtones. Examples
of this are the sound reproduction of a guitar, saxophone, and piano. Sometimes long
decays in bass frequencies, although a form of distortion, can add a sense of fullness.
Harmonics can also be a pleasant kind of distortion caused by electronics or the
headphones themselves, but it is a distortion in that the harmonics are not part of the
Harsh: Combination of unpleasant high frequency peaks and a hashy distortion.
Harshness makes one want to turn the volume down as the harshness overrides other
parts of the music.
Highs: The audio frequencies above about 6 kHz. Examples are the upper ranges of
electric guitars, flutes, triangles. Most headphones cannot reproduce highs accurately.
High Midrange (High Mids, Upper Mids): The audio frequencies between about 2kHz
and 6kHz. Examples are frequencies from the upper voice range, and the bite of electric
guitars and brass instruments such as horns and big band.
Imaging: The placement of vocals or instruments within the soundstage. Good phase
response and coherency will help provide realistic image.
Impact: How music quot;hitsquot; the listener is an indication of how impactful music can be.
Kick drum, explosions in a movie are more dynamic with a headphones ability to
reproduce impact. Very few headphones have this ability.
Incredible: Like awesome, will describe a combination of desirable parameters.
Liquid: Smooth, relaxing, yet detailed sound. Opposite of harsh.
Low Bass: The audio frequencies ranging from below 20Hz to 60Hz. These are the
hardest frequencies for headphones to reproduce because it involves moving a lot of air
with a very small diaphragm. Reproducing very low organ pedal notes, and today’s
electronic instruments present new challenges for headphones to reproduce. You want
a headphone to accurately reproduce this low bottom end when its there in the music,
but not reproduce it when it’s not.
Low Level Detail: Being able to resolve the delicate nuisances of music, especially
during quiet passages.
Low End Detail: the subtle distinct sounds that you can hear in the bass frequencies.
Examples are when a pedal impacts the bass drum or when fingers move across a
Low Midrange (Low Mids): The audio frequencies between about 250Hz and 500Hz.
Very critical in reproducing vocals accurately. Artificial exaggeration here can create a
sense of fullness that is not natural.
Lush: Very rich and full reproduction. Smooth.
Mellow: Reduced high frequencies. The opposite of edgy.
Midbass: The audio frequencies between about 60Hz and 250Hz. Kick drum and bass
guitar are examples of instruments represented by these frequencies.
Midrange (Mids): The audio frequencies between about 250 Hz and 6000 Hz. This is
the ear’s most sensitive range. We can hear even small variations in this region.
Because of this sensitivity, natural and not artificial sounding vocals, piano, and guitars
is extremely important to this range.
Muddy: Ill defined and congested. Headphone keeps going after the signal stops. Can
easily be seen with bad waterfall and/or high harmonic distortion. Muddy sound can
actually be caused by soft eartips, such as some foam eartips.
Musical (or musicality): The ability to hear through the headphone and into the music.
Instruments sound more like real instruments as opposed to speakers reproducing
Natural: Realistic sound reproduction.
Neutral: Free from coloration. Not artificially exaggerating one frequency or another.
Open: Sound which has height and quot;air.” Relates to clean upper midrange and treble.
Perfection: No such thing, but it’s a nice word.
Presence: A sense that the instrument is present in the room with the listener. Able to
reproduce an experience of quot;being there.quot;
Punch: Good reproduction of dynamics. Good transient response, with strong impact.
Sometimes a bump around 5 kHz or 200 Hz. Good waterfall is required here. Ear tips
will also affect the ability to properly reproduce punch.
Reference: A standard by which all others are compared. The highest quality available.
Resolution (Resolving): Hearing under a microscope. The ability to hear fine details.
Can also refer to sampling rates and ability to hear all of the harmonics and tonality of
Rich: See “Full.” Also, having euphonic distortion made of even order harmonics.
Roll off: A frequency response which falls gradually above or below a certain frequency
range. This is can cause inaccurate sound, in that a headphone cannot reproduce all of
Sibilance: A coloration that resembles or exaggerates the vocal ssss sound. Bad
waterfall in the high frequency can exaggerate these sounds.
Smooth: Easy on the ears. Not harsh. Flat (neutral) frequency response, especially in
the midrange. Lack of peaks and dips in the response.
Soundstage: The space in front of the listener from far left, center, to far right. can
also incorporate the quot;depthquot; of soundstage. Is the listener in front of the band
instruments in the back of the orchestra sound like they are in the back? Soundstage
should have width, depth, and height. Higher resolution source material (sampled at
higher rates) will reproduce soundstage better.
Speed: Timely impulse response. The ability of a speaker to respond quickly to signal
input. A fast system with good pace gives the impression of being right on the money in
its timing. Good speed is absolutely critical to realistic musical reproduction and
separates ordinary headphones from great ones.
Sub-Bass: The audio frequencies between about 16Hz and 30Hz. These frequencies
are more “felt” rather than heard. Nearly impossible to reproduce with a headphone.
Sweet: Not strident or piercing. Delicate. Flat high frequency response, low distortion.
Lack of peaks in the response. Highs are extended to above 10k.
Texture: A perceptible pattern or structure in reproduced sound. The ability to hear the
differences between two similar instruments can be described in its texture.
Tight: Good low frequency transient response and detail. Accurate and fast impulse
response. Low decay (great waterfall).
Timbre: The tonal character of an instrument with all of its harmonics that give its
identity. Headphones themselves have a timbre of its own, which should not interfere
with its ability to reproduce music.
Transient: The leading edge of a percussive sound. Good transient response makes
the sound as a whole more live and realistic. Accomplished with fast impulse response
and good waterfall.
Transparent: Easy to hear into the music. Detailed, clear and not muddy. Wide flat
frequency response, sharp time response, very low distortion and noise. The
headphone sound becomes invisible and only the music remains.
Upper Midrange (Upper Mids, High Mids): The audio frequencies between 2 kHz and
6 kHz. Electric guitar, brass instruments and high vocal ranges are in this region.
Uncolored: Free from audible colorations sounding more like real music.
Warm: Describes satisfying full sound. Low and mid bass needs to be accurately
reproduced without becoming lean or thin. High frequencies need to be full and
Weighty: Good low frequency response below 50 Hz. A sense of substance and
produced by deep, controlled bass.