3. Motivation
To validate a real-time system that updates head-related
impulse responses
Goal is to show that the acoustic waveforms measured
on KEMAR match between real and virtual presentations
Applications:
Explore the effects on sound localization with the presentation
of dynamic sound
4. Introduction: What is Real/Virtual Audio?
Real Audio consists of presenting sounds over
loudspeakers
Virtual Audio consists of presenting acoustic waveforms
over headphones.
Advantages
Cost-effective
Portable
Doesn’t depend on room effects
Disadvantages
Unrealistic
5. Introduction: Sound Localization
Interaural Time Difference – ITD – Differences between
sound arrival times at the two ears
Predominant cue in the low frequencies < 2kHz
Interaural Level Difference – ILD – Differences between
sound levels in the two ears
Predominant cue in the higher frequencies due to head
shadowing ~> 2kHz
Encoded in Head-Related Transfer Function (HRTF)
ILD in Magnitude
ITD in Phase
6. Background of RTVAS System
Developed by Jacob Scarpaci (2006)
Uses a Real-Time Kernel in Linux to update HRTF filters
Key to system is that the HRTF being convolved
with input signal is the difference between where
the sound should be and where the subject’s
head position is.
7. Project Motivation/Aims
Goal is to validate that the Real-Time Virtual Auditory
System, developed by Jacob Scarpaci (2006), correctly
updates HRTFs in accordance with head location relative
to sound location.
Approach to validation:
Compare acoustic waveforms measured on KEMAR when
presented with sound over headphones to those presented
over loudspeakers.
Mathematical, signals approach
Perform a behavioral task where subjects are to track dynamic
sound played over headphones or loudspeakers.
Perceptual approach
8. Methods: Real Presentation - Panning
Loudspeaker setup to create a virtual speaker
Nonlinear (Leakey, 1959)
(shown as dashed outline) by interpolation
CH 1 = 1/ 2 (sin(! ) / 2 sin(! pos ))
between two symmetrically located speakers
about 0 degrees azimuth. CH 2 = 1/ 2 + (sin(! ) / 2 sin(! pos ))
9. HRTF Measurement
Empirical KEMAR
17th order MLS used to measure HRTF at every degree from -90 to 90 degrees.
All measurements were windowed to 226 coefficients using a modified
Hanning window to remove reverberations.
Minimum-Phase plus Linear Phase Interpolation
Interpolated from every 5 degree empirical measurements.
Magnitude function was derived using a linear weighted average of the log
magnitude functions from the empirical measurements.
Minimum Phase function was derived from the magnitude function.
Linear Phase component was added corresponding to the ITD calculated for
that position.
10. Acoustic Waveform Comparison: Static
Sound/Static Head Methods
Presented either a speech waveform or noise waveform at three
different static locations: 5, 23, and -23 degrees
During the free-field presentation the positions were created by
using the panning technique (outlined previously) from speakers.
Used 4 different KEMAR HRTF sets in the virtual presentation
Empirical, Min-Phase Interp., Empirical Headphone TF, Min-Phase
Headphone TF
Recorded sounds on KEMAR with microphones located at the
position corresponding to the human eardrum.
11. Static Sound/Static Head: Analysis
Correlated the waveforms recorded over loudspeakers
with the waveforms recorded over headphones for a
given set of HRTFs.
Correlated time, magnitude, and phase functions
Allowed for a maximum delay of 4ms in time to allow for
transmission delays
Broke signals into third-octave bands with the following
center frequencies:
[200 250 315 400 500 630 800 1000 1250 1600 2000 2500 3150 4000
5000 6300 8000 10000]
Correlated time, magnitude, and phase within each band and calculated
the delay(lag) needed to impose on one signal to achieve maximum
correlation.
Looked at differences in binaural cues within each band
18. Dynamic Sound/Static Head: Methods
Presented a speech or a noise waveform either over
loudspeakers or headphones using panning or convolution
algorithm
Sound was presented from 0 to 30 degrees
Used same 4 HRTF sets
25. Static Sound/Dynamic Head: Methods
Speech or noise waveform was presented over
loudspeakers or headphone at a fixed position, 30
degrees.
4 HRTF sets were used
KEMAR was moved from 30 degrees to 0 degree position
while sound was presented.
Head position was monitored using Intersense® IS900
VWT head tracker.
26. Static Sound/Dynamic Head: Analysis
Similar data analysis was performed in this case as in the
previous two cases.
Only tracks that followed the same trajectory were
correlated.
Acceptance Criteria was less than 1 or 1.5 degree difference
between the tracks.
31. Difference in ITDs from Free-Field and
Headphones for Static Noise/Dynamic Head
32. Difference in ILDs from Free-Field and
Headphones for Static Noise/Dynamic Head.
33. Waveform Comparison Discussion
Interaural cues match up very well across the different
conditions as well as between loudspeakers and
headphones.
Result from higher correlations in the magnitude and phase
functions.
Differences (correlation) in waveforms may not matter
perceptually if receiving same binaural cues.
Output algorithm in the RTVAS seems to present correct
directional oriented sounds as well as correctly adjusting
to head movement.
34. Psychophysical Experiment: Details
6 Normal Hearing Subjects
4 Male, 2 Female
Sound was presented over headphones or loudspeakers
Task was to track, using their head, a moving sound
source.
HRTFs tested were, Empirical KEMAR, Minimum-Phase
KEMAR, Individual (Interpolated using Minimum-Phase)
35. Psychophysical Experiment: Details cont.
Sound Details
White noise
Frequency content was 200Hz to 10kHz
Presented at 65dB SPL
5 second in duration
Track Details
15(sin((2pi/5)t)+ sin((2pi/2)t*rand))
36. Psychophysical Experiment: Virtual Setup
Head Movement Training – Subjects just moved head (no sound)
5 repetitions where subjects’ task was to put the square (representing
head) in another box.
Also centers head.
Training – All using Empirical KEMAR
10 trials where subject was shown, via plot, the path of the sound before
it played.
10 trials where the same track as before was presented but no visual
cue was available.
10 trials where subject was shown, via plot, the path but path was
random from trial to trial.
10 trials where tracks were random and no visualization.
37. Psychophysical Experiment: Setup cont.
Experiment (Headphones)
10 trials using Empirical KEMAR HRTFs
10 trials using Minimum-Phase KEMAR HRTFs
10 trials using Individual HRTFs
Repeated 3 times
Loudspeaker Training
Same as headphones but trials were reduced to 5.
Loudspeaker Experiment
30 trials repeated only once
Subjects were instructed to press a button as soon as they
heard the sound. This started the head tracking.
46. Psychophysical Experiment: Discussion
Coherence
The coherence or correlation measure is statistically insignificant in
the empirical and minimum phase interpolation case from that over
loudspeakers.
Coherence of individual HRTFs was surprisingly worse.
Coherence also stays strong as the complexity of the track varies.
Latency
Individual HRTFs show a more variability in latency.
Might be able to track changes quicker using their own HRTFs
Loudspeaker latency is negative which means that subjects are
predicting the path.
Could be because subjects are predicting the path since sound always go
to the right first as well as a result from the delay in pressing the button
47. Psychophysical Experiment: Discussion
Cont.
RMS
No significant difference in total RMS error as well as RMS
undershoot error between Empirical and Minimum-Phase
HRTFs from loudspeakers.
Subjects generally undershoot the path of the sound.
Could be a motor problem, i.e. laziness, as well as perception.
48. Overall Conclusions
Coherence of acoustic recordings may not be the best
measure for validation
Reverberation or panning techniques
If perception is the only thing that matters, than have to
conclude that algorithm works
49. Future Work
Look at different methods for presenting dynamic sound
over loudspeakers.
Try different room environments.
Closer look at differences between headphones
Particularly looking at open canal tube-phones to see if
subjects could distinguish between real and virtual sources.
Various psychophysical experiments that involve dynamic
sound (speech, masking)
Sound localization
Source separation
50. Acknowledgements
Other
Committee
Dave Freedman
Steven Colburn
Jake Scarpaci
Barb Shinn-Cunningham
Nathaniel Durlach My Subjects
Binaural Gang All in Attendance
Todd Jennings
Le Wang
Tim Streeter
Varun Parmar
Akshay Navaladi
Antje Ihlefeld
53. Methods: Real Presentation Continued
Input stimulus was a 17th
Title: Speaker
Presentation
order mls sequence sampled
Source Created
at 50kHz.
Speaker Position
-5
Corresponds to a duration of
10 0
~2.6sec
5
-10
15 0
Waveforms were recorded
10
-20
on KEMAR (Knowles
-10
Electronic Manikin for
30 0
10
Acoustic Research)
20
-40
-30
-10
45 0
10
30
40
54. Results: Real Presentation
• HRTFs measured when sound was presented over loudspeakers using the
linear and nonlinear interpolation functions
Linear Nonlinear
55. Results: Correlation Coefficients at all Spatial Locations for
Interpolated Sound over Loudspeakers
Correlation
Title: Correlation Coefficients
Linear Function Non-linear Function
Speaker Virtual
Left Right Left Right
Location Position
between a
-40 0.98799 0.9758 0.98655 0.97769
-30 0.97427 0.96611 0.97534 0.96777
virtual point
-10 0.96842 0.94612 0.96858 0.9466
45 0 0.95736 0.91602 0.95693 0.91709
10 0.96374 0.95282 0.96384 0.95276
source and a
30 0.97532 0.97095 0.97644 0.97084
40 0.98397 0.98194 0.98268 0.98177
real source
-20 0.98372 0.97316 0.98385 0.97357
-10 0.98054 0.9564 0.98054 0.95649
30 0 0.97184 0.93755 0.97171 0.93774
10 0.97151 0.96414 0.97147 0.96448
20 0.97844 0.97768 0.97883 0.97762
-10 0.993 0.97775 0.99301 0.97787
15 0 0.97821 0.95517 0.97817 0.95503
10 0.98406 0.98576 0.98412 0.98572
-5 0.99326 0.97585 0.99328 0.97601
10 0 0.98927 0.96086 0.98924 0.96077
5 0.99319 0.98977 0.99312 0.98977
Very strong correlation, generally, for all spatial locations
Weaker correlation as speakers become more spatially separated
Weakest correlation when created sound is furthest from both speakers (0
degrees)
56. Spatial Separation of Loudspeakers
Correlation coefficients
for a virtually created
sound source at -10
degrees at various
spatial separations of the
loudspeakers
Correlation declines as the loudspeakers become more spatially separated
57. Example of Psuedo-Anechoic HRTFs
• Correlation coefficients are slightly better when reverberations are taken out of the impulse
responses
• Linear Reverberant: 0.98054, 0.9564 (Left, Right Ears)
• Linear Psuedo-Anechoic: 0.98545, 0.96019 (Left, Right Ears)
• Nonlinear Reverberant: 0.98054, 0.95649 (Left, Right Ears)
• Nonlinear Pseudo-Anechoic: 0.9855, 0.96007 (Left, Right Ears)
58. Correlation Coefficients at all Spatial Locations for Interpolated
Sound over Loudspeakers (Pseudo-Anechoic)
Table 3. Correlation Coefficients for Psuedo-Anechoic HRTFs
Linear Function Non-linear Function
Speaker Virtual
Left Right Left Right
Location Position
-40 0.96567 0.99168 0.96416 0.98421
-30 0.96223 0.95356 0.96138 0.95815
-10 0.96348 0.93433 0.96299 0.93902
45 0 0.95471 0.89491 0.95436 0.89968
10 0.95856 0.93652 0.95913 0.93953
30 0.97678 0.945 0.97825 0.94013
40 0.99563 0.9814 0.99 0.98018
-20 0.98762 0.97555 0.98767 0.97663
-10 0.98545 0.96019 0.9855 0.96007
30 0 0.97281 0.93616 0.97284 0.93623
10 0.97927 0.96945 0.97912 0.96968
20 0.97904 0.98188 0.97846 0.98183
-10 0.99608 0.98114 0.99592 0.98167
15 0 0.97891 0.95475 0.9788 0.95461
10 0.9928 0.98922 0.99287 0.9892
-5 0.99738 0.98141 0.99736 0.98162
10 0 0.99329 0.96323 0.99333 0.9632
5 0.99731 0.9946 0.99736 0.99462
Correlations generally are better when reverberant energy is taken out of the impulse
responses.