Inferring Speech Activity from Encrypted Skype Traffic
Upcoming SlideShare
Loading in...5
×
 

Inferring Speech Activity from Encrypted Skype Traffic

on

  • 961 views

Normally, voice activity detection (VAD) refers to speech processing algorithms for detecting the presence or absence of human speech in segments of audio signals. In this paper, however, we focus ...

Normally, voice activity detection (VAD) refers to speech processing algorithms for detecting the presence or absence of human speech in segments of audio signals. In this paper, however, we focus on speech detection algorithms that take VoIP traffic instead of audio signals as input. We call this category of algorithms network-level VAD.

Traditional VAD usually plays a fundamental role in speech processing systems because of its ability to delimit speech segments. Network-level VAD, on the other hand, can be quite helpful in network management, which is the motivation for our study. We propose the first real-time network-level VAD algorithm that can extract voice activity from encrypted and non-silence-suppressed Skype traffic. We evaluate the speech detection accuracy of the proposed algorithm with extensive reallife traces. The results show that our scheme achieve reasonably good performance even high degree of randomness has been injected into the network traffic.

Statistics

Views

Total Views
961
Views on SlideShare
921
Embed Views
40

Actions

Likes
0
Downloads
11
Comments
0

1 Embed 40

http://mmnet.iis.sinica.edu.tw 40

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Inferring Speech Activity from Encrypted Skype Traffic Inferring Speech Activity from Encrypted Skype Traffic Presentation Transcript

  • Inferring Speech Activity from  Encrypted Skype Traffic Yu‐Chun Chang, Kuan‐Ta Chen, Chen‐Chi Wu, and  Chin‐Laung Lei Oct. 27, 2008 2008/10/27 1
  • Outline • Introduction • Data description • Proposed scheme • Performance evaluation • Conclusion 2008/10/27 2
  • Introduction • VAD (Voice Activity Detection) – The algorithm to extract the presence or absence  of human speech in speech processing. • Source‐level VAD – Audio signal  – Silence suppression • Network‐level VAD – Network traffic – Flow identification, QoS measurement 2008/10/27 3 View slide
  • • The differences between source‐level and network‐ level VAD source‐level network‐level input audio signal network traffic location speaker’s host network node purpose silence suppression traffic management echo cancellation QoS measurement 2008/10/27 4 View slide
  • Introduction (contd.) • Challenges – Payload encryption – Skype do not support silence suppression • Contribution – We propose a network‐level VAD that can infers  speech activity from encrypted and non‐silence‐ suppressed VoIP traffic. 2008/10/27 5
  • Data Description • Experiment setup (Chosen by Skype) 2008/10/27 Audio signal Network traffic  6
  • Data Description (contd.) • Trace summary Total # of traces # TCP # UDP 1839 1427 412 # Relay node Mean packet size Mean time period 1677 109.6 bytes 612.5 sec 2008/10/27 7
  • Proposed Scheme • The indicator of voice activity – packet size • Smoothing • Adaptive thresholding 2008/10/27 8
  • The indicator of voice activity – Packet size 2008/10/27 9
  • Smoothing • EWMA (Exponentially Weighted Moving Average) EWMA : Pi = λYi + (1 − λ ) Pi −1 (λ = 0.2) Y : Observed packet size P : Smoothed packet size 2008/10/27 10
  • Packet Size (bytes) Adaptive thresholding 2008/10/27 11
  • Adaptive thresholding (contd.) P : 140 bytes (P + T2)/2 = 110 bytes (P + T1)/2 = 107 bytes T2 : 80 bytes T1 : 74 bytes 2008/10/27 12
  • Adaptive thresholding (contd.) Packet Size (bytes) 2008/10/27 estimated ON periods 13
  • Performance Evaluation • Number of ON periods Number _ of _ estimated _ ON _ periods Number _ of _ true _ ON _ periods 2008/10/27 14
  • Performance Evaluation (contd.) • Average length of ON periods Mean _ length _ of _ estimated _ ON _ periods Mean _ length _ of _ true _ ON _ periods 2008/10/27 15
  • Performance Evaluation (contd.) • State correctness M _ and _ N ON  period ‐> 1 M _ or _ N OFF period ‐> 0 True speech activity (M) : 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 Estimated speech activity (N): 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 1 M and N: 0 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 M  or  N: 0 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 2008/10/27 16
  • Performance Evaluation (contd.) • State correctness 2008/10/27 17
  • Conclusion • We propose the network‐level VAD which  infers speech activity from network traffic  instead of audio signal. • We propose a VAD algorithm that can extract  voice activity from encrypted and non‐silence‐ suppressed VoIP network traffic. 2008/10/27 18
  • • Thanks 2008/10/27 19
  • Backup slides 2008/10/27 20
  • VAD on audio signaling volume = 10 * log(∑ Si2 ) i Static threshold : 183 db J.‐S. R. Jang, “Audio signal processing and recognition,” http://www.cs.nthu.edu.tw/jang 2008/10/27 21
  • 2008/10/27 22
  • I am a student of National Taiwan University. 2008/10/27 23
  • Performance Evaluation (contd.) • Number of ON periods 2008/10/27 24
  • Performance Evaluation (contd.) • Average length of ON periods 2008/10/27 25
  • Performance Evaluation (contd.) • State correctness M ∩N M ∪N True speech activity (M) : 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 Estimated speech activity (N): 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 1 1 0 1 1 1 2008/10/27 26