Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SFMap: Inferring Services over
Encrypted Web Flows using
Dynamical Domain Name Graphs
TMA 2015
Tatsuya Mori1, Takeru Inoue...
Background(1) Era of web
Change of Internet traffic
※WIDE Mawi Project http://mawi.wide.ad.jp,
samplepoint B, F
1
2002/12/...
Background(2) Encrypting Web
2
D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext,
2014.
Deploying H...
YouTube video over HTTPS!
3
Netflix started encrypting stream
Era of WebSocket
5
HTTP = non-secure!
6
ISPs need to understand traffic mix
•  to figure out what to control in the presence
of congestion.
–  Shaping HTTP flows ...
HTTP vs. HTTPS
•  HTTP:
– HTTP header composes of URL information
•  HTTPS:
– Entire HTTP protocol including header is
enc...
Solution 1: Server IP addresses
•  Many of IP addresses can be
reverse looked up (PTR record)
•  There are many IP address...
Solution 2: SSL/TLS certificates
•  Public key certificate of server is exchanged
during SSL/TLS handshake stage. The
cert...
Solution 3: SNI extension
•  SNI (Server Name Identification) of TLS
can be used to obtain FQDN of an
HTTPS server.
•  Man...
Solution 4: Decrypting HTTPS
•  Anti-virus software or firewall products have
mechanisms to intercept HTTPS traffic
•  The...
Goal
•  Estimate server hostnames of HTTPS traffic
–  Server hostnames can be used as a good hint to
estimate the services...
Idea
•  Leverage DNS name resolutions that
precedes HTTPS transactions
–  Labeling data plane using control plane
–  This ...
Illustration of DNS approach
15
Client
C1
DNS resolver
HTTPS server
S1
61.213.146.4 (akamai)
NTT-COM
10.20.2.3
61.213.146....
Three practical challenges:
1)  CNAME tricks
2)  Incomplete measurements
3)  Dynamicity, diversity, and ambiguity
16
1) CNAME tricks
•  Modern CDN providers heavily make use of
CNAME tricks to optimize content distribution
•  We need to ke...
2) incomplete measurements
•  Various DNS caching mechanisms in the
wild
–  Browser/apps
–  OS
–  Home routers w/DNS resol...
3) Dynamicity, diversity, and
ambiguity
•  A pathological/popular example
19
s1
m1
n1
s1
m1
n1
n2
Observa(on+1: n1! s1
Obs...
SFMap (Service-Flow Map)
20
DNS
queries/ responses
HTTPS flow
DNG
Sec 3.2
Flow key: {s, c}
Hostname, n
input
output
Freq. ...
Illustration of DNG
21
s1 n1
n2
c1 s2 m1
s3
n3
s3
c2
m2
n4
n6
s1 n5
s1
n1
n2
s2 m1
s3
n3
m2
n4
n6
n5
c1
c2
Per-client grap...
Overview of hostname
estimation algorithm (1)
•  Get client/server IP addresses (C,S)
from an HTTPS flow
•  Search a set o...
An example
23
s3
c2
m2
m3
n6
n7
s1 n5
(c2, s1) à estimation = n5
(c2, s3) à candidates = n6, n7
Overview of hostname
estimation algorithm (2)
•  If there are multiple candidates, sort
them in descending order, accordin...
Updating DNG
•  States/Statistics of DNG is updated
online when a DNS query is observed
25
Dataset
•  LAB:
– A small LAN used by research group
•  PROD:
– Middle-scale production network
26
Scales of DNGs (12 hours long)
27
Estimation accuracy (1)
28
UNION グラフ
TTL expiration 無し
UNION DNG without TTL expiration
Exact match
Public suffix match
Estimation accuracy (2)
29
Accuracies of top-3 estimations (UE-NTE)
The top-3 ranked hostnames were similar in many
cases;...
Discussion
•  Sources of inevitable misclassification
– DNS implementations that ignore TTL
expiration
•  It keeps holding...
Discussion (cont.)
•  Scalability
– Did not matter for our datasets
– Size of DNG depends on the number of
client IP addre...
Summary
•  SFMap framework estimates hostnames
(~services) of HTTPS traffic using past DNS queries
•  Key ideas:use of DNG...
Acknowledgements
•  This work was supported by JSPS KAKENHI Grant
Number 25880020.
33
Existing research: DN Hunter
•  Bermudez et al., “DNS to the Rescue: Discerning Content and
Services in a Tangled Web”, AC...
Comparison with DN-Hunter
35
Distributed
monitoring
Statistical
estimation
DN Hunter △ ☓ 
SFMap ⃝ ⃝ 
Distributed monitoring
36
ER
ER
ER
DNS
resolver
SFMap
DNS
(Control-plane)
CR
GW
sampled
NetFlow
(data-plane)
Upcoming SlideShare
Loading in …5
×

SFMap (TMA 2015)

1,656 views

Published on

We present Service-Flow map (SFMap) framework that statistically estimates the hostname of an HTTPS server, given a pair of client and server IP addresses. We evaluate the performance of SFMap through extensive analysis using real packet traces collected from two locations with different scales. We demonstrate that SFMap establishes good estimation accuracies and outperforms a state-of-the-art approach.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

SFMap (TMA 2015)

  1. 1. SFMap: Inferring Services over Encrypted Web Flows using Dynamical Domain Name Graphs TMA 2015 Tatsuya Mori1, Takeru Inoue2, Akihiro Shimoda3, Kazumichi Sato3, Keisuke Ishibashi3, and Shigeki Goto1 1 Waseda University 2 NTT Network Innovation Laboratories 3 NTT Network Technology Laboratories
  2. 2. Background(1) Era of web Change of Internet traffic ※WIDE Mawi Project http://mawi.wide.ad.jp, samplepoint B, F 1 2002/12/1 2012/12/1 Many of primary Internet services have shifted to Web (http): Everything over HTTP Era of P2P Era of Web
  3. 3. Background(2) Encrypting Web 2 D. Naylor et al., The Cost of the “S” in HTTP. Proceedings of ACM CoNext, 2014. Deploying HTTPS is not cost any more Significant portion of web traffic is now encrypted
  4. 4. YouTube video over HTTPS! 3
  5. 5. Netflix started encrypting stream
  6. 6. Era of WebSocket 5
  7. 7. HTTP = non-secure! 6
  8. 8. ISPs need to understand traffic mix •  to figure out what to control in the presence of congestion. –  Shaping HTTP flows is too coarse-grained. –  Shaping flows from a range of IP addresses is also too coarse-grained. •  to know demand of end-users –  What types of services are consuming network resources. → Can be used to rethink new architecture or business model   peering policy, installing cache mechanism, WAN optimization, CCN/ICN, •  Obstacle: coping with HTTPS 7
  9. 9. HTTP vs. HTTPS •  HTTP: – HTTP header composes of URL information •  HTTPS: – Entire HTTP protocol including header is encrypted. No URL information is available. 8 HTTP header TCP header IP header HTTP payload TCP header IP header SSL/TLS encryption http://www.example.com
  10. 10. Solution 1: Server IP addresses •  Many of IP addresses can be reverse looked up (PTR record) •  There are many IP addresses that are not configured to have PTR records. •  A single IP address can be associated with many distinct FQDNS (cloud, hosting services, etc.) 9
  11. 11. Solution 2: SSL/TLS certificates •  Public key certificate of server is exchanged during SSL/TLS handshake stage. The certificate should contain domain name of the server. •  An organization can register a single certificate for many sub-domains, i.e., so called wildcard certificates –  E.g., *.google.com 10
  12. 12. Solution 3: SNI extension •  SNI (Server Name Identification) of TLS can be used to obtain FQDN of an HTTPS server. •  Many of client/server implementations have not adopted SNI, yet. – In our dataset, roughly half of HTTPS clients did not use the SNI extension. 11
  13. 13. Solution 4: Decrypting HTTPS •  Anti-virus software or firewall products have mechanisms to intercept HTTPS traffic •  They use self-signed certificates to work as a transparent HTTP(S) proxy. –  Same as the MTIM (Man-in-the-middle) attack –  Needs for installation of certificates for each OS/ application •  [cf] IETF Explicit Trusted Proxy in HTTP/2.0 (I-D expired) 12
  14. 14. Goal •  Estimate server hostnames of HTTPS traffic –  Server hostnames can be used as a good hint to estimate the services provided by the server –  E.g., www.apple..com, itunes.apple.com , … •  Establish better performance than the existing solution (DN-Hunter)   13
  15. 15. Idea •  Leverage DNS name resolutions that precedes HTTPS transactions –  Labeling data plane using control plane –  This is not a simple task as we will describe soon. –  [cf] state-of-the-art = DN-Hunter (IMC 2012) •  Use statistical inference when measurement is incomplete –  DNS resolutions can be missed due to some reasons 14
  16. 16. Illustration of DNS approach 15 Client C1 DNS resolver HTTPS server S1 61.213.146.4 (akamai) NTT-COM 10.20.2.3 61.213.146.4:443 10.20.2.3:31587 = www.apple.com (APPLE) www.apple.com? 61.213.146.4 from 10.20.2.3
  17. 17. Three practical challenges: 1)  CNAME tricks 2)  Incomplete measurements 3)  Dynamicity, diversity, and ambiguity 16
  18. 18. 1) CNAME tricks •  Modern CDN providers heavily make use of CNAME tricks to optimize content distribution •  We need to keep track of not only client/server IP addresses/hostnames, but also intermediate CNAMEs 17 s1 m1 n1m2 23.2.132.181 e1630.c.akamaiedge.net (TTL=20) www.ieee.org.edgekey.net (TTL=11002) www.ieee.org (TTL=60)
  19. 19. 2) incomplete measurements •  Various DNS caching mechanisms in the wild –  Browser/apps –  OS –  Home routers w/DNS resolver –  DNS resolvers (Organization/ISP/Open) •  From the viewpoint of ISPs, DNS queries originated from end-users can be missed due to the intermediate caching mechanisms 18
  20. 20. 3) Dynamicity, diversity, and ambiguity •  A pathological/popular example 19 s1 m1 n1 s1 m1 n1 n2 Observa(on+1: n1! s1 Observa(on+2: n2! s2 s2
  21. 21. SFMap (Service-Flow Map) 20 DNS queries/ responses HTTPS flow DNG Sec 3.2 Flow key: {s, c} Hostname, n input output Freq. Counter Estimator Sec 3.3 Learner Updater Sec 3.4 Learning with monitored queries: Building DNG (Domain Name Graph) Hostname estimation using DNG Use maximum likelihood estimation when necessary
  22. 22. Illustration of DNG 21 s1 n1 n2 c1 s2 m1 s3 n3 s3 c2 m2 n4 n6 s1 n5 s1 n1 n2 s2 m1 s3 n3 m2 n4 n6 n5 c1 c2 Per-client graphs (local DNG) A global graph (union DNG) m2
  23. 23. Overview of hostname estimation algorithm (1) •  Get client/server IP addresses (C,S) from an HTTPS flow •  Search a set of hostnames N corresponding to (C,S) on DNG – Enumerate edge nodes N that have paths reachable from C to S on DNG – Also consider TTL expiration •  If |N| = 1, it is the estimated hostname 22
  24. 24. An example 23 s3 c2 m2 m3 n6 n7 s1 n5 (c2, s1) à estimation = n5 (c2, s3) à candidates = n6, n7
  25. 25. Overview of hostname estimation algorithm (2) •  If there are multiple candidates, sort them in descending order, according to the likelihood probabilities –  Uncertain events à use frequencies •  Second, third candidates can be informative –  Note: The statistical inference can be extended to Bayes estimation that uses P(n) (a priori probability) 24
  26. 26. Updating DNG •  States/Statistics of DNG is updated online when a DNS query is observed 25
  27. 27. Dataset •  LAB: – A small LAN used by research group •  PROD: – Middle-scale production network 26
  28. 28. Scales of DNGs (12 hours long) 27
  29. 29. Estimation accuracy (1) 28 UNION グラフ TTL expiration 無し UNION DNG without TTL expiration Exact match Public suffix match
  30. 30. Estimation accuracy (2) 29 Accuracies of top-3 estimations (UE-NTE) The top-3 ranked hostnames were similar in many cases; e.g,  pagead2. googlesyndication.com  pubads.g.doubleclick.net,  googleads.g.doubleclick. net
  31. 31. Discussion •  Sources of inevitable misclassification – DNS implementations that ignore TTL expiration •  It keeps holding old information – mobility •  DNS could be resolved in different vantage point – Hardcoded IP addresses •  Some gaming apps did have such mechanism 30
  32. 32. Discussion (cont.) •  Scalability – Did not matter for our datasets – Size of DNG depends on the number of client IP addresses •  Some aging mechanism should be incorporated for much large-scale DNGs (future work) •  URL=hostname + path. –  How can we deal with path? –  Need for a standard mechanism to explicitly expose path like SNI? 31
  33. 33. Summary •  SFMap framework estimates hostnames (~services) of HTTPS traffic using past DNS queries •  Key ideas:use of DNG and statistical inference •  SFMap achieved better accuracies than the state-of-the-art work (DN-Hunter) –  Exact match: 90-92% accuracies –  Public suffix match: 94-95% accuracies –  Top-3 hit: 97-98% accuracies 32
  34. 34. Acknowledgements •  This work was supported by JSPS KAKENHI Grant Number 25880020. 33
  35. 35. Existing research: DN Hunter •  Bermudez et al., “DNS to the Rescue: Discerning Content and Services in a Tangled Web”, ACM IMC 2012  34
  36. 36. Comparison with DN-Hunter 35 Distributed monitoring Statistical estimation DN Hunter △ ☓  SFMap ⃝ ⃝ 
  37. 37. Distributed monitoring 36 ER ER ER DNS resolver SFMap DNS (Control-plane) CR GW sampled NetFlow (data-plane)

×