Your SlideShare is downloading. ×
Reputation Digital Vaccine: Reinventing Internet Blacklists
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Reputation Digital Vaccine: Reinventing Internet Blacklists


Published on

SOURCE Seattle 2011 - Marc Eisenbarth

SOURCE Seattle 2011 - Marc Eisenbarth

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Abstract With increasing inventiveness and agility, cutting edge Internet attack techniques such as “fast fluxing” and advanced persistent threats challenge the effectiveness of traditional blacklists. The challenge undertaken by HP TippingPoint is to rapidly counteract attacks such as these by classifying as much of the Internet as possible along a continuum between reputable and disreputable. Our solution implements a number of novel methods to identify and track Internet hosts, which in turn provides intelligence to the Reputation Digital Vaccine (Reputation DV) Service. Reputation DV provides IPv4, IPv6 and Domain Name System (DNS) security intelligence feeds from a global reputation database that enables customers to actively enforce and manage reputation security policies using the HP TippingPoint Intrusion Prevention System (IPS) Platform.Bio Marc Eisenbarth recently noticed the word "Architect" has been appended to his business cards, and while not entirely sure what that means, he has continued to just do what he has been doing for the last five years, namely improving the HP TippingPoint Intrusion Prevention System (IPS) as a member of DVLabs Advanced Security Intelligence team. Prior to this, he managed "cyber liability" at a US defense contractor for five years and completed a graduate program at Columbia University in Computer Science. Off the clock, he is a "hardware guy" who enjoys releasing various do-it-yourself projects to the general public.HP Confidential 1 16 June 2011
  • 2. HP Confidential 2 13 June 2011
  • 3. Problem Statement A blacklist is simply a list of Internet hosts which all traffic should be discarded indiscriminately. Challenges with traditional blacklists exist in both the development and implementation of the blacklist.HP Confidential 3 13 June 2011
  • 4. Problem Statement The greatest technical challenge to blacklists is how to keep abreast of a rapidly changing threat landscape. Attack techniques such as “fast fluxing” [1] and advanced persistent threats [2] are particularly difficult to identify and protect against because they represent two extremes of the time scale. In the former, a single Internet host constantly varies its IP address and DNS name in order to avoid detection. In the later, a single stealthy attack is carried out over a long period of time, using specific domain knowledge of the target. Other complications include DNS entries with inordinately large number of A or NS records, and similar mechanisms to resist take down and complicate traditional mitigation strategies. Detection of these techniques requires storage, evaluation and comparison against historical state information.HP Confidential 4 15 June 2011
  • 5. Problem Statement The major shortfall of existing blacklists is the fact that they do not classify or discriminate via a relative or absolute reputation score, or offer a confidence metric. Furthermore, traditional blacklists assess reputation simplistically, using a binary classifier rather than a continuum of risk and reputation.HP Confidential 5 13 June 2011
  • 6. Problem Statement The source and quality of the data used to compile a blacklist is often suspect, originating from email server logs, firewall logs, and DNS responses; all of which provide meaningful information but fall far short of profiling a modern attacker for inclusion in a truly reliable and trusted blacklist. To complicate the problem, people looking to purchase this data rather than stand up internal systems to compile and maintain blacklist entries run into poor quality data sold by blacklist vendors, typically due to an increasing pressure to deliver larger and larger number of entries, as well as vendors who balk at the questions of whether or not their lists are of a suitable quality to be used in blocking scenarios in large enterprise networks. Even among highly specialized, trusted feeds we see disagreement. For example, and produce disjoint blacklists for Zeus botnets, which admittedly is one of the more difficult and challenging botnets to track and determine attribution. We assume that this is due to the inherent, limited visibility of both monitoring approaches.HP Confidential 6 13 June 2011
  • 7. Problem Statement Maintaining a reliable, timely, and actionable blacklist that can then be enforced in todays enterprise networks is challenging. The problem that needs resolution is to better scrutinize host systems before adding them to a blacklist so as to minimize false positives and to reassure skeptical customers who want greater transparency into the implications of implementing the blacklist. To achieve this level of improvement, blacklist research must perform additional intelligence gathering out-of-band and must analyze attacks that occur across multiple, disparate network flows which can occur over an arbitrary amount of time. Finally, active interaction with a suspected malicious host is often needed to confirm its disreputable intent. These reasons highlight why we chose not to do this analysis inline using the existing IPS engine and instead invented the HP TippingPoint Reputation Digital Vaccine (Reputation DV) service, which automates the identification and blocking of known bad traffic before it reaches the IPS deep packet inspection engine, thus relieving the load on the IPS and deterring traffic from disreputable host systems.HP Confidential 7 13 June 2011
  • 8. HP Confidential 8 15 June 2011
  • 9. Question This sophisticated use of the DNS system by modern attackers is very much in response to the simplistic attempts at DNS blacklists that began more than 10 years ago. Attackers are generally lazy and innovation is necessity driven. At this point in time, technology is responding. It seems to be responding favorably, evidenced by customers who are blocking millions of reputation sourced events per day.HP Confidential 9 15 June 2011
  • 10. Question The next logical step is considering to replace your IDS with a reputation based system. However, if you really mean intrusion detection system, I’d argue that this does not really make sense. This is due to the historical nature of IDS and the interest in answering questions such as “what happened, and why did it happen? [emphasis on past tense]”. Now, if you want to talk about IPS, with a focus on the timely enforcement component that IPS brings to the table, then the conversation gets interesting. IPS in the context of reputation focuses on the source of a threat, not the vulnerability or payload of the attack. As such, it’s not a replacement for an IPS, but another tool to provide preemptive threat protection. In some senses, reputation provides a cloudy crystal ball which has the ability to forecast to some extent what attackers will do and how to catch them, by being vulnerability agnostic. Again, this allows reputation-based approaches to outpace attackers by focusing on their infrastructure rather than their wares.HP Confidential 10 15 June 2011
  • 11. HP Confidential 11 13 June 2011
  • 12. Approach The first step in generating the Reputation DV package is continual acquisition of potential malicious hosts from various external intelligence feeds (Figure 1). These feeds fall into three broad categories: commercial, open-source, and automated customer submissions. Without exception, these first two feed types suffer from the pitfalls and limitations outlined in the previous section. The third category is split between customers who have elected to share security event data and entities that have allowed collocation of an HP TippingPoint controlled IPS as part of our Lighthouse program. The data received from these sources is unique as it is not simply low level event data, but a set of context rich security events associated with particular HP TippingPoint Digital Vaccine (DV) filters. This allows high level correlation between attacks and attackers, despite their efforts to evade detection by manipulating the fluid assignment scheme of IP addresses and DNS host names.HP Confidential 12 13 June 2011
  • 13. Approach The process of placing a host on a blacklist uses a series of modules, each of which generates data we use to classify the host. The first module is responsible for tracking content on these hosts and retrieving a copy of malicious documents, scripts, executables, and tracking changes to these files. This data is used compute a similarity metric which is then used to cluster hosts which are hosting related malware and exploits. The last is a series of modules collectively called “meta” which is a collection of active and passive intelligence gathering techniques. These techniques are used to define more data points for each entry which are ultimately used to mine additional interrelationships between entries. The passive intelligence techniques that we employ include search engine results, DNS and whois information, and the like. The active intelligence techniques are not limited to port scanning, banner grabbing, content spidering, operating system fingerprinting, uptime tracking, and even high interaction honeypots. This monitoring results in a rich collection of out-of-band intelligence that can be warehoused and then at a later point, used to compare to current state.HP Confidential 13 15 June 2011
  • 14. Approach (continued) Inline monitoring is expensive and difficult to scale for global coverage. Our goal is to monitor an arbitrary system by detecting outwardly visible changes, wherever possible. The motivation for this module lies in the desire to increase the number of Internet systems under surveillance. Unlike many organizations, we do in fact have a large network of sensors that are monitoring the Internet both in an inline capacity as well as through network span ports. While this is useful in its own right and currently scaled out to the degree which allows the results to be considered statistically viable, the chief limitation of this approach is that only traffic which crosses this sphere of inspection can be considered for analysis. Born out of this realization was the concept of an active monitoring system which could reach out and query an arbitrary host and could scale to the point that it could track the Internet as a whole. Currently our systems track around four billion annual events which are distilled into a set of approximately two million IP addresses and a half of million DNS entries which are distributed to the end user and comprise the Reputation DV. To support this massive effort, we developed the extensible architecture outlined above, which is responsible for constructing, maintaining and distributing this blacklist of irreputable Internet hosts.HP Confidential 14 16 June 2011
  • 15. Approach Once module-based classification work is complete, there is an enormous amount of information associated with each entry that now can be consumed by the “rule engine” module, which exists to further classify and score each entry. At the heart of the rule engine is the support-vector machines (SVM) algorithm [3]. SVMs are a set of related supervised learning methods that analyze data and recognize patterns. The advantage that SVMs offer is a soft margin classifier which is able to reduce a single multiclass problem into multiple binary classifications. In other words, it is possible to operate on arbitrary data types and reduce the chance for overfitting the data by accounting for mislabeled examples. Additional algorithms are used to assign various scores to the blacklist entries. These scores represent our assessment of the host’s potential to generate malicious behavior along with our confidence that it is not a false positive. We also distribute tags comprised of the above mined metadata, which serve to classify each entry and provide useful data, such as country of origin, attack family and reason for inclusion.HP Confidential 15 14 June 2011
  • 16. Approach Administrators can leverage the tags and scores to build custom filters used to tailor the blacklist to their company’s business and risk management requirements. An example filter would read, ”block all botnets but not spam originating from Azakstan with a score greater than 80”. The flexibility that these filters offer gives a level of transparency and control to administrators that traditional blacklists cannot provide. Bad traffic is dropped before the IPS deep packet inspection engine resulting in efficient, scalable policy enforcement.HP Confidential 16 15 June 2011
  • 17. HP Confidential 17 14 June 2011
  • 18. Observations It became clear very early on that a whitelist mechanism was needed to train and validate our algorithms. Alexa, which offers a ranking of the top million domains, serves as a good place to start, along with search engine results. However, looking at the top 250,000 domains, we note that a notable percentage would show up in our blacklist algorithms from time to time. It’s important to note that this list contains popular file sharing, porn- related, and unethical advertising websites which often deserve disreputable scores. In further investigating some of these domains, we note that often they are hosted in networks that contain proven malicious domains and are thus there is some validity to a certain amount of “guilty by association”. This idea of the reputation of a ISP is something that we are looking to explore further, and something that has already made the news on a few blogs out there and elicited response from a known German ISP, which appeared near the top of this list. All this to say, due to the lack of granularity of reputation based blocking, for cases where a site such as Google is delivering malicious content, the IPS signature engine is much more adept at handling these cases and for the last obvious cases, compared to Google anyways, a mixture of reputation and filter technologies proves promising, as we shall explore later.HP Confidential 18 15 June 2011
  • 19. Observations (continued) A corollary to this is perhaps that vendors can artificially inflate their reputation lists by including large numbers of addresses that have a low probability of causing business interruption, but are not necessarily malicious. However, we believe this approach to be specious at best, citing the example where an online retailer may be deploying a blacklist and blocking these hosts results in loss of revenue. In fact, in this case it’s conceivable that the retailer in this scenario wouldn’t require a squeaky clean bill of reputation health in order to do business with certain potential customers and this use case helps underscore the flexibility of our approach.HP Confidential 19 16 June 2011
  • 20. Observations After an initial period of time to learn new entries, we see that the rate of new DNS entries is relatively flat. In fact, it doesn’t take long to discover that a relatively small percentage of the Internet is actively queried. Given a large ISP to sample from, we see that this list converges fairly rapidly. Furthermore, we would expect this convergence in smaller networks given additional time. Anomalous deviation from this trend is malicious. A notable example is content distribution networks, such as Akamai. It’s interesting to note that these behave very similar to fast-flux networks. The number of unique IP addresses stays constant with the number of new, unique domains. Furthermore, each new DNS entry is a new IP and a new child domain. The reason for this is in fact very similar to why fast-flux networks behave in this fashion, namely geo-based, high availability. In the malicious case, the new entries are compromised hosts, which again have a new IP address in a different topological location in the network and this host is given a new domain name for tracking purposes.HP Confidential 20 15 June 2011
  • 21. Observations (continued) On the opposite side of the spectrum, we note that popular sites tend to use a relatively small number of IP addresses and have a large number of associated domain names. Obvious reasons for this include schemes such as virtual hosting and less obvious reasons point to the fact that more sites are encoding information into the domain name itself. Finally, Dynamic DNS and publically routed DHCP networks form a very interesting study in and of themselves. Yet the observations that the IP address is often encoded into the DNS entry itself, as well as the one to one relationship between dynamic DNS names and IP addresses, make identification and tracking much more tractable than it seems at first.HP Confidential 21 16 June 2011
  • 22. Observations In the case where attackers do not reuse domain names and address space, always procuring new resources, reputation becomes difficult. This becomes much more problematic as the shift towards IPv6 occurs.HP Confidential 22 15 June 2011
  • 23. HP Confidential 23 15 June 2011
  • 24. HP Confidential 24 15 June 2011
  • 25. Work In Progress Combining reputation information with filters into a hybrid policy model allows increased performance and accuracy of the overall security solution. For example, imagine being able to push a policy that simply states: “Block all compound document types originating from China” or instruct a filter to block that might not be “recommended on” in the default configuration only if the host has a reputation score below a given threshold. This additional information allows customers to justify a more aggressive, and thus effective, security policy.HP Confidential 25 15 June 2011
  • 26. Work In Progress We believe that we can not only offer protection for the subscribers and consumers of cloud computing and large data center deployments, but have the unique capability to protect the reputation of the services themselves by vetting outbound traffic and thereby bringing to market a significant differentiator in this rapidly emerging space.HP Confidential 26 15 June 2011
  • 27. Work In Progress Dynamically decide reputation of never-seen-before hosts by moving from historical and statistical evaluation to predictive, dynamic methods.HP Confidential 27 15 June 2011
  • 28. HP Confidential 28 15 June 2011
  • 29. References[1] Jamie Riden, 2008, “The Honeynet Project: How Fast-Flux Service Networks Work”,[2] Michael K. Daly, 2009, “Advanced Persistent Threat (or Informationized Force Operations)”,[3] Corinna Cortes and V. Vapnik, 1995, “Support-Vector Machines”, Confidential 29 13 June 2011
  • 30. Q&A Evaluation accounts are available that you can use to check out the system, punch in your own address range, etc. as well as a couple publically available whitepapers.HP Confidential 30 15 June 2011