A key concern in today's Internet is the threat of cybercrime. Cybercrimes on the Web use different types of malware and fraud for various purposes such as financial theft, espionage, copyright infringement, denial of service and cyber-warfare. They spread using different protocols such as HTTP or HTTPS, links in email or IM, IRC, malware attachments, and phishing attacks. This cyber threat landscape, often controlled by organized crime and nation states, has been evolving rapidly and is becoming more evasive and difficult to detect. They often make use of multiple infection mechanisms to take control of machines and make them part of botnets, which can then be utilized to perpetrate other kinds of attacks such as data leakage and denial of service attacks. As threats blend across diverse data channels, their detection requires scalable distributed monitoring and cross-correlation with a substantial amount of contextual information. Conventional methods of protecting against cyber attacks such as signature-based detection and firewalls have become less effective.
Many corporations, security companies and governments, thus, are beginning to employ more and more sophisticated means of detecting and protecting against cyber attacks. Recently, data-driven approaches have become popular for detecting new kinds of attacks. Instead of relying on static signature-based detection, these techniques seek to detect anomalies and other patterns from various kinds of data such as network traffic statistics and server and application logs. For example, a sudden increase in the number of unresolvable DNS requests from a laptop might indicate that it is infected by a bot. These approaches rely on very large volumes of data and a variety of analytics to analyze the data. In this talk, I will describe some Big-Data based analytics and systems that IBM has built for detecting different kinds of cyber-attacks, particularly for detecting new kinds or new sources of cyber-attacks that may have not been seen before. These analytics span both real-time processing on the IBM InfoSphere Streams platform as well as off-line processing using InfoSphere Big Insights and data mining tools like SPSS.