Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Disruptive innovation in security technologies


Published on

Summer Course 'Innovation in security applied to the protection of digital identity #CIGTR2015'. (EN)

URJC Summer University Courses
'Disruptive innovation in security techonologies (dinnoTSec14)' (EN)

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Disruptive innovation in security technologies

  1. 1. Libro BBVA maqueta-ING 1.indd 1Libro BBVA maqueta-ING 1.indd 1 02/06/2015 18:23:0502/06/2015 18:23:05
  2. 2. d innoTSec14 Disruptive innovation in security technologies 2014 Summer Course Rey Juan Carlos University Vicálvaro Campus Madrid, from June 30 to July 2, 2014 Libro BBVA maqueta-ING 1.indd 01Libro BBVA maqueta-ING 1.indd 01 02/06/2015 18:23:0902/06/2015 18:23:09
  3. 3. PUBLISHING PRODUCTION DESIGN AND LAYOUT Miguel Salgueiro / MSGrafica PRINTING AND BINDING Gráficas Monterreina Legal Deposit: M-18110-2015 Libro BBVA maqueta-ING 1.indd 02Libro BBVA maqueta-ING 1.indd 02 02/06/2015 18:23:0902/06/2015 18:23:09
  4. 4. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 3 INDEX PROLOGUE ................................................................................................................................................................................................... 5 Santiago Moral ENFORCING LOCATION AND TIME-BASED ACCESS CONTROL ON CLOUD-STORED DATA ........................ 7 Claudio Soriente ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING IN THE INVESTIGATION OF APT CAMPAIGNS .............................................................................................................................................................................. 21 Vicente Díaz CYBERPROBE: TOWARDS INTERNET-SCALE ACTIVE DETECTION OF MALICIOUS SERVER ...................... 35 Juan Caballero PANDORA FMS: HOW TO COMPETE WITH THE MAJOR SOFTWARE VENDORS AND NOT DIE TRYING ................................. 49 Sancho Lerena SPECIALIZATION AND INNOVATION TO COMPETE IN SECURITY ............................................................................. 59 Andrés Tarascó Acuña PROTECTING INFORMATION IN THE CLOUD. ENCRYPTION TECHNOLOGIES ................................................... 67 Isaac Agudo Libro BBVA maqueta-ING 1.indd 03Libro BBVA maqueta-ING 1.indd 03 02/06/2015 18:23:1202/06/2015 18:23:12
  5. 5. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR4 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course DISRUPTIVE INNOVATION IN CYBERCRIME TECHNIQUES ............................................................................................. 85 Etay Maor INNOVATION IN IDENTITY .................................................................................................................................................................. 95 Luis Saiz ENIGMEDIA. INNOVATION IN ENCRYPTED COMMUNICATIONS ................................................................................. 113 Gerard Vidal ROUND-TABLE DISCUSSION: PERSONAL INFORMATION (PII), CITIZENS’ RIGHTS AND INTERNATIONAL REGULATION ......................... 121 Taking part: Luis Saiz, Isaac Agudo, Juan López-Rubio Fernández, Esther González Hernández and Pablo García Mexía Chaired by Miguel Ángel Cano Gómez ANOMALY DETECTION WITH APACHE SPARK ....................................................................................................................... 141 Sean Owen CRIME SENSING THROUGH SOCIAL MEDIA ............................................................................................................................ 151 Luke Sloan A MODEL OF UNIVERSITY-INDUSTRY COLLABORATION: THE RELATIONSHIP BETWEEN BBVA GROUP AND URJC ............................................................................................... 165 Regino Criado / Santiago Moral Rubio PHOTO GALLERY ....................................................................................................................................................................................... 179 Contents of talks are available on the official webpage ( You can look up both slides and videos on CIGTR official channels in YouTube ( and SlideShare ( Libro BBVA maqueta-ING 1.indd 04Libro BBVA maqueta-ING 1.indd 04 02/06/2015 18:23:1202/06/2015 18:23:12
  6. 6. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 5 F or the fourth consecutive year, we put at your fingertips this publication that collects the papers presented during the Summer Course driven by the CIGTR hand in hand with the Rey Juan Carlos University. In this course of 2014 we have dealt with how new technologies, while providing opportunities to increasingly connect our lives, also compel us to adopt new measures to keep them secure. Being prepared for such opportunities as well as the threats that accompany them requires different approaches and ways of thinking. At the same time, the concern regarding the direction to take within its strategic plans is growing rapidly within organizations, in view that this technological evolution is happening at a higher rate than the one they are comfortable to manage. The ecosystem, in which we are immersed, beyond inviting us to innovate, challenges us to be disruptive in our ideas and solutions. Disruptive innovation occurs when you incorporate into the market a product, service, system, process or organizational method that represents a break with the already established, not being a natural evolution of what already exists. In this way, new revolutionary technologies that represent turning points in established practices are born, producing changes of global scope. This type of innovation is generally less efficient during its foray into markets that are already mature in the previous practice, but it is very competitive in those that are open to a lower cost new offering, although they may show some initial shortcomings. PROLOGUE Santiago Moral Rubio Director IT Risk, Fraud & Security. BBVA Group Libro BBVA maqueta-ING 1.indd 05Libro BBVA maqueta-ING 1.indd 05 02/06/2015 18:23:1302/06/2015 18:23:13
  7. 7. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR6 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course In this sense the startups play a fundamental role. These are getting to show greater agility to adapt to the needs of the market compared to large corporations. They also have better motivation to address niches which, in principle, are small and with low profit margin. If this in turn combines with a close and stable partnership with universities or other entities of scientific research, it is possible to gain real and differentiating competitive advantages, and at the same time the progress in knowledge is promoted. This is the commitment of the BBVA Group to find those solutions that will allow us to meet the challenges presented to us in the field of cybersecurity, and along these lines, the content selected for this course accompanies this strategy. Libro BBVA maqueta-ING 1.indd 06Libro BBVA maqueta-ING 1.indd 06 02/06/2015 18:23:1302/06/2015 18:23:13
  8. 8. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 7 T he work that I am going to present is about access control that also takes into account the time and the location. Perhaps you’re familiar with location- based services. If you look towards disruptive technologies, location-based services are one of them. This talk discusses how to ensure this type of systems. I’ll begin with a slide about access control (AC). Here is Alice, who is the owner of some resources, such as some files, and wants to apply access control policies for these files. She is not always connected by what turns to the cloud; we’ll call this server Policy Enforcement Point (PEP). Alice specifies a file and an access policy based on the identity of users who are supposed to access this file. For example, the access policy must allow access to Bob or Charlie. As we have said, since Alice is not always connected, she transfers the file and the security policy to the policy enforcement point, which is exactly what the name suggests, to enforce the security policy on behalf of Alice. To do so, the PEP should be the door to the file and must identify the users. This means that when Bob wants to access the file, his identity is compared with the established security policy and as it is consistent with it, he gets a copy of the file. Similarly, if David wants to access the file, since his identity does not agree with that established Claudio Soriente Senior researcher in the Group of Systems Security (D-INFK) at the Swiss Federal Institute of Technology (ETH) in Zurich. Contents of this presentation are available on the official webpage of CIGTR ENFORCING LOCATION AND TIME-BASED ACCESS CONTROL ON CLOUD-STORED DATA Libro BBVA maqueta-ING 1.indd 07Libro BBVA maqueta-ING 1.indd 07 02/06/2015 18:23:1302/06/2015 18:23:13
  9. 9. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR8 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course in the security policy, he will be denied access to this file. That would be a standard access control based on the user’s identity or, also, a role-based access control according to an organizational hierarchy. And, why do we add location to access control? There are some companies that already do it through a service called Location-Based Rewards. The idea is that customers can get coupons or discounts for visiting some locations classified as premium. For example, Starbucks could send a policy saying that if you visit one of their sites you can take a free coffee. Another reason why location is added to access control is called Geo-fencing. A geo-fence is the perimeter of a geographic area where you can trigger events when the user is moving inside or outside that area. For example, there are companies that send sensitive data to their clients that can only be accessed within the company premises. For example, a bank can establish that only certain data could be accessed within the Bank premises. This would be inside regulatory compliance, but also has security implications. Symantec began a project called ‘Smartphone Honey Stick Project’ whose idea was to leave devices in different sites in the cities of New York and Chicago, as if they were lost. These devices contained files that had been clearly marked as confidential and Symantec monitored the way in which people who found these devices looked at those files. The result of the experiment was that even the people, who were willing to return the device to its owner, looked at those files. In this case, if there had been a geo-fence implemented in those files, surely they wouldn’t have been able to access those sensitive data unless the device was returned to the zone established as geo-fence. In addition, when you set a location in your access control policies you have to look also at the time. For those location-bases rewards, time is important because you want customers to take the reward at specific locations at a particular moment, as the time of opening of the premises. For the geo-fence you want that data could only be accessed within the premises of your company but only in working hours. From this type of systems you can specify that type of security policies so you can establish that both Bob and Charlie can access the file if they are in this place at that time. In addition, you can set that security policy as complex as you want so perhaps, these two persons, in order to access the file, should be in this location at this time, or in another location and at another time. You transfer all that to the PEP, which should store the files, identify the users and must have location capabilities, i.e. must locate the users. And, with all this, it may be the main problem here. If you take a look at the designs of the existing solutions, on the one hand we have research papers on security policies; on the other hand we have Cryptography solutions and thirdly we have deployed systems, i.e., the systems that we use today. Let’s start with the security policies that are usually an extension of role-based access control frameworks. There is a standard that defines access control policies and extends their framework to express also time and location based access. On the one hand, they are clearly expressive; you can arbitrarily define a Libro BBVA maqueta-ING 1.indd 08Libro BBVA maqueta-ING 1.indd 08 02/06/2015 18:23:1302/06/2015 18:23:13
  10. 10. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 9 combination of roles, locations and time intervals. But the problem is that they leave in the hands of an identity to do everything. An identity that stores files, that complies with the security policies and locates users. The problem is that identity, PEP, is considered trustworthy. We trust on it to access plaintext data and trust on it to apply security policies correctly. There is no way for Alice to say whether this person here complies with the security policy of Alice correctly or not. Now let’s take a look at the deployed systems. The deployed systems are based on the check- in, i.e., they are placed in premium locations. The idea here is that the user installs an application on his smartphone so he can visit these premium locations, performs check in at these locations, these check-ins become points and those points entitle you to get a reward, i.e., if you get enough points you could win a free coffee. This is what Bob is going to do. He’ll go visit a local Starbucks, he’ll check-in, the GPS of the his phone is coordinated with the Cloud based gift certificate server, then he visits another shop, he checks-in too, the GPS Locator contacts the server, and once he has visited a certain number of locations, the gift certificate server sends him a gift certificate that perhaps allows Bob to get a free coffee. One of the main advantages is that these systems do not need a localization (positioning) infrastructure. They don’t need the PEP to locate users because everything is based on the user’s phone GPS coordinates. Checking in is a voluntary action of the user that tells the server that he is in a particular location. But the GPS location can be false, the user can be malicious, and in this case, he could abuse the system. That is, he could tell the gift certificate server that he is at a Starbucks premise, while actually he’s sitting on his couch. In addition, we have the same problem we had before: the PEP is trustworthy enough to access the data and comply with the security policies correctly so we have no way to tell if this security policy has been correctly enforced or not. To solve the problem before the emergence of malicious users that provide false GPS coordinates to the server, there are some cryptographic solutions. One of them was presented in 2009 and is based on positioning tests. The basic idea is that the PEP is now a different identity from the localization infrastructure. In fact, here, the localization infrastructure is an ad hoc infrastructure meaning that it consists of access points. That is, if a store wants to participate in this type of systems you can buy one of these access points and put it in the shop. These access points are location proofs. A location proof is a digital statement which says that a user was there at a given time. Here, the idea is that Bob visits a location and get this location proof that has Bob’s identity, location and time, and this line that surrounds it means that you are enrolled in this access point. Bob can also go and connect to different access points to collect several location proofs. The idea is that when he has collected enough location proofs, he can return to the PEP and the PEP can compare the location tests collected by Bob with the established security policy to decide if you can grant or deny Bob access to the file. This means that, apart from checking location proofs in the security policy, the PEP should also check the validity of the security Libro BBVA maqueta-ING 1.indd 09Libro BBVA maqueta-ING 1.indd 09 02/06/2015 18:23:1302/06/2015 18:23:13
  11. 11. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR10 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course policy. Location proofs are a very pragmatic system because you don’t have the PEP to locate users, is separate. And the user cannot fake the proof because the only way for the user to collect a location proof is to go and connect with the localization infrastructure, i.e. go and ‘talk’ to one of those access points that can deliver such location proofs. It is not based on the user’s phone GPS coordinates. On the other hand, the PEP must trust the localization infrastructure; hence, before the system can operate there must be a trust relationship between these two identities. Therefore, one of them must verify the signature that is issued by the other. In addition, we return again to the fact that we are totally trust the PEP. It is expected that it can access the file, that it can enforce the security policy defined by the owner in a correct way. If we look at the design space, if you want to deploy such a system it should have at least three components: policy enforcement (someone who verify the credentials of users who want to access the files), storage for storing these files, and localization (the way of locating users). There is a solution that is only the PEP, i.e., an entity that does all of the above. This is an example of deployed systems where localization is not applied in the PEP and relies on the user GPS coordinates. This is how, as of today, systems work. Also, there are solutions that separate the PEP from the localization infrastructure. The PEP stores files and complies with the security policies while the localization infrastructure, as its name suggests, locates users. But these two entities have to trust each other, that is, it is necessary to get a trust relationship so that the systems can work. In addition, there are cases in which all systems trust the PEP to access data and to comply with security policies. Again, if these people are malicious the system breaks. This is the starting point of the system that we have built, the LoTAC (Location and Time-based Access Control). The first thing we must bear in mind is that we didn’t want any trustworthy PEP, that is, we didn’t want to trust anyone to enforce the security policies installed by users. The idea was that no one, apart from authorized users who are in a certain location at a specific time, should be able to access the files. What we wanted to do was to apply security policies through encryption. The owner of the data, Alice, who is who sets the security policy, encrypts the file so that we make ensure that only authorized users, in the right place and at the right time, will be able to access the encrypted file. Now, if we leave aside security policy enforcement, we still have the storage and location. For the storage we resort to Dropbox cloud storage services. The only thing we need is a ubiquitous access and data storage. As you know, this type of systems does not have localization capabilities hence they don’t know where the users are. They can enforce access control based on the identity of the user so you can say ‘I want that my friend can access this file’. But we also wanted to set aside this option. Libro BBVA maqueta-ING 1.indd 010Libro BBVA maqueta-ING 1.indd 010 02/06/2015 18:23:1302/06/2015 18:23:13
  12. 12. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 11 Once you have the storage, you need localization. The only localization infrastructure that can locate users on a large scale today is the Cellular Network Operator. There is no other thing that you can use if you want to deploy a system that covers a specific geographical area. They do not offer storage services, but we have Dropbox for that. They can identify and locate users, in this case, in all the national territory. Once we have those two components of the system, we don’t want them to trust each other. If we expect a loud server as Dropbox to talk to an operator’s network such as Movistar, this could never happen. So you want a system that integrates perfectly with these two. In addition, Movistar is good for Spain but if Dropbox wants to do business in another country, then we have to talk with another cellular network operator and, perhaps, in that other country there is more than one network operator. Therefore, this trust relationship between both parties is not easy. You want a system that works today and that perfectly integrates these two entities. Let’s take a look at some of the designs that we have chosen. We have the cellular network operator that can identify and locate users. This is what is happening today when Movistar finds out where and who its user is or, at least, what the identity associated with his phone number is. The area is divided into cells within the 3G network. We call them locations and refer to them as e. Each cell location is covered by a location server (you can see these localization servers as the base station controllers behind the antennas that are scattered throughout the national territory). A single location server is the only responsible for locating users who are in your location, who are the ones inside the 3G cell. The location servers have key pairs; there is a public key assigned to that location and a secret key. So for the rest of the lecture we agree that location equals location server that is equal to the public key (e1 = LS1 = pke1) of these locations. That is, you can say this is the cell number one, location 1 (there is a location server that corresponds to the location 1), it has a public key that is published and the corresponding secret key is kept secret. Thus, you can think in the geographic area where the system has been divided into these locations as the granular unit in our system. Regarding the storage server (Dropbox can do but you can take another), it provides a storage and access to ubiquitous data, and does not enforce access control, so Alice will encrypt the data, upload them to Dropbox and Dropbox will allow anyone to download these data. But as they are encrypted we don’t care about the Dropbox access control capabilities. And, finally, we have users. Users access the data in motion, and have mobile devices. They publish a public key (Bob is pkb) and the secret key is stored securely on their phones (this is sk). You can easily think that these pk, these digital identities, are linked to Bob’s SIM card. Until today, Movistar already has a delimiter between the user’s identity and some keys that are embedded in the SIM card, so it is easy to extend this to also include these digital identities. Now, how would you like the system to work? The idea is that here we have the owner of the file, Alice, who began with this file. They find there is Libro BBVA maqueta-ING 1.indd 011Libro BBVA maqueta-ING 1.indd 011 02/06/2015 18:23:1302/06/2015 18:23:13
  13. 13. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR12 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course a security policy that is based on the identity of the users who should access that file. It is also based on the location and the windows of time where those users should be before attempting to access the file. For a moment, within this lecture we will call this group of users the access set, we’ll have the set of locations and time, which define the contextual security policy. Now, everything, the access set and the contextual security policy are given as an input in the encryption algorithm of the entire file and this secures the file bits so here, our files are encrypted and sent to the storage server. Now, no one can download that file. Occasionally, one of the access set users will download the file and move it close to these location servers that are defined here, in the contextual security policy. The location server will accept the encrypted file of Bob and will reproduce it back with the text encryption process. This process is only based on the identity of the user who initiated the protocol; this location server takes care of the localization at the current time. At this point, the location server is not aware that the file has been accepted or how complex is the security policy. It is a system that only accepts inputs and that processes these inputs based on these attributes: the current location, the current time and the identity of the person who initiated the protocol. Once Bob has established a conversation with a sufficient number of location servers, covering all locations of the access policy, he can use the secret key of his phone to decrypt the text and access the original file. This is how we want the system to work. For all this to work, we use some tools with which some of you may be familiar and others not. Our contribution to the encryption scheme in devices integrates all these tools – that I’m going to show you – together, they are not primitive but there is an encryption scheme which will cover all of them, so we will divide the encryption scheme to do so. The first tool of which I am going to speak is tag-based encryption. This is a diagram of the public encryption scheme, where we begin with a file, encrypt it under a public key, this is your secure text (the yellow circle of the slide indicates that it is a secure text), the content is safe, then you decrypt it using the corresponding secret key and go back to the file. The tags-based encryption scheme is special in the sense that the encryption uses public information called tag. This tag is a random string: when you encrypt a file in the public part you can specify a tag that can be any string you can imagine, even an empty string, if you like. Now, in order to decrypt it you need not only the secret key corresponding to the public key, you also need the original tag, so that if you use both you can return to the original file. Security in tag-based encryption is this name that is here in the slide but to decrypt the original file you only need: one, the secret key corresponding to the public key and two, exactly the same tag. Even if you have the correct secret key but you modify a single bit of the tag (so this tag is different from the previous one), what you get is something different. Even if you are the legitimate owner of the secret key corresponding to this encryption key: different tags can ruin your encryption process. This is the first tool that we would use. Another tool that we could use is Onion Encryption (layered encryption). You may be Libro BBVA maqueta-ING 1.indd 012Libro BBVA maqueta-ING 1.indd 012 02/06/2015 18:23:1302/06/2015 18:23:13
  14. 14. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 13 familiar with the Onion Encryption if you know Tor, the search engine infrastructure characterized by anonymity. The idea with Onion encryption is to add consecutive encryption layers like an onion, with cascaded encryption routines. If we have multiple public keys; you start with your text file, encrypt it under a public key and what you get is a layer of encryption. Then, you use other public key in that already encrypted text, encrypt it and get two layers of encryption and you can continue as much as you want. Here we stop with three layers of encryption. Now, we need to remove these layers to decrypt, one by one. If you have the secret key corresponding to these encryption keys, you start with your plaintext under these three layers, you remove a layer with a secret key and you get the second layer of encrypted text, you remove another layer with another secret key and we have one layer of encrypted text, and finally, you remove the inner layer to get the original file. These are the first two tools we use in LoTAC. And, how do we use them? Well, the first thing we are going to use is the Onion Encryption. Again, you want the user to define an access set within a contextual security policy, you input this in a context of a file encryption and, as if by magic, you have the ciphertext in the output. So, for the access set users, you add a layer of encryption to this file with the Onion Encryption tool, using the public key in the access set. Thus, if Bob is in the access set, it adds a layer of encryption to this file with Bob’s public key, so that only Bob will be able to decrypt it. To adapt it to the contextual security policy you need to do something similar. An outer encryption layer will be added to the public key of the location server specified in the contextual security policy. So if you want Bob to be in location 1, you need to add a layer of encryption with the key published in location 1. This means that only the location server1 can be decrypted with the secret key. And you can make as many layers as you want. If we are concerned about the identities and location, we also have to be concerned about time. To do this, we use tab-based encryption where these labels, these random strings that are used during encryption, encode time. And a tag may be something like this [points the slide], and can be as complex as you want, so you are free to specify time intervals. The idea is that now, apart from having the right secret key here, you will also need the original tag when you are decrypting it. This tag here defined by Alice in the contextual security policy cannot be modified, you cannot change that tag. [Slide] This is an example where we have Alice with her file and this is the contextual security policy. Alice wants to grant access to Bob. Here he is in location 1, in one of these days, and here he is in location 2, on this particular day, with a few intervals of time. Now, on the one hand, we have the location servers that publish their public keys, and on the other side, we have users who also published their public keys. And, again, this is the access set of authorized users and this is the contextual security policy. We start the encryption on the access set by encrypting the file with ‘El-Gamal’ under the public key ‘Bob’: now we have a layer of encryption. Then you have to adapt it to the contextual security policy. As the Libro BBVA maqueta-ING 1.indd 013Libro BBVA maqueta-ING 1.indd 013 02/06/2015 18:23:1302/06/2015 18:23:13
  15. 15. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR14 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course contextual security policy says that the first one has to be in location 1 in this time interval, we take the public key of location server 1 and add a layer of encryption using the tag El–Gamal and here we have to specify a tag. The tag is exactly this time. As we have another trail in the contextual policy, we do the same process: we take the public key of location server 2 and use this string as a tag. The tag based on El-Gamal, the public key of the server, and this string as the second tag. Once you are finished with the encryption you have to upload everything to the storage server, that is, we upload this encrypted blob [acronym for Binary Large Object] to Dropbox. Once this is done too, anyone can download this encrypted blob but only authorized users will be able to make sense of the file that is hidden behind the ciphertext. And, how does it work once Bob has downloaded the ciphertext? Well, the decryption of this blob requires the secret key of location server 2 with the tag that we specified at the time and remove the layer of encryption. Once this is done, Bob has to ‘talk’ to the location server that covers location 1, so you need the secret key of this location server in addition to the original tag with which the encryption was done and, once Bob ‘has talked’ with all location servers about contextual security policy, he can remove the innermost layer of encryption using his secret key. And, how does the interaction between Bob and each of the location servers work? The idea is that at some point, Bob will move within the area that covers one of location servers. The location servers firstly will identify Bob, where he has to make sure that he is the legitimate user of that public key. Once this is done, Bob sends to the location server the encrypted blob and the tag used when we made the layer of encryption with the public key of this location server. Now, the only thing that we want the location server to do is checking the tag with the current time. If the current time matches the tag (and remember that for the security of tag-based encryption, Bob cannot change that tag), you can delete a layer of encryption from the blob using the secret key. Now, Bob has gone from blob of three encryption layers to two encryption layers. Clearly, he still needs to connect to the server that allows you to remove the next encryption layer and something similar to the above happens now. Bob goes and ‘talks’ to the location server 1, which is located inside the location base server, then the location server takes the ciphertext issued by Bob, the tag provided by Bob, checks the current time with this tag and if the two match, it is possible to remove a layer of encryption using the secret key. Once Bob has removed all layers of encryption related to the contextual security policy, he will be able to use the secret key of his phone to access the original file. This is not all because we have to be careful with malicious users when they are working together to get to the access right. This is an example of attack if you use the system that I’ve shown you so far. It was a complex problem to solve. The idea is that Bob, to access this file, must be at this location at a particular time but Bob is lazy so he is only at location 1, that is, he is only next to server 1, and cannot ‘talk’ directly to server 2, so he asks for help from a friend. The idea was that, David, who is within the area of the location server 2 provides the ciphertext and the correct tag. The Libro BBVA maqueta-ING 1.indd 014Libro BBVA maqueta-ING 1.indd 014 02/06/2015 18:23:1302/06/2015 18:23:13
  16. 16. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 15 server checks that the tag matches the time and accepts David’s ciphertext and removes a layer of encryption. Now, David can pass the ciphertext to Bob, so Bob can ‘talk’ to the location server 1 at the appropriate time, he provides the correct tag and removes another layer of encryption and Bob can access the file scaled in the access right. We wanted Bob to go to both locations, at two specific windows of time, but he didn’t because Bob got the key thanks to David. How do we solve this? Here we have another tool called Re-randomization. Again, I present a standard of what the diagram of an encryption process would be, where we have a text with a public key, its conversion to ciphertext and how, with a secret key, it returns to the original text. What we can do here to re-randomize the ciphertext under another key. That is, if you take the public key that you used for the first encryption your can re-randomize the ciphertext. You take the encrypted text, apply the original public key and you get the re-randomization of the ciphertext. Look, this is a circle and this is a hexagon, and this tells you that these two encrypted texts are not linked. If you look at these two encrypted texts you cannot say that this is the re-randomized version of that one. The consequence of them having this property of non-linkage is that re-randomization has been used so far to provide the privacy used in mixed networks or another type of private technologies. We use it for security. Let’s see, once you have re-randomized these two encrypted texts they will look different but using the correct key you will be able to decrypt them. It is just the way to unlink this ciphertext from this other, but confidentiality maintains the same properties so you need the same secret keys. The idea is that if you ‘re-randomize’ the ciphertext that was encrypted by pk1 with another public key you get this type of blob, this one here that we see that it is half yellow and half red, and that shows that two public keys have been used. Once you have re-randomized with a public key that is not the original public key that you used for encryption, this ciphertext not can be decrypted with the secret key corresponding to the key used for the re-randomization, nor can be decrypted with the secret key corresponding to the public key used to make the original encryption. This only happens on some groups of ciphertext so you need to take into account here the mathematical operations to make it work. To ensure that this system has these properties, once you’ve done the encryption under a key, if you re-randomize under the same key all is correct, but if you re-randomize under a different key nobody will be able to decrypt the text. These are the features that we want. How do we use this in LoTAC? Well, the interaction between the user and the location server is exactly the same, but remember that the first step is that the application server identifies Bob which means that it checks that Bob is actually the owner of the public key. Once this has been done, this ciphertext that is returned from the location servers is re-randomized under Bob’s public key. Let’s see how this works in the case of a collusion attack. This would be the graphic that would represent a collusion attack: David sends the ciphertext and then the ciphertext is returned with one layer less, he passes this ciphertext to Bob, Bob ‘talks’ with the other server to remove Libro BBVA maqueta-ING 1.indd 015Libro BBVA maqueta-ING 1.indd 015 02/06/2015 18:23:1302/06/2015 18:23:13
  17. 17. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR16 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course the other layer of encryption and he can finally remove the inner layer with the secret key. Now, if this ciphertext is re-randomized with David’s public key because, again, here we have another protocol of identification that makes sure that whoever is speaking with the location server is David, and if this ciphertext is re-randomized with Bob’s public key, you’ll see that you get the same ciphertext because, occasionally, the innermost layer is the same ciphertext that we have re- randomized with two different public keys. Thus, Bob will not be able to access the encrypted text. This trick ensures that Bob should be the only one that goes to each of the two locations where the location servers are in the contextual security policy before being able to access the file. Bob cannot ask for help from other users to avoid violating their access rights. This is the last trick I am going to show. Also, we look at the macro-locations. The idea is that we have location servers that cover the 3G networks cells, and you can specify policies that define one or more of these locations. So, I want that this user to be here, here and here before being able to access the file. But, what if I want to define a macro-location? I want something like this [on the slide]. This is our campus, which maybe covered by various locations so there would be six servers covering our campus. What if Alice, who has defined the security policy, wants to give access to Bob who is in Vicálvaro on this date? This means that, unless Bob is at any of these locations he should be able to access the file because it is in the security policy. Now, how do we deal with this? Here we have another tool called re-encryption, which works in the following manner. Again, we’ll start with a graph that goes from the message to the ciphertext with the public key applied to the original text with its corresponding secret key. Now, I want something to take this ciphertext, which is yellow (so it can be decrypted with the yellow secret key), and turn it into red, thus it can be decrypted with the red secret key. I want to change the public key, under which the ciphertext is, without having to decrypt and re-encrypt, and to do so we have an algorithm that does that by itself. The idea is to calculate the re-encryption key, for which we have an key abstraction algorithm that takes the secret key that corresponds to this public key that was originally used to calculate the ciphertext, and the public key under which I want to change my ciphertext and get my re-encryption key, so this is half red and half yellow again, and this re-encryption key ranges from 1 to 2, which means that I can transform a ciphertext that was encrypted under the public key 1 into a ciphertext under the public key 2. Once I take these two keys of re-encryption there is an input to the re-encryption algorithm and transforms that ciphertext that is yellow (so it can be decrypted using the yellow secret key), to the same ciphertext that can be decrypted with this red secret key. How do we use this in the LoTAC? The idea is to have a hierarchy of locations within the localization infrastructure. Let’s think we have something like this: Madrid, which is made up of neighborhoods and have Vicálvaro (I don’t know if Vicálvaro is a neighborhood but let’s assume it in this example). Now Serrano is covered by two location servers, Chamberí is covered by three location servers and Vicálvaro also has Libro BBVA maqueta-ING 1.indd 016Libro BBVA maqueta-ING 1.indd 016 02/06/2015 18:23:1302/06/2015 18:23:13
  18. 18. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 17 three location servers. If you create a security policy that says that users must be in Vicálvaro, these users should be able to talk with each of these location servers [points screen]. Location servers are under our hierarchy. The idea is that we start with the highest level of our hierarchy. I started with Vicálvaro, but you can start with Spain if you want. The localization infrastructure, Movistar, publishes a public key for Vicalvaro and the encryption key that allows anyone to change a text that was encrypted under the public key of Vicálvaro into a text encrypted under the public key of one of these servers. So the public key that is encrypted under this person can be transformed into a public key that could be decrypted by this other person. And something similar happens here. We start in Madrid, take Serrano or Chamberí public keys and you publish the re-encryption key. All these keys are public for those users who are specified in the security policy. Once you have the public key of Serrano and Chamberí you use it to publish the encryption key for the servers that cover one of these two neighborhoods. Once they are all public, let’s see how they can be used. Assuming this is the access policy. Alice wants to give access to Bob if he is in Vicálvaro on this date in particular. This means that the ciphertext will look like this because the public key of Vicálvaro is blue, you’ll have the innermost layer that is red, for Bob’s public key, and the other layer that is blue because we’ve used the public key of Vicálvaro to create a new layer of encryption. Bob, unless he talks to some of those servers, should be fine, should be accepted by the security policies for each of these servers, it is a server of Vicálvaro. How does it work? Assuming that Bob is actually talking to the location server 8, so he is at this location, he takes this ciphertext, takes the published re-encryption key that moves the ciphertext of Vicálvaro to location 8, re-encrypts the ciphertext which changes this outer layer from blue to purple and now, since this layer is purple, It can be processed by this location server. So the protocol works as usual: Bob is identified, sends the ciphertext and the tag, this person verifies that the tag matches the current time and removes a layer of encryption. Once this is done, Bob can remove an inner layer of encryption and retrieve the original text. All these are the tricks we use in this type of systems and again, our contribution rests on an encryption scheme that can combine all these tricks together. Now we want to assess how these systems work in practice. We have created a prototype implementation where we have a server, clients, and the client’s GSM network. This [on the screen] are some of the findings about how long it takes to encrypt the file, that is, how long it takes Alice to create a ciphertext starting with a file and a security policy. Here we see that it is set in the order of seconds. We also see that the abilities of the system are better in the number of locations than in the number of users, but you can use some tricks as the hierarchy of locations to do the same with the identity of the users, so that aspect can be improved without a problem. However, in a not-improved version, the encryption under 20 locations and 75 users takes no more than one second so the system is fast enough. Now, what we really need to know Libro BBVA maqueta-ING 1.indd 017Libro BBVA maqueta-ING 1.indd 017 02/06/2015 18:23:1302/06/2015 18:23:13
  19. 19. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR18 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course is how long it takes the process of ciphertext for smartphones users, that is, how long it takes the fact that the smartphone establishes a conversation with the location server to process the ciphertext. This is the time taken when the smartphone user is engaged in this process where, as you can see, everything that is controlled by the communication takes about three seconds to download the encrypted file, this includes a simple file of 20 k. The communication between the user and the location server takes about two seconds and this calculation was made on the location server which is the user’s final decryption so it is pretty fast too. This assures us that the system can be used in the smartphones that we have today. With this I have concluded. In conclusion I would say that location and time are definitely the way to open your business model to the development of new applications as access control is increasingly trustworthy. However unless we address the security problems, the critical security applications will not benefit from this new technology. And, as we have developed such a system unless you are concerned about specifying what the correct localization infrastructure is, in this case, it is the cellular network operator because it is the only built infrastructure we have today. You have other systems that are not secure, such as those that rely on the GPS coordinates of the users. When we speak of localization many users are afraid to expose their privacy so private localization is also a very active field. Here you have some of the research that I have used to make this presentation and this is our study that we will present in three days here, in Madrid. Question time I have two questions. The first one is: in this system of tags that you’ve established and mounted on a role-based access control, what are the differences, if any, against the published technologies that use attribute-based encryption based on bilinear pairing? And the second question is about the concept of layers you have introduced and the possibility of the commutativity between layers, the fact that the order is important or not, that is one thing that an attribute-based encryption can solve and that I don’t see in the schema that you’ve told us. Thank you. They are two excellent questions. In fact, I have two slides that explain why attribute-based encryption does not work but I removed them because of time constraints. The main problem with attribute-based encryption is that it is very difficult to define intervals. So, what is the granularity of the attribute? Is it a second, a minute, an hour, a day...? In attribute-based encryption you have, depending on the number of attributes that you have, your tree of the security policy that will be based on an algorithm this big. If the established time of your policy is a day but you have a minute of granularity, you need many attributes under that tree you have built to define access. This is a problem with attribute-based Libro BBVA maqueta-ING 1.indd 018Libro BBVA maqueta-ING 1.indd 018 02/06/2015 18:23:1302/06/2015 18:23:13
  20. 20. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 19 encryption. It is not very expressive when we want to set intervals. It could be fine to set locations but the time intervals won’t be so expressive. Secondly, attribute-based encryption is much more expensive than encryption based on El-Gamal, so you’ll need more battery in your smartphone. And, third, we didn’t want to trust any of the system parts and in attribute-based encryption you have this inherent problem. There must be an authority having the secret key that is related to those attributes, and this authority is in fact a person who can access the files that have been uploaded. So there is an inherent security problem there. For these three reasons we didn’t want to bet on attribute-based encryption. But when I was preparing this lecture it was an issue that appeared. We considered it, but it didn’t work. The other issue related to the commutativity of the layers. Let me go back again. In our example, I showed you that location server 2 removed the outermost layer and location server 1 removed the inner layer. You can do it out of range and how you make sure... Even if do you it out of range, you comply with the policy. The idea is to use tags. So, even if Bob talks to location server 1 in the first place, it’s always ok provided that Bob do so within this period of time. Maybe on July 14th he’ll talk to location server 1 because it is in the correct time window, so the location server will remove this layer. On July 16th he will speak with application server 2, which will remove this layer. He can do it in any order and tags make sure that the security policies are met. I hope this makes sense... This is the hashed version of El- Gamal where the ciphertext group is the group of positive quadratic residues which gives you your options of tags, the possibility of removing layers of encryption in the order that you want, since the group where we are gives you re-randomization that will probably allow you to prevent that a ciphertext is decrypted if it is re-randomized with two public keys, different public keys, and this allows you to have this hierarchy of locations that is permitted under these location keys. All this is in this encryption scheme. In addition to the cellular network infrastructure, trusting that part of the infrastructure, implicitly you also are trusting authentication between the cellphone and the infrastructure, aren’t you? Can a type of infrastructure that would act as a proxy of this communications infrastructure be used? I mean, it has synchronization with the cellular network but you connect via Wi-Fi or via Bluetooth. That could not be done because you need the cell phone to be authenticated, right? The idea is that the SIM card is authenticated... But do you have to use that explicit channel? Is it of any use that you communicate with a web server in the cellular network? No. For this reason, we use the cellular network operator because it is the only ubiquitous localization infrastructure so far and it gives you the property of adding the identification of users in a place. We don’t pay a cost for such identification. If the cellular network operator talks to the cellphones with its base station, they are already authenticated... Libro BBVA maqueta-ING 1.indd 019Libro BBVA maqueta-ING 1.indd 019 02/06/2015 18:23:1302/06/2015 18:23:13
  21. 21. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR20 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course But it’s not only that, but, can someone act as a relay? Because this authentication you’re doing on the cell phone... so someone does not put a cell phone acting as a relay and authenticates the localization at any time and they attack you... Yes. That would be possible. We are as secure as the location and identification of the cellular network infrastructure is. We do not propose new schemes of localization so we are as secure as that and there is no other option if you want to develop such a system today. If the cellular network infrastructure is secure against the possible relays, our system will also be. I find very interesting the way in which you have to deal with the collusion of users, but the question is: how do you avoid Bob sending his private key to David? If the key public is embedded on the user’s SIM card, and today you have SIM cards that have public keys inside. We could use the asymmetric cryptography that is developed inside to make sure that doesn’t happen. If you consider a SIM card as an anti-tamper device, the security key is stored securely in Bob’s SIM card. One of the problems we have in other latitudes is in relation to identity theft in the operators. How could we restrict or control that possibility of someone impersonating me in the operator and getting my SIM card without being me? It is a big question, but right now I don’t have the answer in my head. This is more in a more physical level? With someone who wants to steal your identity he must provide some form of identification. I believe that this is what is happening at this moment and now I do not know a better way to deal with it, but in our system we give for granted many things such as identification and localization and this is because we use a cellular network infrastructure. How to deal with identity theft? It is a complicated problem. How do we face identity theft when the authentication authority, in this case, the cellular network infrastructure, becomes malicious? It is very complicated; I don’t have the answer in my head. Libro BBVA maqueta-ING 1.indd 020Libro BBVA maqueta-ING 1.indd 020 02/06/2015 18:23:1402/06/2015 18:23:14
  22. 22. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 21 T he idea of this lecture is to see which techniques of Artificial Intelligence can be used in the investigation of APT campaigns. In fact, in my company on a daily basis we are dedicated to investigate this type of new threats, cyber-espionage, etc. A topic that today, is quite hot. The purpose of this lecture is to explain how to carry out this type of investigations and what techniques can be used: which types of techniques especially related to Artificial Intelligence make sense to be applied to this type of campaigns. The point is to make more practical than theoretical approach. To that end, I have attempted to make an approximation with real- world applications and using tools that we all can use. The first thing would be able to define what an APT campaign is, due to this is one of the main points that we will see in this lecture. APT is an acronym of Advanced Persistent Threat, which means that threat is persistent and advanced. It is a term that was coined between 2010 and 2011, during the Aurora case, which was an attack against Google, making public some of its details. Through a zero-day, hackers had access to their servers and could managed filter data that apparently their destiny were Chinese servers. Vicente Díaz Malware Senior Analyst. Kaspersky Lab. Global Research & Analyst Team (GREAT) Contents of this presentation are available on the official webpage of CIGTR ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING IN THE INVESTIGATION OF APT CAMPAIGNS Libro BBVA maqueta-ING 1.indd 021Libro BBVA maqueta-ING 1.indd 021 02/06/2015 18:23:1402/06/2015 18:23:14
  23. 23. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR22 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course At that time, when the details were made public, other IT companies had similar attacks. After finding out their relations, it was confirmed that all had been the victim of the same attacks. For this reason, they were described as Advanced Persistent Threat attacks: Advanced because they were carried out through a zero-day; Persistent, because they were infected for several weeks or even months; and Threat, because it was a threat indeed. Since that time, that term began to be massively used for almost any type threat, and today, everyone is talking about APT campaigns when in most cases the term ‘Advanced’ is quite relative. Many times, a very important Banks are infected because they have sent them a link that says “double-click here” to install a file and it’s done. I remember there was a case where a few emails were sent with an executable, someone ran that file, became infected and tried to fix it by deleting the email, believing that they got rid of the attack. For this reason, the term Advanced is used very often when actually it is not. It is certain that they are persistent, since, in most cases, when we carry out investigations of this type of campaigns, the victims have been the target of data thefts for years and they don’t know it. One of the campaigns that arouse most interest at national level has been The Mask (also named ‘Careto’).The truth is that the name is unique. It’s interesting because when they ask me how we name the campaigns, the truth is that we rely on what we see within the campaign and identify it like that. Then, maybe, you won’t go anywhere, but when it reaches the media, a great dissemination of the same is made and you have to explain the origin. In most cases they are not very funny. In the case of Careto, when our colleagues asked us the reason for the name. ‘Careto’ (in English means something like ‘ugly face’). In fact is the campaign of cyber-espionage that we were investigating and that has its origin in 2007, we detected that some binary files had this string, ‘careto’, which was quite odd. But, in addition, there was a second string, which was a password that was used to encrypt the traffic, and it was ‘me cago en la mar’ (Spanish saying). At the end, we decided not to name the campaign with this second name, but it took me much more to explain what that meant. In any case, it was a campaign of cyber-espionage whose code, used to spy on the victims, was very advanced because it hid itself very well in the system; it had many techniques, many modules... Actually, it was a remarkable malicious code due to its complexity. All started because a vulnerability in Kaspersky was exploited to try become invisible for the system. An old vulnerability that was already solved but, the malicious code exploited it. In any case, we have signatures for this kind of things and, from there we started the investigation where we were surprised by the complexity and the large size of this campaign. One of the most interesting aspects was also that it had modules for Operating Systems, such as Linux. We saw traces in some victims that could indicate that it also had some modules for iPad and OSX, but we have never found the malicious code. Actually we did notice the existence of very characteristic traces of Android. Libro BBVA maqueta-ING 1.indd 022Libro BBVA maqueta-ING 1.indd 022 02/06/2015 18:23:1402/06/2015 18:23:14
  24. 24. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 23 Analyzing the campaign deeply, the most active period was between 2012 and 2013, so it was recent the time we were investigating it. We found more than 380 victims with more than 1,000 IPs from different countries. The victims included government institutions, embassies, energy companies, research, etc. Morocco was the country with the largest number of IPs. The problem is that most of the victims were located in Morocco, therefore we believe it was part of the module for Android; Indeed a phone with Android changes more frequently than a device that is not mobile. I just wanted to use this example to explain what we mean by APTs: cyber-espionage campaigns where, nowadays, there are more possibilities to carry out for this kind of attacks. The second part we are going to talk about is Artificial Intelligence. In general, the main problem of Artificial Intelligence, as a professor told me once, are expectations. And intelligence is a very ambitious word with which we imagine an entity that is completely independent and is capable of thinking for itself. Perhaps this is the fate of Artificial Intelligence, but what I mean is, today, we apply a series of techniques, a series of algorithms and series of tools that allow us to make very interesting things. This is the approach of this lecture: seeing what type of Artificial Intelligence techniques could help to detect this type of campaigns. One of the things I did to arrive at a definition was look at the contents of the course of Artificial Intelligence of Berkeley, where we have topics of problem solving, topics of machine learning, clustering and semantic analysis, which is a completely different area. A lot of topics, but only some of them are of interest to investigate APT campaigns. Before that, I would like to comment that Artificial Intelligence is a very wide field, and that some of them can help us to investigate all this type of campaigns and others cannot. Additionally I would like to mention one of the reasons why I am here, perhaps is the lecture I delivered the last year, the one I was talked about how to detect malicious profiles on Twitter using machine learning techniques. I don’t want to repeat this, but if anyone is interested we can see something because it is published; It is another application of these techniques, for this case, in a different context, is the detection of malicious profiles within social networks. What are we going to use from Artificial Intelligence in the APT campaigns? From my point of view, the most interesting aspects are those related to data mining clustering, machine learning and also from expert systems. All this is very complex in itself, but I think that they all have techniques that can be used to assist us in the investigation of this type of campaign. In a collateral way, I will show you some tools that are already being used, but we don’t have to implement them. In my opinion, the best analysts are those that know who lies behind things, although they do not necessarily have to be experts and is able to develop them from scratch. That is, we need to know what the tools we work with, how they help us, the limitations they have and what’s behind the tools since sometimes we Libro BBVA maqueta-ING 1.indd 023Libro BBVA maqueta-ING 1.indd 023 02/06/2015 18:23:1402/06/2015 18:23:14
  25. 25. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR24 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course use them blindly. We just know that they don’t give us any positive and that’s all. But if we don’t know what is behind, we don’t know why it might be failing and perhaps it’s not giving us all the information. Today, the investigation of APTs is a mixture of this knowledge, very technical, and a more knowledge about how to carry out this kind of investigations. We’ll see how an APT campaign is investigated and, at every point, I will make a reference about what type of tools and what type of Artificial Intelligence techniques can be used. The first thing you need to know is how a campaign starts. Imagine a detective film, where there is a murder and it all starts with a corpse and a scream in the middle of the night. In this case, the corpse can be a binary file that has been leaking data and the scream in the middle of the night can be an alert of the intrusion detection system. But we have some initial clue that makes us to think there is something interesting to start an investigation. For example, as I mentioned before about ‘Careto’, the first clue was to find some binaries that were exploiting a vulnerability in a product of Kaspersky. Here, we have to try and find the largest number of clues as possible, collect the largest number of artifacts that will then allow us to develop our investigation. Then the first problem begins here and it is to find what’s related to this campaign. In this case, the binary files, all the artifacts that we can find, which are used in a campaign to exploit it, to infect the system, filter data, for anything, are our primary source of information. It is the most interesting thing at this point. Then, the first reflection is to know what a binary file is and what we can do with it. The first thing we would like is, once we have detected a binary file, which seems to be coming from a campaign and looks very interesting and we are interested in get more information, is how to find other binary files (binaries) that can be connected to this same campaign, that may be related to this group that is behind and which can give us more information because, at the end, what we want to have is an picture, a global vision as wide as possible in which we can incorporate the more elements the better. To find related binaries, one of the things that could come to mind is use distance functions. A distance function is related to clustering; This solution consist of finding what attributes have a binary which is able to build a numeric representation, which gives as a figure, a scoring, and we compare it with another binary. We can establish the distance between these two binaries. Once we have this function defined, we can make a clustering which simply consists of grouping the binaries that are similar. Once we have this, everything is easier. We can also use machine learning to perform this type of learning, that is, to learn what binaries are related to each other. Indeed, clustering has a disadvantage, which is that we don’t know if it works or not. Imagine that we have right now a distance function we believe is good, we throw it against a terabyte of binaries, since we have some hundred thousands or millions, and gives us a number of groups, how do we know that, really, Libro BBVA maqueta-ING 1.indd 024Libro BBVA maqueta-ING 1.indd 024 02/06/2015 18:23:1402/06/2015 18:23:14
  26. 26. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 25 this clustering is good, that these binaries really look alike or not? It is not easy. On the other hand, when we use machine learning the approach is different. The algorithm of machine learning needs supervised learning, which means that we’re going to give it a series of binaries and we’ll say “these belong to this group, these belong to that one and these belong to a third group”. From here, the algorithm will learn which characteristics of the binaries are interesting by making this group different from the others. Then, when we come across a group that we do not know, it will be classified for us automatically. Then, as you can see, here you don’t’ need the distance function but the selection of the appropriate attributes. That is, is the size of a binary relevant to make a grouping? Probably not. And is it the name of the binary? Perhaps it is. Is it the import table? Probably it is. The fact is that there are a number of features that will allow us to know if these binaries really look alike or not. Once we have established the distinction between machine learning and clustering, depending on what you want to do, therefore we could use one or another approach. Before start talking about functions of distance, we must identify if they are really needed. Perhaps we could make everything a bit easier. It won’t be always necessary to use a distance function and, many times, we can search for a pattern within the code that makes this binary have something relevant, something unique that identifies it from the rest. As I said before, within the ‘Careto’ binaries, we found that string. It is quite unique. If we look for this string that ‘Careto’ provided us among all the binaries that we had and it give us other positives, it’d be worth to take a look. So we can look for byte sequences that are unique within our entire collection of binaries. For example, the ‘file’ Linux command tell us what the type of a file is. The file command uses the information in a file called magic, where you have a database with the byte sequences that identify each type of file. The file command – which works quite well – just retrieves the information from a database where each byte sequence makes each file unique, and allows us to identify it uniquely. We can use this same approach when we do searches. In fact, maybe you know ‘Yara’, a metalanguage that allows to create rules for byte sequence searches in binaries. Yara, besides, allows us to define certain logic that tells if this binary is interesting or not. This tool is often used for the analysis of binaries and malicious campaigns. Another thing we can also use is the binary metadata. Metadata provide us with interesting information and it is not necessarily related to the structure of the binary itself, but it has data related to this binary. For example, there is an utility called ‘P Frame’ that if we use with the found file ‘Hot Brasilian XXX’ and additionally we are able to check the date of compilation, the packer, URLs, the files of the API and many other interesting things such as the internal name that, in this case, is ‘Power’; Therefore, if you go to the Windows icon, it will tell you that this is a Power Point file. And the original name is ‘ThisReclamo. exe’. This means that the person, who created this binary, internally is luring the user it is called ‘Hot Brasilian XXX’, impersonating a Power Point Libro BBVA maqueta-ING 1.indd 025Libro BBVA maqueta-ING 1.indd 025 02/06/2015 18:23:1402/06/2015 18:23:14
  27. 27. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR26 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course file. You can imagine that, indeed, it is a lure for a malicious campaign. Then, thanks to those metadatas we are able to check the internal name and also we could use this metadata to search more ‘.exelures’ because this is the internal path that has been used by the person who created it and, if we find a similar path, it will surely have a common origin with this campaign. Metadata are an unjustly forgotten source, in my opinion. An example of Yara that I wanted to show is this sequence of bytes, which is the signature that I have made to find a ransomware that impersonates the police and affects Android systems. In fact, the condition was just having a signature match. It’s that simple. There in nothing else but we can put several conditions so that one or the other is met, that if two are not meet, the third one is positive, etc. It is an example of the simplest thing about Yara. In this case, this signature is a sequence of bytes, so we simply seek this sequence of bytes among all the malicious APKs (Android Application Package File), allowing us to detect if there is another one in this campaign. It allows us to detect patterns, in this case simply looking for a sequence of bytes, so we have a simple way to identify if there are other related binaries. In fact, between making a function of distance versus searching for a pattern, for example, in this case the Virus Total web page has a function of similarity, of distance through the function ‘similar to’ and there is no result when we look for binaries similar to the APK I have mentioned before. However, if I use the rule of Yara, which I have mentioned before, it gives me a lot solutions. Simply through the search for a pattern, this in my case was the public signature of this APK, among all the APK in Virus Total. That is why sometimes is better make things simple and don’t have to do anything very complicated with a distance function, because simply the use of a search for patterns will give the same result or even better. Back to binary comparison, the simplest binary we can imagine, would in this case be a ‘Hello World’, which is a simply line of code and an ‘include’; If we compile this with a different flag, we see that the resulting hash is totally different. Hashes are no use to make comparisons. What I mean here is that the compiler has much size in determining the final shape of a binary. The same source code compiled with different properties makes the end result totally different. The compiler always has a very important sizet; however, there are some approaches that can be used. If we continue to believe that a binary is simply a sequence of bytes, there are some interesting approaches, such as ‘n-grams’, that use ‘n’ sequences of ‘n’ bytes, i.e. instead of using the binary as a large byte sequence we take sequences of bytes in groups of five, or groups of seven, etc., depending on the granularity, allowing us to make partial comparisons that can give us information. In the end the byte sequences are operations at a high level, therefore it makes more sense to use byte sequences. In fact, there is a sequence of operations such as using histograms of different types of bytes. That is, we can use a histogram of all bytes we have inside a binary and, according to this distribution, we could find out the type of file. Libro BBVA maqueta-ING 1.indd 026Libro BBVA maqueta-ING 1.indd 026 02/06/2015 18:23:1402/06/2015 18:23:14
  28. 28. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 27 The good thing about this approach is also works with a partial sample. That is, if we are doing this, for example, in the network and we are taking a number of bytes from a binary that is coming to us, we couldn’t identify the type of binary. In this presentation I have put several references to papers of this type of studies because, as you can see, explaining this is quite complex and I don’t want to get into very technical details. In fact, they are very complicated and would have to learn them all. But I explain this for you to have the vision of the things that can be done and the references and tools that you can use. SSdeep is a very interesting tool that takes identical byte sequences in a binary and makes a hash. This allows to have a binary value and compare it with another binary with partial hashes of byte sequences that has inside. This helps to identify if two binaries match or not. For example, I have two binaries, both of them are ‘Hello World’ that I mentioned before, and SSdeep generates a .txt file which has the signature of these partial internal hashes of the binary and makes a comparison. In this case, the tool says that they have something in common. If we see this from outside, they seem to be the same, but this utility tells if they actually are matched or not. Remember that their MD5 of both are totally different. Then, although we cannot use the hashes as a direct comparison, there are approaches like these partial hashes of byte sequences within the binaries that do allow us to make these comparisons. In any case, as I said before, you don’t need to know all the theory, but it is important to know that we have this approach and how it works. For example I recommend to use SSdeep because it is a very good tool for this type of research. Remember that we are still looking for binaries that could be related to a campaign, and we are seeing from the lowest level what’s interesting in a binary that can be used for this search. At a higher-level of abstraction, if we forget that bytes represent the assembler code instructions of a binary is already executing, this assembler code is a higher level of abstraction and it’s much more interesting when it comes to comparing two binaries. The issue is parsing methods to find code at a high level are not trivial. We all use tools such as ‘Aida Pro’, but you should know they may fail. That is, the high-level representation of the code may have errors. In fact, this type of compilers use very strong premises when interpreting a code but have many errors if these premises are not meet. For example, with a strange compiler and things like that. Then, it may be that the code we obtain with Aida has nothing to do with reality. For the comparison of these binaries there are mixed approximations for example, making comparison of graphs to see if they match or not. This approach, the graph comparison or the search of a sub-graph within a graph, are problems that are known as NP-complete, which means that the computing time cannot be done in a polynomial form, because it takes a long time. BinDiff is a tool that uses this type of comparison in a very visual manner with the flow control of the binary and, in this case, you find differences in code blocks, by comparing the adjacency matrix of a graph (which is a very quick way to make this comparison). It also has its limitations. But knowing how such Libro BBVA maqueta-ING 1.indd 027Libro BBVA maqueta-ING 1.indd 027 02/06/2015 18:23:1402/06/2015 18:23:14
  29. 29. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR28 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course tools work internally and what you get from Aida Pro, it does not mean that you have the actual representation of this binary code, or it means that the comparison between flow controls is always perfect, you have to keep this in mind always. An interesting information of these binaries with a representation at the highest level are the imports. In this case, we see all files that are importing functions that are used by this binary. Remember that many of those functions can be used by the packer. Therefore, that is a file packed and uses certain Windows API functions so it is then unpacked and executed; Do not trust blindly this IAT either, this Import Address Table, because sometimes the own packer is using functions that the binary does not use in any case. Apart from seeing the DLLs that it uses, also, in this case, through Aida, we can see the different functions used within each DLL. This can give us good clues about what kind of activity, the binary will perform on systems. Another interesting tool regarding this table of imports is ImpHash. It is a paper that was made public by Mandiant though, actually, is a technique that seems was used before. In this case, and as I said before, the imports made by a binary give us many clues about what our system does: If it is using a DLL to communicate over the Internet, in some cases we can suspect if it is doing so, if it opens a file, etc. In this way, ImpHash makes a hash of the imports tables made by the binary, so we can compare each others. The key of this approach is summarized in two points: the first one, it is very fast, because by creating a hash it provides us a unique signature of the imports and we can compare it very quickly with a few millions of binaries. The second one, a very interesting point: the imports table created by the compiler depends on the order in which they function calls are in binary source code. This means that if all of us go and create a binary code and use, for example, two DLLs, we might have different signatures for all of us depending on when we call them within the code. That is, it’s important not only that we call those functions but also, when we call them. Doing this will change the compiler, searching the hash of this table of imports makes it different based on this order, so it is very interesting when is looking for codes that are very similar. As I say, it is not a perfect approach to find similarity between two binaries but it’s interesting indeed. ImpHash is increasingly it use and in addition it is a tool that all can apply. The most interesting binaries´ feature is when you perform a dynamic analysis. A dynamic analysis means that we take a binary and execute it in a sandbox and observe its behaviour. In this case, we can find all the features and particularly, the whole behavior of the binary. It is the simplest approach. What is the problem? That is not always possible, for example, due to anti-emulation techniques, because it only executes when the user has made a number of things in the system, etc. But if we are able to get this information, is the most valuable one, since it will surely give more clues of what the binary does, and if it really is the same as the other one we are looking at. Here we are no longer comparing the structural characteristics of the binary, but we are comparing related data that it has. For example, if Libro BBVA maqueta-ING 1.indd 028Libro BBVA maqueta-ING 1.indd 028 02/06/2015 18:23:1402/06/2015 18:23:14
  30. 30. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 29 it connects to a URL, a domain, a file, etc.; this may change for two identical binaries, depending on the data, obviously. But, such data are surely the most important. If we can get them, they are the most effective. The problem is that they are the most difficult to compute because there are many of them and are not atomic (qualitative data are probably not very good to make a function), and they are also quite expensive. In the sense that we need many machines to execute all of these binaries and collect all the activity they are doing. In short, if we want to make a good distance function, we have to keep in mind, first, that we need an expert knowledge since all these features we have been talking about are fine, but you have to know which are the good ones. As we said before about ImpHash, is an example of how the order of function calls alters the reaction of binaries IAT. If we don’t know this, we never would do, for example, a feature of the ImpHash. The topic of the curse of dimensionality means that having many features in the binaries will make the treatment of distance functions almost impossible, so we have to select only those which are good. And, then, over citing means that if we use a few binaries that are not generic, this will make our recognition functions too specific for this binary group afterwards, that is, the training group we use has to be good, that is, we must use something that really works afterwards. I have here a blog entry in which a person says that he would like to make a different malware clustering related to different APTs which, basically, is what we are talking about in this lecture. If you look at the features that we have for each binary, the file name, extension, size, type, compiler, packer, detection, dynamics, etc. At the end the man failed to fix it, but what I want to focus is on the large amount of data we have and the difficulty of selecting those that are really relevant. No technique of Artificial Intelligence will be useful here if we don’t have an expert knowledge behind. However, it is important do not let trees allow us seeing the forest. In this first part, we have spoken only about binaries, but in an APT campaign there are many more things. In this second part of the presentation we will talk about more things. First of all, we have the limitations in the real world. Binaries may be all different or may be all the same or it may be irrelevant. Why? Because nowadays, there is a generic malware so there is no way to relacionate this malicious code with a any group in particular. That is, we use tools that we can obtain in the market, we use remote access tools to get information on our victims but they can relate us to a group. Today, what is used most is what is known as TTPs (techniques, tactics and procedures) which are the techniques, tactics and procedures of groups that are behind the attacks. For that reason, I analyze a machine that has affected the Rey Juan Carlos University and find a generic malware; At the beginning, we think that system got infected because they were surfing through any site and well, it’s a normal infection. But it turns out that this malware, if you got infected through this web page, which is used by such-and such group and they often use this malware to infect the first victim, as a first step, to set foot within the company and then use a lateral movement tool to make another infection of the Libro BBVA maqueta-ING 1.indd 029Libro BBVA maqueta-ING 1.indd 029 02/06/2015 18:23:1402/06/2015 18:23:14
  31. 31. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR30 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course domain, etc. These tactics and procedures, make up the behavior that allows us to identify the opponent, that today is known as attribution and companies pay for it and it is the reason why all companies put it in their portfolio. Which elements must be taken to identify the group working in the shadow? Do you know the APT1 report from Mandiant? Mandiant is a company dedicated to the forensic analysis of the investigation of this type of campaign. It works together with the American Government and last year they made this report where they talked about different APTs originated in China and targeted the United States. In this report, there is an interesting part, where they talk about attribution and how use it. The description is like this: Considering the analogy with the physical world, imagine a thief that leaves traces of his crime in different crime scenes. In each individual theft we can see how the thief access to the system, the tools used to broke safety, the particular item to steal or if he took everything to then see which the interesting thing was. It’s a good analogy to see how, today, this sort of things are investigated depending on the way this group acts. It helps us identify who may be is behind. Of course, a perfect group does not act in the same way twice. In addition, today, it’s very usual find false clues. That is the problem of attribution in the digital world, which is very difficult to know. And this type of initiatives and type of information, are what we try to make useful to identify those fraudulent groups. But notice that this information is very difficult to analyze automatically, in the first part of the lecture we were using machine learning tools that we could really use to develop in a first phase, but for a second phase of the investigation, we are talking about analysts who are able to give sense to all that information we found out. What else can we do? Network communications mainly. Servers from where datas are sent, domains are used, domain´s information… all is use from the network infrastructure in any campaign is very interesting to know and certainly it is the most valuable information after the binaries. Actually, methods are used for communication are all non-standard which is a characteristic encryption use and this is very interesting because it will allow to identify not only items but also domains. For example, regarding what I said before about the malware campaign for Android, when we were investigating this it turned out that it was distributing itself through various porn sites. After analyzing them deeply, we believe they had similarities, although, they were at the structural level, that is, it seemed all pointed to the same web sites for resources. We made a small script to download all those sites, see where they pointed at and make a small map where you could see the relations. We can think the probability if we take a hundred porn sites, all of them will have a similar structure. That likelihood is really low. Then, this type of analysis, allows to see that there is an inherent infrastructure which let us to find more sites that are being used for this kind of campaign. That’s why the domains used for these campaigns and the internal structure is a very interesting value that allows us to not only to Libro BBVA maqueta-ING 1.indd 030Libro BBVA maqueta-ING 1.indd 030 02/06/2015 18:23:1402/06/2015 18:23:14
  32. 32. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 31 establish the relations, but also to discover more related items. The information from Whois, you know what it is: If you have a domain and use a Whois, it gives us data of who is the owner (often owner is anonymous) and commonly are false. Does this mean that the Whois information is not important or valuable? Absolutely not; It is very interesting information for several reasons. Remember what we said about the TTPs where the groups have preferences: there are groups that, perhaps, always use the same mail servers in the Whois record, maybe they use the same address to register 40 domains too. There are limitations and being practical, you use some that are your Favorites; You can use the same providers to register the domain; You can make the records with the same date...; as you can see, in this campaign, viewing the logs, we see clearly that, in November, is where is reached the maximum point of activity to register those domains that were used later. All those bad habits, companies where the hosting is located, the domains use in these campaigns, domains are rotating among different hosting providers, in different countries and they are always the same, you always can see the same rotation. All this information is very interesting because they are bad habits that Groups have, small errors that let us find a relation and clues which allows us to analyze the specific campaign. All this Whois information is very valuable even if it is false, because maybe they always use the same address or same domain or the same provider or all is registered at the same time. Then, there are a couple of tool: Domains Tools and Whoisology allow us to find domains that already exist from a specific email address. For example: what is known as reverse Whois. And you can also download it and also buy them, the Whois databases; It’s several terabytes but you can have it in your organization and try to find relation among domains which also it is very interesting for the investigation. What else do we have? We have data from sinkhole. Sinkhole is a malicious campaign: All stolen datas are recollected in a server which has an specific domain. Then you can go to the police and show them the evidences that this domain is doing malicious activity and is recollecting stolen data of that campaign. So you can get this domain to, rather than redirect to the malicious server, redirect to your server, so all the infected users report data is going to our server. What will we see? Usually, nothing in terms of data because the stolen data are always encrypted; But we will indeed see the IPs of the infected users. As a result many times we are able to warn victims that they are infected but they don’t even know it about that situation. If you talk to a victim and you say “Hey, I need to talk to the security guy because we believe you are infected”. The first thing they ask is “how do you know it?”, “Are you trying to sell me an antivirus?” But we see it because we are seeing the victim’s IP that is sending stolen data from your organization without they knowing it. Then, thanks to a Sinkhole we can see this type of data that are very valuable because you can see the victims’ typology, the dates of infection, how data is reported, etc., we can have lots of information for a campaign. Libro BBVA maqueta-ING 1.indd 031Libro BBVA maqueta-ING 1.indd 031 02/06/2015 18:23:1402/06/2015 18:23:14
  33. 33. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR32 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course Regarding Big Data, I’ll just comment a couple of things because it seems nowadays, Big Data is a magic solution for everything, just as the Artificial Intelligence, and it is not so far. First, for implementing a Big Data, first you need an infrastructure: installing Hadoop on your home server is not Big Data. Should we normalize data or not? Today also is fashionable not to normalize anything. You put all the data on a server, we have 100 terabytes and have an elastic search so we have searches that will give us data, this is a good way for some things, but for others is not good. For example, if you want to have a sinkhole data, if you have to ‘grep’ through the terabytes of data you have whenever you want to find an IP it will be completely unusable. There are data that can be normalized, and you take this data, put them in a database, even if it is MySQL, you put the IP as an integer and you make a search tree and, in less than a second, you’ll have the IPs that are related to a campaign and will be able to perform range searches more easily, using masks. But if you have to make this search in all the access.log files you have, it will take hours. Therefore, this is not a magic solution and it’s a good practice if you have your datas normalized. I just wanted to mention it although it has almost nothing to do with this. In conclusion, I would like to mention the record linkage topic. Everything that we been talking about is fine, we have a lot of data, not all of them can be represented easily and it is not simple to link those datas. In fact, not only that, but also we have data that we obtain from other organizations. How to represent these data to be useful for us, for other organizations, etc.? There are several tools. For example, one of the most famous is Maltego, which allows us to integrate all this information on servers that the own Maltego has and what it does is to exchange these data with other servers that they have so we will get more intelligence. We can buy our own servers and connect them each others. In the end, we want to have a framework of this type in which we can see all these relationships while integrating all this intelligence, but it’s not simple. Maltego is a commercial tool that use public frameworks, free open source, as CIF (Collective Intelligence Framework), which also allows us to integrate intelligence from many sources in a place that we can exploit. Be careful, it cannot be exploited alone; we will have to make the relation each others. At the end, the type of information that we have is sort of something like this, shown on the screen. This is not an automated tool, it is handmade; It allows us to obtain this type of representation, and I think it’s a suitable vision is of what the analyst has to see. It shows very extensive information, and this can be classified easily, the one that can be atomized, but sometimes this is not possible. Then, from my point of view and, today, indeed there are many tools that help us, there are many frameworks to integrate all this intelligence, but there is still a very important work from the analyst, who knows how to use those tools and what they mean to, subsequently, learn to interpret the data. What we said of the TTPs, It often cannot be done automatically. Libro BBVA maqueta-ING 1.indd 032Libro BBVA maqueta-ING 1.indd 032 02/06/2015 18:23:1402/06/2015 18:23:14
  34. 34. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 33 Today, if you want to go ahead with Artificial intelligence, the most fashionable things are the Bayesian Networks and graphical models. Imagine a model in which graphs we have seen before, have a probability that depends on what happens we will know if we are in a state or another and we identify it as malicious or not, if it belongs to a group, a campaign, etc. This is the approach that we are working on at this time. Today, I believe that no current model that has it well integrated, whether it is commercial or not. Probably I’m totally wrong, but I don’t know if it is possible to do all this in a semi-automatic way and make a framework where all these possibilities could all integrate. Why? Because there are too many features and integrate all from a formal and generic point of view, is quite difficult. However, I think that certain parts, as we have seen before, benefit from these tools and can be used in this framework to help us make decisions and perform our analysis. Although I currently am sceptic regarding what you can do from a generic point of view. I conclude: the idea of this lecture was to see some issues that must be taken into account in an investigation of APT campaigns, explain how you can use an useful tools which help to understand better how the tools that are used at different levels and at different stages work. Libro BBVA maqueta-ING 1.indd 033Libro BBVA maqueta-ING 1.indd 033 02/06/2015 18:23:1402/06/2015 18:23:14
  35. 35. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR34 d innoTSec14 – Disruptive innovation in security technologies 2014 Summer Course Libro BBVA maqueta-ING 1.indd 034Libro BBVA maqueta-ING 1.indd 034 02/06/2015 18:23:1402/06/2015 18:23:14
  36. 36. 2014 Summer Course d innoTSec14 – Disruptive innovation in security technologies Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 35 T hank you very much. Indeed, I am going to talk about CyberProbe: Towards Internet-Scale Active Detection of Malicious Servers. This is a work with my students Antonio Nappa and Zubair Rafique at IMDEA Software Institute that is here in Madrid, in the Montegancedo campus of the Madrid Technical Computing Engineering University, and with our colleagues Zhaoyan Xu and Guofei Gu, of the Texas A&M University in United States. I summarize the kind of things I do in one slide. I work in security systems, software security and network security on topics such Software, Vulnerabilities, Exploits, Malware, Intrusion Detection and Forensic Analysis Techniques. One thing that unites all of these research areas is what I put in the center, Binary Program Analysis which, basically, is when you use the code without source codeaccess, simply having access to executable files that implement it. Therefore I had Vicente just before and he has already introduced many of the issues that I am going to deal with. Every day we have more cyberattacks, right? Basically we have three types of profile attackout there. We have the cybercriminals who have a very clear motivation: they want to make money. We have a second group of attackers that we usually call hacktivists like Wikileaks, Anonymous, etc., and normally we can classify their motivation as political. And CYBERPROBE: TOWARDS INTERNET-SCALE ACTIVE DETECTION OF MALICIOUS SERVER Juan Caballero Assistant Research Professor, IMDEA Software Institute. Contents of this presentation are available on the official webpage of CIGTR Libro BBVA maqueta-ING 1.indd 035Libro BBVA maqueta-ING 1.indd 035 02/06/2015 18:23:1402/06/2015 18:23:14