Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Methodology of Intentional Risk Analysis and Complex Networks

563 views

Published on

2013 Summer Course Book. 139 pages.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Methodology of Intentional Risk Analysis and Complex Networks

  1. 1. Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course Rey Juan Carlos University Aranjuez, 8–10 July 2013
  2. 2. PUBLISHING PRODUCTION DESIGN AND LAYOUT Miguel Salgueiro / MSGrafica
  3. 3. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR ÍNDEX INTRODUCTION........................................................................................................................................................................................... 5 Santiago Moral PROLOGUE..................................................................................................................................................................................................... 7 Regino Criado COMPLEX NETWORK THEORY: INTRODUCTION AND APPLICATIONS......................................................................... 9 Stefano Boccaletti SCALE-FREE RISK MODELING.............................................................................................................................................................. 21 Regino Criado / Víctor Chapela THREAT HORIZON 2015: MORE DANGER FROM KNOWN THREATS............................................................................... 37 Adrian Davis CASANDRA: A FRAMEWORK FOR MANAGING TECHNOLOGY RISK............................................................................... 49 Juan Manuel Vara / Marcos López CASANDRA IN PRACTICE: THE DEVELOPMENT OF A SECURITY MASTER PLAN.................................................... 57 Rafael Ortega
  4. 4. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course ROUND-TABLE DISCUSSION: METHODOLOGIES AND TOOLS FOR INTENTIONAL RISK ANALYSIS.............................................................................. 67 Taking part: Javier Candau Juan Corredor Pinilla José Antonio Mañas Rafael Ortega García Chaired by Luis Fernández Delgado advanced techniques for detecting complex fraud schemes in large datasets................................................................................................................................................................................ 85 Stephen Moody STATISTICAL MECHANICS AND INFORMATION THEORY AS THE BASIS OF STATISTICAL SECURITY.................................................................................................................................................................... 105 Teresa Correas / Roberto Ortiz / Santiago Moral PHOTO GALLERY......................................................................................................................................................................................... 123 Contents of talks are available on the official webpage (www.cigtr.info). You can look up both slides and videos on CIGTR official channels in YouTube (www.youtube.com/user/CIGTR) and SlideShare (www.slideshare.net/CIGTR).
  5. 5. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR he limits of the digital world are growing in an unstoppable way. It comes to a point where it’s very complex to discern boundaries with physical world. We are no longer talking about the connected world (‘always online’) but about a hyper-connected world 1 where multiple mobile connections per individual live together, steadily increasing numbers that could reach 5,000 million users in 2020 (according to reports of J.P. Morgan), approaching the total population of the planet. The rise of mobile devices –interactive cell phones (smartphones), tablets and even the incipient “appcessories”– and the adoption of decentralized and virtual cloud paradigms contribute to the continued growth of information systems. The content and services offering for the users is increasingly extensive and interconnected. Digital identities are shared and managed in areas that must safeguard the privacy rights of individuals. In this scenario, information systems are becoming more dynamic and heterogeneous in terms of the number and type of components. The interconnections between them are increasing. This complexity makes it difficult to analyze and manage risks under the traditional paradigms. It is hardly feasible to treat individually each one of the components. Moreover, to these circumstances we must add the continuous and rapid evolution of environment surrounding organizations. Individuals who perpetrate fraud activities and attacks on corporate objectives are increasingly organized and shape their professional skills in order to adapt with greater agility before the changes in defense strategies and get the most out of this growing complexity. We are talking about, without a doubt, teams with business models whose maturity levels are high. INTRODUCTION Santiago Moral Rubio Director IT Risk, Fraud and Security. BBVA Group 1 Mobile World Congress 2013: www.lavanguardia.com/tecnologia/20130124/54363068379/ llega-el-mundo-hiperconectado.html#ixzz2wrxTRb00 
  6. 6. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course The assessment of the intentional risk associated with these scenarios requires more agile tools and methodologies which adapt to the continuous dynamism of these circumstances. A great opportunity is thus opened to grow in these areas and promote research environments in which both professionals in the business world and academicians collaborate. We need to enrich the business vision with new scientific alternatives that achieve methods of analysis and management which, on the one hand, are able to treat the systems as a whole beyond their individual elements, and on the other, provide agility in adapting to these dynamic and competitive environments. From Research Center for Technological Risk Management (CIGTR Centro de Investigación para la Gestión Tecnológica del Riesgo) research initiatives are promoted in these areas. Highlighting also the conferences and training activities we organize annually with the objective of promoting spaces for the University-Industry knowledge exchange. This is the third course organized on behalf of CIGTR in the setting of Rey Juan Carlos University Summer Courses . The last edition was attended by more than 190 people from diverse backgrounds. Information security professionals lived during these three days with scientists from fields as diverse as Computer Science, Physics or Applied Mathematics; we also had college students and PhD students willing to learn about these highly innovative initiatives. Through this publication, we convey to those interested the transcription of the lectures presented in this Summer Course.
  7. 7. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR ociety, in general, does not recognize mathematicians as creative people. Indeed, in the words of J.D. Barrow, the definition that is usually made of a mathematician is “that kind of person you don’t want to meet in a party”. However, to do science and, in particular, solve mathematical problems, you require a great deal of creativity. Otherwise, how would Archimedes have managed to verify if the crown of King Hieron was real gold or not? How would the inventor of the Enigma machine have managed to get an encryption system able to drive the allies mad during Second World War? How could someone devise a method to fly, and even up to the Moon? Solving a mathematical problem involves a creative process which must sometimes be complemented with a great deal of dedication and effort. Creativity does not belong to a single area, and a very important component of it is that it allows promoting and building bridges between different disciplines. Mathematics is linked to creativity to solve very different problems and, also, with very varied techniques. In fact, mathematics is the language of science, a symbolic language that banishes the ambiguity and doubt, the only language with a built-in logic that makes it possible to establish an intimate connection with the deeper mechanisms of operation of nature. In this context it is also important to have in mind that mathematics evolve and change rapidly, and which constitute an important part of the engine that makes society and technology progress. Many examples support this assertion. Chaos and Fractals have widely surpassed the limits that had in their origins and have spread throughout the scientific world, as the arguments made by a few diehard scientists who argued that chaos was reduced little more or less to computer mistakes and that actually chaos did not exist. The classical form PROLOGUE Regino Criado Professor of Applied Mathematics of the Rey Juan Carlos University
  8. 8. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course of the theory of knots was transplanted to the biochemistry of DNA and various advances in algebraic geometry have allowed developing some of the most known asymmetric algorithms in the field of cryptography. Many areas of mathematical research are now enriched by the active and direct contact with the applied sciences, and it should be noted at this point that a particularly relevant feature of the science from the beginning of the 21st century is the fading of the traditional boundaries of the materials. About fifty years ago the mathematician Paul Erdös proposed a particularly simple approach to communication networks: together with his colleague Alfred Rényi invented the formal theory of random graphs, a network of nodes connected by links in a purely random manner, key event related to the birth of the theory of complex networks. It is precisely in the field of complex networks and the search of innovative solutions for the analysis, assessment and management of intentional risk and fraud prevention, where the talks developed in the course we are presenting have their place, “Methodology of Intentional Risk Analysis on Internet and Complex Networks”, sponsored by the Research Center for Technological Risk Management (CIGTR Centro de Investigación para la Gestión Tecnológica del Riesgo) and which I have the honor to introduce. Theory and applications of complex networks, their use in risk modelling, CASANDRA methodology, advanced techniques for detecting complex fraud schemes in large datasets, and the role of statistical mechanics and information theory as the basis of statistical security, are some of the topics that were addressed in the various papers of the course that are collected in this volume. All these topics constitute part of the brilliant collaboration initiated more than three years ago between BBVA CIGTR and Rey Juan Carlos University and we hope it will have the scope and success whose first results seem to predict, and which has allowed us, among other things, to present an overview of some of the progress made, where creativity in the search for solutions and innovation in the fields of risk management and fraud prevention have lead this collaboration.
  9. 9. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR he objective of this paper is to make a brief introductory presentation of complex networks, focusing particularly on a question that may be of greater interest to you: how from the available data about a specific problem, you can have a representation of complex networks. This is a priority, since complex networks are tools and models that allow the treatment of systems composed of many elements that interact with each other. In this sense, there is an abundant literature on what types of indicators can characterize these systems once we have a representation of the networks; for example, once we have a network, we can make a classification of the different components, specifying which are the most important or the most vulnerable. We will know where you have to protect the network, where we have to protect the system against possible attacks or where we act to ensure that this is the desired behavior. So, the first step that you must take to analyze the information of a complex system is to transform a huge repository of data from this system in a representation of the network that will be useful to perform comprehensive analysis. Take for example the network displayed in slide 1. This is the representation of a repository of a social network that shows the collaborations among movie actors. Connections between actors have Stefano Boccaletti CNR-Institute of Complex Systems (Florence, Italy) Contents of this presentation are available on the official webpage of CIGTR www.cigtr.info. Both slides and videos can be looked up on CIGTR official channels in YouTube and SlideShare. COMPLEX NETWORK THEORY: INTRODUCTION AND APPLICATIONS
  10. 10. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR10 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course been established when they have been in the same movie, which allows us to study this social network of this type and, starting from this structure, know which has been the actor who has had a greater influence on a certain genre or how the structure of some movie genres is. Therefore, the idea that arises is how to transform the data repositories into objects of this style (into networks), which subsequently may be of interest for the analysis and information obtaining from the system itself. Slide 4 shows a summary of the content of this lecture. Three forms of networks will be presented: Physical Networks, Functional Networks and Parenclitic Networks. Each form corresponds to a particular class of systems. PHYSICAL NETWORKS A Physical Network is the representation of a system where the system itself suggests what the elements are –which we call nodes or unitary elements– and what are the interactions between them (slide 5). For this reason is called Physical, because it is the system itself that suggests the form of representing it on a network. For example, we could map the information of what is happening in a society into a network considering each individual of the society as a node and each social relationship between individuals (family, work, friendship, etc.) as a link or interaction between nodes. The social relationship can be selected considering the problem to address (study the influence of work relationships in social behaviors, the bonds of friendship with fashions or social trends, etc.). Another example of physical network would be the representation of the Word Wide Web (WWW), although this is actually a virtual network. The WWW is a huge container of web pages. A representation of this in a physical network would have as nodes the documents of the webpages and as links or interactions between nodes the links to other webpages that are included in each page. This complex network is currently the largest we can think in terms of the number of elements it’s made of, since there are hundreds of millions of documents or nodes that are interconnected to each other. The network representations of the transport networks are also important examples of physical networks. For example, the network of air transport could be represented as a network whose nodes are the airports of the world and the links or interactions between nodes would correspond to the flights between airports. Another type of transport network is a road network; we can represent each city as a node and the roads as links. In these cases, the criticalities in the transport of persons or goods can be studied once we have taken into account the network representation from all the information available about flights/travels. As we can see, in all of these examples the system itself suggests what or who the nodes are and what or who the links are in a very simple way.
  11. 11. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 11 Homogeneous network and heterogeneous network In slide 6 the example corresponding to transport (road and air transportation) networks is discussed in more detail. As we can see, there are fundamental differences between the road and air transport networks, although a priori the spatial structure of both is similar (a city has connections of roads and airports are usually located close to the cities). To see the differences between both networks, let’s first define the concept of degree of a node “k” as the number of links of the node to other nodes on the network. In the road network shown on slide 6, Kansas City has a node degree equal to 4 (k = 4), since it has four roads, one going to St. Louis and three others towards Omaha, Wichita and Denver, respectively. In addition, we can make a graphical representation of the distribution of grades, setting the grade “k” on the x-axis and on the y- axis the number of nodes in the network that have a certain value of grade “P(k)”.This distribution of degrees is also considered as the probability distribution of a degree in the network (i.e. we determine in this way what is the probability of a node being in the network with a certain degree) Looking at the graphic representations of the degree distributions of two transport networks shown on slide 6, we can see a fundamental difference. In the case of the road network, the axes have a linear scale and the distribution is centered on a specific average value k, has a relatively small width (sigma) and has an exponential drop to the right and to the left of the average value (Poisson distribution). In the case of the air transport network, we have a power-law distribution where both axes have a logarithmic scale (i.e., in this case the probability of finding a degree “k” is equal to “k raised to minus a given exponent”). The first network type, the road network, is called Homogeneous Network, since when we choose a node randomly, we will find that it has approximately k/2 connections (k/2 plus or minus sigma is practically k/2 since sigma is usually a small value). On the other hand, the second type of network, air transport, is called heterogeneous network since within it very high degree and very low degree nodes coexist simultaneously (there are airports where many connections are concentrated and function as hubs, such as the airport of Madrid or New York, and others with many less connections, called peripheral airports, such as the airport of Santander). From the point of view of risk, when the network is homogeneous, the strategy to study this risk is called random failure. In this case, given that the nodes are more or less equivalent with respect to their degree, it is enough to launch an attack to one of them chosen randomly and see how the network reacts because it will give approximately the same answer for any node that we choose. On the contrary, in heterogeneous networks we must make a very clear distinction between a random failure and a targeted attack. For example, a terrorist who wants to do harm to the network will tend to attack hub airports, not peripheral
  12. 12. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR12 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course airports. As we have seen, in heterogeneous networks the power-law distribution denotes the existence of few hub nodes (on the right side of the graph), that is, there is a low probability of finding nodes with a high degree or large number of links; however, there is a high probability of finding random (random failure) peripheral nodes, since most are of this type. However, the robustness of the network depends on the targeted attacks to the hubs. Therefore, in order to protect the network, it is important to have a classification of the nodes and know which are the most important. Physical networks. Examples Another type of network would be the Internet backbone. Slide 7 shows how the Internet backbone looks like in the United States in 2001, where each node is a router and each link a connection between two routers. Nodes have a certain color depending on their grade, and as you can see, it is a very heterogeneous network, because nodes with many links (yellow) coexist with nodes on the periphery of the network that, at best, have a single link (red). Next there is an example of physical network in the context of social networks called science citation index (slide 8). Scientific papers that have been published in the magazines refer to other previous papers and, in turn, are referenced in papers that were published later. In this case, each paper is represented in the network as a node and each link as a reference to a paper. From this network mapping we can obtain indicators of the degree of influence of a given paper in a sector, or in the research activities of this sector, even which are the sectors of interest for a scientist to publish something, among other examples. An example with collaborative networks could also be considered. We have previously seen an example that mapped data of movie actors but we can also perform studies with data on scientists (slide 9 - science co-authorship). Using the same index of mentions, it is considered in this case that the nodes are scientists and the links are the collaborations between scientists working together on a project. In this network we could study the extent of influence of a scientist in one or several multidisciplinary sectors. For example, in the field of research on complex networks engineers, mathematicians, physicists, etc., collaborate. They also serve to analyze the ability to interact that the scientific community has. This type of network can be applied to any other society that cooperates in a particular activity. Now let’s talk about ecology. Another example of physical networks is food webs (slide 10), in which the nodes are species and the links are the “prey- predator” relationships. In this respect, in these networks we can see activities of cannibalism in which species eat themselves; these cases are represented by “self-links”. This representation of the network is very important for ecology, since it serves to see what happens to the food web when one of the species is at risk. In fact, it is the basis of multiple new studies on environmental sustainability. Another area of application of the physical networks is epidemiology through, for example,
  13. 13. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 13 the study of networks of sexual contacts. Thus it’s analyzed how a disease spreads in a population. In particular, an experiment was conducted in Sweden, in which several men and women were asked how many people they maintained sexual relationships with in the past ten years (slide 11). We can see that the network is very heterogeneous and the scale factor is very clear. That’s why these studies can be applied to develop policies of vaccination for diseases that are transmitted by sexual contact (they would be located in the hub nodes). The examples given in this section correspond to examples of physical networks. It was clear in those that the system itself suggests which elements would be the nodes and which ones would be the links.   FUNCTIONAL NETWORKS The question that arises in this type of networks is how to represent the information of a large number of time series information in a complex network. An example would be the information obtained from an electroencephalography; this is a collection of time-series data taken in different areas of the brain. Another example would be information that brings the evolution over time of the shares of a company listed on the stock exchange (slide 12). The concept of functional network is reflected in its own denomination: function. Depending on the problem to be solved, a metrics is defined, that is, a measurement (for example, the correlation between two time series, or the Pearson coefficient, or phase synchronization - often used in studies of epilepsy, etc.) with which groups of time series on a network can be mapped. This type of network is called “all-to-all coupled”; since each node is linked to everyone else and the coupling value is the value of the measurement corresponding to the couple of analyzed series. Let us take an example: after selecting a particular measurement, two time series are compared on the basis of such a measurement, for example, 1 and 2, and we get a value that is associated with the link between node 1 and node 2 in the network. In the same way, we would apply the same measurement for the rest of couples of series and so on until we finally obtain a network with all nodes connected to each other and the links with values assigned on the basis of the measurement used. This type of network is called weighted clique, that is, a network in which all nodes are linked (all to all) in a weighted way, since each link is associated with a real number that is the value obtained by applying the measurement. Once the network is obtained, you’d want to simplify the set of information to facilitate follow-up. To do so, we set a threshold, which must between “0” and “1”, so we transform the weighted functional network in a structured network of zeroes and ones. In this case, we set to “1” all the values which are above a threshold and to “0” those below. Therefore, starting from such threshold we build a network where there are connected nodes and nodes that aren’t. By studying the properties of these networks we can extract information from the original system.
  14. 14. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR14 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course These networks have nothing physical. The nodes are the time series and the links depend on the type of measurement that has been chosen. For this reason they are called functional networks, because they have two elements of subjectivity: on the one hand, the type of measurement that gives better information on the system, and, on the other, the selected threshold (when it’s applied you lose part of the system information). Our group has studied a method that offers optimization criteria on the threshold and measurements (see slide 12). The question focuses on defining the threshold and measurement to obtain maximum information about a problem with regard to its resolution. Slide 13 shows an example concerning a work done by our team in Israel aimed at improving epilepsy treatment. There are various methods to combat epilepsy however, there are people who do not respond to drug treatments. In these cases, it is necessary to surgically remove the portion of the brain where the source of epilepsy is located. This method is usually applied in children aged 7 to 14 having attacks every three or four hours and who cannot live a normal life. Before surgery, a test named intracranial electroencephalogram is performed. In this test patient’s head is opened and electrodes are placed on the brain’s surface to obtain information that allow us to know where the source of epilepsy is. The signal provided by the electrodes is clean, free from noise as the electrodes are placed just on top of the brain. The measurement used is phase synchronization on which a phase is associated with each retrieved data and the correlation between phases is measured. We perform that because we know that epilepsy is a phase synchronization problem. In an epileptic attack all neurons in the brain are synchronized at the same time, causing them to do only a single task and the modularity to do various tasks is lost. From the data obtained with the electrodes, we create a functional network (slide 14). This shows that, between two certain instants of time, the phase is in sync and an epileptic attack has occurred. Once the first attack is studied, also the preparation for the next one is analyzed. In the network we can see that the first attack remains latent, i.e., that in some areas the nodes remain in phase synchronization after the first attack, then it gradually increases and triggers. This is the reason why later the entire system is “re-synched”. It’s as if there is an area of the brain that keeps memory of that state and triggers again in the future. So, this is precisely the area that needs to be removed, which acts as a hidden trigger of the system. PARENCLITIC NETWORKS There are a number of systems that cannot be represented as physical networks, or from which we don’t have time series that can be mapped in functional networks. A typical case is blood tests, whose result is static (cholesterol, white blood cells, etc.) - see slide 16. In this case, the question that arises is the following: Can I make a representation that tells me something about a person from these data: If he’s sick or healthy, if you have a particular kind of disease or if you have a specific risk? Another example is the genetic expressions, such as those of a plant. If
  15. 15. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 15 it suffers from some kind of disorder during its growth, we can draw a picture of its genes in an instant of time. But, what can we say about it? In general, that can be applied to any medical test, for instance magnetic resonance imaging. These have a very high spatial resolution, since on the scale of a group of neurons of the brain I can do an estimate of the oxygen consumption. However, they have a very poor temporal resolution, since images of neurons can only be taken every ten seconds and these trigger or react in times in the order of milliseconds, which is not suitable for a functional network since it can’t represent the dynamics of a neuron. Another example can be the evolution of the gross domestic product (GDP) of a country, since we get the picture over too long a period (months), which implies too big evolutions over time where it is not possible to make a functional analysis to answer questions such as: which sector of the country is in crisis? how can crisis in a sector affect another? how a crisis in the financial sector can affect, for example, the services sector? Therefore, we have static data that are expressions of variables or characteristics. And the problem happens because we cannot build physical or functional networks from them, since the relationship is not clear (e.g.: cholesterol and blood sugar levels). Parenclisis The concept of parenclisis was defined a long time ago by Leucippus, Democritus, Epicurus and Titus Lucretius Caro (slide 18). It should be emphasized the works of Lucretius, author of De rerum natura. This work that talks about the cosmogonic theory that Democritus and Epicurus had already put on the table (slide 19). The main concept they raised was the indivisibility of matter, composed of components called atoms, word of Greek origin meaning “which cannot be divided”, “a” (not) and “tomos” (divisible). Subsequently it was scientifically demonstrated that the word “atom” was incorrectly used when it was proved that they can be divided. However, they raised the concept of the existence of an essential element or indivisible entity and studied how the world could have been produced from these elements (currently we speak of quarks, Higgs boson, etc.). The cosmogonic theory arose that these elements fell from the void as drops of rain describing parallel paths (i.e., they assumed that they would never converge). To justify the formation of molecules by joining atoms, they posed that, unpredictably, it had to take place a deflection from the path through which an atom meets with another. They called it “parenclisis”, which is a very important concept in the history of philosophy. Slide 19 shows a fragment in English of De rerum natura, where the concept of “parenclisis” is defined as unpredictable swerve. This is the origin of the word parenclisis, which fundamentally means deviation. Let’s see how we used this concept. In our studies with networks we have a certain number “n” of subjects or systems {s1 , s2 ,..., sn }. For example, if we talk about the case of blood tests, we will have millions of people who had the same blood test. Each person is a system or subject. We
  16. 16. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR16 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course consider that each subject is classified according to the problem we want to address (for example, “healthy” and “sick” subjects), so we have “m” classes to which these subjects can belong to {c1 , c2 ,... cm }}.. Each subject or system “i” is depicted or identified by a vector of “p” features fi = (f1 , f2i, ... fp i ). For example, in the case of blood tests the characteristics will be: cholesterol, blood sugar, etc. Now we make the following hypothesis: “each one of these m classes (for example, healthy or sick) will correspond in the space of the p features to a link or constraint defined by F(f1 , f2 ,...fp ) = 0.” That is to say, for example if the subject is healthy, it does not mean that the value of his cholesterol is between two values, but that the characteristics (p variables: cholesterol, blood sugar, etc.) of his blood test are such that the function Fhealthy is equal to zero. This is called in mathematical terms link or constraint, (see slide 20) and the function Fhealthy is, itself, the model. In other words, the constraint Fhealthy (f1 , f2 ,... fp ) = 0 defines the combination of characteristics associated with a subject of class ‘healthy’ and the constraint Fdisease (f1 , f2 ,...p f) = 0 would correspond with the case study “sick”. In general, there will be m different constraints, one for each of the m classes. However, it is not easy to obtain an exact expression of these functions or models F, since they can have many variables. In genetic expressions of plants, for example, if we have 22.900 genes, we would have to write a function F of 22.900 variables, which is practically impossible. Therefore, the idea that arises with parenclitic networks is to take the space of the p variables or characteristics in all possible planes, or, in other words, all possible two-dimensional projections (for example, pairs of variables such as “blood sugar and cholesterol” or “cholesterol and white blood cells”). For each of these pairs of features i,j (considering that i,j=1,..., p), population (subjects) are represented as a distribution of points in the space of dimension p=2. Here if we can get the constraint projections, which will be functions Fij such that Fij (fi , fj )=0 in the ij plane. That is, these projections (local models) Fij correspond to the intersection of that link Fij with each one of the planes that we have in the space of the features ij. There are several ways to do this, such as polynomial fit or data mining methods (Support Vector Machine, Artificial Neural Networks, etc.); through these the collection or distribution of points is interpolated with a line that gives us the model of the pair of features (for example, the model of how cholesterol and blood sugar behave in the “healthy” class). To study the case of each subject and see to which class belongs, each subject is characterized by the location of their point in that plane (every blood test) and its distance from the model that has been obtained. Each point (subject) is associated with a network, where each node is a feature of the blood test, for example, and each link is the value of the distance between the point and
  17. 17. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 17 the model in the i-j space. From here comes the concept of parenclisis (slide 21). We are going to see it now in a more clear form with an example (slide 22). We have a three- dimensional space with three features, for example: cholesterol, blood sugar, and white blood cells. We assume that the population, that is, the points (subjects) are located on a link (the green surface - see slide 22). We consider the pairs of planes (“Feature 1” with “Feature 2”, “Feature 2” with “Feature 3” and “Feature 1” with “Feature 3”), and observe the intersections with the links (see bottom left image of slide 22), which are the dotted lines, and which would be the models. For each subject analyzed (see the red dot in the slide) we associate a network, where each node is a feature and each link is the distance to the model (see slide). In each plane we have a different model, we’re putting the distance of all possible models representing all intersections possible if the unique model, which is the constraint that we are imagining. And, why is it called parenclitic? Because it has information of the deviations. Each link contains the information on how much the subject is deflected at the corresponding plane from the normal tendency of those belonging to its class (the dotted lines that are displayed on the slide). The use of parenclitic networks Now let’s see how we can use this information in the case of early diagnosis of obstructive nephropathy. It is a disease that occurs in the newborn that causes significant damage to the kidneys, and since it leads to an obstruction of the urinary tract, it causes urine to go back to these organs. Since babies are not able to speak and point out where it hurts them, it is very important to make an early diagnosis of a disease that is the leading cause of transplantation of kidneys in children. From the analysis of the urine, where parameters such as metabolites and waste are measured, we obtain groups of population who have the disease and groups who haven’t. This is the on/off model that we have talked about, with two classes: healthy group and sick group. If we build parenclitic networks of each of the subjects from the urine analyses, we have networks of around a thousand nodes each. It is noted that networks of normal subjects (i.e., the “healthy ones”) are fairly homogeneous networks, they are random (see upper images of slide 23 - green color). That is quite normal, since if we have a model and a person who belongs to it, the deviation of the model will be more or less random with respect to the population I have. On the contrary, in the case of sick patients, the resulting network is totally different, is star- shaped, that is, there is a central node connected to the rest of nodes (see bottom images of slide 23 - red color). This tells us there is a metabolite that differs substantially, implying that the individual is sick, and that I can isolate the metabolite that is responsible for the disease, given that it shows us where all the differences of the subject are concentrated with regard to the normal subjects
  18. 18. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR18 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course class. Therefore, we have a method that tells us which are the key factors of a particular disease. Next, we are going to focus on something more complicated. We will carry out an experiment on a type of plant, the Arabidopsisthaliana, which is famous for being the base plant on which all genetic experiments are usually performed. In this plant, we have the entire DNA sequenced. This plant is subjected to abiotic stress. In contrast to biotic stress, in which parasites (living organisms) are used, the abiotic stress is used to study phenomena such as heat or cold, salinity, water scarcity, climate change or any other factor that damages the normal behavior of a plant and is not related to living organisms. These phenomena cause significant crop losses. In particular, there is an abiotic stress that interferes with the osmotic activity of the roots. To carry out the experiment, the plant is planted and provided with chemical components which have an adverse effect on its osmosis. Then we analyze its 22.591 gene expression levels in six moments of time for the 240 minutes following the stress test, so we get six pictures where we have the value of the expression of each of the genes. Later, we make the parenclitic representation, and we see that the network is very heterogeneous: we have different stars connected with few lines (see slide 24). Thanks to a measure of centrality (for example, α-centrality), we can obtain a classification of the nodes. Center nodes of the stars in the parenclitic network correspond to key genes that regulate the plant in response to this kind of stress. Outcomes are shown after the analysis of each of the 6 instants of time (see slide 25). This information is very valuable, because, if we take this case as an example, we found 20 genes responsible for orchestrating the genetic behavior of the plant to this kind of stress. If we supply those data to geneticists, they will know where to modify genetically the plant so it behaves better under these circumstances, for example, to grow better in barrens. After this research, we reviewed the scientific literature in this field and found that some of these genes had already been discovered in previous experiments. However, we realized that there were genes of which there was no knowledge. From here it was proposed to carry out an experiment to create a transgenic line of the plant in which the expression of a specific gene in more than 22,000 was blocked. That is, we modify a single component of a huge network, this being the component that predicts that network, according to the experts on parenclitic networks. Slide 26 shows the results of this experiment using photos of the plant and histograms. The picture strip with the WT (wild- type) name represents how the plant grows without any kind of modification under this kind of stress. It was analyzed the average length of the roots of the plant unaltered (WT, wild type) and of the transgenic plants (which had modifications in 7 genes located as key factors through parenclitic networks). It was observed the average value of the length of the roots and the standard deviation in the histograms of the upper part of the slide, no average value is outside of the normal conditions. Again, we have shown a parenclitic representation where it has been cleared which are the critical
  19. 19. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 19 elements (in this case, genes) fundamental of the system. We have also applied this to human genetics, with data from women who suffer from certain types of cancer. We can identify genes that are risk factors for the development of a given disease in a population. In fact, there are thousands of applications for parenclitic networks, because they allow you to represent systems where there is no evolution over time and extract the key factors of the system. In conclusion, I would point out that we have a network representation tool for any data repository. On the one hand, we have Physical Networks, used in millions of examples since 1999 in fields such as epidemiology. On the other hand, there are also Functional Networks, very popular in econophysics and neuroscience applications. Finally, we also have the Parenclitic Networks, which can be applied to any kind of static data repository. That being said, we have the first step to build a network representation that gives system information.
  20. 20. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR20 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course
  21. 21. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 21 Regino Criado The Socrates’ phrase, “Human science is more about ruining errors than discovering truths”, is a reflection on how we have progressed since the first model that was arisen two years ago up to the result that we will present today, in which we will talk about the five problems of digital risk management. Víctor Chapela The scale-free risk model is where the investigation is headed. We will share with you what has been the problem that we have tried to solve. Many of us, who work in security, live through these five problems every day, especially for managing and modeling the risks, as well as to reduce and mitigate them in the future. PROBLEM 1: “TOO MUCH THEORY AND LITTLE PRACTICE” In this sense, we see that when someone carries out an intrusion into a system, normally leaps from a machine to another, from one operating system to the next, in order to gain access to passwords to access other places. That has a very clear representation as a graph. Thus was born, several years ago, the idea that graphs could represent these intrusions, that they‘d even Regino Criado Professor of Applied Mathematics of the Rey Juan Carlos University Víctor Chapela CEO of Sm4rt Corp. Contents of this presentation are available on the official webpage of CIGTR www.cigtr.info. Both slides and videos can be looked up on CIGTR official channels in YouTube and SlideShare. SCALE-FREE RISK MODELING
  22. 22. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR22 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course allow us to automate the hacking, and therefore, automate the way to prevent it. Graphs are a static representation of a series of relationships, but Regino went much beyond, up to complex networks, which allow us to find new ways to interrelate. However, those new forms, which would allow us to have the various components of a network interrelated to each other (IPs, the information of applications, etc.), were much more difficult to model and understand. Initially, the first intuitions emerged just over ten years ago, with something called Preferential Attachment where the most connected nodes connected with greater probability than the less connected nodes. On the Internet and in scale- free networks, those most connected nodes are exponentially more connected and this is maintained over time. This means, for example, that the rich of become richer or that Wikipedia is going to have more and more connections and this will tend to grow exponentially over time, while web sites with few connections, in general, tend to stay with those few connections. This type of mathematical intuitions led us to begin the exploration. Regino Criado Within the classical version of risk measurement as function of the impact or consequences, we can think of availability, integrity and confidentiality, establishing a formula that allows us to calculate that risk. We introduce a scalable constant α and a constant β that will measure the convergence of this expression, in such a way that in order to calculate these parameters we have to position ourselves in the case of greatest risk, obtaining for α a value of 4 and for the convergence parameter a value of -0,016. So we have an expression of the risk that although it is not the classical one, derives from it (see slide 6). In the first proposed approach we had certain elements shaping the risk of a network: value, accessibility and anonymity. We used the exponential of a matrix, an infinite series that gave us the form in which the value is distributed from the source nodes (the vaults) where the information lies. Accessibility, on the other hand, spreads from the source nodes towards the target nodes. In this regard the other nodes receive the value that confers them a possible access to this valuable information. In short, here we have another expression where we are using a formula that will allow us to calculate the probability of making a leap from node “i” to node “j” where it appears a function (slide 8) which sets a parameter, a different value for the type of connection if it’s due to affinity or if it’s an existing physical connection. We are considering the adjacency matrix and the connections graph where, in addition to considering real connections that are part of the static risk, we consider the connections that also are part of the dynamic risk due to affinity. The expression “d” is the Haussdorf distance from node “j” to the set of vaults, that is, the minimum of the distance from a node to a set of nodes already located in the graph.
  23. 23. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 23 Víctor Chapela It is clear then that we have resolved many of the equations, but the problem is that the theory is many times very difficult to implement. The ideal in these cases would be that we could present this information so that it is usable. This year we have been building software and completing the mathematical modeling, thus enabling the understanding from the point of view of someone dedicated to risk management in an organization. With this software we can load data from a sniffing. Based on this loaded data we have mapped a number of nodes. There are several types of nodes (source machines from where connections, applications, servers are started). Risk has been represented in “red” color on the connections of the chart (see slide 9) on the basis of those having more accessibility. The software also allows collapsing the nodes; those nodes going to the same places are collapsed. It is also possible to add value to the nodes. The nodes can be applications, servers, groups of users, etc.; and if some of them we had saved data, either intellectual property or information of credit cards for a certain value; you can add this value to the node. This value is permeated through the rest of the network and it changes the values of different parts, the software recalculates what are values and how they are scattered throughout the network. Although it is still a version that is not intended to be used by a system administrator, it intends to tell us which are the routes of higher risk, for example, the higher risk connections among a group of users and a server, from the point of view of how anonymous is the person who is connecting (anonymity), how accessible is information (accessibility) and how much value there is at the end point. The way in which we have tried to go beyond the theory is to take it down to applications that no longer have this formulation in an evident way, but that they may represent graphically and intuitively how we may use this information. PROBLEM 2: “MANAGE SECURITY, NO RISK” What we mean by managing security is closely related to what Carl Sagan said in the Cosmos series, in which it was explained that Venus had led to much speculation, since Venus is a very bright object -the second brightest after the Moon- but, however, when it was pinpointed with the telescope nothing could be seen, only clouds. So in 1800 they assumed that if there were clouds, they were made of water, if there was water, then it was very humid and there were marshes, and thus there was life. And if there was life, then most likely there were dinosaurs. Then, the observation was that nothing could be seen, while the conclusion was that there should be dinosaurs. Actually, clouds are made of sulfuric acid, so it would be a bit difficult to live on the face of Venus and besides it’s totally arid. Something very similar is how we see security: something that nobody sees. The same goes for the risk, no one sees it. And you will reach the conclusion that it is necessary more hardware and more software. You reach a point where the risk that we can’t measure leads us to make decisions
  24. 24. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR24 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course from something that we don’t know and it’s essential that we can understand this. In the book I presented a year ago, reflecting on how security works today I tried to make an analogy by taking it to a well-known level, which is the personal one. If I had to make a risk analysis, I would be really making a vulnerability assessment based on the criteria of engineers in the field of security. To perform this type of analysis, based on my criteria (Víctor Chapela’s) I would start with a scan. The reality is that today risks are managed when we find faults. Although we then try to manage them with the concepts of impact by probability to give them some type of rating. For this impact analysis and the probability we count on the opinion of experts who evaluate situations based on how security levels (confidentiality, integrity and availability) are affected, etc. Then, the findings of this analysis, which in my case would be the result of evaluating the Víctor Chapela’s various components that could be at risk. For instance, thesethey would say that physical interaction is very dangerous, because there are germs, walking is very dangerous, bathing is forbidden because it’s the number one accident type in the world... And then the personal Master Security Plan would include a rating of critical assets together with suggested security controls where we would firstly pay attention to the availability. Therefore, in this way, since I could slip and lose the ability to move, I’d need a vehicle so I cannot fall, another vehicle for the garden, and it’d be necessary to clone myself so that my children will have a spare in case something goes wrong. Additionaly,from the point of view of confidentiality, I would need to keep everything hermetic with a special suit with an integrated toilet to go out on the street, while for my physical integrity it’d be very important to wear a helmet, an armored truck... In companies we are acting the same way, that is to say, there is a series of vulnerabilities and we consider the posibilities to resolve them all without having an accurate measurement of how much I am reducing the risk. Now consider this aspect in a clearer way, consider what my likelihood of having an accident is. We have this well measured and we can know what is my policy depending on my age, how I’ve driven in the past, etc. Accidents resemble a server or a data center going down (as we have redundancy in these, we can reduce the accidental). Also, diseases are like viruses that affect our data centers. As we have antibodies, we have signatures for different viruses that exist in the systems. But where we are confused is regarding what is referred to intentional risk. We know that if we go out there with 10,000 euros in cash, we have a higher risk. Nobody has to explain that, we sense it. Then, a potential aggression is more likely to occur, and that’s something usually not taken into account in the networks or in the digital risk. At BBVA we have spent many years indoctrinating with the concept of intentional risk. I’ve learned a lot in this regard, for example that intentional risk is something that changes the way we evaluate because we work on the principle that a potential attacker wants to minimize his risk and maximize his benefit, as any businessman. The hacker seeks basically to get the most benefit (the more he can steal the better) and minimize the risk he faces.
  25. 25. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 25 But this risk has two factors (see slide 47): Anonymity of third parties: how anonymous am I? Although this aspect is only a piece of the component, because when I measure consciously or unconsciously my risk to commit any misdeed, I’m weighing up what are the consequences. That is a constant and does not depend on the area of risk management, but it does at the country level (laws say what I can do and what I cannot). In this context what changes are when they know who have commited the crime (if you don’t know who steals or breaks something, the risk is reduced). Then what we measure in networks is part of the anonymity, which is something that can be measured because there is a property in the social sciences called de-individualization, whereby we perceive less risk if we move as part of a group. Accessibility for third parties: On the other hand we take into consideration the cost. In the case of the networks cost could be how accessible is something? That is to say, how much does it take you to reach the goal? If I have to learn how to hack into a database where valuable information is saved, then it will take more time and more effort, and it won’t be the same as if I already have the username and password. Digital Risk Mitigation The three components we are talking about: value, anonymity and accessibility really come from the field of Game Theory. We have implemented them in the field of Complex Networks Theory in the following way. If we return to the idea of managing risk (see slide 48), we understand accessibility very well. To reduce it, that is, to hinder the access to the information, we can authorize or not people to access, filter out different types of packets/ protocols, encrypt... But we can also reduce the value if we disassociate information and, for example, have half card number, or numbering a user instead of naming it. In the same way, anonymity has to do with authentication. I remember that in Brazil they had problems of identity theft until they start using fingerprint readers. Somehow, these three groups of controls attack different areas in which we can reduce business risk by increasing the risk to the attacker or reduce his profit. Complex networks and intentional risk (static and dynamic) Regino Criado During this year we have been developing the basic elements for the construction of a graph that represents a real network, where the weakest and most vulnerable network points can be located, in a simple way, either by their accessibility, by the value they contain or by the anonymity with which an attacker can enter. In any case, the risk of an intentional attack basically depends on these three very specific variables that must be quantified: value, accessibility and anonymity (all of them from the point of view of the attacker, in the same way as in the cases of game theory).
  26. 26. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR26 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course We identified two types of risk in this environment: static risk, in which the authorized routes are used to access the value and that derives from the overconfidence which is given to someone who can access certain privileged data or certain values; and the dynamic risk which has to do with the use of unauthorized routes. Víctor Chapela The second problem that arose, therefore, was how we could manage the different risk. Here we learned much in the moment in which Santiago Moral brought to the table the game theory as a way of understanding the risks of an organization. And what we have done is to advance on another line, which is different from the one originally modeled in, for instance, Casandra. In this way, we worked aimed attrying to implement any of those same elements through small variations in mathematical models that would enable us to understand the risks in a data network. PROBLEM 3: “REDUCTIONISM AND DETERMINISM DO NOT APPLY IN THE DIGITAL RISK” As for the third problem, we tend to try to reduce it to something deterministic, in a way that we can manage it with existing tools when, on the contrary, we are facing a complex problem. The difference between a problem that can be reduced to the representation of a rule or a series of deterministic rules and complexity is that this normally doesn’t lend itself to be reduced. At this point there is a saying that I really like. George E.P. Box says that “in essence all models are wrong but some are useful”. I believe that those that are useful are those that best represent what we are seeing within the reality with which we have to interact. We are entering an era of complex problems in which mathematics is applied, most of the simple problems already have a model and we know how to manage it. Complexity is over there, in economy, in computer security, these are being analyzed as complex problems from the mathematical point of view, where there are no linear relationships among variables and these are multiple. We live in a world that is equipped with hardware that has its limits and complex systems (climate, etc.) difficult to understand and model with conventional equipment. At the beginning, human beings tended to treat phenomena that had to do with very deterministic physical laws, which were even very easy to calculate (the architecture of 3000 years ago and that of today are based on the same principles - different materials but the same principles). When the digital world change that, the human being for the first time was faced with a new world of high complexity because of the relationships among its billions of elements. The number of interrelationships is very large and the technological globalization caused that everything would become complex (even Economy or Computer Sciencehave become complex sciences) and that is very interesting. I grew up with the illusion that computing was something predictable, that it was something controllable, and it was controllable because I lived in my little world (my computer). But in that world we have lost control, when we connected it with the rest of the universe we went from having deterministic computers--where the
  27. 27. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 27 source of a failure could be located easily when this happened –and linear computers– where there were no processes running in parallel - to an indeterministic world, with more variables, so much that today we don’t really know what happens in our machine (for example, Power Point fails and we don’t know why). One of the ways in which we manage this new type of uncertainty is rebooting the machine trying to return to known initial conditions. If this goes wrong, we reinstall everything. While reinstalling or restarting, as in chaos theory, we can sometimes predict the first oscillations, and that is what happens to us, we turn on our machine, and try to restore it to a known state. This is my own rather common demonstration that machines are chaotic in themselves. If, on top of that, all the machines are interconnected and we consider the different levels (even the Socket level)... Basically, as the number of digital machines that are connected grows on a linear basis, the number of applications and routes or connections –the ways in which I can be hacked– increase exponentially. This reductionism in which we had lived until today no longer works. What’s more, it has never worked very well, at least in the computing fields. What we wanted to do was to analyze. We classified the dinosaurs, not in the sophisticated way that we classify genes today, but as the classifier considered it. This was an attempt to analyze thoroughly. For example, if we take a Volvo made of Lego and I give you the 201.000 parts that make up the car, you won’t be able to reconstruct it. Therefore, it doesn’t matter if I understand perfectly each piece; I have to know how each one is related to the others. These relationships could be represented perfectly by a graph or a complex network. If that also shifts over time and in the sense that the relationships are not linear as in the Lego, it means that one of the pieces can be the exponential of the other, so the synthesis of these different pieces becomes much more complex. Today these pieces are already sliced and diced; we already have data in the banks, in the cloud. What we have to do now is to re-synthetize those elements into new knowledge. And it is precisely there where the digital world put us to the test because as it is exponential it gives us a new order of things, where the free scale, which starts precisely with Fractals, any network fragment is the same as any other fragment at another network scale. I think that in fields such as digital networks or economic networks the scale is free, as in Fractals. That’s why we called originally today’s lecture “scale-free risk”, because in the end we believe that the studies in which we are working will be applied to what is happening inside a computer, or in a home network up to what is happening in a network such as the Internet. It will allow us to relatively and absolutely measure the risk differences among all the networks. What is the main difference? That the (red line - slide 87) probability of something being deflected four standard deviations is much greater than the other which is practically infinitesimal. The way to measure digital risk is also based on a normal distribution, which is the basis of mathematical theories that are used today for the economy, health sector, etc. and tries to normalize
  28. 28. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR28 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course something that is exponential. Therein really lies our problem. As the number of nodes grows exponentially, the value and the complexity grow. The traditional statistics are not enough and it’s not possible to predict risk. In the models that we consider, we are assuming that the greater anonymity of the Internet, as well as the value accumulation or the existing complexity in a greater number of access points to that information, causes the risk to rise exponentially, and with these new risks, we need new tools. REDUCTIONISM (linear and non-linear systems) Regino Criado How do we respond to complexity? We respond by working with complex networks, performing a strict analysis of the various components that make up the network and their interactions, connections, etc. Young children are used to the world being linear; the whole is the sum of the parts. Such thought that is known as reductionism, opposed to a holistic vision, was the prevailing in the Science around the 70s-80s. The idea is that you can analyze a system by breaking it down into its parts, analyzing each one separately and then, when joining them, you have a more or less correct vision of the system behavior globally. That is true when the system is linear. In this sense, there are certain phenomena such as the salt crystal or the star clusters that allow a certainly simplified analysis from the mathematical point of view: in the salt crystal, for example, each component interacts only with its neighbors so it configures a mesh, and we can make that analysis because all the particles behave equally. There are other models, the non-linear ones, where the analysis of each element that configures the system is not of much use. I can understand very well how a neuron performs its synapses and how it’s connected with each other, but going from there to understanding what are the mechanisms that allow us to speak or understand how the memory works is a huge leap that has to do with the emerging behaviors of the network structure. In fact, another example is the difference between genomes. Indeed, if we compare the genomes of different organisms, genome meaning a set of genes, man and primates share 99%, in the genome (in the number or set of genes), but not in the connections that allow a few genes to inhibit the behavior of others when it comes to configuring the network. It’s in the complexity of that network where the big difference lies. On the Internet, with 800 million nodes interacting among themselves, there is a non- linear behavior, and that’s why the complex networks analysis is so important. COMPLEX NETWORKS AND GRAPH THEORY First you have to refer to the difference between a complex network and a graph, a graph is understood as a set of elements that we represent as nodes (points) and edges (links or
  29. 29. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 29 vertices) that represent the interactions between these nodes. Graph theory was born from the hand of Leonhard Euler when he tried to mathematically solve if it was possible to traverse the seven bridges of the city of Königsberg (Kaliningrad) and return to the starting point without passing twice through the same bridge. Euler is still today recognized as one of the most important mathematicians of all time, in fact, he is the one who published a greater number of original pages of Mathematics throughout his life. He only was approached by Paul Erdös. He died in 1996 and wrote 1475 original papers with 493 authors. In 1960, Paul Erdös and Alfred Rényi proposed a theory to explain the evolution of graphs. When we speak of complex networks, the essential thing is the scientific revolution that is causing around this concept. Because an increasing number of technological networks and natural networks being so different, they present a great similarity in the structure and this is linked to the function. These networks are characterized by a few highly connected nodes and many others lowly connected. If instead of seeing it in the logarithmic plane, we do so in the normal plane of variables, the chart would resemble a branch of the hyperbola of the first quadrant. In addition, these types of networks, both natural and technological, have a very similar structure that is characterized by what are called “small-world structures”, which have a similar performance facing small perturbations. These networks are characterized because they consist of small nuclei (worlds) highly connected to each other and lowly connected with the rest (small communities) and by the relatively small distance between them (think about the 6th degree of separation in the universal network of contacts). That is, the number of hops that we need to take to get from one node to another is very small. However, the fundamental difference between complex networks and graph theory is given by the network size, since this implies that the computation tools that must be used, given the huge amount of data, require performing an optimization of the type of algorithms used in calculating the various parameters (slide 106). The interest in the use of complex networks covers fields such as the technological networks, and the biological, economic, and social networks. All of them respond to this same structure (slides 108-109). From the mathematical point of view, a graph can be represented by a matrix of ones and zeroes (see slide 115). On the slide 116 you can find the most important definitions relating to parameters of a complex network such as the degree of a node, the length of a path, the geodetic distance or shortest path between nodes, among others. Víctor Chapela What we achieve by integrating the part of complex networks is to solve the problem of precisely addressing the complexity, that is, the millions of interrelations among the nodes. We should also integrate elements of the game theory and we needed to do something practical.
  30. 30. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR30 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course PROBLEM 4: “WE MANAGE DIGITAL RISK IN A TRADITIONAL AND SUBJECTIVE WAY” A human would not be able to classify all in the case of, for example, the assessment of impact and probability based on the different levels of security (confidentiality, integrity and availability). A good example of this takes us back to the early stages of the Internet, when there were two different ways of classifying the web search engines: Yahoo-style was manual, with hundreds or thousands of people classifying different web sites based on keywords specified by the users, and Google, which was using mathematics based on the use of graphs, or rather, complex networks, to understand what was the site most likely wanted by the Internet user. Precisely, to define which the most popular sites were, Google used all the hyperlinks to each page, showing the popularity and influence of each site within a group. Of course it was much more accurate than manual classifying. We wanted to achieve that same effect but at risk level, so no one had to classify control by control, machine by machine, software by software, etc. To do this, we divide the risk into two types: static and dynamic. The static risk relates to any person who has authorized access to the information, that is, an employee, a customer, a system administrator or a provider who access through an external network. The likelihood that such information would be stolen has to do with the anonymity and the value of the information. As for dynamic risk, we return to the original problem that had arisen with all this, since it has to do with the probability that someone without authorized access could hack or steal an identity to access the valuable information by finding the vulnerabilities inside the network. In order to know the static risk, we began to see that there was a lot of information in the logs of devices in an organization. Of course we faced the problem that the information was very partial, the server logs referred only to that server, when we really needed all of them to correlate data, it would be relatively difficult to consolidate all this type of information. So we opted for the network sniffing, a process through which a computer sees packets that are going through a network segment. This allowed us to see all the routes that were in use –all the permitted ones–, and therefore the existing controls were already implicit in the network. Nothing that was moving through that network bypassed those controls. Then we found that there were two types of nodes that could be classified automatically: the IPs of the source connections and the target applications where the connection arrived at (IP address and port number). Taking into account these two types of accesses, we had an access frequency both for end users and for technical users, which gave us two different weights for each edge (i.e. how often a normal user gains access and how often a technical user does, who most of the time has the option of taking all of the value that is in the application or system). With these two types of user, we proceeded to the construction of the graph.
  31. 31. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 31 Regino Criado As Víctor has said, we have IPs and IPs-port, two types of nodes that we will represent in a graph. And we will do it with an example that brings together all the complex features when setting attributes of accessibility, anonymity and value that must be granted to both nodes and edges. From the network sniffing we see that there are certain IPs connected to certain IPs-port. If we perform sniffing for a period of time, we see how many times they have accessed from one site to another. Each source IP is a node and we add another node for each destination IP-port (second column on slide 129). Víctor Chapela A way to distinguish how we consider technical and non-technical users was based on the type of port to which they are connecting. For example, if it’s a SSH port we assume that is a technical user, if it’s an HTTP port then we assume that is an end user. Regino Criado Our goal is to provide all the elements of the graph of anonymity, accessibility and value. The value in principle is given initially to the vaults, to those nodes where the valuable information lies. With the software we developed we try to find those elements where the risk is higher. Static risk: Anonymity Regarding anonymity, it is based on the concept of deindividuation that Víctor discussed before, it is considered that the more surrounded by people and individual is, the more anonymous he feels. We have modeled this using a type of collapse that is defined on slide 135. A value of anonymity is assigned to the nodes that collapse, the nodes that are connected to the same elements. For example, on slide 136, nodes IP1 and IP2 have access to IP5:p1, therefore, they will collapse and are assigned a value of anonymity equal to 2. In addition, node IP1 does not necessarily have to be part of a single collapse (it’s not an equivalence relation), as you can see on the slide. For example, node IP1 is also connected to IP6:p3 just as nodes IP2, IP3 and IP13. Víctor Chapela There is an important aspect that we cannot see on the model shown on the slide. Here we are trying to measure the scales, consider an ERP –a resource management system of a company– if there are a thousand people who are connected to that same system they are much more anonymous than ten, out of those one thousand, who are connected to the treasury system. In reality, when there are a large number of people accessing the same application, they perceive themselves as more anonymous. Regino Criado The frequencies of the collapsed nodes are added, see slide 137. Static risk: Value Assignment Once we distribute the anonymities through the collapsed graph we seek ways of distributing
  32. 32. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR32 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course the value in the case of the static risk (see slide 142). To do this, we understand that if the value of the information that lies in a node and there are multiple accesses, each one of them has access only to that part of the value, so we have to divide the value by the inverse of the anonymity. Last year we set out that, based on the vault, originally we were using the exponential of the matrix to distribute the value. However, we detected that there was a certain redundancy which had to be removed in a concrete example. Anyway, there is a way to distribute the value; we have called the algorithm “max-path” (see slides 139-153). Static risk: Accessibility Allocation For the parameter of accessibility a value has to be associated with edges and nodes. To do this we use the paradigm of random walker, the PageRank algorithm used by Google in the calculation of the centrality of the most important nodes. What do we mean by the random walker? If we let a walker who is moving across the network go, there are certain nodes through which passes more often and that will be the most important. In that sense, we consider a damping factor of 0.15 for the user graph, and of 0.25 for the administrators graph (i.e., this corresponds, in the first case, to 6 connections before making a random hop and in the second case to 4 connections). Finally, we have a value for each source node in the whole distributed degree, the anonymity for each edge and an accessibility of each edge. DYNAMIC RISK: MODELING Víctor Chapela In order to assess dynamic risk, in addition to taking into account the sniffing, we also perform a vulnerability scanning that will give us new potential routes for a sophisticated attacker to access a machine with certain vulnerability. Thus each of the vulnerabilities turns into a new access, to which we must add another type of accesses that a hacker manages to get: access by similarity. If I enter a HP-UX machine that has a series of settings and a series of administrator users on that computer, the chances of you gaining access to other HP-UX computers with the same configuration are very high (for example, chances are that all of these HP-UX are managed by the same group of people or have the same type of bugs or configuration problems, etc.). That is, if you have access to a computer with certain characteristics, it’ll be easier for you to access other similar equipment. Vulnerability scanning provides us with the information we need, tells us which versions are installed, which ports are open - whether in use or not. Dynamic risk: Value Assignment And in the value assignment, we use the same method that we had in the static risk. In that case, the problem we had with dynamic risk was that all hacker access are administrative and potential, they are not real; then if I connected everything with everything, and then we used the algorithms to move the value across the network. In other
  33. 33. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 33 words, if I have a database and then I have a web server, and that web server accesses the database as administrator, a person who has access to the web server could potentially access the data through the interface if this is accessing and making a query every time. Then, the value that is in my database, in our model moves to the web server, or moves to the administrator’s team. That’s how we know how much value the administrator can access because we calculate how much of that value gets to his machine. Then with this same idea, if we connect everything with everything via vulnerabilities and via similarities, suddenly we find that the value of the whole network was the same, so the value no longer matters to us and we return to the original value we calculated and assume that the attacker would want to get to that value through the shortest path. Now we are going to see that. Dynamic risk: Anonymity In the dynamic risk we don’t calculate the value again but the anonymity did change. Anonymity is important for the attacker. And is important only to the extent of his first connection, because once he connects somewhere usually he can take a username/password and can reconnect with the new username and password or can use any of the existing connections within that machine. Then, the point at which someone is anonymous at first is the only moment in which the attacker will have a higher or lower risk. Thus, we consider three groups or collectives of anonymity: 1. If I come from the Internet, we have a very low risk for the attacker because it’s very difficult to identify him. For example, with the logs we’ll never know who used that IP at eight o’clock in the evening at a Starbucks in Moscow. So in these cases the anonymity is very high. With the added problem that he can come from any country, so there are no consequences if, for example, there is no extradition treaty; even if we knew who the attacker was, there is no way to take measures against him with personal consequences 2. Then there is the internal Wi-Fi and the external suppliers’ access or the third party access. These have an intermediate risk because if you are connected to a wireless network entering the internal network (if it is a wireless network that goes out to Internet is measured as if it were Internet) you still have a high anonymity because you might be off-premise and connect to the internal network, and if you are working at a supplier you might not have a direct contract with those ultimately affected. 3. And, finally, on the internal network we assume that, if there is a contractual relationship, you are identified by your username/password and we know to whom the machine belongs to and therefore the attacker’s anonymity is much more reduced. Dynamic risk: Accessibility Regino Criado Accessibility is important because it will make the difference between the dynamic and the static risk.
  34. 34. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR34 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course Therefore, in order to calculate the accessibility in the context of dynamic risk we first construct a new graph which shows the authorized routes, the nodes already established and the new nodes that are reconnections by similarity (if two systems have the same operating system, have the same administrator, etc.) we set a new link between them, they are evolving networks, so to speak. What we need to see, following the same example of the random walker (the PageRank algorithm used in Google), is, if you want to better understand: I am in a node and that node has five connections, that is, five possible sites to go. So, first I roll a dice to see if I use one of these five connections or take a random hop, and then I roll another dice to see through which of those nodes I go, what is my problem? That by taking a random hop I can go anywhere, so we have to make a PageRank with a personalization vector. It is actually a Markov chain, a stochastic process, where when we are in node i the likelihood of hopping from node i to node j is given by the expression pij, shown on slide 169, where f(i,j) has two values, α if it is an existing connection in the static risk, β if it is a potential connection, where 0 β α and β must be proportional to the entropy of the system. Víctor Chapela At the end of the day, as it’s no longer enough to know which path I could take, because now there were potential routes which had never been used, we are giving preference to the routes that have already been used by authorized persons and that tells us that if this route exists, chances are that if an attacker comes, he’ll take an existing route. And second, we are giving preference to the shorter paths in the network. Then the attacker wants to reach the value, we don’t calculate the value again as we said, then the attacker would want to take the existing routes and the shorter paths. Now, if there’s no short path or if it’s not an existing route, he may want to hop directly to the server if there is vulnerability and maybe he could do it, he doesn’t have to go around. Then that probability will continue to be expressed in the random walker but with less probability than taking the best-known routes. Regino Criado And as Víctor said, another important part is to take into account, facing the possible hop, the knowledge that the hacker may have about the network structure, which would be measured as the distance from the node to the set of vaults; the logical thing for the hacker is to go along that path that brings him closer to the value that he wants to get. Víctor Chapela Then, the most valuable of everything that Regino has mentioned is that we accomplished some mathematical formulas that then came to an application. This helps us, with standard information, to have a view of our network risk, except for the value we put into the application that potentially returned a result. That value is the only thing that we put in, everything else, both static risk and dynamic risk, are calculated on top of that.
  35. 35. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 35 So, by using the intrinsic properties of the network to calculate anonymity and accessibility and the dispersion of the value within the network, enables this to scale a lot, because we only have to say where the repositories of value are and we can scale it. And there will be another part that we see as a challenge for the coming year, I’ll talk about that now, that has to do with modeling the additional controls that are within the network. FIFTH PROBLEM: “IN GENERAL WE ENGINEERS MANAGE DIGITAL RISK” We engineers are taking care of security and risk management. I think that we have many virtues and some faults too. For example, if we consider a glass with water as that shown on slide 172, optimists say the glass is half-full, pessimists say the glass is half-empty and engineers say the glass is twice the size it needs to be, that is, it should be smaller, it’s a problem with the poor design of the glass. From their point of view, engineers look at the solutions and don’t necessarily look at the part concerning whether the glass is full or not, since that doesn’t matter to them, it is oversized. Fortunately, since engineers are involved, we can build good software. I think that the most important value of this project is that we aren’t just technologists working in isolation. On the one hand we have incorporated the experience of the Bank itself and on the other hand highly valuable people like Regino, Miguel and even Angel, who has also been involved almost since the beginning and what they have done is to enhance and leverage those things that perhaps we wouldn’t have been able to do in isolation. Regino Criado This is the result of the collaboration of four entities: Smart, Rey Juan Carlos University, Research Center for Technological Risk Management and BBVA. Our intention is to extend these efforts to other areas and put an emphasis on the multidisciplinary approach in solving this kind of problems involving areas such as discrete mathematics, computer science, probability, statistical mechanics, and nonlinear differential equations. The working group consists of theoretical physicists, computer experts, biologists, sociologists and, of course, mathematicians. And in any case it is really established that this leads to a paradigm shift in the way of conceiving risk.
  36. 36. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR36 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course
  37. 37. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 37 oday where we have something approximating one smart phone, or one device that can connect to the Internet, for every person on the planet, give or take seven billion. Did Thomas Watson get it wrong? He did, “there is a world market for maybe five computers” per person. Now the scary thing is, when someone buys a new car, this has about thirty computers built-in and most use windows. Therefore, the amount of computing power that goes into everything we use today, continuously dosing to everything, which is even built into objects none of us would consider to be computers, is immense and getting bigger. You only have to look how far things have changed. It took twenty four days for Google+ to reach 20 million users. While Facebook, which is approximately 1 billion, reached 20 million users in two years .The pace to which things are changing, the pace of the adoption of technology, is getting quicker. Last year a German made the 25th billions download from iTunes, so that means an average of four downloads for each person on the planet and the number is getting bigger. However it’s not only the pace of change but the pace of innovation is also fast as well. That’s why predicting is so difficult and at the same time so funny. But, do we do this well? Adrian Davis Principal Research Analyst. Information Security Forum (ISF) Contents of this presentation are available on the official webpage of CIGTR www.cigtr.info. Both slides and videos can be looked up on CIGTR official channels in YouTube and SlideShare. THREAT HORIZON 2015: MORE DANGER FROM KNOWN THREATS
  38. 38. Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR38 Methodology of Intentional Risk Analysis on Internet and Complex Networks 2013 Summer Course What’s the point? We do for a particular reason: organizations, organizations like yours, are also looking to the future, just to know what will the world look like in a couple of years, what will the technologies be and what are the problems that can come out. Without a doubt the technology is a small piece of the picture in our companies. The most important thing is cash. If the company doesn’t have cash, it can’t do anything; it’s as simple as that. Likewise, information is the nervous system of the business, if you don’t know what’s happening, which is information, it’s not possible to react, toplan or manage. In this sense, all businesses have a reputation to protect. Let’s see some examples of mistakes managing that reputation. About five to ten years ago there were cases in Belgium of people becoming ill after drinking Coca-Cola. The company ignored it thinking it wasn’t their problem and it got worse. The trace of the contamination led to a bottling plant in Belgium, which was a franchise operation. Because of the way they handled that problem, the consumption of Coke in Belgium went from approximately half of the market to under ten percent. It has not yet recovered to even be a quarter of the Belgium market. So that’s one of the reasons why we do the Threat Horizon. What will affect my reputation today and tomorrow? What can I do today to protect my reputation tomorrow? We call it our crystal ball. Every year about this time my team and I get a chance to dream, to ask ourselves: what’s the world going to be like in two years’ time? What will happen with the economics and politics of the world? Let’s think about the growth of Africa for example. If Africa grows at the rate it’s currently doing, there’ll be five hundred million more mobile phone subscribers in two years’ time. Now if anybody could write an app that can predict the African weather, I sense they’re going to make a huge amount of money, because all you have to do is sell it at 1 cent… and five hundred million times one cent is still a lot of money. We’re trying to think about the things that will impact the business; that will impact the information security, the information risk or risk professionals that you may need to plan for over the next 24 to 36 plus months. And we use the PLEST model that means political, legal and regulatory, economic, socio-cultural and technical, which many business professionals are very familiar with. Like I said before, we ask multiple questions: How will the Arab spring play out? What is going to be the affect of for example bringing half a billion new mobile phone users on board? What are the socio-cultural impacts of declining birth rates in China as well as the West or Russia? What are the kinds of technologies that will breakthrough over the next couple of years? To answer those questions we talk to a lot of people: experts of the ISF’s organizations, we ask questions on the Internet, we search out people called futurologists, we go to academics like INSEAD… And then based on what we know and what we do in terms of expertise we draw out the big ideas and concepts and we all then sit in
  39. 39. 2013 Summer Course Methodology of Intentional Risk Analysis on Internet and Complex Networks Centro de Investigación para la Gestión Tecnológica del Riesgo CIGTR 39 a room the ten of us and we just throw the ideas around. Then we take them out, we give them back to our membership, to our experts in the raw world. We get something in the region of six, seven hundred people helping. We normally get two to three thousand ideas a year about how the world would change and from that we condense it down into twenty or thirty pages, in ten big ideas that form the Threat Horizon. So it’s not just me, it’s a big multidisciplinary big global effort to produce this report. And that’s what we are going to talk about. So let’s start up with our predictions for 2013, this is this year, so let’s see how our predictions went. One of the things we really picked up on was how much more regulation was coming. So for example we now have one of the big issues of the European Data Protection directive, the directives on cyberspace and cybersecurity and of course the Americans Dodd-Frank and many others that will come. We thought political governments are going to get involved in everything. States are going to start attacking non-states. States are going to start hacking States, which actually is pretty obvious, because States have done that for thousands of years. We talked about things like breach notification laws and the idea of digital human rights, such as the right to be forgotten or the right of privacy, which we are starting to see coming through in laws now. We talked about things such as IPv6 being an issue. But actually it isn’t yet. IPv6 is one of the big things that are going to happen. And we talked about the raise of Africa. Those are some of the big trends that we thought are coming through. Have we got them right? I probably would say we’re about 60%, 70% right. Some haven’t matured as quick before they would. Some have rocketed in from nowhere. In terms of Information Security, one of the problems we worried about is data leakage. Somebody posts something on the Internet and the information my company wished to keep secret it’s actually distributed globally. In fact I probably say that the most famous example of data leakage is a gentleman called Mister Edward Snowden of no fixed abode or country yet. We talked about new e-crime opportunities. If criminal became business people, we’d be out of the job, because the criminal fraternity is incredibly good at dreaming up new ideas to make money. The only problem is that they don’t have to worry about a legal framework to do it, which is why they can move a lot quicker than we can. We have talk about security in the supply-chain, revolution in devices, things as hacktivism as well. So we came up with what we call the Threat Radar. We had this thing about what can I manage, those things I know about but I can do stuff: deploying information security controls, teaching people not to do stupid things. But there are also things I can’t do anything about, such as the espionage by governments. You know they do it, but you may not have the resources or the skills to stop them. Then there are things that I don’t know about and I can’t do anything either, for example when you

×