Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Social and economical networks from (big-)data - Esteban Moro II

628 views

Published on

COMPLEX NETWORKS: THEORY, METHODS, AND APPLICATIONS (2ND EDITION)
May 16-20, 2016

Published in: Science
  • Be the first to comment

  • Be the first to like this

Social and economical networks from (big-)data - Esteban Moro II

  1. 1. Social and economical networks from (big-)data Esteban Moro @estebanmoro NTMB Lake Como School May 2016
  2. 2. @estebanmoro Summary 1. Intro to Social/Geo Big Data 2. Sources of Social/Geo Big Data 3. Tools for Social/Geo Big Data 4. Applications of Big Data in Social and 
 Economical problems 5. Outlook
  3. 3. @estebanmoro Mobile phone data 4.Aplications of Big Data in Social/Geo Problems
  4. 4. @estebanmoro Understanding human behavior Behavior Observation
  5. 5. @estebanmoro Human behaviors spread in networks Social contagion measure the relative topological overlap of the neighborhood of two users A and B, representing the proportion of their common friends, as OAB = NAB/((KA-1)+(KB-1)-NAB), where NAB is the number of common neighbors of A and B, and KA (KB) denotes the degree of node A(B).1 Fig. 3(d) demonstrates the effect of removing links in order of strongest (or weakest) overlaps. In both cases, we find that removing ties in rank order of weakest to strongest ties will lead to a sudden disintegration of the network. In contrast, reversing the order shrinks the network without precipitously breaking it apart. 0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 20 25 Probability Number of Churner Neighbours May Churners June Churners July Churners (a) 0.3 0.35 0.4 3 Churners 4 Churners 5 Churners 6 Churners locally disintegrate a community, while the removal of the weak links will delete bridges that connect different communities, leading to a network collapse. Further, we believe that the observed local relationship, between network topology and tie strength affects any global information diffusion process (like churn). In fact, we opine that churn as a behavior can be viewed less as a dyadic phenomenon (affected only by strong churner- churner ties), but more as a diffusion process where both strong and weak ties play a significant role in spreading the influence through the network topology. 4. PREDICTING CHURNERS IN THE CALL GRAPH We next discuss how to exploit social ties to identify potential churners in an operator’s network. Our approach is as follows. We start with a set of churners (e.g. for April) and their social relationships (ties) captured in the call graph (for March). Using the underlying topology of the call graph, we then initiate a diffusion process with the churners as seeds. Effectively, we model a “word-of-mouth” scenario where a churner influences one of his neighbors to churn, from where the influence spreads to some other neighbor, and so on. At the end of the diffusion process, we inspect the amount of influence received by each node. Using a threshold-based technique, a node that is currently not a churner can be declared to be a potential future one, based on the influence that has been accumulated. Finally, we measure the number of correct predictions by tallying with the actual set of churners that were recorded for a subsequent month (e.g. for May). The diffusion model is based on Spreading Activation (SPA) techniques proposed in cognitive psychology and later used for trust metric computations [32]. In essence, SPA is similar to performing a breadth-first search on the call graph GMarch=(V,E). The basic steps are outlined below:- Node Activation: During each iterative step i, there is a set of active nodes. Let X be an active node which has associated energy Dasgupta, K. et al., 2008. Social ties and their relevance to churn in mobile telecom networks. Sundsøy, P., Bjelland, J., Canright, G., Engø-Monsen, K., & Ling, R. (2010). 2010 International Conference on Advances in Social Networks Analysis and Mining
  6. 6. @estebanmoro Human behaviors spread in networks Social contagion received the social message were 0.39% (s.e.m., 0.17%; t-test, .02) more likely to vote than users who received no message at Figure2showsthat theobserved per-friend treatment effect astie-strengthincreases.Alloftheobservedtreatmenteffectsf a bInformational message Social message friends have voted. Today is Election Day What’s this? People on Facebook Voted Find your polling place on the U.S. Politics Page and click the "I Voted" button to tell your friends you voted. close• VOTE l Voted 10 1 5 5 3 7 6 Today is Election Day What’s this? People on Facebook Voted Find your polling place on the U.S. Politics Page and click the "I Voted" button to tell your friends you voted. close• VOTE l Voted 10 1 5 5 3 7 6 0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 Directeffectoftreatment onownbehaviour(%) Self- reported voting Search for polling place Validated voting Validated voting Social message versus control Social message versus informational message Jaime Settle, Jason Jones, and 18 other e 1 | The experiment and direct effects. a, b, Examples of the informational message and social message Facebook treatments (a) and their dire behaviour (b). Vertical lines indicate s.e.m. (they are too small to be seen for the first two bars). EARCH LETTER Bond, R. M. R., Fariss, C. J. C., Jones, J. J. J., Kramer, A. D. I. A., Marlow, C. C., Settle, J. E. J., & Fowler, J. H. J. (2012). A 61- million-person experiment in social influence and political mobilization., 489(7415), 295–298. http://doi.org/10.1038/nature11421
  7. 7. @estebanmoro The greater the similarity between individuals the more likely they are to establish a connection Homophily Buzz 27 Attribute Random Communicate Age -0.0001 0.297 Gender 0.0001 -0.032 ZIP -0.0003 0.557 County 0.0005 0.704 Language -0.0001 0.694 tion coe cients for random pairs of people and pairs of people who communicate. degree of homophily of random pairs of users with pairs of users that communicate. 50 60 70 80 Worldwide Buzz 27 Attribute Random Communicate Age -0.0001 0.297 Gender 0.0001 -0.032 ZIP -0.0003 0.557 County 0.0005 0.704 Language -0.0001 0.694 Table 5: Correlation coe cients for random pairs of people and pairs of people who communicate. We compare the degree of homophily of random pairs of users with pairs of users that communicate. 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 (a) Random (b) Communicate Figure 21: Number of pairs of people of di↵erent ages. We plot ages of two people and color corresponds to the number of such pairs. (a) Ages of randomly selected pairs of people; we note there is little correlation. (b) Ages of people who communicate with one another, i.e., ages of people at the endpoints of links in the communication network. The high correlation is captured by the diagonal trend. We contrast this statistic with the correlation coe cient where we choose users via a process of uniform random sampling across 1.3 billion users. We also consider two measures of similarity—the correlation coe cient and the probabil- Correlation coefficient Number of pairs of people at different ages Leskovec, J. & Horvitz, E., 2008. Planetary-scale views on a large instant-messaging network. pp.915–924.
  8. 8. @estebanmoro Contagion or Homophily • Contagion = Homophily? • Influence and homophily are usually confounded in observational social network studies network registered Ͼ14 billion page views and sent 3.9 b messages over 89.3 million distinct relationships. For details a the service, the data, and descriptive statistics see the Data se of the SI. Evidence of Assortative Mixing and Temporal Clustering We observe strong evidence of both assortative mixing and poral clustering in Go adoption. At the end of the 5-month pe adopters have a 5-fold higher percentage of adopters in their networks (t Ϫ stat ϭ 100.12, p Ͻ 0.001; k.s. Ϫ stat ϭ 0.06, p Ͻ 0 and receive a 5-fold higher percentage of messages from ado than nonadopters (t Ϫ stat ϭ 88.30, p Ͻ 0.001; k.s. Ϫ stat ϭ p Ͻ 0.001). Both the number and percentage of one’s local net who have adopted are highly predictive of one’s propensity to a (Logistic: ␤(#) ϭ 0.153, p Ͻ 0.001; ␤(%) ϭ 1.268, p Ͻ 0.001), a adopt earlier (Hazard Rate: ␤(#) ϭ 0.10, p Ͻ 0.001; ␤(%) ϭ 0 p Ͻ 0.001). The likelihood of adoption increases dramatically the number of adopter friends (Fig. 2C), and correspondi adopters are more likely to have more adopter friends (Fig. mirroring prior evidence on product adoption in networks (2 Adoption decisions among friends also cluster in time. randomly reassigned all Go adoption times (while maintainin adoption frequency distribution over time) and compared obse Fig. 1. Diffusion of Yahoo! Go over time. (A–C and D–F) Two subgraphs of the Yahoo! IM network colored by adoption states on July 4 (the Go launch date), August 10, and October 29, 2007. For animations of the diffusion of Yahoo! Go over time see Movies S1 and S2. Fig. 3. Distinguishing homophily and influence. (A and B) The fraction of observed treated to untreated adopters (nϩ/nϪ) under random (A) and propensity score (B) matching over time. The dotted line shows a ratio of 1, when treatment has no effect. The Right Inset in B graphs the average marginal influence effects of having 1, 2, 3, or 4 adopter friends implied by random (open circles) and propensity score (filled circles) matching. The Left Inset graphs the average cosine distance of attribute andbehaviorvectorsofadopterstoadopterfriendsasthenumberofadoptersinthelocalnetworkincreases(͚i,j n cos(xi a ,xj a )/n).(C)Graphsthecosinedistancesofadopters to their adopter friends cos(xit a , xjt a ), their nonadopter friends cos(xit a , xjt), and a random alter cos(xit a , xrt) over time with trend lines fitted by ordinary least squares. (D) The fraction of treated and untreated adopters, where treatment is defined as having a friend who adopted within a certain time period (or recency) (⌬t ϵ ti a Ϫ tj a ϭ R), under random matching (open circles) and propensity score matching (filled circles). The Inset graphs the cosine distances of dyads of adopters cos(xit a , xjt a ) by the time Aral, S., et al. 2009. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences, 106(51), p.21544.
  9. 9. @estebanmoro Granovetter (weak ties) Strong ties happen within communities. Weak ties across communities Onnela, J.-P., et al PNAS 2007 A B 1 100 10 10 0 10 1 10 2 10 10 6 10 4 100 102 104 106 108 10 10 12 10 10 10 8 10 vi vj Oij=0 Oij=1/3 Oij=1Oij=2/3 <O> w ,<O> b 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 P cum (w), P cum (b) C D Degree k Link weight w (s) P(k) P(w) Fig. 1. Characterizing the large-scale structure and the tie strengths of the mobile call graph. (A and B) Vertex degree (A) and tie strength distribution (B).
  10. 10. @estebanmoro Granovetter (weak ties) Strong ties happen within communities. Weak ties across communities Figure 1. Groups and links. (A) Sample of Twitter network: nodes represent users and links, interactions. The follower connections are plotted as gray arrows, mentions in red, and retweets in green. The width of the arrows is proportional to the number of times that the link has been used for mentions. We display three groups (yellow, purple and turquoise) and a user (blue star) belonging to two groups. (B) Different types of links depending on their position with respect to the groups’ structure: internal, between groups, intermediary links and no-group links. doi:10.1371/journal.pone.0029358.g001 The Strength of Intermediary Ties in Social Media network (followers and followees), while the second consists in retrieval of the user activity from the stream of Twitter (p tweets, mentions and retweets). In the first stage, the dire unweighted network is obtained from the information on followers and followees of each user. The data was collected u a breadth-first search technique: Starting from several se followers and followees of the seeds were retrieved. Then the s procedure was repeated for the newly discovered users obtaini Figure 5. Intermediary links. (A) Ratio r between the number o the links in the follower network (black curve), those with mentions groups of the users connected by the link. Inset, ratios between t doi:10.1371/journal.pone.0029358.g005 PLoS ONE | www.plosone.org RTs happen between groups.
 MTs within groups Grabowicz, P. A., Ramasco, J. J., Moro, E., Pujol, J. M., & Eguiluz, V. M. (2012). PLoS ONE, 7(1), e29358. http://doi.org/10.1371/journal.pone.0029358
  11. 11. @estebanmoro Problem: how users manage their social contacts? Problem: characterising/predicting social turnover Answer: study CDRs to detect new/old social relationships 3.5 4.0 4.5 5.0 5.5 0100200300400500600700 Year neighbordid 2 7 20 54 148 403 1096 2980 1.1 persons/day 0.6 persons/day Temporal networks
  12. 12. @estebanmoro Problem: how users manage their social contacts? Problem: characterizing/predicting social turnover Answer: study CDRs to detect new/old social relationships y on the average . Thus, we could (2) nt tij. Or equiv- the distribution tivities from that iven by: (3) iven by the ccf of (4) are heavy tailed, ime dependence by the exponen- me is an exponen- s, we nique data users ut all ators, time . We d the mmu- hows wever, times unob- if the n our muni- gests that a large fraction of the revealed aggregated social connectivity ki(T) is given by newly formed or removed con- nections. Thus, ki(T) usually overestimates the instantaneous human social capacity of maintaining social ties. The imbalance between the number of added/removed ties measures how social capacity changes. At the end of B C Numberofties i(t) ki(t) 301102030 n ,i(t) n ,i(t) + i(0) (4) are heavy tailed, ime dependence by the exponen- me is an exponen- mmu- hows ever, times unob- f the n our muni- thus is of tside asses ation of tie s ob- n old tivity t the e, we cases A BNumTieid Days 0 30 60 90 120 150 180 210 1102030110 n ,i(t) Figure 2. From communication activity to tie dynamics: Panel (A) shows the communication events of a given individual in our database with Real example: 700M relationships
 23M people Temporal networks
  13. 13. @estebanmoro Problem: how users manage their social contacts? We create and destroy relationships at the same pace! Users’ social capacity remains constant! Capacity = 5 Links created/destroyed = 4 Temporal networks
  14. 14. @estebanmoro Problem: how users manage their social contacts? Users have different social strategies: Social explorer Social keeper Links created = 23, Capacity = 4 Links created = 3, Capacity = 24 Temporal networks
  15. 15. @estebanmoro 0 5 10 ks v y 0 5 10 ks na v y 10 11 12 13 14 15 6.07.08.09.0 bb0m$x bb1m$x 16 2024 28 32364044 48 52 56 60 64 68 16 20 24 28 32 36 4044 48525660 64 68 6789 n↵,i i ni A B Problem: how users manage their social contacts? Social strategy changes with age Capacity LlinksCreated/destroyed 0 5 10 ks na v y g F M 0 5 10 ks na v y g F M 10 11 12 13 14 15 bb0m$x 16 2024 28 32364044 48 52 56 60 64 68 16 20 24 28 32 36 4044 48525660 64 68 i n↵,ii B Miritello, G., Lara, R., Cebrian, M., & Moro, E. (2013). Temporal networks
  16. 16. @estebanmoro Our mobility is highly predictable Human mobility and that their average call frequency f is ≥0.5 hour−1 [(22) sections S1 and S2]. The trajectories of two users with widely different mobility patterns are shown in Fig. 1A: The first user moves in the vicinity of N = 22 towers in a 30-km region, whereas the second visits as many as N = 76 towers spanning approximately a 90-km neighborhood. To under- stand the recurrent nature of individual mobility, we assigned to each user a mobility network (23) (Fig. 1B), in which nodes are the locations visited by the user (each location corresponding to a where Ni is the number of distinct locations visited by user i, capturing the degree of predictability of the user’s whereabouts if each location is visited with equal probability; (ii) the temporal-uncorrelated entropy Sunc i ≡ −∑Ni j¼1pið jÞ log2pið jÞ, where pi( j) is the his- torical probability that location j was visited by the user i, characterizing the heterogeneity of visitation patterns; (iii) the actual entropy, Si, which depends not only on the frequency of visitation, but also the order in which the nodes were visited and the time spent at each activity, during which we have no information about the user’s location (Fig. 1C). This incom- pleteness of the collected data is captured by the parameter q, representing the fraction of hour- long intervals when the user’s location is unknown to us. As Fig. 1E shows, P(q) across our user base peaked around q = 0.7, which indicated that, for a typical user, we have no location update for about 70% of the hourly intervals, which masks the user’s real entropy Si. We therefore studied the dependence of the entropy S(q) on the incompleteness q, which A B Mon Tue Wed Thu Fri Sat Sun C D E 5% 23% 15% 5% 4% 52% 5% 6% 27% Distance (km) Distance (km) Fig. 1. (A) Trajectories of two anonymized mobile phone users who visited the vi- cinity of N = 22 and 76 different towers during the 3-month-long observational period. Each dot corresponds to a mobile phone tower, and each time a user makes a call, the closest tower that routes the call is recorded, pinpointing the user’s approximate location. The gray lines represent the Voronoi lattice, approximating each tower’s area of reception. The colored lines represent the recorded move- Song, C., Qu, Z., Blumm, N., & Barabasi, A.-L. (2010). Limits of predictability in human mobility. Science, 327(5968), 1018.
  17. 17. @estebanmoro Our mobility is highly predictable Human mobility ? One shop gets 20% of the use of a credit card Including other persons choices we can reach 30% accuracy Krumme, C. et al., 2013. The predictability of consumer visitation patterns. Scientific Reports, 3, pp.–.
  18. 18. @estebanmoro Most of our social connections happen in our neighbourhood (Gravity Law) Geography of social networks Pij / 1 (dij)↵ Liben-Nowell D et al PNAS 2005
  19. 19. @estebanmoro Most of our social connections happen in our neighbourhood (Gravity Law) Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 am of physical distances between egos and alters. The graph shows the number of ties by distance, in 200 km bins (for example, Ne unted towards the 5400 km bin). The total number of ties in each of the two simulations is the same as in the observed data. Based on Takhteyev, Gruzd, Wellman. (2011). Geography of Twitter networks. Social Networks, 34(1), 9–9. http://doi.org/10.1016/j.socnet.2011.05.006 Geography of social networks
  20. 20. @estebanmoro Most of our geographical movements happen locally (Gravity Law) Pij / 1 (dij)↵ Fig 1. A) Map of the mobility fluxes Tij between municipalities based on Twitter inferred trips (white). Infomap communities detected on the network Ti colored under the mobility fluxes (blue colors). B) Mobility fluxes Tij between municipalities i and j are constructed by aggregating the number of trips b them. C) Correspondence between the observed fluxes Tij and the fitted gravity model fluxes. Dashed line is the Tij ¼ Tgrav ij while the (blue) solid line is conditional average of Tgrav ij for values of Tij. Maps were created using the maptools and sp packages in the R environment. doi:10.1371/journal.pone.0128692.g001 Llorente, A., et al. (2015). PLoS ONE, 10(5), e0128692 find them not suitable to study socio-economical activity: the administrative boundaries between municipalities reflect and historical decisions, while economical activity happens hose boundaries. The result is that municipalities in Spain cially diverse, ranging from municipalities with only 7 in- s to others with 3.2 million population. Although there exists ggregations of municipalities in provinces (regions) or sta- metropolitan areas, we have used our own data to detect eco- areas. In particular, we have used user daily trips between alities in our database to detect those which are economi- ated. We say that there is a daily trip between municipality i a user has tweeted in place i and j consecutively within the y. In our database we find 1.9 million trips by 0.22 million With those trips we construct the daily mobility flux network ween municipalities as the number of trips between place i emarkably, the statistical properties of trips and of the mobil- x Tij coincide with those of other mobility data (see Supp. or example, trip distance and elapsed time are power law dis- with exponents very similar to those found in the literature. mobility fluxes Tij are well described by the Gravity Law .69) Tij ' Tgrav ij = P↵i i P ↵j j dij relations between municipalities. This resu mobility detected from geo-located tweets tained are a good description of economic paper, we restrict our analysis to the geog the Infomap detected communities (see fig munities which are not formed by at least 5 this, 99% of the total country of the popul analysis. Similar (although statistically w for municipalites or provinces. Social media fingerprints The goal of this work is to quantify how t be extracted from social media and then economical level of cities. To this end, we have been widely explored in other fields l ences. All these four measures rely on the where users live. Instead of using informa analyze the places where the user has twe town of the user, the municipality where h highest frequency, a method usually emplo social media. To this end we select those u located tweets in our period and which h .pnas.org/cgi/doi/10.1073/pnas.0709640104 ↵i ⇡ ↵j = 0.42, = 0.89 Geography of mobility networks
  21. 21. @estebanmoro Social/mobility communities coincide with geographical communities Results and Discussion The question naturally arises: What is the best way to group these pixels into larger regions? A similar question has been a focus of network research over the past decade; there one seeks the best way to partition a network into separate, non-overlapping communities [13–18]. The leading approach is based on optimizing the network’s ‘‘modularity’’ [15]. High modularity values occur when the network is subdivided such that there are many links within communities and few between them, as compared to a randomly generated network with otherwise similar characteristics. However, we are not trying to partition the network itself, but rather to use the network’s characteristics to partition the geographic space underneath the network’s topology while guaranteeing spatial adjacency, one of the essential features of a geographic region. analysis as it allowed us to correctly represent the human network from which we started (see Text S2). After two iterations of the algorithm, a surprisingly accurate map of the Greater London region emerged, along with an area corresponding to Scotland, with just a few detached pixels scattered across the rest of Great Britain (Fig. 2 (a) and (b)). With subsequent iterations the modularity increased, ultimately converging to a maximum of 0.58, indicative of a good partitioning compared to the randomized network, as mentioned in [15,20]. The resulting subdivision had 23 communities, 13 of which were clearly delineated geographically, although some scattered pixels and fuzzy boundaries remained. To determine if these artefacts were due to noise produced by the heuristics of spectral partitioning, we next fine-tuned the spectral partitioning algorithm in a manner suggested by Newman [16], iteratively moving pixels from one region to another to maximize overall modularity (see Text S3). When applied to our data, this process Figure 1. The geography of talk in Great Britain. This figure shows the strongest 80% of links, as measured by total talk time, between areas within Britain. The opacity of each link is proportional to the total call time between two areas and the different colours represent regions identified using network modularity optimisation analysis. doi:10.1371/journal.pone.0014248.g001 Ratti, C. et al. (2010). PLoS ONE, 5(12), e14248. Social Media Fingerprints of Unemplo Llorente, A., et al. (2015). PLoS ONE, 10(5), e0128692 Geography of networks
  22. 22. @estebanmoro Applications to industry / economy Situation Behavior Observation
  23. 23. @estebanmoro 1170 19821 344641 2527 107791 338876 2943 114556 120582 338886 340093 65056 107980 121016 71712 89531 69069 98625 5710 90159 158641 173320 90186 174288 12431 90318 95904 117313 91027 116916 344646 16711 24280 27907340705 41680 81752 186106 38011 59089 186412 137285 29771 73827 374473 166044 58788 43936 49579 53876 54794 33222 99163 181910 21926 47874 25922 28397 156130 31796 166808 35266 137990 35774 185687 63045 44532 104624 122034 146892 167918 305313 340720 381704 431475 114530133873 133874 91988 130539 64127 130635 130538 117280 125562 133057 240543 92689 129170 52068 44669 148059 45045 118894 90075 54637 109790 100961 107097 339477 147632 63578 123256 78509 175969 84625 109602 108770 345604 146491 157535 160498109617 129613 118654 159020 178365 159465 User analytics Problem: household occupancy
 Answer: study the fixed line social network Why does it work? Social interactions are clustered! 300k fixed-line clients South-America telco
  24. 24. @estebanmoro Product adoption Problem: Improve targeting in product adoption Answer: study “network” data from CDRs, CRMs, etc. to detect social influence 92% Trust friend’s recommendation 47% Trust ads on TV. 33% Trust display ads on mobile devices
  25. 25. @estebanmoro Problem: Improve targeting in product adoption Answer: “network” data from CDRs, CRMs, etc. to detect social influence Direct relationship Indirect relationship Product adoption
  26. 26. @estebanmoro Problem: Improve targeting in product adoption Real example: 17 million relationships, 
 6M users Product adoption
  27. 27. @estebanmoro Social CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM CRM Random Problem: Improve targeting in product adoption Which users do we have to target? Product adoption
  28. 28. @estebanmoro Problem: Improve targeting in product adoption Adoption rate increased by x3 Targeting Random CRM based CRM + social based Product adoption
  29. 29. @estebanmoro Product adoption Sundsøy, P., Bjelland, J., Canright, G., Engø-Monsen, K., & Ling, R. (2010). Product adoption networks and their growth in a large mobile phone network. 2010 International Conference on Advances in Social Networks Analysis and Mining, 208–216. Figure 2. Time evolution of the iPhone adoption network. One node represents one subscriber. Node color: represents iPhone model: red=2G, green=iPhone 3G, yellow=3GS. Node size, link width, and node shape (attributes which are visible in Q3 2007) represent, respectively, internet volume, weighted sum of SMS and voice traffic, and subscription type. Round node shape represents business users, while square represents consumers. adoption network is diffusing over the underlying social network. In particular we will often focus on the time evolution of the LCC of the adoption network – which may or may not form a social network monster. We recall from Figure 1 that the other components are often rather small compared to the LCC. Hence we argue that studying the evolution of the LCC itself gives useful insight into the strength of the network spreading mechanisms in operation. It also gives insight into the broader context of adoption. As described in [8], two friends adopting together does not necessarily imply social influence – there might also be external factors that control the the underlying mechanism. A. The iPhone case The iPhone 2G was officially released in the US in late Q2 2007 followed by 3G in early Q3 2008 and 3GS late Q2 2009. It was released on the Telenor net in 2009. Despite the existence of various models, we have chosen to look at the iPhone as one distinct product, since (as we will see) the older models are naturally substituted in our network. Figure 2 shows the development of the iPhone monster in one particular market. We observe how the 2G phone is gradually substituted
  30. 30. @estebanmoro Organizational analysis Team
 managers • How do we detect hidden leaders inside a company? • Find real leaders inside the company and measure their • Centrality • Connectivity • Number of communities and their diversity • Train a model to detect them • Find other people with similar roles inside the company
  31. 31. @estebanmoro Insurance pricing / Credit risk • Whom would you lend money to in this network? Use of mobile phone data or social networks to asses credit risk in microcredit approval
 (Lenddo, Cignifi) Granovetter: larger diversity of contacts, more opportunities, more job offers, etc. BIG DATA, SMALL CREDIT The Digital Revolution and Its Impact on Emerging Market Consumers, Omidyar Network
  32. 32. @estebanmoro Areas with larger diversity of contacts have more economic development (Granovetter) Deprivation index of an area decreases with: • Number of social contacts • Diversity of social contacts Economic development Eagle, N., Macy, M., & Claxton, R. (2010). Science, 328(5981), 1029–1031. 1186605 tie formation. Previous studies have found that in- dividuals benefit from having social ties that bridge between communities. These benefits include access to jobs and promotions (5–13), greater job mobility (14, 15), higher salaries (9, 16, 17), opportunities for entrepreneurship (18, 19), and increased power in negotiations (20, 21). Although these studies sug- gest the possibility that the individual-level bene- fits of having a diverse social network may scale to the population level, the relation between network structure and community economic development has never been directly tested (22). As policy-makers struggle to revive ailing econ- omies, understanding this relation between net- work structure and economic development may provide insights into social alternatives to traditional stimulus policies. To that end,we analyzed the most complete record of a national communication net- work studied to date and coupled thissocial network data with detailed socioeconomic indicators to mea- surethisrelationdirectly,atthepopulationlevel.The communication network data were collected during the month of August 2005 in the UK. The data contain more than 90% of the mobile phones and greater than 99% of the residential and business landlines in the country. The resulting network has 65 × 106 nodes, 368 × 106 reciprocated social ties, a mean geodesic distance (minimum number of direct or indirect edges connecting two nodes) of 9.4, an average degree of 10.1 network neighbors, and a giant component (the largest connected subgraph) containing 99.5% of all nodes (23). Although the nature of this communication data limits causal inference, we were able to test the hypothesized correspondence between social network structure and economic development using the 2004 UK government’s Index of Mul- tiple Deprivation (IMD), a composite measure of relative prosperity of 32,482 communities encom- passing the entire country (24), based on income, employment, education, health, crime, housing, and the environmental quality of each region (25). Each residential landline number was associated with the IMD rank of the exchange in which it was We then compared the IMD rank of each com- entropy associated with individual i’s communi- Fig. 1. An image of regional communication diversity and socioeconomic ranking for the UK. We find that communities with diverse communication patterns tend to rank higher (represented from light blue to dark blue) than the regions with more insular communication. This result implies that communication diversity is a key indicator of an economically healthy community. [(29) Crown copyright material is reproduced with the permission of the Controller of Her Majesty’s Stationery Office] REPORTS onMay24,2010www.sciencemag.orgDownloadedfrom
  33. 33. @estebanmoro Areas with • larger diversity of mobility or • users with larger radius or gyration have more economical development Economic development Smith, C., Quercia, D., & Capra, L. (2013). Finger on the pulse: identifying deprivation
 using transit flow analysis, 683–692. IMD Score 0.0 0.2 0.4 0.6 0.8 1.0 (a) Real composite IMD score (b) True class (c) Predicted Class Figure 4. a) Census areas containing stations coloured according to real composite IMD score; b) stations which fall in the 1st or 4th quartile for composite IMD, classified as high or low; and c) predicted classifications for the same areas. clude exploiting other variables available in the Oyster data, such as ticket price and card type (e.g., standard, student, el- derly and disabled). We have also begun to develop methods By combining different datasets we can build a multiplex network (i.e. network with multiple types of edge), which may offer additional insights into the relationship between
  34. 34. @estebanmoro Geographical models to estimate unemployment
  35. 35. @estebanmoro (Functional) geographical areas in SpainWhich are the real economical areas?
  36. 36. @estebanmoro (Functional) geographic areas in Spain Madrid Barcelona “The piece is absolutely useless, even ridiculous, outside Spain, because the audience cannot hope to understand its significance, nor the performers to play it as it should be played.”
  37. 37. Content Social Interaction Mobility Penetration Activity
  38. 38. @estebanmoro Twitter penetration • Is Twitter penetration related to economical development of areas? • At country scale twitter penetration ~ GDP • At small scale is the opposite! twitter penetration ~ unemployment witter in a given area, a more illustrative metric is o between the number of Twitter users and the ratio does not distribute uniformly across the globe a country economic development approximated by ile this property has been already described e.g. by fit of a power law approximation increased when ather than all Twitter users appearing in a country. th a penetration rate below 0.05‰ (we also exclude s smaller than 10,000). es of the world. (A) Spatial distribution of the index. (B) capita GDP of a country. R2 coefficient equals 0.65. Hawelka, B. et al., 2013. Geo-located Twitter as the proxy for global mobility patterns. 5 10 10 20 paro factor[i]*tt[,variables_sel[i]] % unemp. Penetrationrateindex ⇢ = 0.70 [0.6, 0.77] 11 Figure 3. Users and GDP per capita. Correlation between country level Twitter penetration and GDP/capita.
  39. 39. @estebanmoro “llorenteetal5” — 2014/8/27 — 22:19 — page 4 — #4 i i Tweet Detected Misspellings Alguien se viene con migo aver la vida de PI?? - “Con migo” instead of “Conmigo” (with me in Spanish). - “aver” instead of “a ver” (aver is not a Spanish word) La quiero mucho y la hecho de menos - “Hecho de menos” instead of “echo de menos” (“I miss her” in Spanish). All the 618 expressions such as “Con migo”, “Aver” or “Hecho de menos” have been searched literally within the text of the whole dataset of tweets. ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● A B Entropy: 0.72 Unemployment rate: 11% Entropy: 0.42 Unemployment rate: 23% A B C 2.5 5.0 7.5 10.0 5 10 15 20 hour fraction mun a b %oftweets Proportion of tweets Hour Low unemp. rate High unemp. rate Activity(%tweets) Entropy: 0.42 Unemployment: 20.3% Entropy: 0.72 Unemployment: 8.8% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● A C B Twitter social interactions • Granovetter: diversity of interactions yields to more opportunities • Diversity of interactions between cities is correlated with economical development • We construct the graph of social interactions
 
 
 
 • Measure diversity with entropy % unemp Entropy(%) Eagle et al, Science 2010 wij = number of @ between areas i and j pij = wij/ Pki j=1 wij Si = Pki j=1 pij log pij 20 40 60 80 10 20 paro factor[i]*tt[,variables_sel[i]] ⇢ = 0.21[ 0.37, 0.04]
  40. 40. @estebanmoro Twitter geographical interactions • Diversity of geographical mobility is correlated with development
 
 • We use the graph of flows
 
 
 • Measure diversity with entropy % unemp Entropy(%) Smith, C., Mashhadi, A. & Capra, L., 2013. Ubiquitous sensing for mapping poverty in developing countries. Smith, C., Quercia, D. & Capra, L., 2013. Finger on the pulse: identifying deprivation using transit flow analysis.Smith, C., Quercia, D. & Capra, L., 2013. Finger on the pulse: identifying deprivation using transit flow analysis. pp.683–692. ˜Si = P˜ki j=1 ˜pij log ˜pij ˜pij = Tij/ P˜ki j=1 Tij 0.1 0.2 0.3 0.4 10 20 paro tt[,"sio"] ⇢ = 0.023 [ 0.19, 0.14]
  41. 41. @estebanmoro Twitter content • Two different approaches • Classical approach: NLP applied to detect mentions to “unemployment”, “job”, “economy”, … Antenucci, D. et al., 2014. Using Social Media to Measure Labor Market Flows. Thousands Factor1 2011 2012 2013 280 300 320 340 360 380 400 420 440 460 -10 -5 0 5 10 15 20 Initial Claims (left scale) Social Media (right scale) 0.000 0.001 0.002 10 20 paro tt[,"emp"] % unemp #mentionstoemployment ⇢ = 0.33 [ 0.17, 0.47]
  42. 42. @estebanmoro Twitter content • Our approach: NLP applied to detect lexical complexity 
 (as a proxy for educational level) • Readability (Gunning index) • Serious misspellings • 
 • We construct a list of more than 600 incorrect 
 expressions of this type validated by spanish language linguistic experts. • We do not take into account misspellings due to 
 different Spanish accents and IM abbreviations • We compute for each area the fraction of users that 
 make a number of serious misspellings Tweet Correct spelling Alguien se viene con migo aver la vida de PI?? Alguien se viene conmigo a ver la vida de PI?? La quiero mucho y la hecho de menos La quiero mucho y la echo de menos Yo llendo a trabajar con este tiempo Yo yendo a trabajar con este tiempo J. Davenport et al, The Readability of Tweets and their Geographic Correlation with Education, arXiv:1401.6058, 2014 0 5 10 15 20 10 20 paro factor[i]*tt[,variables_sel[i]] % unemp Numberofmisspellers
  43. 43. @estebanmoro Twitter activity • Is unemployment reflected in twitter daily patterns? Just arrived to work, mondays are too hard… “llorenteetal5” — 2014/8/27 — 22:19 — page 4 — #4 Tweet Detected Misspellings ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● A B Entropy: 0.72 Unemployment rate: 11% Entro Unemployment ra A B C 2.5 5.0 7.5 10.0 5 10 15 20 hour fraction mun a b %oftweets Proportion of tweets Hour Low unemp. rate High unemp. rate Activity(%tweets) ● ● ● ● ● ● ● ● A C B ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● eco unemp emp job fmiss madrugada tarde manana siorsocial siosocial sior sio rtwpen −0.5 0.0 0.5 corre ff 500 1000 10 20 paro factor[i]*tt[,variables_sel[i]] 20 40 60 80 factor[i]*tt[,variables_sel[i]] 0 5 10 15 20 factor[i]*tt[,variables_sel[i]] 4 5 6 7 10 20 paro factor[i]*tt[,variables_sel[i]] 5 10 10 20 paro factor[i]*tt[,variables_sel[i]] 40 50 60 70 10 20 paro factor[i]*tt[,variables_sel[i]] Penetration rate Entropy1 (geo) Entropy2 (geo) Entropy1 (social) Entropy2 (social) Activity (morning) Activity (afternoon) Activity (night) Misspellers rate #Job tweets #Employment tws Unemployment tws #Economy tws ! ! a) b) d) % Unemployment %Correlation PenetrationrateActivity(mroning) ⇢ = 0.48 [ 0.34, 0.60]
  44. 44. @estebanmoro Summary of the variables ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● eco unemp emp job fmiss madrugada tarde manana siorsocial siosocial sior sio rtwpen −0.5 0.0 0.5 corre ff 500 1000 10 20 paro factor[i]*tt[,variables_sel[i]] 20 40 60 80 10 20 paro factor[i]*tt[,variables_sel[i]] 20 40 60 80 1020 paro factor[i]*tt[,variables_sel[i]] 20 40 60 80 1020 paro factor[i]*tt[,variables_sel[i]] 0 5 10 15 20 10 20 paro factor[i]*tt[,variables_sel[i]] 0 5 10 15 20 1020 paro factor[i]*tt[,variables_sel[i]] 0 5 10 15 20 1020 paro factor[i]*tt[,variables_sel[i]] 4 5 6 7 10 20 paro factor[i]*tt[,variables_sel[i]] 5 10 10 20 paro factor[i]*tt[,variables_sel[i]] 40 50 60 70 10 20 paro factor[i]*tt[,variables_sel[i]] 0.2 0.4 0.6 0.8 10 factor[i]*tt[,variables_sel[i]] 0 50 100 150 200 10 factor[i]*tt[,variables_sel[i]] 0 50 100 150 200 10 factor[i]*tt[,variables_sel[i]] 0 50 100 150 200 10 factor[i]*tt[,variables_sel[i]] Penetration rate Entropy1 (geo) Entropy2 (geo) Entropy1 (social) Entropy2 (social) Activity (morning) Activity (afternoon) Activity (night) Misspellers rate #Job tweets #Employment tws #Unemployment tws #Economy tws ! ! a) b) c) d) e) % Unemployment % UnemploymentCorrelation Entropy1(social)Misspellersrate PenetrationrateActivity(mroning) Social/geo variables have low correlation Penetration rate/activity and content are highly correlated with unemployment
  45. 45. @estebanmoro Explanatory power of Twitter variables x y 5 10 15 20 25 510152025 % Unemployment (real) %Unemployment(predicted) • Simple linear regression Penetration Entropy (social) Activity (morning) #misspellers "unemployment" 0 10 20 30 40 * % weight in the model R2 = 0.64
  46. 46. @estebanmoro Researcher Data scientist Policy maker Editor RefereeResearcher
  47. 47. @estebanmoro Nowcasting of shadow economy Do we detect less or more unemployment that it is officially registered? Error = Unemployment_model - Unemployment_registered (Tweets geolocalizados) Dataset: 19.6 Million geolocalized tweets
 A. Llorente, EM, et al, 2015 http://arxiv.org/abs/1411.3140 15 20 25 30 35 −0.3−0.10.00.10.20.3 tt$sumergida error 30% 20% 10% 0% -10% -20% -30% Error % Shadow Economy 15 20 25 30 35 Model predicts less unemployment that official figures in areas with larger shadow economy
  48. 48. @estebanmoro Use model in under-developed countries 250 Million people 30 million Twitter users 50% smartphone penetration in 2015
  49. 49. @estebanmoro Predicting economic models for disaster damage
  50. 50. @estebanmoro Sandy Hurricane, 29 October 2012 1/2 $Billion impact in FEMA grants and Insurance claims 1-2 years to asses the economical damage Kryvasheyeu, Moro, E., et al (2016). Science Advances Predicting economic models for disaster damage
  51. 51. @estebanmoro Tweets per 100k 1800 250 30−8 −6 −4 value1 Sandy Hurricane, 29 October 2012 1/2 $Billion impact in FEMA grants and Insurance claims 1-2 years to asses the economical damage Urban economic models for housing prices
  52. 52. @estebanmoro Sandy Hurricane, 29 October 2012 1/2 $Billion impact in FEMA grants and Insurance claims 1-2 years to asses the economical damage ! Correlationwitheconomicaldamage Hours since hurricane landing activity sentiment Tweets per 100k 1800 250 30−8 −6 −4 value1 #Tweets Tweet sentiment Grants (FEMA) Insurance claims Predicting economic models for disaster damage
  53. 53. @estebanmoro Other natural disasters DISCUSSION We found that Twitter activity during a large-scale natural disaster—in this instance Hurricane Sandy—is related to the proximity of the region to the path of the hurricane. Activity drops as the distance from the hurricane increases; after a distance of approximately 1200 to 1500 km, the influence of proximity disappears. High-level analysis of the com- position of the message stream reveals additional findings. Geo- Table 1. Activity-damage correlation (Kendall t, Spearman r, and Pearson r) for additional events. Disasters are sorted on the order of the increasing strength of the Pearson correlation coefficient. All disasters demonstrate moderate to strong levels of statistically significant correlations (P < 0.05) [with the exception of Alaska floods (DR-4122)]. Event ID Type Kendall t P Spearman r P Pearson r P DR4116 Floods 0.15 9.04 × 10−5 0.21 1.87 × 10−4 0.18 9.71 × 10−4 DR4117 Tornadoes 0.17 0.05 0.26 0.05 0.24 0.06 DR4176 Tornadoes 0.18 8.92 × 10−3 0.28 6.68 × 10−3 0.27 9.60 × 10−3 Sandy Hurricane 0.16 3.30 × 10−13 0.24 5.04 × 10−13 0.30 5.99 × 10−20 DR4145 Floods 0.33 3.54 × 10−8 0.47 2.42 × 10−8 0.45 1.08 × 10−7 DR4177 Floods 0.36 4.44 × 10−4 0.52 2.33 × 10−4 0.45 1.53 × 10−3 DR4175 Tornadoes 0.34 0.02 0.46 0.03 0.46 0.03 DR4195 Floods 0.32 1.28 × 10−8 0.47 3.35 × 10−9 0.46 6.32 × 10−9 DR4174 Tornadoes 0.56 5.24 × 10−3 0.69 6.07 × 10−3 0.68 6.93 × 10−3 DR4157 Tornadoes 0.51 9.70 × 10−4 0.71 2.38 × 10−4 0.72 1.71 × 10−4 DR4168 Mudslide 0.44 0.04 0.59 0.03 0.86 1.84 × 10−4 DR4193 Earthquake 0.74 3.80 × 10−5 0.90 7.50 × 10−7 0.88 3.92 × 10−6 DR4122 Floods 1.00 — 1.00 — 1.00 — R E S E A R C H A R T I C L E DISCUSSION We found that Twitter activity during a large-scale natural disaster—in this instance Hurricane Sandy—is related to the proximity of the region to the path of the hurricane. Activity drops as the distance from the hurricane increases; after a distance of approximately 1200 to 1500 km, the influence of proximity disappears. High-level analysis of the com- position of the message stream reveals additional findings. Geo- enriched data (with location of tweets inferred from users’ profiles) show that the areas close to the disaster generate more original content, characterized by a lower fraction of retweets. This extends the previous understanding of retweeting behavior in crisis (31, 32) and confirms other studies (41). Finally, we find that messages from disaster regions generate more interest globally, with a higher normalized count of re- tweet sources. In the first study of its kind based on the actual ex-post damage assessments, we demonstrated that the per-capita number of Twitter messages corresponds directly to disaster-inflicted monetary damage. The correlation is especially pronounced for persistent postdisaster ac- tivity and is weakest at the peak of the disaster. We established that per-capita activity and per-capita damage both have an approximately log-normal distribution and that the Pearson correlation coefficient between the two can reach 0.6 for a carefully selected observation pe- riod in the aftermath of the landfall. This makes social media a viable platform for preliminary rapid damage assessment in the chaotic time immediately after a disaster. Our results suggest that, during a disaster, officials should pay attention to normalized activity levels, rates of DR4145 Floods 0.33 3.54 × 10−8 0.47 2.42 × 10−8 0.45 1.08 × 10−7 DR4177 Floods 0.36 4.44 × 10−4 0.52 2.33 × 10−4 0.45 1.53 × 10−3 DR4175 Tornadoes 0.34 0.02 0.46 0.03 0.46 0.03 DR4195 Floods 0.32 1.28 × 10−8 0.47 3.35 × 10−9 0.46 6.32 × 10−9 DR4174 Tornadoes 0.56 5.24 × 10−3 0.69 6.07 × 10−3 0.68 6.93 × 10−3 DR4157 Tornadoes 0.51 9.70 × 10−4 0.71 2.38 × 10−4 0.72 1.71 × 10−4 DR4168 Mudslide 0.44 0.04 0.59 0.03 0.86 1.84 × 10−4 DR4193 Earthquake 0.74 3.80 × 10−5 0.90 7.50 × 10−7 0.88 3.92 × 10−6 DR4122 Floods 1.00 — 1.00 — 1.00 — Fig. 5. Distribution of activity-damage correlations (Pearson correla- tion coefficients) across all disasters considered in the study. In terms of damage, disasters appear to group according to their type, with cost increasing from tornado storms, to floods, and eventually to hurricanes. The correlation between activity and damage is very strong for small-scale (low-cost) disasters, then it weakens and remains, on average, at the same level across moderate-cost to high-cost events. Predicting economic models for disaster damage
  54. 54. @estebanmoro 5.Outlook
  55. 55. @estebanmoro Problems
  56. 56. @estebanmoro Problems • Data based Societies/Governments • Transparency: data-driven 
 decisions • Responsability: decisions 
 backed-up by data and 
 algorithms • Policy makeing with 
 A/B Testing 
 • http://www.wired.com/2012/04/ff_abtesting/all/1 • http://www.fastcompany.com/3042630/first-us-chief-data-scientist-dj-patilscientist-dj-patil
  57. 57. @estebanmoro Problems • N ≠ ALL • Some social sectors might be not well represented • Potential biases towards youngest, richest, etc. • We need sampling techniques to assure the representativeness of the data. • Biases everywhere!
  58. 58. @estebanmoro Problems • N ≠ ALL • Biases everywhere http://www.pewglobal.org/2016/02/22/smartphone-ownership-and-internet-usage-continues-to-climb-in-emerging-economies/
  59. 59. @estebanmoro Problems • N ≠ ALL • Biases everywhere http://www.pewinternet.org/2015/08/19/the-demographics-of-social-media-users/
  60. 60. @estebanmoro Problems • Privacy ~ 1 / Value • Traceability Who/where/how is accessing our data? • Value: Most data is proprietary, but whose is its value? • Measure: how much privacy is lost when our data is used? How much is our data valued? FT.com http://on.ft.com/14yjj65
  61. 61. @estebanmoro • Which data has more value? Privacy / Data Value anonymized/aggregated form? Q2. On day {dd/MM} you assigned a value of {min-bid per category} to the information [{least valued info per category}]. This was your minimum bid. Why? multi-choice* Q3. On day {dd/MM} you assigned a value of {max-bid per category} to the information [{most valued info per category}]. This was your maximum bid. Why? multi-choice* Q4. Imagine there was a market in which you could sell your personal information (e.g. information about people you called, places you’ve been, applications you’ve used, songs you’ve listened to, etc.). Who would you trust to handle your information? Please, order the following entities from most to least trusted. rank** Q5. The category {locationskcommunicationskappskmedia} is the one that you refused to sell the most ({percentage of opt-outs}). Why? free-text Table 3: Questions asked in the EoS questionnaire. *included: Fair value, Test/Mistake, Other (free text). For minimum-bid related ques- tions additional options were To win the auction, Info not important; conversely, for maximum-bid related questions, the additional option was To prevent selling. **entities to be ranked included: banks, government, insurance companies, telcos, yourself. are concerned about mobile PII protection (Q1) but do not tend to read the Terms of Service (Q4) nor are aware of cur- rent legislation on data protection (Q5). Moreover, they do not seem to trust how neither application providers (Q2) nor telecom operators (Q3) use their data. The EoS survey was designed to gather additional quan- titative and qualitative information from our participants af- ter the data collection was complete. In particular, we asked participants to put a value (under the same auction game con- straints) on category-specific bulk information – i.e. all the data gathered in the study for each category. For instance, in the case of location information, a visualization of a partic- ipant’s mobility data collected over the 6-weeks period was shown in the Web questionnaire (as depicted in Figure 1) and the participant was asked to assign it a monetary value. Furthermore, for each category, we asked participants about the minimum/maximum valuations given during the study, in order to understand the reasons why they gave these valu- ations. Table 3 contains all the questions of the EoS survey. The EoS questionnaire was administered through a slightly modified version of the same Web application used for the daily surveys. The main difference are the visualiza- tions of the collected data. Figure 1: Location-specific bulk information question in the EoS survey. 6 Staiano, J., Oliver, N., Lepri, B., de Oliveira, R., Caraviello, M., & Sebe, N. (2014). Money walks: a human-centric study on the economics of personal mobile data.
  62. 62. @estebanmoro Privacy (?) Valor Privacidad/Anonimicidad LPD TOS EULA Group Social net Activity Mobility PII* * PII = personally identifiable information Value Privacy/Anonymity
  63. 63. @estebanmoro • Which data has more value? Privacy / Data value Unique in the shopping mall: On the reidentifiability of credit card metadata Yves-Alexandre de Montjoye, Laura Radaelli, Vivek Kumar Singh, Alex “Sandy” Pentland, Science 2015 survey shows that financial and credit card data sets are considered the most sensitive personal data worldwide (25). Among Americans, 87% consider credit card data as moderately or extremely private, whereas only 68% consider health and genetic information private, and 62% consider location data private. At the same time, financial data sets have been used extensively for credit scoring (26), fraud detection (27), and understanding the predictability of shopping patterns (28). Financial metadata have great potential, but they are also personal and highly sensitive. There are obvious benefits to having metadata data sets broadly available, but this first requires a solid understanding of their privacy. To provide a quantitative assessment of the likelihood of identification from financial data, we used a data set D of 3 months of credit card transactions for 1.1 million users in 10,000 shops in an Organisation for Economic Co-operation and Development country (Fig. 1). The data set was simply anonymized, which means that it did not contain any names, account numbers, or obvious identifiers. Each transaction was time-stamped with a resolution of 1 day and associated with one shop. Shops are distributed throughout the country, and the number of shops in a district scales with population density (r 2 = 0.51, P < 0.001) (fig. S1). Fig. 1 Financial traces in a simply anonymized data set such as the one we use for this work. Arrows represent the temporal sequence of transactions for user 7abc1a23 and the prices are grouped in bins of increasing size (29). We quantified the risk of reidentification of D by means of unicity ε (19). Unicity is the risk of reidentification knowing p pieces of outside information about a user (29). We evaluate εp of D as the percentage of its users who are reidentified with p randomly selected points from their financial trace. For each user, we extracted the subset S(Ip) of traces that match the p known points (Ip). A user was considered reidentified in this correlation attack if |S(Ip)| = 1. Figure 2 shows that the unicity of financial traces is high (ε4 > 0.9, green bars). T that knowing four random spatiotemporal points or tuples is enough to uniquely reid of the individuals and to uncover all of their records. Simply anonymized large-sca metadata can be easily reidentified via spatiotemporal information. Fig. 2 The unicity ε of the credit card data set given p points. The green bars represent unicity when spatiotemporal tuples are known. This show spatiotemporal points taken at random (p = 4) are enough to uniquely character individuals. The blue bars represent unicity when using spatial-temporal-price t 0.50) and show that adding the approximate price of a transaction significantly inc likelihood of reidentification. Error bars denote the 95% confidence interval on the m Furthermore, financial traces contain one additional column that can be used to re individual: the price of a transaction. A piece of outside information, a spatiotem can become a triple: space, time, and the approximate price of the transaction. Th contains the exact price of each transaction, but we assume that we only o
  64. 64. @estebanmoro @estebanmoro emoro@math.uc3m.es GRACIAS!
  65. 65. @estebanmoro References • Reviews about applications of mobile phone data • Blondel, V. D., Decuyper, A., & Krings, G. (2015). A survey of results on mobile phone datasets analysis. EPJ Data Science, 4(1), 10. http://doi.org/10.1140/epjds/ s13688-015-0046-0 • MOBILE PHONE NETWORK DATA FOR DEVELOPMENT. (2013). UN Global Pulse • Saramaki, J., & Moro, E. (2015). From seconds to months: an overview of multi-scale dynamics of mobile telephone calls. The European Physical Journal B, 88(6). http:// doi.org/10.1140/epjb/e2015-60106-6 • Naboulsi, D., Fiore, M., Ribot, S., & Stanica, R. (n.d.). Large-scale Mobile Traffic Analysis: a Survey. IEEE Communications Surveys & Tutorials, 1–1. http://doi.org/ 10.1109/COMST.2015.2491361 • Netmob book of abstracts: • Oral: http://netmob.org/assets/img/netmob15_book_of_abstracts_oral.pdf • Posters: http://netmob.org/assets/img/netmob15_book_of_abstracts_posters.pdf
  66. 66. @estebanmoro References • Books about applications of Network Analysis to industry • Cross, R., Thomas, R. J., Singer, J., Colella, S., and Silverstone, Y . 2010. The Organizational Network Fieldbook. Jossey-bass, San Francisco, California. • Van den Bulte, C., and Wuyts, S. 2007. Social networks and marketing. Marketing Science Institute. • Pinheiro, C. A. R. 2011. Social network analysis in telecommunications. John Wiley & Sons. • Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke, Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection, Wiley, 2015 • Articles about applications of Network Analysis to Industry • [Credit Scoring] San Pedro, Jose, Davide Proserpio, and Nuria Oliver. "MobiScore: towards universal credit scoring from mobile phone data." User Modeling, Adaptation and Personalization. Springer International Publishing, 2015. 195-207.
  67. 67. @estebanmoro References • Some articles about applications of Network Analysis to Industry/Goverment • [Poverty] Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting poverty and wealth from mobile phone metadata. Science, 350(6264), 1073–1076. http://doi.org/10.1126/ science.aac4420 • [Census] Deville, P., Linard, C., Martin, S., Gilbert, M., Stevens, F. R., Gaughan, A. E., et al. (2014). Dynamic population mapping using mobile phone data. Proceedings of the National Academy of Sciences, 111(45), 15888–15893. http://doi.org/10.1073/pnas. 1408439111 • [Credit Scoring] San Pedro, Jose, Davide Proserpio, and Nuria Oliver. "MobiScore: towards universal credit scoring from mobile phone data." User Modeling, Adaptation and Personalization. Springer International Publishing, 2015. 195-207. • [GDP] Guidotti, R., Coscia, M., Pedreschi, D., & Pennacchioli, D. (2016). Going Beyond GDP to Nowcast Well-Being Using Retail Market Data. NetSci-X, 9564(Chapter 3), 29–42. doi:10.1007/978-3-319-28361-6_3 • [Energy] Bogomolov, A., Lepri, B., Larcher, R., Antonelli, F., Pianesi, F., & Pentland, A. (2016). Energy consumption prediction using people dynamics derived from cellular network data. EPJ Data Science, 5(1), 1. doi:10.1140/epjds/s13688-016-0075-3
  68. 68. @estebanmoro References • Some datasets: • Reality mining dataset: http://realitycommons.media.mit.edu/realitymining.html • Mobile Data Challenge 2012 (2012), http://research.nokia.com/page/12000 • Data for Development Challenge (2014), http://d4d.orange.com • Big Data Challenge 2014 (2014), telecomitalia.com/tit/en/bigdatachallenge.html • OpenBigData (2015), http://theodi.fbk.eu/openbigdata/

×