Social and economical networks from (big-)data - Esteban Moro II
Social and economical networks from
(big-)data
Esteban Moro
@estebanmoro
NTMB Lake Como School May 2016
@estebanmoro
Summary
1. Intro to Social/Geo Big Data
2. Sources of Social/Geo Big Data
3. Tools for Social/Geo Big Data
4. Applications of Big Data in Social and
Economical problems
5. Outlook
@estebanmoro
Human behaviors spread in networks
Social contagion
measure the relative topological overlap of the neighborhood of
two users A and B, representing the proportion of their common
friends, as OAB = NAB/((KA-1)+(KB-1)-NAB), where NAB is the
number of common neighbors of A and B, and KA (KB) denotes
the degree of node A(B).1
Fig. 3(d) demonstrates the effect of
removing links in order of strongest (or weakest) overlaps. In both
cases, we find that removing ties in rank order of weakest to
strongest ties will lead to a sudden disintegration of the network.
In contrast, reversing the order shrinks the network without
precipitously breaking it apart.
0
0.1
0.2
0.3
0.4
0.5
0.6
0 5 10 15 20 25
Probability
Number of Churner Neighbours
May Churners
June Churners
July Churners
(a)
0.3
0.35
0.4
3 Churners
4 Churners
5 Churners
6 Churners
locally disintegrate a community, while the removal of the weak
links will delete bridges that connect different communities,
leading to a network collapse. Further, we believe that the
observed local relationship, between network topology and tie
strength affects any global information diffusion process (like
churn). In fact, we opine that churn as a behavior can be viewed
less as a dyadic phenomenon (affected only by strong churner-
churner ties), but more as a diffusion process where both strong
and weak ties play a significant role in spreading the influence
through the network topology.
4. PREDICTING CHURNERS IN THE
CALL GRAPH
We next discuss how to exploit social ties to identify potential
churners in an operator’s network. Our approach is as follows. We
start with a set of churners (e.g. for April) and their social
relationships (ties) captured in the call graph (for March). Using
the underlying topology of the call graph, we then initiate a
diffusion process with the churners as seeds. Effectively, we
model a “word-of-mouth” scenario where a churner influences
one of his neighbors to churn, from where the influence spreads to
some other neighbor, and so on. At the end of the diffusion
process, we inspect the amount of influence received by each
node. Using a threshold-based technique, a node that is currently
not a churner can be declared to be a potential future one, based
on the influence that has been accumulated. Finally, we measure
the number of correct predictions by tallying with the actual set of
churners that were recorded for a subsequent month (e.g. for
May). The diffusion model is based on Spreading Activation
(SPA) techniques proposed in cognitive psychology and later used
for trust metric computations [32]. In essence, SPA is similar to
performing a breadth-first search on the call graph GMarch=(V,E).
The basic steps are outlined below:-
Node Activation: During each iterative step i, there is a set of
active nodes. Let X be an active node which has associated energy
Dasgupta, K. et al., 2008. Social ties and their relevance
to churn in mobile telecom networks.
Sundsøy, P., Bjelland, J., Canright, G., Engø-Monsen, K., & Ling, R. (2010).
2010 International Conference on Advances in Social Networks Analysis and
Mining
@estebanmoro
Human behaviors spread in networks
Social contagion
received the social message were 0.39% (s.e.m., 0.17%; t-test,
.02) more likely to vote than users who received no message at
Figure2showsthat theobserved per-friend treatment effect
astie-strengthincreases.Alloftheobservedtreatmenteffectsf
a bInformational message
Social message
friends have voted.
Today is Election Day What’s this?
People on Facebook Voted
Find your polling place on the U.S.
Politics Page and click the "I Voted"
button to tell your friends you voted.
close•
VOTE
l Voted
10 1 5 5 3 7 6
Today is Election Day What’s this?
People on Facebook Voted
Find your polling place on the U.S.
Politics Page and click the "I Voted"
button to tell your friends you voted.
close•
VOTE
l Voted
10 1 5 5 3 7 6
0
0.3
0.6
0.9
1.2
1.5
1.8
2.1
Directeffectoftreatment
onownbehaviour(%)
Self-
reported
voting
Search for
polling
place
Validated
voting
Validated
voting
Social
message
versus
control
Social
message
versus
informational
message
Jaime Settle, Jason Jones, and 18 other
e 1 | The experiment and direct effects. a, b, Examples of the informational message and social message Facebook treatments (a) and their dire
behaviour (b). Vertical lines indicate s.e.m. (they are too small to be seen for the first two bars).
EARCH LETTER
Bond, R. M. R., Fariss, C. J. C., Jones, J. J. J., Kramer, A. D. I. A., Marlow, C. C., Settle, J. E. J., & Fowler, J. H. J. (2012). A 61-
million-person experiment in social influence and political mobilization., 489(7415), 295–298. http://doi.org/10.1038/nature11421
@estebanmoro
The greater the similarity between individuals the more likely they are to
establish a connection
Homophily
Buzz 27
Attribute Random Communicate
Age -0.0001 0.297
Gender 0.0001 -0.032
ZIP -0.0003 0.557
County 0.0005 0.704
Language -0.0001 0.694
tion coe cients for random pairs of people and pairs of people who communicate.
degree of homophily of random pairs of users with pairs of users that communicate.
50
60
70
80
Worldwide Buzz 27
Attribute Random Communicate
Age -0.0001 0.297
Gender 0.0001 -0.032
ZIP -0.0003 0.557
County 0.0005 0.704
Language -0.0001 0.694
Table 5: Correlation coe cients for random pairs of people and pairs of people who communicate.
We compare the degree of homophily of random pairs of users with pairs of users that communicate.
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
(a) Random (b) Communicate
Figure 21: Number of pairs of people of di↵erent ages. We plot ages of two people and color
corresponds to the number of such pairs. (a) Ages of randomly selected pairs of people; we note
there is little correlation. (b) Ages of people who communicate with one another, i.e., ages of people
at the endpoints of links in the communication network. The high correlation is captured by the
diagonal trend.
We contrast this statistic with the correlation coe cient where we choose users via a process
of uniform random sampling across 1.3 billion users.
We also consider two measures of similarity—the correlation coe cient and the probabil-
Correlation coefficient
Number of pairs of people at different ages
Leskovec, J. & Horvitz, E., 2008. Planetary-scale views
on a large instant-messaging network. pp.915–924.
@estebanmoro
Contagion or Homophily
• Contagion = Homophily?
• Influence and homophily are usually confounded in observational social
network studies
network registered Ͼ14 billion page views and sent 3.9 b
messages over 89.3 million distinct relationships. For details a
the service, the data, and descriptive statistics see the Data se
of the SI.
Evidence of Assortative Mixing and Temporal Clustering
We observe strong evidence of both assortative mixing and
poral clustering in Go adoption. At the end of the 5-month pe
adopters have a 5-fold higher percentage of adopters in their
networks (t Ϫ stat ϭ 100.12, p Ͻ 0.001; k.s. Ϫ stat ϭ 0.06, p Ͻ 0
and receive a 5-fold higher percentage of messages from ado
than nonadopters (t Ϫ stat ϭ 88.30, p Ͻ 0.001; k.s. Ϫ stat ϭ
p Ͻ 0.001). Both the number and percentage of one’s local net
who have adopted are highly predictive of one’s propensity to a
(Logistic: (#) ϭ 0.153, p Ͻ 0.001; (%) ϭ 1.268, p Ͻ 0.001), a
adopt earlier (Hazard Rate: (#) ϭ 0.10, p Ͻ 0.001; (%) ϭ 0
p Ͻ 0.001). The likelihood of adoption increases dramatically
the number of adopter friends (Fig. 2C), and correspondi
adopters are more likely to have more adopter friends (Fig.
mirroring prior evidence on product adoption in networks (2
Adoption decisions among friends also cluster in time.
randomly reassigned all Go adoption times (while maintainin
adoption frequency distribution over time) and compared obse
Fig. 1. Diffusion of Yahoo! Go over time. (A–C and D–F) Two subgraphs of the
Yahoo! IM network colored by adoption states on July 4 (the Go launch date),
August 10, and October 29, 2007. For animations of the diffusion of Yahoo! Go
over time see Movies S1 and S2.
Fig. 3. Distinguishing homophily and influence. (A and B) The fraction of observed treated to untreated adopters (nϩ/nϪ) under random (A) and propensity score
(B) matching over time. The dotted line shows a ratio of 1, when treatment has no effect. The Right Inset in B graphs the average marginal influence effects of having
1, 2, 3, or 4 adopter friends implied by random (open circles) and propensity score (filled circles) matching. The Left Inset graphs the average cosine distance of attribute
andbehaviorvectorsofadopterstoadopterfriendsasthenumberofadoptersinthelocalnetworkincreases(͚i,j
n
cos(xi
a
,xj
a
)/n).(C)Graphsthecosinedistancesofadopters
to their adopter friends cos(xit
a
, xjt
a
), their nonadopter friends cos(xit
a
, xjt), and a random alter cos(xit
a
, xrt) over time with trend lines fitted by ordinary least squares. (D)
The fraction of treated and untreated adopters, where treatment is defined as having a friend who adopted within a certain time period (or recency) (⌬t ϵ ti
a
Ϫ tj
a
ϭ
R), under random matching (open circles) and propensity score matching (filled circles). The Inset graphs the cosine distances of dyads of adopters cos(xit
a
, xjt
a
) by the time
Aral, S., et al. 2009. Distinguishing influence-based contagion from homophily-driven
diffusion in dynamic networks. Proceedings of the National Academy of Sciences,
106(51), p.21544.
@estebanmoro
Granovetter (weak ties)
Strong ties happen within communities. Weak ties across communities
Onnela, J.-P., et al PNAS 2007
A
B
1
100
10
10
0
10
1
10
2
10
10
6
10
4
100 102 104 106 108
10
10 12
10 10
10 8
10
vi vj
Oij=0 Oij=1/3
Oij=1Oij=2/3
<O>
w
,<O>
b
0 0.2 0.4 0.6 0.8 1
0
0.05
0.1
0.15
0.2
P
cum
(w), P
cum
(b)
C D
Degree k Link weight w (s)
P(k)
P(w)
Fig. 1. Characterizing the large-scale structure and the tie strengths of the
mobile call graph. (A and B) Vertex degree (A) and tie strength distribution (B).
@estebanmoro
Granovetter (weak ties)
Strong ties happen within communities. Weak ties across communities
Figure 1. Groups and links. (A) Sample of Twitter network: nodes represent users and links, interactions. The follower connections are plotted as
gray arrows, mentions in red, and retweets in green. The width of the arrows is proportional to the number of times that the link has been used for
mentions. We display three groups (yellow, purple and turquoise) and a user (blue star) belonging to two groups. (B) Different types of links
depending on their position with respect to the groups’ structure: internal, between groups, intermediary links and no-group links.
doi:10.1371/journal.pone.0029358.g001
The Strength of Intermediary Ties in Social Media
network (followers and followees), while the second consists in
retrieval of the user activity from the stream of Twitter (p
tweets, mentions and retweets). In the first stage, the dire
unweighted network is obtained from the information on
followers and followees of each user. The data was collected u
a breadth-first search technique: Starting from several se
followers and followees of the seeds were retrieved. Then the s
procedure was repeated for the newly discovered users obtaini
Figure 5. Intermediary links. (A) Ratio r between the number o
the links in the follower network (black curve), those with mentions
groups of the users connected by the link. Inset, ratios between t
doi:10.1371/journal.pone.0029358.g005
PLoS ONE | www.plosone.org
RTs happen between groups.
MTs within groups
Grabowicz, P. A., Ramasco, J. J., Moro, E., Pujol, J. M., & Eguiluz, V. M. (2012).
PLoS ONE, 7(1), e29358. http://doi.org/10.1371/journal.pone.0029358
@estebanmoro
Problem: how users manage their social contacts?
Problem: characterising/predicting social turnover
Answer: study CDRs to detect new/old social relationships
3.5 4.0 4.5 5.0 5.5
0100200300400500600700
Year
neighbordid
2
7
20
54
148
403
1096
2980
1.1 persons/day
0.6 persons/day
Temporal networks
@estebanmoro
Problem: how users manage their social contacts?
Problem: characterizing/predicting social turnover
Answer: study CDRs to detect new/old social relationships
y on the average
. Thus, we could
(2)
nt tij. Or equiv-
the distribution
tivities from that
iven by:
(3)
iven by the ccf of
(4)
are heavy tailed,
ime dependence
by the exponen-
me is an exponen-
s, we
nique
data
users
ut all
ators,
time
. We
d the
mmu-
hows
wever,
times
unob-
if the
n our
muni-
gests that a large fraction of the revealed aggregated social
connectivity ki(T) is given by newly formed or removed con-
nections. Thus, ki(T) usually overestimates the instantaneous
human social capacity of maintaining social ties.
The imbalance between the number of added/removed
ties measures how social capacity changes. At the end of
B
C
Numberofties
i(t)
ki(t)
301102030
n ,i(t)
n ,i(t) + i(0)
(4)
are heavy tailed,
ime dependence
by the exponen-
me is an exponen-
mmu-
hows
ever,
times
unob-
f the
n our
muni-
thus
is of
tside
asses
ation
of tie
s ob-
n old
tivity
t the
e, we
cases
A
BNumTieid
Days
0 30 60 90 120 150 180 210
1102030110
n ,i(t)
Figure 2. From communication activity to tie dynamics:
Panel (A) shows the communication events of a given individual in our database with
Real example:
700M relationships
23M people
Temporal networks
@estebanmoro
Problem: how users manage their social contacts?
We create and destroy relationships at the same pace!
Users’ social capacity remains constant!
Capacity = 5
Links created/destroyed = 4
Temporal networks
@estebanmoro
Problem: how users manage their social contacts?
Users have different social strategies:
Social explorer Social keeper
Links created = 23, Capacity = 4 Links created = 3, Capacity = 24
Temporal networks
@estebanmoro
0
5
10
ks
v
y
0
5
10
ks na
v
y
10 11 12 13 14 15
6.07.08.09.0
bb0m$x
bb1m$x
16
2024
28
32364044
48
52
56
60
64
68
16
20
24
28
32
36
4044
48525660
64
68
6789
n↵,i
i
ni
A B
Problem: how users manage their social contacts?
Social strategy changes with age
Capacity
LlinksCreated/destroyed
0
5
10
ks na
v
y
g
F
M
0
5
10
ks na
v
y
g
F
M
10 11 12 13 14 15
bb0m$x
16
2024
28
32364044
48
52
56
60
64
68
16
20
24
28
32
36
4044
48525660
64
68
i
n↵,ii
B
Miritello, G., Lara, R.,
Cebrian, M., & Moro, E.
(2013).
Temporal networks
@estebanmoro
Our mobility is highly predictable
Human mobility
and that their average call frequency f is ≥0.5
hour−1
[(22) sections S1 and S2].
The trajectories of two users with widely
different mobility patterns are shown in Fig. 1A:
The first user moves in the vicinity of N = 22
towers in a 30-km region, whereas the second
visits as many as N = 76 towers spanning
approximately a 90-km neighborhood. To under-
stand the recurrent nature of individual mobility,
we assigned to each user a mobility network (23)
(Fig. 1B), in which nodes are the locations visited
by the user (each location corresponding to a
where Ni is the number of distinct locations
visited by user i, capturing the degree of
predictability of the user’s whereabouts if each
location is visited with equal probability; (ii)
the temporal-uncorrelated entropy Sunc
i ≡
−∑Ni
j¼1pið jÞ log2pið jÞ, where pi( j) is the his-
torical probability that location j was visited
by the user i, characterizing the heterogeneity
of visitation patterns; (iii) the actual entropy,
Si, which depends not only on the frequency
of visitation, but also the order in which the
nodes were visited and the time spent at each
activity, during which we have no information
about the user’s location (Fig. 1C). This incom-
pleteness of the collected data is captured by the
parameter q, representing the fraction of hour-
long intervals when the user’s location is
unknown to us. As Fig. 1E shows, P(q) across
our user base peaked around q = 0.7, which
indicated that, for a typical user, we have no
location update for about 70% of the hourly
intervals, which masks the user’s real entropy Si.
We therefore studied the dependence of the
entropy S(q) on the incompleteness q, which
A B
Mon Tue Wed Thu Fri Sat Sun
C
D E
5%
23%
15% 5%
4%
52%
5%
6%
27%
Distance (km) Distance (km)
Fig. 1. (A) Trajectories of two anonymized
mobile phone users who visited the vi-
cinity of N = 22 and 76 different towers
during the 3-month-long observational period. Each dot
corresponds to a mobile phone tower, and each time a user
makes a call, the closest tower that routes the call is recorded,
pinpointing the user’s approximate location. The gray lines
represent the Voronoi lattice, approximating each tower’s area
of reception. The colored lines represent the recorded move-
Song, C., Qu, Z., Blumm, N., & Barabasi, A.-L. (2010). Limits of predictability in human
mobility. Science, 327(5968), 1018.
@estebanmoro
Our mobility is highly predictable
Human mobility
?
One shop gets 20% of the use of
a credit card
Including other persons choices
we can reach 30% accuracy
Krumme, C. et al., 2013. The predictability of consumer visitation patterns. Scientific Reports, 3, pp.–.
@estebanmoro
Most of our social connections happen in our neighbourhood (Gravity Law)
Geography of social networks
Pij /
1
(dij)↵
Liben-Nowell D et al PNAS 2005
@estebanmoro
Most of our social connections happen in our neighbourhood (Gravity Law)
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81
am of physical distances between egos and alters. The graph shows the number of ties by distance, in 200 km bins (for example, Ne
unted towards the 5400 km bin). The total number of ties in each of the two simulations is the same as in the observed data. Based on
Takhteyev, Gruzd, Wellman. (2011). Geography of Twitter networks. Social Networks, 34(1), 9–9.
http://doi.org/10.1016/j.socnet.2011.05.006
Geography of social networks
@estebanmoro
Most of our geographical movements happen locally (Gravity Law)
Pij /
1
(dij)↵
Fig 1. A) Map of the mobility fluxes Tij between municipalities based on Twitter inferred trips (white). Infomap communities detected on the network Ti
colored under the mobility fluxes (blue colors). B) Mobility fluxes Tij between municipalities i and j are constructed by aggregating the number of trips b
them. C) Correspondence between the observed fluxes Tij and the fitted gravity model fluxes. Dashed line is the Tij ¼ Tgrav
ij while the (blue) solid line is
conditional average of Tgrav
ij for values of Tij. Maps were created using the maptools and sp packages in the R environment.
doi:10.1371/journal.pone.0128692.g001
Llorente, A., et al. (2015). PLoS ONE, 10(5), e0128692
find them not suitable to study socio-economical activity:
the administrative boundaries between municipalities reflect
and historical decisions, while economical activity happens
hose boundaries. The result is that municipalities in Spain
cially diverse, ranging from municipalities with only 7 in-
s to others with 3.2 million population. Although there exists
ggregations of municipalities in provinces (regions) or sta-
metropolitan areas, we have used our own data to detect eco-
areas. In particular, we have used user daily trips between
alities in our database to detect those which are economi-
ated. We say that there is a daily trip between municipality i
a user has tweeted in place i and j consecutively within the
y. In our database we find 1.9 million trips by 0.22 million
With those trips we construct the daily mobility flux network
ween municipalities as the number of trips between place i
emarkably, the statistical properties of trips and of the mobil-
x Tij coincide with those of other mobility data (see Supp.
or example, trip distance and elapsed time are power law dis-
with exponents very similar to those found in the literature.
mobility fluxes Tij are well described by the Gravity Law
.69)
Tij ' Tgrav
ij =
P↵i
i P
↵j
j
dij
relations between municipalities. This resu
mobility detected from geo-located tweets
tained are a good description of economic
paper, we restrict our analysis to the geog
the Infomap detected communities (see fig
munities which are not formed by at least 5
this, 99% of the total country of the popul
analysis. Similar (although statistically w
for municipalites or provinces.
Social media fingerprints
The goal of this work is to quantify how t
be extracted from social media and then
economical level of cities. To this end, we
have been widely explored in other fields l
ences. All these four measures rely on the
where users live. Instead of using informa
analyze the places where the user has twe
town of the user, the municipality where h
highest frequency, a method usually emplo
social media. To this end we select those u
located tweets in our period and which h
.pnas.org/cgi/doi/10.1073/pnas.0709640104
↵i ⇡ ↵j = 0.42, = 0.89
Geography of mobility networks
@estebanmoro
Social/mobility communities coincide with geographical communities
Results and Discussion
The question naturally arises: What is the best way to group
these pixels into larger regions? A similar question has been a focus
of network research over the past decade; there one seeks the best
way to partition a network into separate, non-overlapping
communities [13–18]. The leading approach is based on
optimizing the network’s ‘‘modularity’’ [15]. High modularity
values occur when the network is subdivided such that there are
many links within communities and few between them, as
compared to a randomly generated network with otherwise
similar characteristics.
However, we are not trying to partition the network itself, but
rather to use the network’s characteristics to partition the
geographic space underneath the network’s topology while
guaranteeing spatial adjacency, one of the essential features of a
geographic region.
analysis as it allowed us to correctly represent the human network
from which we started (see Text S2).
After two iterations of the algorithm, a surprisingly accurate
map of the Greater London region emerged, along with an area
corresponding to Scotland, with just a few detached pixels
scattered across the rest of Great Britain (Fig. 2 (a) and (b)).
With subsequent iterations the modularity increased, ultimately
converging to a maximum of 0.58, indicative of a good
partitioning compared to the randomized network, as mentioned
in [15,20]. The resulting subdivision had 23 communities, 13 of
which were clearly delineated geographically, although some
scattered pixels and fuzzy boundaries remained. To determine if
these artefacts were due to noise produced by the heuristics of
spectral partitioning, we next fine-tuned the spectral partitioning
algorithm in a manner suggested by Newman [16], iteratively
moving pixels from one region to another to maximize overall
modularity (see Text S3). When applied to our data, this process
Figure 1. The geography of talk in Great Britain. This figure shows the strongest 80% of links, as measured by total talk time, between areas
within Britain. The opacity of each link is proportional to the total call time between two areas and the different colours represent regions identified
using network modularity optimisation analysis.
doi:10.1371/journal.pone.0014248.g001
Ratti, C. et al. (2010). PLoS ONE, 5(12), e14248.
Social Media Fingerprints of Unemplo
Llorente, A., et al. (2015). PLoS ONE, 10(5), e0128692
Geography of networks
@estebanmoro
Product adoption
Problem: Improve targeting in product adoption
Answer: study “network” data from CDRs, CRMs, etc. to detect social influence
92% Trust friend’s
recommendation
47% Trust ads on TV.
33% Trust display ads on
mobile devices
@estebanmoro
Problem: Improve targeting in product adoption
Answer: “network” data from CDRs, CRMs, etc. to detect social influence
Direct relationship
Indirect relationship
Product adoption
@estebanmoro
Product adoption
Sundsøy, P., Bjelland, J., Canright, G., Engø-Monsen, K., & Ling, R. (2010). Product adoption
networks and their growth in a large mobile phone network. 2010 International Conference on
Advances in Social Networks Analysis and Mining, 208–216.
Figure 2. Time evolution of the iPhone adoption network. One node represents one subscriber. Node color: represents iPhone model: red=2G, green=iPhone 3G,
yellow=3GS. Node size, link width, and node shape (attributes which are visible in Q3 2007) represent, respectively, internet volume, weighted sum of SMS and
voice traffic, and subscription type. Round node shape represents business users, while square represents consumers.
adoption network is diffusing over the underlying social
network. In particular we will often focus on the time
evolution of the LCC of the adoption network – which may or
may not form a social network monster. We recall from Figure
1 that the other components are often rather small compared to
the LCC. Hence we argue that studying the evolution of the
LCC itself gives useful insight into the strength of the network
spreading mechanisms in operation. It also gives insight into
the broader context of adoption. As described in [8], two
friends adopting together does not necessarily imply social
influence – there might also be external factors that control the
the underlying mechanism.
A. The iPhone case
The iPhone 2G was officially released in the US in late Q2
2007 followed by 3G in early Q3 2008 and 3GS late Q2 2009.
It was released on the Telenor net in 2009. Despite the
existence of various models, we have chosen to look at the
iPhone as one distinct product, since (as we will see) the older
models are naturally substituted in our network. Figure 2 shows
the development of the iPhone monster in one particular
market. We observe how the 2G phone is gradually substituted
@estebanmoro
Organizational analysis
Team
managers
• How do we detect hidden
leaders inside a company?
• Find real leaders inside the
company and measure their
• Centrality
• Connectivity
• Number of communities
and their diversity
• Train a model to detect
them
• Find other people with
similar roles inside the
company
@estebanmoro
Insurance pricing / Credit risk
• Whom would you lend money to in this network?
Use of mobile phone data or social networks to asses credit risk in microcredit approval
(Lenddo, Cignifi)
Granovetter: larger diversity of contacts, more opportunities, more job offers, etc.
BIG DATA, SMALL CREDIT
The Digital Revolution and Its Impact on Emerging Market Consumers, Omidyar
Network
@estebanmoro
Areas with larger diversity of
contacts have more economic
development (Granovetter)
Deprivation index of an area
decreases with:
• Number of social contacts
• Diversity of social contacts
Economic development
Eagle, N., Macy, M., & Claxton, R. (2010).
Science, 328(5981), 1029–1031. 1186605
tie formation. Previous studies have found that in-
dividuals benefit from having social ties that bridge
between communities. These benefits include access
to jobs and promotions (5–13), greater job mobility
(14, 15), higher salaries (9, 16, 17), opportunities for
entrepreneurship (18, 19), and increased power in
negotiations (20, 21). Although these studies sug-
gest the possibility that the individual-level bene-
fits of having a diverse social network may scale to
the population level, the relation between network
structure and community economic development
has never been directly tested (22).
As policy-makers struggle to revive ailing econ-
omies, understanding this relation between net-
work structure and economic development may
provide insights into social alternatives to traditional
stimulus policies. To that end,we analyzed the most
complete record of a national communication net-
work studied to date and coupled thissocial network
data with detailed socioeconomic indicators to mea-
surethisrelationdirectly,atthepopulationlevel.The
communication network data were collected during
the month of August 2005 in the UK. The data
contain more than 90% of the mobile phones and
greater than 99% of the residential and business
landlines in the country. The resulting network has
65 × 106
nodes, 368 × 106
reciprocated social ties, a
mean geodesic distance (minimum number of direct
or indirect edges connecting two nodes) of 9.4, an
average degree of 10.1 network neighbors, and a
giant component (the largest connected subgraph)
containing 99.5% of all nodes (23).
Although the nature of this communication
data limits causal inference, we were able to test
the hypothesized correspondence between social
network structure and economic development
using the 2004 UK government’s Index of Mul-
tiple Deprivation (IMD), a composite measure of
relative prosperity of 32,482 communities encom-
passing the entire country (24), based on income,
employment, education, health, crime, housing,
and the environmental quality of each region (25).
Each residential landline number was associated
with the IMD rank of the exchange in which it was We then compared the IMD rank of each com- entropy associated with individual i’s communi-
Fig. 1. An image of regional communication diversity and socioeconomic ranking for the UK. We find
that communities with diverse communication patterns tend to rank higher (represented from light blue
to dark blue) than the regions with more insular communication. This result implies that communication
diversity is a key indicator of an economically healthy community. [(29) Crown copyright material is
reproduced with the permission of the Controller of Her Majesty’s Stationery Office]
REPORTS
onMay24,2010www.sciencemag.orgDownloadedfrom
@estebanmoro
Areas with
• larger diversity of mobility or
• users with larger radius or gyration
have more economical development
Economic development
Smith, C., Quercia, D., & Capra, L. (2013). Finger on the pulse: identifying deprivation
using transit flow analysis, 683–692.
IMD Score
0.0
0.2
0.4
0.6
0.8
1.0
(a) Real composite IMD score (b) True class (c) Predicted Class
Figure 4. a) Census areas containing stations coloured according to real composite IMD score; b) stations which fall in the 1st or 4th quartile for
composite IMD, classified as high or low; and c) predicted classifications for the same areas.
clude exploiting other variables available in the Oyster data,
such as ticket price and card type (e.g., standard, student, el-
derly and disabled). We have also begun to develop methods
By combining different datasets we can build a multiplex
network (i.e. network with multiple types of edge), which
may offer additional insights into the relationship between
@estebanmoro
(Functional) geographic areas in Spain
Madrid Barcelona
“The piece is absolutely useless, even ridiculous, outside Spain, because the audience cannot
hope to understand its significance, nor the performers to play it as it should be played.”
@estebanmoro
Twitter penetration
• Is Twitter penetration related to economical development of areas?
• At country scale twitter penetration ~ GDP
• At small scale is the opposite! twitter penetration ~ unemployment
witter in a given area, a more illustrative metric is
o between the number of Twitter users and the
ratio does not distribute uniformly across the globe
a country economic development approximated by
ile this property has been already described e.g. by
fit of a power law approximation increased when
ather than all Twitter users appearing in a country.
th a penetration rate below 0.05‰ (we also exclude
s smaller than 10,000).
es of the world. (A) Spatial distribution of the index. (B)
capita GDP of a country. R2
coefficient equals 0.65.
Hawelka, B. et al., 2013. Geo-located Twitter as the proxy for global mobility patterns.
5
10
10 20
paro
factor[i]*tt[,variables_sel[i]]
% unemp.
Penetrationrateindex
⇢ = 0.70 [0.6, 0.77]
11
Figure 3. Users and GDP per capita. Correlation between country level Twitter penetration and
GDP/capita.
@estebanmoro
“llorenteetal5” — 2014/8/27 — 22:19 — page 4 — #4
i
i
Tweet Detected Misspellings
Alguien se viene con migo aver la vida de PI??
- “Con migo” instead of “Conmigo” (with me
in Spanish).
- “aver” instead of “a ver” (aver is not a
Spanish word)
La quiero mucho y la hecho de menos
- “Hecho de menos” instead of “echo de
menos” (“I miss her” in Spanish).
All the 618 expressions such as “Con migo”, “Aver” or “Hecho de menos” have been searched literally within the
text of the whole dataset of tweets.
●
●
●
●
●●●
●
●
●
●
●
●
●
● ● ●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
A B
Entropy: 0.72
Unemployment rate: 11%
Entropy: 0.42
Unemployment rate: 23%
A B
C
2.5
5.0
7.5
10.0
5 10 15 20
hour
fraction
mun
a
b
%oftweets
Proportion of tweets
Hour
Low unemp. rate
High unemp. rate
Activity(%tweets)
Entropy: 0.42
Unemployment: 20.3%
Entropy: 0.72
Unemployment: 8.8%
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
A
C
B
Twitter social interactions
• Granovetter: diversity of interactions yields to more opportunities
• Diversity of interactions between cities is correlated with economical development
• We construct the graph of social interactions
• Measure diversity with entropy
% unemp
Entropy(%)
Eagle et al, Science 2010
wij = number of @ between areas i and j
pij = wij/
Pki
j=1 wij
Si =
Pki
j=1 pij log pij
20
40
60
80
10 20
paro
factor[i]*tt[,variables_sel[i]]
⇢ = 0.21[ 0.37, 0.04]
@estebanmoro
Twitter geographical interactions
• Diversity of geographical mobility is correlated with development
• We use the graph of flows
• Measure diversity with entropy
% unemp
Entropy(%)
Smith, C., Mashhadi, A. & Capra, L., 2013. Ubiquitous sensing for mapping poverty in developing countries.
Smith, C., Quercia, D. & Capra, L., 2013. Finger on the pulse: identifying deprivation using transit flow analysis.Smith, C., Quercia, D. & Capra, L., 2013. Finger on the pulse: identifying deprivation using transit flow analysis. pp.683–692.
˜Si =
P˜ki
j=1 ˜pij log ˜pij
˜pij = Tij/
P˜ki
j=1 Tij
0.1
0.2
0.3
0.4
10 20
paro
tt[,"sio"]
⇢ = 0.023 [ 0.19, 0.14]
@estebanmoro
Twitter content
• Two different approaches
• Classical approach: NLP applied to detect mentions to “unemployment”,
“job”, “economy”, …
Antenucci, D. et al., 2014. Using Social Media to Measure Labor Market Flows.
Thousands
Factor1
2011 2012 2013
280
300
320
340
360
380
400
420
440
460
-10
-5
0
5
10
15
20
Initial Claims (left scale)
Social Media (right scale)
0.000
0.001
0.002
10 20
paro
tt[,"emp"]
% unemp
#mentionstoemployment
⇢ = 0.33 [ 0.17, 0.47]
@estebanmoro
Twitter content
• Our approach: NLP applied to detect lexical complexity
(as a proxy for educational level)
• Readability (Gunning index)
• Serious misspellings
•
• We construct a list of more than 600 incorrect
expressions of this type validated by spanish language
linguistic experts.
• We do not take into account misspellings due to
different Spanish accents and IM abbreviations
• We compute for each area the fraction of users that
make a number of serious misspellings
Tweet Correct spelling
Alguien se viene con migo aver la
vida de PI??
Alguien se viene conmigo a ver
la vida de PI??
La quiero mucho y la hecho de
menos
La quiero mucho y la echo de
menos
Yo llendo a trabajar con este
tiempo
Yo yendo a trabajar con este
tiempo
J. Davenport et al, The Readability of Tweets and their
Geographic Correlation with Education, arXiv:1401.6058, 2014
0
5
10
15
20
10 20
paro
factor[i]*tt[,variables_sel[i]]
% unemp
Numberofmisspellers
@estebanmoro
Nowcasting of shadow economy
Do we detect less or more unemployment that it is officially registered?
Error = Unemployment_model - Unemployment_registered
(Tweets geolocalizados)
Dataset: 19.6 Million geolocalized tweets
A. Llorente, EM, et al, 2015
http://arxiv.org/abs/1411.3140
15 20 25 30 35
−0.3−0.10.00.10.20.3
tt$sumergida
error
30%
20%
10%
0%
-10%
-20%
-30%
Error
% Shadow Economy
15 20 25 30 35
Model predicts less
unemployment that
official figures in
areas with larger
shadow economy
@estebanmoro
Use model in under-developed countries
250 Million people
30 million Twitter users
50% smartphone
penetration in 2015
@estebanmoro
Sandy Hurricane, 29 October 2012
1/2 $Billion impact in FEMA grants and Insurance claims
1-2 years to asses the economical damage
Kryvasheyeu, Moro,
E., et al (2016).
Science Advances
Predicting economic models for disaster damage
@estebanmoro
Sandy Hurricane, 29 October 2012
1/2 $Billion impact in FEMA grants and Insurance claims
1-2 years to asses the economical damage
!
Correlationwitheconomicaldamage
Hours since hurricane landing
activity
sentiment
Tweets
per 100k
1800
250
30−8
−6
−4
value1
#Tweets
Tweet sentiment
Grants (FEMA)
Insurance claims
Predicting economic models for disaster damage
@estebanmoro
Other natural disasters
DISCUSSION
We found that Twitter activity during a large-scale natural disaster—in
this instance Hurricane Sandy—is related to the proximity of the region
to the path of the hurricane. Activity drops as the distance from the
hurricane increases; after a distance of approximately 1200 to 1500 km,
the influence of proximity disappears. High-level analysis of the com-
position of the message stream reveals additional findings. Geo-
Table 1. Activity-damage correlation (Kendall t, Spearman r, and Pearson r) for additional events. Disasters are sorted on the order of the
increasing strength of the Pearson correlation coefficient. All disasters demonstrate moderate to strong levels of statistically significant correlations
(P < 0.05) [with the exception of Alaska floods (DR-4122)].
Event ID Type Kendall t P Spearman r P Pearson r P
DR4116 Floods 0.15 9.04 × 10−5
0.21 1.87 × 10−4
0.18 9.71 × 10−4
DR4117 Tornadoes 0.17 0.05 0.26 0.05 0.24 0.06
DR4176 Tornadoes 0.18 8.92 × 10−3
0.28 6.68 × 10−3
0.27 9.60 × 10−3
Sandy Hurricane 0.16 3.30 × 10−13
0.24 5.04 × 10−13
0.30 5.99 × 10−20
DR4145 Floods 0.33 3.54 × 10−8
0.47 2.42 × 10−8
0.45 1.08 × 10−7
DR4177 Floods 0.36 4.44 × 10−4
0.52 2.33 × 10−4
0.45 1.53 × 10−3
DR4175 Tornadoes 0.34 0.02 0.46 0.03 0.46 0.03
DR4195 Floods 0.32 1.28 × 10−8
0.47 3.35 × 10−9
0.46 6.32 × 10−9
DR4174 Tornadoes 0.56 5.24 × 10−3
0.69 6.07 × 10−3
0.68 6.93 × 10−3
DR4157 Tornadoes 0.51 9.70 × 10−4
0.71 2.38 × 10−4
0.72 1.71 × 10−4
DR4168 Mudslide 0.44 0.04 0.59 0.03 0.86 1.84 × 10−4
DR4193 Earthquake 0.74 3.80 × 10−5
0.90 7.50 × 10−7
0.88 3.92 × 10−6
DR4122 Floods 1.00 — 1.00 — 1.00 —
R E S E A R C H A R T I C L E
DISCUSSION
We found that Twitter activity during a large-scale natural disaster—in
this instance Hurricane Sandy—is related to the proximity of the region
to the path of the hurricane. Activity drops as the distance from the
hurricane increases; after a distance of approximately 1200 to 1500 km,
the influence of proximity disappears. High-level analysis of the com-
position of the message stream reveals additional findings. Geo-
enriched data (with location of tweets inferred from users’ profiles)
show that the areas close to the disaster generate more original content,
characterized by a lower fraction of retweets. This extends the previous
understanding of retweeting behavior in crisis (31, 32) and confirms
other studies (41). Finally, we find that messages from disaster regions
generate more interest globally, with a higher normalized count of re-
tweet sources.
In the first study of its kind based on the actual ex-post damage
assessments, we demonstrated that the per-capita number of Twitter
messages corresponds directly to disaster-inflicted monetary damage.
The correlation is especially pronounced for persistent postdisaster ac-
tivity and is weakest at the peak of the disaster. We established that
per-capita activity and per-capita damage both have an approximately
log-normal distribution and that the Pearson correlation coefficient
between the two can reach 0.6 for a carefully selected observation pe-
riod in the aftermath of the landfall. This makes social media a viable
platform for preliminary rapid damage assessment in the chaotic time
immediately after a disaster. Our results suggest that, during a disaster,
officials should pay attention to normalized activity levels, rates of
DR4145 Floods 0.33 3.54 × 10−8
0.47 2.42 × 10−8
0.45 1.08 × 10−7
DR4177 Floods 0.36 4.44 × 10−4
0.52 2.33 × 10−4
0.45 1.53 × 10−3
DR4175 Tornadoes 0.34 0.02 0.46 0.03 0.46 0.03
DR4195 Floods 0.32 1.28 × 10−8
0.47 3.35 × 10−9
0.46 6.32 × 10−9
DR4174 Tornadoes 0.56 5.24 × 10−3
0.69 6.07 × 10−3
0.68 6.93 × 10−3
DR4157 Tornadoes 0.51 9.70 × 10−4
0.71 2.38 × 10−4
0.72 1.71 × 10−4
DR4168 Mudslide 0.44 0.04 0.59 0.03 0.86 1.84 × 10−4
DR4193 Earthquake 0.74 3.80 × 10−5
0.90 7.50 × 10−7
0.88 3.92 × 10−6
DR4122 Floods 1.00 — 1.00 — 1.00 —
Fig. 5. Distribution of activity-damage correlations (Pearson correla-
tion coefficients) across all disasters considered in the study. In terms
of damage, disasters appear to group according to their type, with cost
increasing from tornado storms, to floods, and eventually to hurricanes.
The correlation between activity and damage is very strong for small-scale
(low-cost) disasters, then it weakens and remains, on average, at the same
level across moderate-cost to high-cost events.
Predicting economic models for disaster damage
@estebanmoro
Problems
• Data based Societies/Governments
• Transparency: data-driven
decisions
• Responsability: decisions
backed-up by data and
algorithms
• Policy makeing with
A/B Testing
• http://www.wired.com/2012/04/ff_abtesting/all/1
• http://www.fastcompany.com/3042630/first-us-chief-data-scientist-dj-patilscientist-dj-patil
@estebanmoro
Problems
• N ≠ ALL
• Some social sectors might be not
well represented
• Potential biases towards youngest,
richest, etc.
• We need sampling techniques to
assure the representativeness of
the data.
• Biases everywhere!
@estebanmoro
Problems
• N ≠ ALL
• Biases everywhere
http://www.pewglobal.org/2016/02/22/smartphone-ownership-and-internet-usage-continues-to-climb-in-emerging-economies/
@estebanmoro
Problems
• N ≠ ALL
• Biases everywhere
http://www.pewinternet.org/2015/08/19/the-demographics-of-social-media-users/
@estebanmoro
Problems
• Privacy ~ 1 / Value
• Traceability Who/where/how is
accessing our data?
• Value: Most data is proprietary,
but whose is its value?
• Measure: how much privacy is
lost when our data is used? How
much is our data valued?
FT.com http://on.ft.com/14yjj65
@estebanmoro
• Which data has more value?
Privacy / Data Value
anonymized/aggregated form?
Q2. On day {dd/MM} you assigned a value of {min-bid per category} to the information [{least valued
info per category}]. This was your minimum bid. Why?
multi-choice*
Q3. On day {dd/MM} you assigned a value of {max-bid per category} to the information [{most valued
info per category}]. This was your maximum bid. Why?
multi-choice*
Q4. Imagine there was a market in which you could sell your personal information (e.g. information about
people you called, places you’ve been, applications you’ve used, songs you’ve listened to, etc.). Who would
you trust to handle your information? Please, order the following entities from most to least trusted.
rank**
Q5. The category {locationskcommunicationskappskmedia} is the one that you refused to sell the most
({percentage of opt-outs}). Why?
free-text
Table 3: Questions asked in the EoS questionnaire. *included: Fair value, Test/Mistake, Other (free text). For minimum-bid related ques-
tions additional options were To win the auction, Info not important; conversely, for maximum-bid related questions, the additional
option was To prevent selling. **entities to be ranked included: banks, government, insurance companies, telcos, yourself.
are concerned about mobile PII protection (Q1) but do not
tend to read the Terms of Service (Q4) nor are aware of cur-
rent legislation on data protection (Q5). Moreover, they do
not seem to trust how neither application providers (Q2) nor
telecom operators (Q3) use their data.
The EoS survey was designed to gather additional quan-
titative and qualitative information from our participants af-
ter the data collection was complete. In particular, we asked
participants to put a value (under the same auction game con-
straints) on category-specific bulk information – i.e. all the
data gathered in the study for each category. For instance, in
the case of location information, a visualization of a partic-
ipant’s mobility data collected over the 6-weeks period was
shown in the Web questionnaire (as depicted in Figure 1)
and the participant was asked to assign it a monetary value.
Furthermore, for each category, we asked participants about
the minimum/maximum valuations given during the study,
in order to understand the reasons why they gave these valu-
ations. Table 3 contains all the questions of the EoS survey.
The EoS questionnaire was administered through a
slightly modified version of the same Web application used
for the daily surveys. The main difference are the visualiza-
tions of the collected data.
Figure 1: Location-specific bulk information question in the EoS
survey.
6
Staiano, J., Oliver, N., Lepri, B., de Oliveira, R., Caraviello, M., & Sebe, N. (2014).
Money walks: a human-centric study on the economics of personal mobile data.
@estebanmoro
• Which data has more value?
Privacy / Data value
Unique in the shopping mall: On the reidentifiability of credit card metadata
Yves-Alexandre de Montjoye, Laura Radaelli, Vivek Kumar Singh, Alex “Sandy” Pentland, Science 2015
survey shows that financial and credit card data sets are considered the most sensitive
personal data worldwide (25). Among Americans, 87% consider credit card data as
moderately or extremely private, whereas only 68% consider health and genetic information
private, and 62% consider location data private. At the same time, financial data sets have
been used extensively for credit scoring (26), fraud detection (27), and understanding the
predictability of shopping patterns (28). Financial metadata have great potential, but they are
also personal and highly sensitive. There are obvious benefits to having metadata data sets
broadly available, but this first requires a solid understanding of their privacy.
To provide a quantitative assessment of the likelihood of identification from financial data, we
used a data set D of 3 months of credit card transactions for 1.1 million users in 10,000 shops
in an Organisation for Economic Co-operation and Development country (Fig. 1). The data set
was simply anonymized, which means that it did not contain any names, account numbers, or
obvious identifiers. Each transaction was time-stamped with a resolution of 1 day and
associated with one shop. Shops are distributed throughout the country, and the number of
shops in a district scales with population density (r
2
= 0.51, P < 0.001) (fig. S1).
Fig. 1 Financial traces in a simply anonymized data set such as the one we use for this work.
Arrows represent the temporal sequence of transactions for user 7abc1a23 and the prices are
grouped in bins of increasing size (29).
We quantified the risk of reidentification of D by means of unicity ε (19). Unicity is the risk of
reidentification knowing p pieces of outside information about a user (29). We evaluate εp of D
as the percentage of its users who are reidentified with p randomly selected points from their
financial trace. For each user, we extracted the subset S(Ip) of traces that match the p known
points (Ip). A user was considered reidentified in this correlation attack if |S(Ip)| = 1.
Figure 2 shows that the unicity of financial traces is high (ε4 > 0.9, green bars). T
that knowing four random spatiotemporal points or tuples is enough to uniquely reid
of the individuals and to uncover all of their records. Simply anonymized large-sca
metadata can be easily reidentified via spatiotemporal information.
Fig. 2 The unicity ε of the credit card data set given p points.
The green bars represent unicity when spatiotemporal tuples are known. This show
spatiotemporal points taken at random (p = 4) are enough to uniquely character
individuals. The blue bars represent unicity when using spatial-temporal-price t
0.50) and show that adding the approximate price of a transaction significantly inc
likelihood of reidentification. Error bars denote the 95% confidence interval on the m
Furthermore, financial traces contain one additional column that can be used to re
individual: the price of a transaction. A piece of outside information, a spatiotem
can become a triple: space, time, and the approximate price of the transaction. Th
contains the exact price of each transaction, but we assume that we only o
@estebanmoro
References
• Reviews about applications of mobile phone data
• Blondel, V. D., Decuyper, A., & Krings, G. (2015). A survey of results on mobile phone
datasets analysis. EPJ Data Science, 4(1), 10. http://doi.org/10.1140/epjds/
s13688-015-0046-0
• MOBILE PHONE NETWORK DATA FOR DEVELOPMENT. (2013). UN Global Pulse
• Saramaki, J., & Moro, E. (2015). From seconds to months: an overview of multi-scale
dynamics of mobile telephone calls. The European Physical Journal B, 88(6). http://
doi.org/10.1140/epjb/e2015-60106-6
• Naboulsi, D., Fiore, M., Ribot, S., & Stanica, R. (n.d.). Large-scale Mobile Traffic
Analysis: a Survey. IEEE Communications Surveys & Tutorials, 1–1. http://doi.org/
10.1109/COMST.2015.2491361
• Netmob book of abstracts:
• Oral: http://netmob.org/assets/img/netmob15_book_of_abstracts_oral.pdf
• Posters: http://netmob.org/assets/img/netmob15_book_of_abstracts_posters.pdf
@estebanmoro
References
• Books about applications of Network Analysis to industry
• Cross, R., Thomas, R. J., Singer, J., Colella, S., and Silverstone, Y . 2010. The
Organizational Network Fieldbook. Jossey-bass, San Francisco, California.
• Van den Bulte, C., and Wuyts, S. 2007. Social networks and marketing. Marketing
Science Institute.
• Pinheiro, C. A. R. 2011. Social network analysis in telecommunications. John Wiley &
Sons.
• Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke, Fraud Analytics Using
Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for
Fraud Detection, Wiley, 2015
• Articles about applications of Network Analysis to Industry
• [Credit Scoring] San Pedro, Jose, Davide Proserpio, and Nuria Oliver. "MobiScore:
towards universal credit scoring from mobile phone data." User Modeling, Adaptation and
Personalization. Springer International Publishing, 2015. 195-207.
@estebanmoro
References
• Some articles about applications of Network Analysis to Industry/Goverment
• [Poverty] Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting poverty and wealth
from mobile phone metadata. Science, 350(6264), 1073–1076. http://doi.org/10.1126/
science.aac4420
• [Census] Deville, P., Linard, C., Martin, S., Gilbert, M., Stevens, F. R., Gaughan, A. E., et
al. (2014). Dynamic population mapping using mobile phone data. Proceedings of the
National Academy of Sciences, 111(45), 15888–15893. http://doi.org/10.1073/pnas.
1408439111
• [Credit Scoring] San Pedro, Jose, Davide Proserpio, and Nuria Oliver. "MobiScore:
towards universal credit scoring from mobile phone data." User Modeling, Adaptation and
Personalization. Springer International Publishing, 2015. 195-207.
• [GDP] Guidotti, R., Coscia, M., Pedreschi, D., & Pennacchioli, D. (2016). Going Beyond
GDP to Nowcast Well-Being Using Retail Market Data. NetSci-X, 9564(Chapter 3), 29–42.
doi:10.1007/978-3-319-28361-6_3
• [Energy] Bogomolov, A., Lepri, B., Larcher, R., Antonelli, F., Pianesi, F., & Pentland, A.
(2016). Energy consumption prediction using people dynamics derived from cellular
network data. EPJ Data Science, 5(1), 1. doi:10.1140/epjds/s13688-016-0075-3
@estebanmoro
References
• Some datasets:
• Reality mining dataset: http://realitycommons.media.mit.edu/realitymining.html
• Mobile Data Challenge 2012 (2012), http://research.nokia.com/page/12000
• Data for Development Challenge (2014), http://d4d.orange.com
• Big Data Challenge 2014 (2014), telecomitalia.com/tit/en/bigdatachallenge.html
• OpenBigData (2015), http://theodi.fbk.eu/openbigdata/