We live in a world of data, of big data. A big portion of this data has been generated by humans, and particularly through their mobile phones. In fact, there are almost as many mobile phones in the world as humans. The mobile phone is the piece of technology with the highest levels of adoption in human history. We carry them with us all through the day (and night, in many cases), leaving digital traces of our physical interactions. Mobile phones have become sensors of human activity in the large scale and also the most personal devices.
In my talk, I will present some of the work that we are doing at Telefonica Research in the area of human behavior understanding from data captured with mobile phones, and particularly our work in the area of Big Data for Social Good. I will highlight opportunities but also challenges that we would need to address in order to truly leverage this opportunity.
Nuria Oliver is a computer scientist and Scientific Director at Telefónica. She holds a Ph.D. from the Media Lab at MIT. She is one of the most cited female computer scientist in Spain, with her research having been cited by more than 8900 publications. She is well known for her work in computational models of human behavior, human computer-interaction, intelligent user interfaces, mobile computing and big data for social good.
Search and Society: Reimagining Information Access for Radical Futures
The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs Connect
1. Big Data for
Social Good
Nuria Oliver, PhD
Scientific Director
Telefonica R&D
2. 7+ billion mobile phones worldwide
97% of world’s population (ITU)
Mobile penetration of 120% to 89% of population (ITU)
Emerging and developed regions
More time spent on our phones than watching TV or with our with
our partner (US and UK)
3.
4.
5. Mobile Network Infrastructure
● A cell tower (also known as BTS or base station) provides an
approximate location
● Spacing between towers is 500m-1km in urban areas, 2-3 km in
rural areas
● Each tower has a “Voronoi cell”; this is the area to which the
tower provides best coverage i.e. tower is closest
6.
7. Typical Mobile Data
• CDR
• SMS
Consumption Social Network Mobility
Call duration In/Out Degree Radius of gyration
N. Events Delta w.r.t time window Travelled distance
Lapse between events Unique Calls per day
Rate of popular
antennas
Reciprocated events Unique SMS per day
Regularity of popular
antennas
… … …
HR_ORG TLFN_A TLFN_B CD_GEO_A CD_GEO_B DT_ORG CD_SNTD CD_ERB CD_CCC QT_DUR
20:05:31 XXX YYY 3 11 20140519 2 1562 568 33
… … … … … … … … … …
HR_ORG TLFN_A TLFN_B CD_GEO_A CD_GEO_B DT_ORG CD_SNTD QT_TRFG
15:53:54 XXX ZZZ 3 25 20140506 2 1
… … … … … … … …
11. Big Data for Social Good @ Telefonica
Crime Prediction
Analysis of impact of floods
http://www.wired.co.uk/news/archive/2013-10/17/nuria-oliver
70% accuracy
13. Flooding Through the Lens of Mobile
Phone Activity
David Pastor-Escuredo1, Alfredo Morales-Guzmán1, Yolanda
Torres-Fernández1, Jean-Martin Bauer2, Amit Wadhwa2, Carlos
Castro-Correa3, Liudmyla Romanoff4, Jong Gun Lee4, Alex
Rutherford4, Vanessa Frias-Martinez5, Nuria Oliver6, Enrique Frias-
Martinez6, Miguel Luengo-Oroz4
1. Universidad Politécnica de Madrid
2. Vulnerability Analysis and Mapping, World Food Programme
3. Coordinación de Estrategia Digital Nacional, Presidencia de la República de México
4. United Nations Global Pulse
5. University Maryland
6. Telefónica Research
14. UN Global Pulse is an innovation initiative of the United Nations Secretary-
General, driving a big data revolution for development.
Its vision is to see big data used safely and responsibly for public good; its
mission to accelerate discovery, development and adoption of big data
and real-time analytics for sustainable development and humanitarian
action.
15. Tabasco Floods
• Tabasco state frequently suffers flooding
• 800mm of rain 31st October – 3rd November 2009 (4x
normal)
• 214,000 people affected
• State of emergency declared
16. Reconstructing the Floods
• We combined mobile data with
NASA LANDSAT satellite imagery
data, governmental surveys and
census data
• NASA LANDSAT geolocation of
the most affected areas
• Census data representativeness
of mobile data
• Precipitations data and civil
protection reports timeline of
people’s awareness and response
17. CDR Representativeness
● Mobile user
representativeness
validated against census
data
● Population estimates
based on CDRs are strongly
correlated with official
Census data and
population statistics
● Mobile data proxy of
population distribution in
areas where other data
sources are not available
or reliable
R2=0.97
18. Event Detection
x(t) is raw number of unique phones
placing or receiving calls in each
antennae per day
● Quantify impact of floods by
comparing behavior of BTS against
baseline activity using z-score like
metric
● BTS with higher variations in the
number of calls made during the
floods were located in the most
affected regions
● Special events (e.g. Christmas) led
to similar changes in behavior wrt
baseline, but across the entire
region as opposed to in specific
locations
19.
20. Actionable Insights
● Patterns in BTS activity can be
used to measure the impact of
floodings
● Behavioral insights are important
for emergency services
● People communicated more as a
result of the initial impacts of the
floodings, rather than following the
recommendations of public
protection
● Civil protection warnings did not
seem to be an effective way to
raise awareness
● Spikes in activity are observed only
after the situation was critical
21. Big Data for Social Good @ Telefonica
Crime Prediction
Analysis of impact of floods
http://www.wired.co.uk/news/archive/2013-10/17/nuria-oliver
70% accuracy
27. Data Analysis
• Call Records from 1st Jan till 31st May 2009
• Compute mobility as different number of BTS visited
• Stages
• Medical Alert - Stage 1 (17th-27th April)
• Closing Schools - Stage 2 (28th-1st May)
• Suspension of Essential Activities - Stage 3 (1st May-6th May)
• Baselines
• same periods, different year (2008)
32. Big Data for Social Good @ Telefonica
Crime Prediction
Analysis of impact of floods
http://www.wired.co.uk/news/archive/2013-10/17/nuria-oliver
70% accuracy
34. Affects quality of life and economic development both
at the national and local level
Several studies explore relationships between crime
and socio-economic variables: education, income,
unemployment, ethnicity, …
Several studies have shown significant concentrations
of crime in small geographical areas: crime hotspots
Crime
35. T1: Natural surveillance as key deterrent for crime:
people moving around are eyes on the street
(Jacobs, 1961)
high diversity among the population and high
number of visitors -> less crime
T2: Defensible space theory (Newman, 1972)
high mix of people -> more crime
Crime and Urban Environment
36. People-centric perspective vs Place-
centric perspective
people-centric perspective used for
individual or collective criminal profiling
place-centric perspective used for
predicting crime hotspots
Crime Prediction
37. Data-driven and place-centric approach to
crime prediction
Multimodal approach: people dynamics
derived from mobile network data and
demographics
European metropolis: London
Prediction of crime hotspots and not criminals
profiling
Our Approach
38. Smartsteps Dataset:
for each of the Smartsteps cells a variety of demographic and
human dynamics variables were computed every hour for 3
weeks (from December 9 to December 15, 2012 and from
December 23, 2012 to January 5, 2013)
Criminal Cases Dataset:
criminal cases for December 2012 and for January 2013
London Borough Profiles Dataset:
open dataset containing 68 metrics about the population of a
particular geographic area
Multimodal Approach: Data
39. • Footfall count: Shows the trend in footfall in a
specified area hourly, daily, weekly and monthly.
Provides a basic profile of the crowd.
• Catchment area: Shows which postal sectors are
your customers coming from by hour, day, week
and month. Shows the “battleground” for two sites.
• Transport mode: Shows flows of crowds from any
two points, segmented by road, air, train, etc.
SmartSteps
40. For each cell and for each hour the dataset contains:
an estimation of how many people are in the cell
the percentage of these people at home, at work or
just visiting the cell
the gender splits (male vs. female)
the age splits (0-20 years, 21-30 years, 31-40 years, …)
SmartSteps Data
41. Crime geolocation for 2 months (December 2012 –
January 2013)
All reported crimes in UK specifying month and year and
not specific day/time
Median crime value (=5) used as threshold
Spatial granularity of borough profiles is at LSOA levels:
LSOA are small geographical areas defined by UK Office
for National Statistics (mean population: 1500)
Crime Data
42. 68 metrics about the population of a specific geographical area:
demographics, households, migrant population, employment,
earnings, life expectancy, happiness levels, house prices, etc.
Spatial granularity of borough profiles is at LSOA levels:
LSOA are small geographical areas defined by UK Office for
National Statistics (mean population: 1500)
London Borough Profiles Data
43. From Smartsteps data we extract
1st order features (mean, median, min., max.,
entropy, etc.)
2nd order features on sliding windows of variable
length (1 hour, 4 hours, 1 day, etc.) to account for
temporal patterns
Feature Extraction
44. Feature Selection
Mean decrease in Gini coefficient of inequality
the feature with maximum mean decrease in Gini
coefficient is expected to have the maximum
influence in minimizing the out-of-the-bag error
The feature selection process produced a
reduced subset of 68 features (from an initial pool
of about 6000 features)
45. Classification Approach
Binary classification task: high crime area vs low crime
area
10-fold cross-validation approach
Classifier: Random Forest (RF)
RF overcomes logistic regression, support vector
machines, neural networks, decision trees
48. Features encoding daily dynamics have more predictive power
than features extracted on a monthly basis
Relevance of high number of residents to predict crime areas
increased ratio of residents -> more crime (in contrast with
Newman’s thesis)
Entropy-based features are useful for predicting the crime
hotspots
high diversity of functions (home vs work) and high diversity of
people (gender and age) act as eyes on street decreasing
crime (in line with Jacobs’ thesis)
Relevant Features
49. Only 6 out of 68 features in the joint model are
London Borough features, namely
%working population claiming out of work benefits
Largest migrant population
% overseas nationals entering the UK
% resident population born abroad
Relevant Features
50. Our method captures the dynamics of a place rather
than making extrapolations from previous crime
histories. We can use it in areas where people are less
inclined to report crimes
Our method provides new ways of describing
geographical areas: novel risk-inducing or risk-reducing
features of geographical areas
Implications
51. Relevant Publications
“A gender-centric analysis of calling behavior in a developing country using
CDRS”, Vanessa Frias-Martinez, Enrique Frias-Martinez and Oliver, N.,
Proceedings of AAAI, 2009
"Prediction of Socioeconomic Levels using Cell Phone Records", Victor Soto
and Vanessa Frias-Martinez and Jesus Virseda and Enrique Frias-Martinez,
International Conference on User Modeling, Adaptation and Personalization,
UMAP'11, Industrial Track, Girona, Spain, 2011
"An Agent-Based Model Of Epidemic Spread Using Human Mobility and Social
Network Information", Vanessa Frias-Martinez et al, 3rd International
Conference on Social Computing, SocialCom '11, Boston, USA, 2011.
Talk at WIRED 2013. London UK
http://www.wired.co.uk/news/archive/2013-10/17/nuria-oliver
52. Relevant Publications
“Moves on the street: Predicting Crime Hotspots using aggregated
anonymized data on people dynamics” - A. Bogomolov, B. Lepri, J. Staiano,
Leouze, E., N. Oliver, F. Pianesi, A. Pentland Journal of Big Data (Big Data
Journal 2015)
“Once Upon a Crime: Towards Crime Prediction from Demographics and
Mobile Data” - A. Bogomolov, B. Lepri, J. Staiano, N. Oliver, F. Pianesi, A.
Pentland 16th ACM International Conference on Multimodal Interaction
(ICMI 2014)
"Flooding through the Lens of Mobile Phone Activity"
Pastor-Escuredo, D., Torres Fernandez, Y., Bauer, J.M., Wadhwa, A., Castro-
Correa, C., Romanoff, L., Lee, J.G., Rutherford, A., Frias-Martinez, V., Oliver,
N., Frias-Martinez, E. and Luengo-Oroz, M. Proceedins of IEEE Global
Humanitarian Technology Conference, GHTC 2014, Silicon Valley, CA, Oct
2014
54. Temporal and Spatial Granularity
• Big Data can be available in real-time or if not in real time much more frequently than how
data is typically collected (every 5-10 years for example for census data);
• Some kinds of Big Data (e.g. data about the city, collected by sensors placed in the urban
infrastructure) can be collected with significantly finer grained spatial granularity than with
traditional methods;
Cost and Effort
Accuracy
• It could be argued that some kinds of data that are relevant for the public sector (e.g.
migrations) can be collected more accurately by automatic means through Big Data
platforms than by manual means as it is the state of the art.
• In addition, given that there isn’t a human-in-the-loop, the data is less prone to human errors
and potential biases introduced by humans.
• Most of the Big Data that could be used for the public sector is data that
has been collected already for other purposes. In addition, Big Data is
typically collected by automatic means which makes its collection very
cost-efficient;
57. Risk/Benefit Analysis
• Even while adopting all sorts of precautions, if by any chance there is an error in the data
extraction such that there could be a privacy leak, the consequences for the business could
be devastating
• Hence, the potential benefit would have to be really large to compensate for the risk.
• As this is an emergent research area, the benefits are still to be defined
Regulatory/Social
• Lack of updated regulation
• Lack of clear guidelines regarding safe data handling, processing and sharing for
humanitarian purposes
• Risk of potential unintended social consequences
• Risk of creating a digital divide, unbalanced access to data and-or expertise on how to
analyze it and make sense of it
Internal Barriers
• Big Data for Development/Social Good projects are typically not part of
any business unit in the Telcos;
58. Technical
• Representativeness of the data, generalization
• Combination of data from multiple sources
• Real-time analysis and prediction
• Lack of ground truth intervention to validate
• Significant vs substantially significant
• Correlations vs causality
Privacy/Security
• Potential privacy risks need to be minimized and understood.
Control and transparency
• Security and traceability of the data
• Clear code of conduct and ethical principles when dealing
with data
• Strict access control when appropriate
59. Framework for Data Sharing
Remote
Access
Question and
Answers
Limited
License
Pre-computed
Indicators &
Synthetic
Data
Security
Secure access to the data
Anonymization
Limited or aggregated data release
Dataprotectedthrough
ApplicativeExploratory
Low to medium
number of users
Medium to large number
of users and open data
Development stage
60. What can we all do to
responsibly turn this
opportunity into a reality ?