Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Social and economical networks from (big-)data - Esteban Moro
1. Social and economical networks from
(big-)data
Esteban Moro
@estebanmoro
Master City Science, April 2016
2. @estebanmoro
Summary
1. Intro to Social/Geo Big Data
2. Sources of Social/Geo Big Data
3. Tools for Social/Geo Big Data
4. Applications of Big Data in Social and
Economical problems
5. Outlook
14. @estebanmoro
McKinsey Global Institute Big Data Report 2011
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
18. @estebanmoro
> Big Data, Better answers
Improve our understanding of well-known problems
Different geo/temporal scales: real time (nowcasting/
forecasting), small areas
> Big data, Big new questions
Unknown/unsolved problems
21. @estebanmoro
Frequency
Semantics
• Social networks:
• Twitter, Facebook,
Foursquare, etc.
• Google:
• Points of interest,
searchs, etc.
• Financial data
• Transfers
• Credit card transactions
• Mobile phone:
• CDRs (calls/SMS),
network events, etc.
• Phone sensors
Geo and Social Data Sources
29. @estebanmoro
How much does
BigData cost?
Sources of data
• Free APIs (http://dev.twitter.com)
• Data vendors
• GNIP
• Datasift
• Data cost is a function of volume and
query complexity.
• Volume: 10k tweets = $1
• Complexity: 1 unit = 0.20$
• Typical queries (a word/hashtag) in a
week = $100’s
30. @estebanmoro
Other sources of BigData
http://insights.wired.com/profiles/blogs/monetizing-data-milking-the-new-cash-cow
Data monetization
31. @estebanmoro
Other sources of BigData
Data monetization
https://www.commerce360.es http://dynamicinsights.telefonica.com
34. @estebanmoro
Other sources of BigData
Other sources of data (pictures, Flickr)
https://www.flickr.com/photos/walkingsf/sets/72157627140310742/with/5925795351/
35. @estebanmoro
Other sources of BigData
Other sources of data (pictures, NASA)
http://www.citylab.com/tech/2014/05/the-economic-data-hidden-in-satellite-views-of-city-lights/371660/
37. @estebanmoro
What can we do with social/geo bigdata
• Basically:
• a) Build modes of user behavior:
• Geo-social activity
• Geo-individual recommendation
• Geomarketing
• Fraud detection
• Insurance dynamical pricing
• b) Build models of areas activity
• Optimal distribution of resources (retail,
banks)
• Event detection
• Measure fluxes between areas (traffic,
transport, health, etc.)
• Macro-economical indexes of areas
38. @estebanmoro
2015
250 participants :: 140 institutions, 32 countries, 5 continents
Organized by
IV Conference on the scientific analysis
of mobile phone datasets
39. @estebanmoro
2015 Crowds:
Real time event detection in cities
Estimating attendance of events
Cities:
Energy consumption
Predicting crime hotspots
Health catchment areas
Census estimation
Economies:
Loan Repayment
Food consumption and
poverty indices
Microcredit approval
Labor market
Societies:
Spread of diseases
Social influence
Privacy
Product adoption
Marketing
Mobility:
Mobility prediction
Impact of Sharing Economy
Optimization of public
transportation
Mobility
Content
Activity
Social
40. @estebanmoro
We Are Social @wearesocialsg • 293
ACTIVE
INTERNET USERS
TOTAL
POPULATION
ACTIVE SOCIAL
MEDIA ACCOUNTS
MOBILE
CONNECTIONS
ACTIVE MOBILE
SOCIAL ACCOUNTS
FIGURE REPRESENTS MOBILE
SUBSCRIPTIONS, NOT UNIQUE USERS
FIGURE REPRESENTS ACTIVE USER
ACCOUNTS, NOT UNIQUE USERS
FIGURE REPRESENTS ACTIVE USER
ACCOUNTS, NOT UNIQUE USERS
FIGURE REPRESENTS TOTAL NATIONAL
POPULATION, INCLUDING CHILDREN
FIGURE INCLUDES ACCESS VIA
FIXED AND MOBILE CONNECTIONS
JAN
2015 A SNAPSHOT OF THE COUNTRY’S KEY DIGITAL STATISTICAL INDICATORS
MILLION MILLION MILLION MILLION MILLION
• Sources: Wikipedia; InternetLiveStats, InternetWorldStats; Facebook, Tencent, VKontakte, LiveInternet; GSMA Intelligence
46.5
URBANISATION: 77%
35.7
PENETRATION: 77%
22.0
PENETRATION: 47%
50.3
vs. POPULATION: 108%
17.8
PENETRATION: 38%
DIGITAL IN SPAIN
41. @estebanmoroWe Are Social @wearesocialsg • 299
JAN
2015 TOP ACTIVE SOCIAL PLATFORMS
• Source: GlobalWebIndex, Q4 2014. Figures represent percentage of the total national population using the platform in the past month.
SURVEY-BASED DATA: FIGURES REPRESENT USERS’ OWN CLAIMED / REPORTED ACTIVITY
SOCIAL NETWORK
MESSENGER / CHAT APP / VOIP
42%!
33%!
20%!
17%!
12%!
11%!
10%!
9%!
9%!
7%!
WHATSAPP
FACEBOOK
FACEBOOK
MESSENGER
TWITTER
SKYPE
GOOGLE+
INSTAGRAM
SHAZAM
LINKEDIN
PINTEREST
42. @estebanmoroWe Are Social @wearesocialsg • 295
JAN
2015 TIME SPENT WITH MEDIA
SURVEY-BASED DATA: FIGURES REPRESENT USERS’ OWN CLAIMED / REPORTED ACTIVITY
AVERAGE DAILY USE
OF THE INTERNET
VIA A PC OR TABLET
(INTERNET USERS)
AVERAGE DAILY USE
OF THE INTERNET VIA A
MOBILE PHONE (MOBILE
INTERNET USERS)
AVERAGE DAILY USE
OF SOCIAL MEDIA
VIA ANY DEVICE
(SOCIAL MEDIA USERS)
AVERAGE DAILY
TELEVISION VIEWING
TIME (INTERNET USERS
WHO WATCH TV)
• Source: GlobalWebIndex, Q4 2014. Based on a survey of internet users aged 16-64.
NOTE THAT AVERAGE TIMES ARE BASED SOLELY ON PEOPLE WHO USE EACH MEDIUM, AND DO NOT FACTOR NON-USERS
3H 58M 1H 51M 1H 54M 2H 31M
43. @estebanmoro
Opinion:
Political opinion
Product/Brand opinion
Cities:
Tourism activity
Event detection
Economies:
Unemployment
Microcredit approval
Human resources
Social:
Influencer detection
Community analysis
Social mobilization
Mobility:
Tourism in cities
World-wide transport
Mobility
Content
Activity
Social
57. @estebanmoro
Health
Correlation between content in social networks and symptoms
60 80 100 120 140
0100200300
tagl[, 1]
(tagl[,3]/tagl[,2])*1e+05/4
60 80 100 120 140
0200400600800
tagl[, 1]
(tagl[,3]/tagl[,2])*1e+05/4
60 80 100 120 140
02006001000
tagl[, 1]
(tagl[,3]/tagl[,2])*1e+05/4
flu
Allergy
headache
Weeks since Jan 2012)
Incidence(per100kusers)
60 80 100 120 140
050010001500
tagl[, 1]
(tagl[,3]/tagl[,2])*1e+05/4
fever
headache
flu
Incidence
alta
media
baja
58. @estebanmoro
Health
Correlation between content in social networks and symptoms
group within the correlation matrix. There is a second group with a peak
seasonality during the hottest months of the year, this group is mostly form by
pollens and O3. And finally, there is a third cluster where variables correlate
between each other very strongly, however, the correlation with the rest of time
series is zero or very small.
Figure 4. Geo Spatial representation of the logarithmic transformation of total mentions of health related time
59. @estebanmoro
Health
Correlation between content in social networks and symptoms
0
500
1000
1500
2000
1_Mon 2_Tue 3_Wed 4_Thu 5_Fri 6_Sat 7_Sun
dias
fraction
1600
1700
1800
1900
fraction
Incidence(per100000users)
days
0
100
200
300
400
500
1_Mon 2_Tue 3_Wed 4_Thu 5_Fri 6_Sat 7_Sun
dias
fraction
400
420
440
460
480
fraction
days
Headache backache
60. @estebanmoro
Political opinion
• Ejemplo: identificación de partidarios durante las campañas políticasCatalan elections 2010
-1.0 -0.5 0.0 0.5
-1.0-0.50.00.51.0
0
0
PSC
CiU
PSC
CiU
PSC
CiU
PSC
CiU
ERC
PPC
ICV
PSC
CiU
ERC
PPC
ICV
C's
SOL
PSC
CiU
ERC
PPC
ICV
C's
SOL
PPT
PACMA
CORI
62. @estebanmoro
References
• Reviews on mobile phone applications
• Blondel, V. D., Decuyper, A., & Krings, G. (2015). A survey of results on mobile phone
datasets analysis. EPJ Data Science, 4(1), 10. http://doi.org/10.1140/epjds/
s13688-015-0046-0
• MOBILE PHONE NETWORK DATA FOR DEVELOPMENT. (2013). UN Global Pulse
• Saramaki, J., & Moro, E. (2015). From seconds to months: an overview of multi-scale
dynamics of mobile telephone calls. The European Physical Journal B, 88(6). http://
doi.org/10.1140/epjb/e2015-60106-6
• Naboulsi, D., Fiore, M., Ribot, S., & Stanica, R. (n.d.). Large-scale Mobile Traffic
Analysis: a Survey. IEEE Communications Surveys & Tutorials, 1–1. http://doi.org/
10.1109/COMST.2015.2491361
• Conferences
• NetMob http://netmob.org
• NetSci http://netsci2016.net
63. @estebanmoro
References
• Mining the Social web, O’Reilly
http://shop.oreilly.com/product/0636920030195.do
• Aplicaciones
• Pinheiro, C. A. R. 2011. Social network analysis in telecommunications. John
Wiley & Sons.
• Morselli, C., ed. 2013. Crime and Networks. Routledge.
65. @estebanmoro
• There are many frameworks to study social networks
• In general we have:
• Analysis platforms: they implement most of the
algorithms for graph analysis:
• Local metrics (degree, clustering)
• Centrality metrics (betweenness, closeness, etc.)
• Community finding algorithms
• Visualization libraries
• Display graphs in different forms (layout, colors,
etc.)
• Graph databases: allow the storage (distributed),
queries and some type of analysis for (big) graph
data.
Libraries
67. @estebanmoro
• Network data can be stored in many databases
• However in the last years, the interest in graph databases has grown steadily
Graph databases
67
http://db-engines.com/en/ranking_categories
68. @estebanmoro
• They are databases that uses graph structures
for queries. Data is represented using
nodes, edges and properties of them
• Each node knows its neighbors
• They implement in a very easy
way queries on graphs lie:
• Find the neighbors of a node
• Find the path between two nodes
• Those queries in a typical relational database require several “joins”:
Graph databases
68
http://neo4j.com/developer/graph-db-vs-rdbms/
69. @estebanmoro
• Some examples
• Neo4j (comercial/open-source): problably
the more used. It has its own query
lenguage (Cypher). It can be accessed from
many other languages (R, pyhton, java)
http://neo4j.com
• Sparksee (commercial): built for high-
performance and scalability. http://sparsity-
technologies.com
• Titan (Apache): distributed graph database,
built to store, query graphs with billions of
nodes and edges. http://
thinkaurelius.github.io/titan/
Graph databases
69
70. @estebanmoro
• It can be used using API Rest (HTTP)
• It has his own query language: Cypher
Neo4J
70
72. @estebanmoro
1. You can download the full Panama papers database in Neo4J format
2. https://offshoreleaks.icij.org/pages/database
3. Count number of nodes / number of relationships
Application: Panama papers
72
73. @estebanmoro
1. Show the relationships of the President of Azerbaijan (Ilham Aliyev) and his children
2. https://panamapapers.icij.org/20160404-azerbaijan-hidden-wealth.html
3. Search for all the officers named “ Aliyev"
Application: Panama papers
73
75. @estebanmoro
• Built in many programming languages
• Boost Graph Library (BGL) is
probably the most known and old. Built
in C++ and optimized to be general,
fast and efficient.
• SNAP (Standford Network Analysis),
writen in C++ and optimized for
massive graphs. (Jure Leskovec)
• NetworkX (python): library for
studying graphs and networks.
Reasonable efficient for large networks
and their visualization
https://networkx.github.io
Analysis Libraries
75
76. @estebanmoro
• Graph-tool (python): module for manipulation and
statistical analysis of graphs. Based heavily on BGL
to have same performance. (Tiago P. Peixoto)
https://graph-tool.skewed.de
• igraph (python, C y R): library written in C but also
exists as a Python and R packages. It implements
most algorithms. http://igraph.org
• networkDynamic (R): to analyze temporal
networks
Analysis libraries
76
77. @estebanmoro
• Other platforms for the analysis of massive graphs
(distributed)
• Giraph (Apache): graph processing with high
scalability. Used by Facebook, compatible with
Hadoop. http://giraph.apache.org
• Pregel (Comercial): Google’s graph platform
• GraphLab (Commercial): graph-based, high
performance, distributed computational
framework (including Machine Learning Toolkits)
https://graphlab.org
• GraphX (Apache): distributed graph processing
framework on top of Apache Spark. Has many
powerful algorithms for graph analysis.
http://spark.apache.org/graphx/
Analysis libraries
77
78. @estebanmoro
• Most of the analysis libraries contain visualization tools or
modules to visualize graphs.
• Apart from those, there are other tools specialized in the
visualization of graphs
• Gephi is problably the most known one: is an
interactive visualization software (includes some
analysis metrics). Works in Windows Linux and
MacOSX. It is the „photoshop“ for graphs ☺ http://
gephi.org
• Pajek is program in Windows to visualize and
analyze big graphs.
http://vlado.fmf.uni-lj.si/pub/networks/pajek/
• Linkurious graph visualization on top of Neo4j
http://linkurio.us
Visualization libraries
78
79. @estebanmoro
• Graphviz: open-sourced library to visualize graph data http://
www.graphviz.org
• Sigma.js is a javascript library to visualize graphs on the web.
http://sigmajs.org
• Vis.js is a general javascript visualization library also with tools
to visualize graphs. http://visjs.org/
• lightning-viz.org provides API-based access to reproducible
web visualizations
• D3.js also have some graph visualization tools. Examples:
• http://christophergandrud.github.io/networkD3/
• http://bl.ocks.org/mbostock/4062045
• https://flowingdata.com/2012/08/02/how-to-make-an-
interactive-network-visualization/
Visualization libraries
79
80. @estebanmoro
• Allows to modify and
customize the visualization of
graphs in an interactive way
• It has many layout algorithms
• Contains some graph
metrics:
• Centrality
• PageRank
• Connected components
• Etc.
• Allows to import/export
graphs in many different
formats.
Gephi
80
81. @estebanmoro
• About graph databases
• Wikipedia: http://en.wikipedia.org/wiki/Graph_database
• Libro: Graph Databases (O’Reilly) http://graphdatabases.com
• Graph database ranking: http://db-engines.com/en/ranking/graph+dbms
• About Neo4J
• Learn Neoj4j: book http://neo4j.com/book-learning-neo4j/
• Graphacademy (de Neo4j): http://neo4j.com/graphacademy/ has some online courses
• About a
• Igraph: Statistical Analysis of Network Data with R (libro) http://www.amazon.com/
Statistical-Analysis-Network-Data-Use/dp/1493909827/
• GraphX: A gentle introduction to GraphX in Spark http://www.sparktutorials.net/
analyzing-flight-data:-a-gentle-introduction-to-graphx-in-spark
Some references
81
82. @estebanmoro
• About visualization
• Gephi:
• Learn how to use Gephi https://gephi.org/users/
• Introduction to Network Analysis and Visualization
http://www.martingrandjean.ch/gephi-introduction/
Some references
82
90. @estebanmoro
References
• Online material
▪ The igraph book (incompleto)
▪ igraph wikidot
▪ Manual sencillo en español
• Books
▪ Statistical Analysis of Network Data with R
106. @estebanmoro
networkDynamic
References
About temporal networks
▪ Holme, P., & Saramaki, J. (2012).Temporal networks. Physics Reports, 519(3), 97–
125.
▪ Saramaki, J., & Moro, E. (2015). From seconds to months: an overview of multi-
scale dynamics of mobile telephone calls.The European Physical Journal B, 88(6).
http://doi.org/10.1140/epjb/e2015-60106-6
About the networkDynamic, tsna and ndtv packages
▪ Package examples for networkDynamic https://cran.r-project.org/web/packages/
networkDynamic/vignettes/networkDynamic.pdf
▪ PackageVignette for ndtv https://cran.r-project.org/web/packages/ndtv/vignettes/
ndtv.pdf
▪ PackageVignette for tsna https://cran.r-project.org/web/packages/tsna/vignettes/
tsna_vignette.html
Tutorials
▪ Temporal network tools in statnet: networkDynamic, ndtv and tsna