Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Social and economical networks from (big-)data - Esteban Moro

642 views

Published on

COMPLEX NETWORKS: THEORY, METHODS, AND APPLICATIONS (2ND EDITION)
May 16-20, 2016

Published in: Science
  • Be the first to comment

  • Be the first to like this

Social and economical networks from (big-)data - Esteban Moro

  1. 1. Social and economical networks from (big-)data Esteban Moro @estebanmoro Master City Science, April 2016
  2. 2. @estebanmoro Summary 1. Intro to Social/Geo Big Data 2. Sources of Social/Geo Big Data 3. Tools for Social/Geo Big Data 4. Applications of Big Data in Social and 
 Economical problems 5. Outlook
  3. 3. @estebanmoro Mobile phone data 1.Intro to Social Geo Big Data
  4. 4. @estebanmoro The data explosion
  5. 5. @estebanmoro
  6. 6. @estebanmoro The three V’s
  7. 7. @estebanmoro 90% of the data today was created in the last 2 years Volume
  8. 8. @estebanmoro Volume http://blogs.msdn.com/b/data__knowledge__intelligence/archive/2013/02/18/big-data-big-deal.aspx
  9. 9. @estebanmoro Velocity
  10. 10. @estebanmoro Variety
  11. 11. @estebanmoro The three layers of resources
  12. 12. @estebanmoro Data is not information. Neither value Acción Decisión Datos Conoci- miento Infor- mación ML SNA NLP
  13. 13. @estebanmoro NLP SNA Tweets about event brand, person Linguistic analysis of its content Content classification. Alert generation Data is not information. Neither value
  14. 14. @estebanmoro McKinsey Global Institute Big Data Report 2011 http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
  15. 15. @estebanmoro
  16. 16. @estebanmoro We are what we repeatedly do Situation Behavior Observation
  17. 17. @estebanmoro > Big Data, Better answers
 Improve our understanding of well-known problems Different geo/temporal scales: real time (nowcasting/ forecasting), small areas > Big data, Big new questions
 Unknown/unsolved problems
  18. 18. @estebanmoro Mobile phone data 2.Sources of social / geo big data
  19. 19. @estebanmoro
  20. 20. @estebanmoro Frequency Semantics • Social networks: • Twitter, Facebook, Foursquare, etc. • Google: • Points of interest, searchs, etc. • Financial data • Transfers • Credit card transactions • Mobile phone: • CDRs (calls/SMS), network events, etc. • Phone sensors Geo and Social Data Sources
  21. 21. @estebanmoro Maps • Raster images • Googlemaps & OpenStreetMap • Static maps + routes • http://maps.google.com/maps/api/staticmap • http://open.mapquestapi.com/guidance/v1/ • Cartodb • https://cartodb.com
  22. 22. @estebanmoro Who What Where With whom When Mining the social web, O’Reilly
 http://shop.oreilly.com/product/0636920030195.do Social media data sources
  23. 23. @estebanmoro Social media data sources 2M tweets geolocalized in Madrid
  24. 24. @estebanmoro Mining the social web, O’Reilly
 http://shop.oreilly.com/product/0636920030195.do Social media data sources Where Who How What
  25. 25. @estebanmoro Shops & Services Food Professional Social media data sources
  26. 26. @estebanmoro Mobile phone data Where When With whom
  27. 27. @estebanmoro Credit card Where When What
  28. 28. @estebanmoro How much does 
 BigData cost? Sources of data • Free APIs (http://dev.twitter.com) • Data vendors • GNIP • Datasift • Data cost is a function of volume and query complexity. • Volume: 10k tweets = $1 • Complexity: 1 unit = 0.20$ • Typical queries (a word/hashtag) in a week = $100’s
  29. 29. @estebanmoro Other sources of BigData http://insights.wired.com/profiles/blogs/monetizing-data-milking-the-new-cash-cow Data monetization
  30. 30. @estebanmoro Other sources of BigData Data monetization https://www.commerce360.es http://dynamicinsights.telefonica.com
  31. 31. @estebanmoro Other sources of BigData Other sources of data http://insideairbnb.com
  32. 32. @estebanmoro Other sources of BigData Other sources of data (pictures, Panoramio) http://www.sightsmap.com
  33. 33. @estebanmoro Other sources of BigData Other sources of data (pictures, Flickr) https://www.flickr.com/photos/walkingsf/sets/72157627140310742/with/5925795351/
  34. 34. @estebanmoro Other sources of BigData Other sources of data (pictures, NASA) http://www.citylab.com/tech/2014/05/the-economic-data-hidden-in-satellite-views-of-city-lights/371660/
  35. 35. @estebanmoro Mobile phone data 3.Applications of Big Data
  36. 36. @estebanmoro What can we do with social/geo bigdata • Basically: • a) Build modes of user behavior: • Geo-social activity • Geo-individual recommendation • Geomarketing • Fraud detection • Insurance dynamical pricing • b) Build models of areas activity • Optimal distribution of resources (retail, banks) • Event detection • Measure fluxes between areas (traffic, transport, health, etc.) • Macro-economical indexes of areas
  37. 37. @estebanmoro 2015 250 participants :: 140 institutions, 32 countries, 5 continents Organized by IV Conference on the scientific analysis
 of mobile phone datasets
  38. 38. @estebanmoro 2015 Crowds: Real time event detection in cities Estimating attendance of events Cities: Energy consumption Predicting crime hotspots Health catchment areas Census estimation Economies: Loan Repayment Food consumption and 
 poverty indices Microcredit approval Labor market Societies: Spread of diseases Social influence Privacy Product adoption Marketing Mobility: Mobility prediction Impact of Sharing Economy Optimization of public 
 transportation Mobility Content Activity Social
  39. 39. @estebanmoro We Are Social @wearesocialsg • 293 ACTIVE INTERNET USERS TOTAL POPULATION ACTIVE SOCIAL MEDIA ACCOUNTS MOBILE CONNECTIONS ACTIVE MOBILE SOCIAL ACCOUNTS FIGURE REPRESENTS MOBILE SUBSCRIPTIONS, NOT UNIQUE USERS FIGURE REPRESENTS ACTIVE USER ACCOUNTS, NOT UNIQUE USERS FIGURE REPRESENTS ACTIVE USER ACCOUNTS, NOT UNIQUE USERS FIGURE REPRESENTS TOTAL NATIONAL POPULATION, INCLUDING CHILDREN FIGURE INCLUDES ACCESS VIA FIXED AND MOBILE CONNECTIONS JAN 2015 A SNAPSHOT OF THE COUNTRY’S KEY DIGITAL STATISTICAL INDICATORS MILLION MILLION MILLION MILLION MILLION • Sources: Wikipedia; InternetLiveStats, InternetWorldStats; Facebook, Tencent, VKontakte, LiveInternet; GSMA Intelligence 46.5 URBANISATION: 77% 35.7 PENETRATION: 77% 22.0 PENETRATION: 47% 50.3 vs. POPULATION: 108% 17.8 PENETRATION: 38% DIGITAL IN SPAIN
  40. 40. @estebanmoroWe Are Social @wearesocialsg • 299 JAN 2015 TOP ACTIVE SOCIAL PLATFORMS • Source: GlobalWebIndex, Q4 2014. Figures represent percentage of the total national population using the platform in the past month. SURVEY-BASED DATA: FIGURES REPRESENT USERS’ OWN CLAIMED / REPORTED ACTIVITY SOCIAL NETWORK MESSENGER / CHAT APP / VOIP 42%! 33%! 20%! 17%! 12%! 11%! 10%! 9%! 9%! 7%! WHATSAPP FACEBOOK FACEBOOK MESSENGER TWITTER SKYPE GOOGLE+ INSTAGRAM SHAZAM LINKEDIN PINTEREST
  41. 41. @estebanmoroWe Are Social @wearesocialsg • 295 JAN 2015 TIME SPENT WITH MEDIA SURVEY-BASED DATA: FIGURES REPRESENT USERS’ OWN CLAIMED / REPORTED ACTIVITY AVERAGE DAILY USE OF THE INTERNET VIA A PC OR TABLET (INTERNET USERS) AVERAGE DAILY USE OF THE INTERNET VIA A MOBILE PHONE (MOBILE INTERNET USERS) AVERAGE DAILY USE OF SOCIAL MEDIA VIA ANY DEVICE (SOCIAL MEDIA USERS) AVERAGE DAILY TELEVISION VIEWING TIME (INTERNET USERS WHO WATCH TV) • Source: GlobalWebIndex, Q4 2014. Based on a survey of internet users aged 16-64. NOTE THAT AVERAGE TIMES ARE BASED SOLELY ON PEOPLE WHO USE EACH MEDIUM, AND DO NOT FACTOR NON-USERS 3H 58M 1H 51M 1H 54M 2H 31M
  42. 42. @estebanmoro Opinion: Political opinion Product/Brand opinion Cities: Tourism activity Event detection Economies: Unemployment Microcredit approval Human resources Social: Influencer detection Community analysis Social mobilization Mobility: Tourism in cities World-wide transport Mobility Content Activity Social
  43. 43. @estebanmoro Dynamic population estimation Deville, P, et al. (2014). Dynamic population mapping using mobile phone data. 
 PNAS 111(45), 15888–15893. http://doi.org/10.1073/pnas.1408439111
  44. 44. @estebanmoro Purchasing behavior during holidays BBVA + MIT
  45. 45. @estebanmoro Mobility inside cities Habidatum
  46. 46. @estebanmoro Mobility inside cities Habidatum
  47. 47. @estebanmoro Mobility between cities A. Llorente, E. Moro et al (2014)
  48. 48. @estebanmoro Event detection Orange
  49. 49. @estebanmoro Transport http://cargocollective.com/juanfrans
  50. 50. @estebanmoro Tourism http://www.centrodeinnovacionbbva.com/bbvatourism
  51. 51. @estebanmoro Real state http://www.urbandataanalytics.com/2014/03/12/las-edades-de-madrid/
  52. 52. @estebanmoro https://mcorella.cartodb.com/viz/2858ca72-e1ec-11e5-bfd8-0ea31932ec1d/public_map http://analytics.afi.es/AfiAnalytics/noticias/1503332/1491511/0/es-tu-casa-grande-o-pequena-y-las-de-tu-barrio.html Real state
  53. 53. @estebanmoro Real state http://www.datanami.com/2015/08/12/inside-the-zestimate-data-science-at-zillow/
  54. 54. @estebanmoro Real state http://www.amazon.com/Zillow-Talk-Rules-Real-Estate/dp/1455574740
  55. 55. @estebanmoro Health Prediction of air quality in cities (http://www.bsc.es/caliope/es)
  56. 56. @estebanmoro Health Correlation between content in social networks and symptoms 60 80 100 120 140 0100200300 tagl[, 1] (tagl[,3]/tagl[,2])*1e+05/4 60 80 100 120 140 0200400600800 tagl[, 1] (tagl[,3]/tagl[,2])*1e+05/4 60 80 100 120 140 02006001000 tagl[, 1] (tagl[,3]/tagl[,2])*1e+05/4 flu Allergy headache Weeks since Jan 2012) Incidence(per100kusers) 60 80 100 120 140 050010001500 tagl[, 1] (tagl[,3]/tagl[,2])*1e+05/4 fever headache flu Incidence alta media baja
  57. 57. @estebanmoro Health Correlation between content in social networks and symptoms group within the correlation matrix. There is a second group with a peak seasonality during the hottest months of the year, this group is mostly form by pollens and O3. And finally, there is a third cluster where variables correlate between each other very strongly, however, the correlation with the rest of time series is zero or very small. Figure 4. Geo Spatial representation of the logarithmic transformation of total mentions of health related time
  58. 58. @estebanmoro Health Correlation between content in social networks and symptoms 0 500 1000 1500 2000 1_Mon 2_Tue 3_Wed 4_Thu 5_Fri 6_Sat 7_Sun dias fraction 1600 1700 1800 1900 fraction Incidence(per100000users) days 0 100 200 300 400 500 1_Mon 2_Tue 3_Wed 4_Thu 5_Fri 6_Sat 7_Sun dias fraction 400 420 440 460 480 fraction days Headache backache
  59. 59. @estebanmoro Political opinion • Ejemplo: identificación de partidarios durante las campañas políticasCatalan elections 2010 -1.0 -0.5 0.0 0.5 -1.0-0.50.00.51.0 0 0 PSC CiU PSC CiU PSC CiU PSC CiU ERC PPC ICV PSC CiU ERC PPC ICV C's SOL PSC CiU ERC PPC ICV C's SOL PPT PACMA CORI
  60. 60. @estebanmoro Political opinion General Strike Spain March 12
  61. 61. @estebanmoro References • Reviews on mobile phone applications • Blondel, V. D., Decuyper, A., & Krings, G. (2015). A survey of results on mobile phone datasets analysis. EPJ Data Science, 4(1), 10. http://doi.org/10.1140/epjds/ s13688-015-0046-0 • MOBILE PHONE NETWORK DATA FOR DEVELOPMENT. (2013). UN Global Pulse • Saramaki, J., & Moro, E. (2015). From seconds to months: an overview of multi-scale dynamics of mobile telephone calls. The European Physical Journal B, 88(6). http:// doi.org/10.1140/epjb/e2015-60106-6 • Naboulsi, D., Fiore, M., Ribot, S., & Stanica, R. (n.d.). Large-scale Mobile Traffic Analysis: a Survey. IEEE Communications Surveys & Tutorials, 1–1. http://doi.org/ 10.1109/COMST.2015.2491361 • Conferences • NetMob http://netmob.org • NetSci http://netsci2016.net
  62. 62. @estebanmoro References • Mining the Social web, O’Reilly
 http://shop.oreilly.com/product/0636920030195.do • Aplicaciones • Pinheiro, C. A. R. 2011. Social network analysis in telecommunications. John Wiley & Sons. • Morselli, C., ed. 2013. Crime and Networks. Routledge.
  63. 63. @estebanmoro Mobile phone data 3.Tools for social/geo big data
  64. 64. @estebanmoro • There are many frameworks to study social networks • In general we have: • Analysis platforms: they implement most of the algorithms for graph analysis: • Local metrics (degree, clustering) • Centrality metrics (betweenness, closeness, etc.) • Community finding algorithms • Visualization libraries • Display graphs in different forms (layout, colors, etc.) • Graph databases: allow the storage (distributed), queries and some type of analysis for (big) graph data. Libraries
  65. 65. @estebanmoro 3 layers of graph technologies 66
  66. 66. @estebanmoro • Network data can be stored in many databases • However in the last years, the interest in graph databases has grown steadily Graph databases 67 http://db-engines.com/en/ranking_categories
  67. 67. @estebanmoro • They are databases that uses graph structures 
 for queries. Data is represented using 
 nodes, edges and properties of them • Each node knows its neighbors • They implement in a very easy 
 way queries on graphs lie: • Find the neighbors of a node • Find the path between two nodes • Those queries in a typical relational database require several “joins”: Graph databases 68 http://neo4j.com/developer/graph-db-vs-rdbms/
  68. 68. @estebanmoro • Some examples • Neo4j (comercial/open-source): problably the more used. It has its own query lenguage (Cypher). It can be accessed from many other languages (R, pyhton, java) http://neo4j.com • Sparksee (commercial): built for high- performance and scalability. http://sparsity- technologies.com • Titan (Apache): distributed graph database, built to store, query graphs with billions of nodes and edges. http:// thinkaurelius.github.io/titan/ Graph databases 69
  69. 69. @estebanmoro • It can be used using API Rest (HTTP) • It has his own query language: Cypher Neo4J 70
  70. 70. @estebanmoro Neo4j Cypher 71 http://neo4j.com/developer/cypher-query-language/
  71. 71. @estebanmoro 1. You can download the full Panama papers database in Neo4J format 2. https://offshoreleaks.icij.org/pages/database 3. Count number of nodes / number of relationships Application: Panama papers 72
  72. 72. @estebanmoro 1. Show the relationships of the President of Azerbaijan (Ilham Aliyev) and his children 2. https://panamapapers.icij.org/20160404-azerbaijan-hidden-wealth.html 3. Search for all the officers named “ Aliyev" Application: Panama papers 73
  73. 73. @estebanmoro 1. Show all the companies (entities) related to them Application: Panama papers 74
  74. 74. @estebanmoro • Built in many programming languages • Boost Graph Library (BGL) is probably the most known and old. Built in C++ and optimized to be general, fast and efficient. • SNAP (Standford Network Analysis), writen in C++ and optimized for massive graphs. (Jure Leskovec) • NetworkX (python): library for studying graphs and networks. Reasonable efficient for large networks and their visualization
 https://networkx.github.io Analysis Libraries 75
  75. 75. @estebanmoro • Graph-tool (python): module for manipulation and statistical analysis of graphs. Based heavily on BGL to have same performance. (Tiago P. Peixoto)
 https://graph-tool.skewed.de • igraph (python, C y R): library written in C but also exists as a Python and R packages. It implements most algorithms. http://igraph.org • networkDynamic (R): to analyze temporal networks Analysis libraries 76
  76. 76. @estebanmoro • Other platforms for the analysis of massive graphs (distributed) • Giraph (Apache): graph processing with high scalability. Used by Facebook, compatible with Hadoop. http://giraph.apache.org • Pregel (Comercial): Google’s graph platform • GraphLab (Commercial): graph-based, high performance, distributed computational framework (including Machine Learning Toolkits) https://graphlab.org • GraphX (Apache): distributed graph processing framework on top of Apache Spark. Has many powerful algorithms for graph analysis.
 http://spark.apache.org/graphx/ Analysis libraries 77
  77. 77. @estebanmoro • Most of the analysis libraries contain visualization tools or modules to visualize graphs. • Apart from those, there are other tools specialized in the visualization of graphs • Gephi is problably the most known one: is an interactive visualization software (includes some analysis metrics). Works in Windows Linux and MacOSX. It is the „photoshop“ for graphs ☺ http:// gephi.org • Pajek is program in Windows to visualize and analyze big graphs. 
 http://vlado.fmf.uni-lj.si/pub/networks/pajek/ • Linkurious graph visualization on top of Neo4j http://linkurio.us Visualization libraries 78
  78. 78. @estebanmoro • Graphviz: open-sourced library to visualize graph data http:// www.graphviz.org • Sigma.js is a javascript library to visualize graphs on the web. http://sigmajs.org • Vis.js is a general javascript visualization library also with tools to visualize graphs. http://visjs.org/ • lightning-viz.org provides API-based access to reproducible web visualizations • D3.js also have some graph visualization tools. Examples: • http://christophergandrud.github.io/networkD3/ • http://bl.ocks.org/mbostock/4062045 • https://flowingdata.com/2012/08/02/how-to-make-an- interactive-network-visualization/ Visualization libraries 79
  79. 79. @estebanmoro • Allows to modify and customize the visualization of graphs in an interactive way • It has many layout algorithms • Contains some graph metrics: • Centrality • PageRank • Connected components • Etc. • Allows to import/export graphs in many different formats. Gephi 80
  80. 80. @estebanmoro • About graph databases • Wikipedia: http://en.wikipedia.org/wiki/Graph_database • Libro: Graph Databases (O’Reilly) http://graphdatabases.com • Graph database ranking: http://db-engines.com/en/ranking/graph+dbms • About Neo4J • Learn Neoj4j: book http://neo4j.com/book-learning-neo4j/ • Graphacademy (de Neo4j): http://neo4j.com/graphacademy/ has some online courses • About a • Igraph: Statistical Analysis of Network Data with R (libro) http://www.amazon.com/ Statistical-Analysis-Network-Data-Use/dp/1493909827/ • GraphX: A gentle introduction to GraphX in Spark http://www.sparktutorials.net/ analyzing-flight-data:-a-gentle-introduction-to-graphx-in-spark Some references 81
  81. 81. @estebanmoro • About visualization • Gephi: • Learn how to use Gephi https://gephi.org/users/ • Introduction to Network Analysis and Visualization
 http://www.martingrandjean.ch/gephi-introduction/ Some references 82
  82. 82. @estebanmoro Simple examples 83
  83. 83. @estebanmoro igraph
  84. 84. @estebanmoro igraph
  85. 85. @estebanmoro igraph
  86. 86. @estebanmoro igraph
  87. 87. @estebanmoro igraph
  88. 88. @estebanmoro igraph
  89. 89. @estebanmoro References • Online material ▪ The igraph book (incompleto) ▪ igraph wikidot ▪ Manual sencillo en español • Books ▪ Statistical Analysis of Network Data with R
  90. 90. @estebanmoro NetworkX
  91. 91. @estebanmoro NetworkX
  92. 92. @estebanmoro NetworkX
  93. 93. @estebanmoro NetworkX
  94. 94. @estebanmoro NetworkX
  95. 95. @estebanmoro networkDynamic
  96. 96. @estebanmoro networkDynamic
  97. 97. @estebanmoro networkDynamic
  98. 98. @estebanmoro networkDynamic
  99. 99. @estebanmoro networkDynamic
  100. 100. @estebanmoro networkDynamic
  101. 101. @estebanmoro networkDynamic
  102. 102. @estebanmoro networkDynamic
  103. 103. @estebanmoro networkDynamic
  104. 104. @estebanmoro networkDynamic
  105. 105. @estebanmoro networkDynamic References About temporal networks ▪ Holme, P., & Saramaki, J. (2012).Temporal networks. Physics Reports, 519(3), 97– 125. ▪ Saramaki, J., & Moro, E. (2015). From seconds to months: an overview of multi- scale dynamics of mobile telephone calls.The European Physical Journal B, 88(6). http://doi.org/10.1140/epjb/e2015-60106-6 About the networkDynamic, tsna and ndtv packages ▪ Package examples for networkDynamic https://cran.r-project.org/web/packages/ networkDynamic/vignettes/networkDynamic.pdf ▪ PackageVignette for ndtv https://cran.r-project.org/web/packages/ndtv/vignettes/ ndtv.pdf ▪ PackageVignette for tsna https://cran.r-project.org/web/packages/tsna/vignettes/ tsna_vignette.html Tutorials ▪ Temporal network tools in statnet: networkDynamic, ndtv and tsna

×