Social Network Analysis: State of the Art and Challenges Noshir Contractor, UIUC & NCSA  Katy Börner, Indiana University  Network Data  Network Data Collection & Extraction Network Data Coding Network Data Analysis: Methods Network Data Visual-Analytics: Implementations and Algorithms Network Simulation Network Visualization  Tools for Networking and Referrals Network Workflows
Description Multi-dimensional, multi-relational, multi-layer, multi-rater, ego-centric – data sets or streams. Entities could be people, text, documents, organizations, websites, blogs. Datasets vary in number of nodes, and relations/dimensions, subnetworks.  Culture of benchmark datasets shared: Holland Leinhardt, Sampson, Krackhardt, Enron … many included with software packages such as UCINET. Examples Global Information Sector, Blogosphere, Political networks, Emergency Multiorganizational Networks, AP News, Tobacco Control, Multidisciplinary virtual teams, Scientometric networks, Bibliometric Co-author, Co-citation networks, Funding/collaboration networks, (Cell) Phone usage logs, U.S. Census data, GIS data, Capacity of medical facilities, Transit route schedules, etc. Challenges Intellectual: How networks explain large social phenomena such as diffusion of ideas, creativity, political movements and action, infectious diseases (AIDS, Avian flu), online retailing, mobilization to prepare, respond and recover from disasters? Network Data
There are now 1 billions books cataloged in WordCat  http:// www.oclc.org/worldcat / Brewster Kahle’s Internet archive captures about 600TB of data  http:// www.archive.org /  - Michael Macy’s group (Cornell U, Cybertools) Google does not list any more how many web pages it indexes. Indiana University receives 650 million email messages per year (excluding four times this amount from sites that are non-existent and are blocked). Global Information Sector longitudinal dataset of international IT by David Knoke, U Minnesota, companies Public Health Epidemiology, Evaluation and Surveillance Data (Gary Giovino, Roswell Park Cancer Center, Brad Hesse/NCI, Pamela Clark/Battelle) Instrumenting human interaction (video, speech, and non-verbal interactions) and cognition: Corman/ASU, Cox/NCSA, Pentland/MIT, Bennett Bertenthal/U of Chicago (Cybertools) Infectious Disease Epidemiology data – Marathe/Eubank, Virginia Tech A large number of small data sets dealing with the same variables or collected using the same instruments that are typically not analyzed collectively, but could be. For instance, ego-centric network datasets.  Network Data: Size and Complexity are Growing Fast
Description Manual (self-report, interviews observation), Sensors, Web-crawling, Automated network extraction methods: Entity extraction, relation or affiliation extraction Examples EgoNet, CI as a source of data (provenance graphs), Bibliographic data, CITESEER, Web 2.0 technologies (Blogs, RSS feeds, etc), GATE, Crawdad (Corman), D2K & T2K (Welge), VIAS (Craig) Challenges Validation of automated (or computer-augment) approaches to entity and affiliation extraction Network Data Collection:  Entity and Affiliation Extraction
Description Manual, Tagging, Semantic, From Taxonomy to “Folksonomies” Examples Flickr, Tag Cloud, Semantic Grid, Provenance, Social bookmarking, Credentialling Challenges Self-organizing standards – Metadata, Semantic Grid, Incentives structures for mobilizing communities to contribute Adaptiveness to changing Folksonomies “ Conservative” tendency towards the mean or majority view Network Data Coding
Description ERGM - Exponential Random Graph Models (Wasserman, Pattison, Robins, Snijders, et al) Network Evolution – Actor oriented models  (Snijders, Steglich) Positional analysis – generalized block modeling (Batagelj) Autocorrelation Models (Leenders) Spectral analysis (Richards, Seary) Multi-relational, multi-rater networks (Koehly, Corman) Sampling, missing data (Wasserman, Butts) Examples Monte Carlo techniques for Maximum Likelihood Estimation of ERGM: Simulate a distribution of random graphs from a starting set of parameter values and to refine these estimated parameter values by comparing the distribution of graphs with observed graph until parameter stabilizes.  Challenges “ However, Monte Carlo approaches to MLE can be computer intensive, so estimation for networks with a large number of nodes, or for a complex model may not be possible or may take an unacceptably long time (p. 157, Wasserman & Robins, 2005). Scalability in terms of size of networks (nodes, number of relationships) and complexity of model – number of parameters to be estimated. Network Data Analysis: Methods
Description Descriptive methods to calculate (simple) network statistics (e.g., centrality or transitivity) Procedure-based analysis for more complex (iterative) algorithms (e.g., cluster analysis or eigen decomposition) Statistical modeling based on probability distributions (e g., Exponential Random Graph Modeling, Quadratic Assignment Procedures) Examples Pajek (Vlado), SNA-R/StatNet (Butts), Multinet/PSPAR (Richards/Seary), VOSON (Ackland), StocNet/SIENA (Snijders/Steglich), NetVis (Cummings), JUNG (Fisher) Challenges Data formats interoperability, data integration, data management. Open code, doing “bake-offs” using simulation models on benchmark data to explain variance. Evaluating and assessing different Markov Chain Monte Carlo (MCMC) algorithms such as Robbins-Monro, Gibbs Sampler, Metropolis Hastings)  Extensible and scalable software frameworks to ‘plug & play’ diverse algorithms. Algorithm and code documentation and learning modules. Network Data Visual-Analytics:  Implementations and Algorithms
Source: Huisman, M.  & Van Duijn, M. A. J. (2005). Software for Social Network Analysis.  In P J. Carrington, J. Scott, & S. Wasserman (Editors), Models and Methods in Social Network Analysis (pp. 270-316). New York: Cambridge University Press. Software for Network Analysis
Source: Huisman, M.  & Van Duijn, M. A. J. (2005). Software for Social Network Analysis.  In P J. Carrington, J. Scott, & S. Wasserman (Editors), Models and Methods in Social Network Analysis (pp. 270-316). New York: Cambridge University Press. Software for Network Analysis
Source: Huisman, M.  & Van Duijn, M. A. J. (2005). Software for Social Network Analysis.  In P J. Carrington, J. Scott, & S. Wasserman (Editors), Models and Methods in Social Network Analysis (pp. 270-316). New York: Cambridge University Press. Software Toolkits for Network Analysis
Description Systems Dynamics Agent Based Models Computational Network Models Examples Diffusion of information, emergence of norms, coordination of conventions, or participation in collective action (Macy) Spread of epidemics EPISIMS (Marathe/Eubank) Tobacco control (Clark/Hesse) Computational modeling Environments: Repast (Sallach), Blanche (Contractor) Challenges Reusable, transparency, docking, multi-scale simulations, distributed data, doing “bake-offs” on simulation models on benchmark data to explain variance. Theoretical testing and empirical validation.  Network Simulation
Description An (animated) image of a network is sometimes worth a list of millions of (dynamically changing) node-node pairs.  In many cases, visuals are the major means to represent and communicate scientific results -- across scientific boundaries. Examples Pajek (Batagelj), GUESS (Adar), JUNG (Fisher), MatrixVis (Chris Mueller), TraceEncounters (Paley), TreePlus (Plaisant) Challenges Eye candy vs. highly readable and effective visualizations.  Visualization of data origin, provenance, accuracy, (un)certainty. Tight coupling of data analysis and visualization to help people make sense of very large, dynamically evolving datasets. Scalable, interactive/iterative specification of data analysis and data mappings. Network Visualization
http://www.visualcomplexity.com/vc/
Description Search-based recommendations, Category-based recommendations, Collaborative filtering: Memory-based, Model-based (clustering), Association rules (or item-to-item collaborative filtering), Content-based methods, Recommendation support Hybrid methods Examples TraceEncounters (Paley), Intellibadge (Donna Cox), I-neighbors (Hampton), AOL (Karahalios), LEEP (Haythornthwaite), CI-KNOW (Contractor), NetExpert (Ramon Sanguesa ),  Adaptive Referral Systems (Singh) Challenges Algorithms to capturing data in automated and close to real-time Agreement and implementation of metadata, provenance, and  Developing theoretically grounded and statistically defensible mechanisms for referrals Implementation of scalable algorithms Privacy benefit trade-offs Access to large bibliographic databases Tools for Networking and Referrals
Description Itineraries – Supporting power users who want to thread multiple programs, data sets  Examples Kepler and D2K (Welge) Challenges Agreement and implementation of standards Interfacing with existing and new visual-analytic tools, data sets Network Workflows: Tying it all together
What Other Challenges Exist?

Contractor-Borner-SNA-SAC

  • 1.
    Social Network Analysis:State of the Art and Challenges Noshir Contractor, UIUC & NCSA Katy Börner, Indiana University Network Data Network Data Collection & Extraction Network Data Coding Network Data Analysis: Methods Network Data Visual-Analytics: Implementations and Algorithms Network Simulation Network Visualization Tools for Networking and Referrals Network Workflows
  • 2.
    Description Multi-dimensional, multi-relational,multi-layer, multi-rater, ego-centric – data sets or streams. Entities could be people, text, documents, organizations, websites, blogs. Datasets vary in number of nodes, and relations/dimensions, subnetworks. Culture of benchmark datasets shared: Holland Leinhardt, Sampson, Krackhardt, Enron … many included with software packages such as UCINET. Examples Global Information Sector, Blogosphere, Political networks, Emergency Multiorganizational Networks, AP News, Tobacco Control, Multidisciplinary virtual teams, Scientometric networks, Bibliometric Co-author, Co-citation networks, Funding/collaboration networks, (Cell) Phone usage logs, U.S. Census data, GIS data, Capacity of medical facilities, Transit route schedules, etc. Challenges Intellectual: How networks explain large social phenomena such as diffusion of ideas, creativity, political movements and action, infectious diseases (AIDS, Avian flu), online retailing, mobilization to prepare, respond and recover from disasters? Network Data
  • 3.
    There are now1 billions books cataloged in WordCat http:// www.oclc.org/worldcat / Brewster Kahle’s Internet archive captures about 600TB of data http:// www.archive.org / - Michael Macy’s group (Cornell U, Cybertools) Google does not list any more how many web pages it indexes. Indiana University receives 650 million email messages per year (excluding four times this amount from sites that are non-existent and are blocked). Global Information Sector longitudinal dataset of international IT by David Knoke, U Minnesota, companies Public Health Epidemiology, Evaluation and Surveillance Data (Gary Giovino, Roswell Park Cancer Center, Brad Hesse/NCI, Pamela Clark/Battelle) Instrumenting human interaction (video, speech, and non-verbal interactions) and cognition: Corman/ASU, Cox/NCSA, Pentland/MIT, Bennett Bertenthal/U of Chicago (Cybertools) Infectious Disease Epidemiology data – Marathe/Eubank, Virginia Tech A large number of small data sets dealing with the same variables or collected using the same instruments that are typically not analyzed collectively, but could be. For instance, ego-centric network datasets. Network Data: Size and Complexity are Growing Fast
  • 4.
    Description Manual (self-report,interviews observation), Sensors, Web-crawling, Automated network extraction methods: Entity extraction, relation or affiliation extraction Examples EgoNet, CI as a source of data (provenance graphs), Bibliographic data, CITESEER, Web 2.0 technologies (Blogs, RSS feeds, etc), GATE, Crawdad (Corman), D2K & T2K (Welge), VIAS (Craig) Challenges Validation of automated (or computer-augment) approaches to entity and affiliation extraction Network Data Collection: Entity and Affiliation Extraction
  • 5.
    Description Manual, Tagging,Semantic, From Taxonomy to “Folksonomies” Examples Flickr, Tag Cloud, Semantic Grid, Provenance, Social bookmarking, Credentialling Challenges Self-organizing standards – Metadata, Semantic Grid, Incentives structures for mobilizing communities to contribute Adaptiveness to changing Folksonomies “ Conservative” tendency towards the mean or majority view Network Data Coding
  • 6.
    Description ERGM -Exponential Random Graph Models (Wasserman, Pattison, Robins, Snijders, et al) Network Evolution – Actor oriented models (Snijders, Steglich) Positional analysis – generalized block modeling (Batagelj) Autocorrelation Models (Leenders) Spectral analysis (Richards, Seary) Multi-relational, multi-rater networks (Koehly, Corman) Sampling, missing data (Wasserman, Butts) Examples Monte Carlo techniques for Maximum Likelihood Estimation of ERGM: Simulate a distribution of random graphs from a starting set of parameter values and to refine these estimated parameter values by comparing the distribution of graphs with observed graph until parameter stabilizes. Challenges “ However, Monte Carlo approaches to MLE can be computer intensive, so estimation for networks with a large number of nodes, or for a complex model may not be possible or may take an unacceptably long time (p. 157, Wasserman & Robins, 2005). Scalability in terms of size of networks (nodes, number of relationships) and complexity of model – number of parameters to be estimated. Network Data Analysis: Methods
  • 7.
    Description Descriptive methodsto calculate (simple) network statistics (e.g., centrality or transitivity) Procedure-based analysis for more complex (iterative) algorithms (e.g., cluster analysis or eigen decomposition) Statistical modeling based on probability distributions (e g., Exponential Random Graph Modeling, Quadratic Assignment Procedures) Examples Pajek (Vlado), SNA-R/StatNet (Butts), Multinet/PSPAR (Richards/Seary), VOSON (Ackland), StocNet/SIENA (Snijders/Steglich), NetVis (Cummings), JUNG (Fisher) Challenges Data formats interoperability, data integration, data management. Open code, doing “bake-offs” using simulation models on benchmark data to explain variance. Evaluating and assessing different Markov Chain Monte Carlo (MCMC) algorithms such as Robbins-Monro, Gibbs Sampler, Metropolis Hastings) Extensible and scalable software frameworks to ‘plug & play’ diverse algorithms. Algorithm and code documentation and learning modules. Network Data Visual-Analytics: Implementations and Algorithms
  • 8.
    Source: Huisman, M. & Van Duijn, M. A. J. (2005). Software for Social Network Analysis. In P J. Carrington, J. Scott, & S. Wasserman (Editors), Models and Methods in Social Network Analysis (pp. 270-316). New York: Cambridge University Press. Software for Network Analysis
  • 9.
    Source: Huisman, M. & Van Duijn, M. A. J. (2005). Software for Social Network Analysis. In P J. Carrington, J. Scott, & S. Wasserman (Editors), Models and Methods in Social Network Analysis (pp. 270-316). New York: Cambridge University Press. Software for Network Analysis
  • 10.
    Source: Huisman, M. & Van Duijn, M. A. J. (2005). Software for Social Network Analysis. In P J. Carrington, J. Scott, & S. Wasserman (Editors), Models and Methods in Social Network Analysis (pp. 270-316). New York: Cambridge University Press. Software Toolkits for Network Analysis
  • 11.
    Description Systems DynamicsAgent Based Models Computational Network Models Examples Diffusion of information, emergence of norms, coordination of conventions, or participation in collective action (Macy) Spread of epidemics EPISIMS (Marathe/Eubank) Tobacco control (Clark/Hesse) Computational modeling Environments: Repast (Sallach), Blanche (Contractor) Challenges Reusable, transparency, docking, multi-scale simulations, distributed data, doing “bake-offs” on simulation models on benchmark data to explain variance. Theoretical testing and empirical validation. Network Simulation
  • 12.
    Description An (animated)image of a network is sometimes worth a list of millions of (dynamically changing) node-node pairs. In many cases, visuals are the major means to represent and communicate scientific results -- across scientific boundaries. Examples Pajek (Batagelj), GUESS (Adar), JUNG (Fisher), MatrixVis (Chris Mueller), TraceEncounters (Paley), TreePlus (Plaisant) Challenges Eye candy vs. highly readable and effective visualizations. Visualization of data origin, provenance, accuracy, (un)certainty. Tight coupling of data analysis and visualization to help people make sense of very large, dynamically evolving datasets. Scalable, interactive/iterative specification of data analysis and data mappings. Network Visualization
  • 13.
  • 14.
    Description Search-based recommendations,Category-based recommendations, Collaborative filtering: Memory-based, Model-based (clustering), Association rules (or item-to-item collaborative filtering), Content-based methods, Recommendation support Hybrid methods Examples TraceEncounters (Paley), Intellibadge (Donna Cox), I-neighbors (Hampton), AOL (Karahalios), LEEP (Haythornthwaite), CI-KNOW (Contractor), NetExpert (Ramon Sanguesa ), Adaptive Referral Systems (Singh) Challenges Algorithms to capturing data in automated and close to real-time Agreement and implementation of metadata, provenance, and Developing theoretically grounded and statistically defensible mechanisms for referrals Implementation of scalable algorithms Privacy benefit trade-offs Access to large bibliographic databases Tools for Networking and Referrals
  • 15.
    Description Itineraries –Supporting power users who want to thread multiple programs, data sets Examples Kepler and D2K (Welge) Challenges Agreement and implementation of standards Interfacing with existing and new visual-analytic tools, data sets Network Workflows: Tying it all together
  • 16.