NE7012- SOCIAL NETWORK ANALYSIS

NE7012 SOCIAL
NETWORK ANATYSIS
PREPARED BY: A.RATHNADEVI A.V.C COLLEGE OF
ENGINEERING
UNIT 1-INTRODUCTION

UNIT I- INTRODUCTION
Introduction to Web - Limitations of current Web – Development of Semantic Web – Emergence
of the Social Web – Statistical Properties of Social Networks -Network analysis - Development
of Social Network Analysis - Key concepts and measures in network analysis - Discussion
networks -Blogs and online communities - Web-based networks
1.1 INTRODUCTION TO WEB
 The first web browser was invented in 1990 by Tim Berners-Lee.
 It was called World Wide Web and was later renamed Nexus.
 In 1993, Marc Andreesen created a browser that was easy to use and install with the
release of Mosaic (later Netscape).
1.2 LIMITATIONS OF THE CURRENT WEB
 There is a general consent that the Web is one of the greatest inventions of the 20th
Century. But could it be better?
 The reason that we do not often raise this question any more has to do with our unusual
ability to adapt to the limitations of our information systems. In the case of the Web this
means adaptation to our primary interface to the vast information that constitutes the
Web: the search engine.
 In the following we list four questions that search engines cannot answer at the moment
with satisfaction or not at all.
1.2.1 What’s wrong with the Web?
The questions below are specific for the sake of example, but they represent very general
categories of search tasks.
1. Who is Frank van Harmelen?
 To answer such a question using the Web one would go to the search engine and Enter
the most logical keyword: Harmelen
 The results returned by Google are shown in Figure 1.1. (Note that the results are slightly
different depending on whether one enters Google through the main site or a localized
version.)

 If this question and answer would be parts of a conversation, the dialogue would sound
like this
Q: Who is Frank van Harmelen?
A: I don’t know but there are over a million documents with the word“harmelen” on them and
I found them all really fast (0.31s). Further, you can buy Harmelen at Amazon. Free Delivery on
Orders Over 15.
 Not only the advertisement makes little sense, but from the top ten results only six are
related to the Frank van Harmelen we are interested in. Upon closer inspection the
problem becomes clear: the word Harmelen means a number of things. It’s the name of a
number of people, including the (unrelated) Frank van Harmelen and Mark van
Harmelen.
 Six of the hits from the top ten are related to the first person, one to the latter. Harmelen
is also a small town in the Netherlands (one hit) and the place for a tragic train accident
(one hit).
 The problem is thus that the keyword Harmelen (but even the term Frank van Harmelen)
is polysemous. The reason of the variety of the returned results is that designers of search
engines know that users are not likely to look at more than the top ten results. Search
engines are thus programmed in such a way that the first page shows a diversity of the
most relevant links related to the keyword.

2. Show me photos of Paris
 The most straightforward solution to this search task is typing in “Paris photos” in the
search bar of our favorite search engine.
 Most advanced search engines, however, have specific facilities for image search where
we can drop the term photo from the query. Some of the results returned by Google
Image Search are shown in Figure 1.2.
 Again, what we immediately notice is that the search engine fails to discriminate two
categories of images: those related to the city of Paris and those showing Paris Hilton, the
heiress to the Hilton fortune whose popularity on the Web could hardly be disputed
 More striking is the quality of search results in general. While the search engine does a
good job with retrieving documents, the results of image searches in general are
disappointing.

 For the keyword Paris most of us would expect photos of places in Paris or maps of the
city.
 In reality only about half of the photos on the first page, a quarter of the photos on the
second page and a fifth on the third page are directly related to our concept of Paris. The
rest are about clouds, people, signs, diagrams etc.
 The problem is that associating photos with keywords is a much more difficult task than
simply looking for keywords in the texts of documents. Automatic image recognition is
currently a largely unsolved research problem, which means that our computers cannot
“see” what kind of object is on the photo.
 Search engines attempt to understand the meaning of the image solely from its context,
e.g. based on the name of the file and the text that surrounds the image. Inevitably, this
leads to rather poor results.
3. Find new music that I (might) like
 This query is at an even higher level of difficulty so much so that most of us
Wouldn’t even think of posing it to a search engine. First, from the perspective of
automation, music retrieval is just as problematic as image search.

 As in the previous case, a search engine could avoid the problem of understanding the
content of music and look at the filename and the text of the web page for clues about the
performer or the genre.
 We suspect that such search engines do not exist for different reasons: most music on the
internet is shared illegally through peer-to-peer systems that are completely out of reach
for search engines.
 Music is also a fast moving good; search engines typically index the Web once a month
and therefore too slow for the fast moving world of music releases.
 But the reason we would not attempt to pose this query mostly has to do with formulating
the music we like. Most likely we would search for the names of our favorite bands or
music styles as a proxy, e.g. “new release”
4. Tell me about music players with a capacity of at least 4GB.
 This is a typical e-commerce query: we are looking for a product with certain
characteristics.
 One of the immediate concerns is that translating this query from natural language to the
Boolean language of search engines is almost impossible. We could try the search “music
player” “4GB” but it is clear that the search engine will not know that 4GB is the
capacity of the music player and we are interested in all players with at least that much
memory (not just those that have exactly 4GB).
 Such a query would return only pages where these terms occur as they
are. Problem is that general purpose search engines do not know anything about music
players or their properties and how to compare such properties.
 They are good at searching for specific information (e.g. the model number of an MP3
player), but not in searching for descriptions of items.
1.2.2 Diagnosis: A lack of knowledge
 The questions above are arbitrary in their specificity but they illustrate a general problem
in accessing the vast amounts of information on the Web. Namely, in all five cases we
deal with a knowledge gap: what the computer understands and able to work with is
much more limited than the knowledge of the user.

 The handicap of the computer is mostly due to technological difficulties in getting our
computers to understand natural language or to “see” the content of images and other
multimedia.
 Even if the information is there, and is blatantly obvious to a human reader, the computer
may not be able to see anything else of it other than a string of characters.
 This problem affects all of the above queries to some extent. A human can
Quickly skim the returned snippets (showing the context in which the keyword occurs)
and realize that the different references to the word Harmelen do not all refer to persons
and even the persons named Harmelen cannot all be the same.
 In the second query, it is also blatantly obvious for the human observer that not all
pictures are of cities. However, even telling cities and celebrities apart is a difficult task
when it comes to image recognition.
 In the case of the second query, an important piece of knowledge that the computer
doesn’t possess is the common knowledge that there is a city named Paris and there is a
famous person named Paris Hilton (who is also different from the Hilton in Paris).
 Answering the third query requires the kind of extensive background knowledge about
musical styles, genres etc. that shop assistants and experts in music possess. This kind of
knowledge is well beyond the information that is in the database of a typical music store.
 The third case is also interesting because there is also lacking background knowledge
about the user. There has to be a way of providing this knowledge to the search engine in
a way that it understands it.
 The fourth query is not worthy because it highlights the problem of aggregating
information.
1.2.3 The semantic solution
 The idea of the Semantic Web is to apply advanced knowledge technologies in order to
fill the knowledge gap between human and machine.
 This knowledge can either be information that is already described in the content of the
Web pages but difficult to extract or additional background knowledge that can help to
answer queries in some way.

 In the following we describe the improvement one could expect in case of our four
queries based on examples of existing tools and applications that have been implemented
for specific domains or organizational settings.
 In the case of the first query the situation can be greatly improved by providing personal
information in a semantic format.
 Solution is to attach a semantic profile to personal web pages that describe the same
information that appears in the text of the web page but in a machine process able format.
 The Friend-of-a-Friend (FOAF) project provides a widely accepted vocabulary for such
descriptions. FOAF profiles listing attributes such as the name, address, interests of the
user can be linked to the web page or even encoded in the text of the page.
 As we will see several profiles may also exist on the Web describing the same person. As
all profiles are readable and comparable by machines, all knowledge about a person can
be combined automatically.
 The solution in the second case is to attach metadata to the images in question. For
example, the online photo sharing site Flickr allows annotating images using geographic
coordinates.
 After uploading some photos users can add keywords to describe their images (e.g.
“Paris, Eiffel-tower”) and drag and drop the images on a geographic map to indicate the
location where the photo was taken. In the background the system computes the latitude
and longitude of the place where the user pointed and attaches this information to the
image.
 Although in this case the system is not even aware that Paris is a city, minimal additional
information about photos (the geo-coordinates) enables a kind of visualization that makes
the searching task much easier.
 In third case the background knowledge required for recommending music is already at
work behind the online radio called Pandora. Pandora is based on the Music Genome
Project, an attempt to create a vocabulary to describe characteristics of music from
melody, harmony and rhythm, to instrumentation, orchestration, arrangement, lyrics, and
the rich world of singing and vocal harmony.

 Over several years thousands of songs have been annotated by experts in music theory.
This knowledge is now used by the system to recommend unknown music to users based
on their existing favorites.
 Our fourth problem, the aggregation of product catalogs can also be directly addressed
using semantic technology.
 As we have seen the problem in this case is the difficulty of maintaining a unified catalog
in a way that does not require an exclusive commitment from the providers of product
information. (In practice, information providers often have their own product databases
with a proprietary classification
system.) Further, we would like to keep the catalogue open to data providers adding new,
emerging categories of products and their descriptions (e.g. mp3 players as a subclass of
music players with specific attributes such as capacity, size, color etc.)
1.3 DEVELOPMENT OF THE SEMANTIC WEB
1.3.1 Research, development and standardization
 The vision of extending the current human-focused Web with machine processable
descriptions of web content has been first formulated in 1996 by Tim Berners-Lee, the
original inventor of the Web [BLFD99].
 The Semantic Web has been actively promoted since by the World Wide Web
Consortium (also led by Berners-Lee), the organization that is chiefly responsible for
setting technical standards on the Web.
 the Semantic Web has quickly attracted significant interest from funding agencies on
both sides of the Atlantic, reshaping much of the AI research agenda in a relatively short
period of time
 In particular, the field of Knowledge Representation and Reasoning took center stage, but
outcomes from other fields of AI have also been put into to use to support the move
towards the Semantic Web: for example, Natural
Language Processing and Information Retrieval have been applied to acquiring
knowledge from the World Wide Web.

 The complete list of individuals in this community consists of 608 researchers mostly
from academia (79%) and to a lesser degree from industry (21%). Geographically, the
community covers much of the United States, Europe, with some activity in Japan and
Australia.
 As Figure 1.5 shows, the participation rate at the individual ISWC events have quickly
reached the level typical of large, established conferences and remained at that level even
for the last year of data (2004), when the conference was organized in Hiroshima, Japan.
The number of publications written by the members of the community that contain the
keyword “SemanticWeb” has been sharply rising since the beginning.
 The core technologies of the SemanticWeb, logic-based languages for knowledge
representation and reasoning have been developed in the research field of Artificial
Intelligence.
 Tools for creating, storing and reasoning with ontologies have been primarily developed
by university-affiliated technology startups (for example, Aduna, Onto Text and
Ontoprise) and at research labs of large corporations (see for example he work of the
advanced technology groups at IBM and Hewlett-Packard.
 Most of these tools are available as open source as at the current stage vendors expect to
make profit primarily by developing complete solutions and providing support for other
developers.
 The World Wide Web Consortium still plays a key role in standardization where the
interoperability of tools necessitates mediation between various developer and user

communities, as in the case of the development of a standard query language and
protocol to access ontology stores across the Web.
1.3.2 Technology adoption
 The SemanticWeb was originally conceptualized as an extension of the current Web, i.e.
as the application of metadata for describing Web content. In this vision, the content that
is already on the Web (text, but also multimedia) would be enriched in a collaborative
effort by the users of the Web.
 The SemanticWeb suffers from what the economist Kevin Kelly calls the fax-effect.
Kelly notes that when the first fax machines were introduced, they came with a very hefty
price tag.
 Yet they were almost useless: namely, the usefulness of a fax comes from being able to
communicate with other fax users.
 In this sense every fax unit sold increases the value of all fax machines in use. While
traditional goods such as the land or precious metals become more valuable the less is
produced (called the law of scarcity), the fax machine network exhibits the opposite,
which is called the law of plentitude.
 What makes the case of the Semantic Web more difficult, however, is an additional cost
factor. Returning to the example of the fax network, we can say that it required a certain
kind of agreement to get the system working on a global scale: all fax machines needed to
adopt the same protocol for communicating over the telephone line. This is similar to the
case of the Web where global interoperability is guaranteed by the standard protocol for
communication (HTTP).
1.4 THE EMERGENCE OF THE SOCIAL WEB
 The first wave of socialization on the Web was due to the appearance of blogs, wikis and
other forms of web-based communication and collaboration.
 Blogs and wikis attracted mass popularity from around 2003 What they have
in common is that they both significantly lower the requirements for adding content to the
Web: editing blogs and wikis did not require any knowledge of HTML any more. Blogs

and wikis allowed individuals and groups to claim their personal space on the Web and
fill it with content at relative ease.
 Although the example of Wikipedia, the online encyclopedia is outstanding,
Wikis large and small are used by groups of various sizes as an effective knowledge
management tool for keeping records, describing best practices or jointly developing
ideas.
 The first online social networks (also referred to as social networking services) entered
the field at the same time as blogging and wikis started to take off. In 2003, the first-
mover Friendster25 attracted over five million registered users in the span of a few
months which was followed by Google and Microsoft starting or announcing similar
services.
1.4.1 Web 2.0 + SemanticWeb =Web 3.0?
 Web 2.0 is often contrasted to the Semantic Web, which is a more conscious and
carefully orchestrated effort on the side of the W3C to trigger a new stage of
developments using semantic technologies.

 In practice the ideas of Web 2.0 and the Semantic Web are not exclusive alternatives:
while Web 2.0 mostly effects how users interact with the Web, while the Semantic Web
opens new technological opportunities for web developers in combining data and services
from different sources.
 The Semantic Web can offer to the Web 2.0 community is a standard infrastructure for
the building creative combinations of data and services.
 Standard formats for exchanging data and schema information, support for data
integration, along with standard query languages and protocols for querying remote data
sources provide a platform for the easy development of mashups.
1.5 STATISTICAL PROPERTIES OF SOCIAL NETWORKS
 To study social networks, we first represent them as graphs. We want to understand the
structural patterns and properties of these graphs
Two types of properties:
 Static properties: describing the structure of snapshots of graphs.
 Dynamic properties: describing how the structure evolves over time.
 These properties may be for unweight or weighted graphs, where weights may represent
multi-edges (e.g. multiple phone calls from one person to another), or edge weights (e.g.
monetary amounts between a donor and a recipient in a political donation network).
 Properties to Understand:
a) What do social networks look like, on a large scale?
b) How do networks behave over time?
c) How do the different components of an entire network form?
d) How do the non-giant weakly connected components behave over time?
e) What distributions and patterns do weighted graphs maintain?
f) What happens when we take into account multiple edges and weighted edges?
1.5.1 Static Properties

 While all networks we examine are evolving over time, there are properties that are
measured at single points in time, that is, static snapshots of the graphs. For the purposes
of organization we will further divide these properties into those applying to unweighted
graphs and to weighted graphs.
1.5.1.1 Static Unweighted Graphs
 Here, we present the ‘laws’ that apply to static snapshots of real graphs
Without considering the weights on the edges. Those include the patterns in
Degree distributions, the number of hops pairs of nodes can reach each other, Local
number of triangles, eigenvalues and communities. Next, we describe the related patterns
in more detail.
1) S-1: Heavy-tailed Degree Distribution
 The degree distribution of many real graphs obey a power law f(d) ∝ d−, with  > 0,
and f(d) being the fraction of nodes with degree d.
 This means that there exist many low degree nodes, whereas only a few high degree
nodes in real graphs
2) S-2: Small Diameter
 The diameter of a static graph is the maximum distance between any two nodes.
 Real world graphs often have small diameters.
 This is known as the ‘small-world phenomenon’ or the ‘six degrees of separation’.
 Diameter can be high jacked by long chains.
 Therefore we use the effective diameter which is the minimum number of hops in which
some fraction (usually 90%) of all connected node pairs can be reached.
3) S-3: Triangle Power Law (TPL)
 The number of triangles ∆ follows a power-law in the form of f(∆) ∝ ∆σ, with the
exponent σ < 0. The number of nodes that participate in ∆ number of triangles follows a
power-law in the form of f(∆) ∝ ∆σ , with the exponent σ < 0.
 TPL means that

 Many nodes have only a few triangles in their neighborhoods and
 A few nodes participate in many numbers of triangles with their neighbors.
4) S-4: Eigenvalue Power Law (EPL)
 The eigenvalues of a graph are defined as the eigenvalues of its adjacency matrix. The set
of eigenvalues of a graph is called a graph spectrum.
 (For a matrix A, if there is a vector X s.t. AX = X for some scalar, then  is the
eigenvalue of A assoc with eigenvector X.)
 EPL states that the 20 or so largest eigenvalues of the Internet graph are power-law
distributed. It has been shown that the Eigenvalue Power Law is a consequence of the
Degree Power Law.
5) S-5: Community Structure
 Real-world graphs exhibit a modular structure, with nodes forming groups, and possibly
groups within groups.
 In other words, the nodes form communities where groups of nodes in the same
community are tighter connected to each other than to those nodes outside the
community.
1.5.1.2 Static Weighted Graphs

 We consider weighted directed graphs  Data set: records in the form (IP-source, IP
destination, timestamp, number of-packets)
 We can have multi-edges and weights 
Notations:
 W(t): the total weight up to time t
 E(t): the number of distinct edges up to time t
 Ed(t): the number of multi-edges (d stands for duplicate edges) up to time t
 N(t): the number of nodes up to time t
1) SW-1: Weight Power Law (WPL)
 Between W(t) and E(t), we observe that W(t) = E(t)w (w ranges from 1.01 to 1.5)
 This means that more edges in the graph imply super linearly higher total weight.
 We also have
N(t) = E(t)n Ed(t) = E(t)dupe
Nsrc(t) = E(t)nsrc Ndst(t) = E(t)dst
2) SW-2: Edge Weights Power Law
 Given a real-world graph, nodes i and j with weights wi and wj , the edge ei,j with weight
wi,j , then we have the power law

 This means that the weight of a given edge and weights of its neighboring two nodes are
correlated (similar to Newton’s Gravitational Law).
3) SW-3: Snapshot Power Laws (SPL)
 Consider the i-th node of a weighted graph, at time t (a snapshot), and let outi , outwi be
its out-degree and out-weight. Then
 Where ow is the out-weight-exponent of the SPL. Similarly, for the in-degree, with in-
weight-exponent iw.
 The exponents iw and ow take values in the range [0.9-1.2] and [0.95-1.35], respectively.
 The exponent over time remains almost constant.
1.5.2 Dynamic Properties
 These are typically studied by looking at a series of static snapshots and seeing how
measurements of these snapshots compare. Like the static properties we presented
previously, we also divide these into properties that take into account weights and those
that don’t.

1.5.2.1 Dynamic Unweighted Graphs
 The patterns in dynamic time-evolving graphs that do not consider edge
weights include the shrinking diameter property, the densification law, oscillating around
a constant size secondary largest connected components, the largest eigenvalue law and
the bursty and self-similar edge additions over time. We next describe these laws in
detail.
1) D-1: Shrinking Diameter
 It can be observed that not only is the diameter of real graphs small, but it also shrinks
and then stabilizes over time.
 There is a ‘gelling point’ at which many small disconnected components merge and form
the largest connected component in the graph.
 This is like the ‘coalescence’ of the graph at which point the diameter ‘spikes’.
 Afterwards, with new edges the diameter keeps shrinking until it reaches an equilibrium.
 The vertical line marks the gelling point.
2) D-2: Densification Power Law (DPL)
 The relationship between E(t) and N(t) (the number of edges and nodes at time t) follows
the Densification Power Law

 where β is the densification exponent with value between 1.03 and 1.7
 This indicates a super linearity between the number of nodes and the number of edges.
 Also explain the densification effect.
 For (c) the good linear fit agrees with the DPL.
 (d) is the corresponding component sizes.
3) D-3: Diameter-plot and Gelling point
 Real graphs exhibit a gelling point, at which the diameter spikes and (several)
disconnected components gel into a giant component.
 Before that point, the graph is more or less in an establishment period, typically
consisting of a collection of small, disconnected components.
 After the gelling point, the graph obeys the expected rules.
 Example: PostNet data on slide 30 & 32.
4) D-4: Constant/Oscillating NLCCs
 After the gelling point, the secondary and tertiary connected components remain of
approximately constant size, with small oscillations.
 New nodes typically link to the GCC
 Very few of the newcomers link to the 2nd (or 3rd) CC, helping them to grow slowly
 In very rare cases, a newcomer links both to an NLCC and GCC, thus leading to the
absorption of the NLCC into the GCC

 At that point, we have a drop in the size of the 2nd CC

5) D-5: LPL: Principal eigen value over time
 The principal eigenvalue λ1(t) of the 0-1 adjacency matrix A and the number of edges
E(t) over time follow a power law with exponent less than 0.5,especially after the ‘gelling
point’. i.e.
1.5.2.2 Dynamic Weighted Graphs
1) DW-1: Bursty/self-similar weight additions
 Tracking how much weight a graph puts on at each time interval (i.e. ΔW(t)) and looking
at the entropy plots.
 The weight additions over time show self-similarity.
 If the edge weight is the number of reoccurrences of that edge, the slope of the plot >
0.95 (more uniform) For other feature as edge weight, the weight additions are more
bursty, the slope being as low as 0.6 for the Network Traffic dataset.
2) DW-2: LWPL: Weighted principal eigenvalue over time
 (λ1,w Power Law (LWPL)) Weighted real graphs exhibit a power law for the largest
eigenvalue (i.e. λ1,w(t)) of the weighted adjacency matrix Aw and the number of edges
E(t) over time. That is the exponent β ranged from 0.5 to 1.6

Applications of these Laws
 These patterns are helpful for
1. Spotting anomalous graphs and sub-graphs,
2. Answering questions about entities in a network, and
3. Answering questions about what-if scenarios.
 Spotting anomalies is vital for
1. Determining abuse of networks
2. Fraudulent reputation building (in e-auction systems)
3. Detection of dwindling/abnormal social sub-groups
4. Network intrusion detection
 Analyzing network properties is also useful for
1. Identifying authorities and search algorithms,
2. Discovering the “network value” of customers
3. Improve recommendation systems
 What-if scenarios are vital for
1. Extrapolation,
2. Provisioning and
3. Algorithm design
1.6 NETWORK ANALYSIS
 Social Network Analysis (SNA) is the study of social relations among a set of actors.
 The key difference between network analysis and other approaches to social science is
the focus on relationships between actors rather than the attributes of individual actors.
 Network analysis takes a global view on social structures based on the belief that types
and patterns of relationships emerge from individual connectivity and that the presence
(or absence) of such types and patterns have substantial effects on the network and its
constituents.

 The network structure provides opportunities and imposes constraints on the individual
actors by determining the transfer or flow of resources (material or immaterial) across the
network.
 SNA is thus a different approach to social phenomena and therefore requires a new set of
concepts and new methods for data collection and analysis.
 Network analysis provides a vocabulary for describing social structures, provides formal
models that capture the common properties of all (social) networks and a set of methods
applicable to the analysis of networks in general.
 The concepts and methods of network analysis are grounded in a formal description of
networks as graphs. Methods of analysis primarily originate from graph theory as these
are applied to the graph representation of social network data.
 The methods of data collection in network analysis are aimed at collecting relational data
in a reliable manner. Data collection is typically carried out using standard questionnaires
and observation techniques that aim to ensure the correctness and completeness of
network data.
 Often records of social interaction (publication databases, meeting notes, newspaper
articles, documents and databases of different sorts) are used to build a model of social
networks.
1.7 DEVELOPMENT OF THE SOCIAL WEB
 The field of Social Network Analysis today is the result of the convergence of several
streams of applied research in sociology, social psychology and anthropology.
 Many of the concepts of network analysis have been developed independently by various
researchers often through empirical studies of various social settings.
 For example, many social psychologists of the 1940s found a formal description of social
groups useful in depicting communication channels in the group when trying to explain
processes of group communication.
 Already in the mid-1950s anthropologists have found network representations useful in
generalizing actual field observations, for example when comparing the level of
reciprocity in marriage and other social exchanges across different cultures.

 Despite the various efforts, each of the early studies used a different set of concepts and
different methods of representation and analysis of social networks.
 The term “social network” has been introduced by Barnes in 1954
 This convergence was facilitated by the adoption of a graph representation of social
networks usually credited to Moreno.
 Moreno called a sociogram was a visual representation of social networks as a set of
nodes connected by directed links.
 The nodes represented individuals in Moreno’s work, while the edges stood for personal
relations.
 However, similar representations can be used to depict a set of relationships between any
kind of social unit such as groups, organizations, nations etc.
 It is a network image between workers (W), solderers(S) and inspectors (I).
 While 2D and 3D visual modeling is still an important technique of network analysis, the
sociogram is honored mostly for opening the way to a formal treatment of network
analysis based on graph theory.
 One of the relatively new areas of network analysis is the analysis of networks in
entrepreneurship, an active area of research that builds and contributes to organization
and management science.

 The vocabulary, models and methods of network analysis also expand continuously
through applications that require to handle ever more complex data sets.
 An example of this process is the advances in dealing with longitudinal data. New
probabilistic models are capable of modelling the evolution of social networks and
answering questions regarding the dynamics of communities.
 Formalizing an increasing set of concepts in terms of networks also contributes to both
developing and testing theories in more theoretical branches of sociology.
 The increasing variety of applications and related advances in methodology can be best
observed at the yearly Sunbelt Social Networks Conference series, which started in 1980.
The field of Social Network Analysis also has a journal of the same name since 1978.
 While the field of network analysis has been growing steadily from the beginning, there
have been two developments in the last two decades that led to an explosion in network
literature.
 First, advances in information technology brought a wealth of electronic data and
significantly increased analytical power.
 Second, the methods of SNA are increasingly applied to networks other than social
networks such as the hyperlink structure on the Web or the electric grid.
1.8 KEY CONCEPTS AND MEASURES IN NETWORK ANALYSIS
1.8.1 Networks component
 Actors (nodes, points, vertices):
1. Individuals, Organizations, Events …
2. Can have properties (attributes)
 Relations (lines, arcs, edges, ties): between pairs of actors.
1. Undirected (symmetric) / Directed (asymmetric)
2. Binary / Valued

 Most network analysis methods work on an abstract, graph based representation of real
world networks.
 The units of interest in a network are the combined sets of actors and their relations.
 We represent actors with points and relations with lines.
 In general, a relation can be:
 Undirected / Directed
 Binary / Valued

1.8.2 Types of networks
 We can examine networks across multiple levels:
1. Ego network
2. Partial network
3. Complete or “Whole” network
1. Ego network
 Have data on a respondent (ego) and the people they are connected to (alter).
 May include estimates of connections among alters
 Measures:
1. Size
2. Types of relations

2. Partial network
 Ego networks plus some amount of tracing to reach contacts of contacts
 Something less than full account of connections among all pairs of actors in the relevant
population
3. Complete or “Whole” network
 Connections among all members of a population.
 Data on all actors within a particular (relevant) boundary.
 Never exactly complete (due to missing data), but boundaries are set
 E.g.: Friendships among workers in a company
 Measures:
1. Graph properties
2. Density
3. Sub-groups
4. Positions
1.8.3 Basic data structures
1. from picture to matrices

2. from matrices to list
1.8.4 Measuring networks
1. Connectivity
 Indirect connections are what make networks systems. One actor can reach another if
there is a path in the graph connecting them.

Basic elements:
 A path is a sequence of nodes and edges starting with one node and ending with another,
tracing the indirect connection between the two. On a path, you never go backwards or
revisit the same node twice.
Example: a  b  cd
 A walk is any sequence of nodes and edges, and may go backwards.
Example: a  b  c  b c d
 A cycle is a path that starts and ends with the same node.
Example: a  b  c  a
 If you can trace a sequence of relations from one actor to another, then the two are
connected. If there is at least one path connecting every pair of actors in the graph, the
graph is connected and is called a component.
 Intuitively, a component is the set of people who are all connected by a chain of relations.
2. Distance and number of path
 Distance is measured by the (weighted) number of relations separating a pair, using the
shortest path.

Actor “a” is:
1 step from 4
2 steps from 5
3 steps from 4
4 steps from 3
5 steps from 1
 Paths are the different routes one can take. Node-independent paths are particularly
important.

3. Centrality
 Centrality refers to (one dimension of) location, identifying where an actor resides in a
network.
 Centrality is fairly straight forward: we want to identify which nodes are in the ‘center’ of
the network. In the sense that they have many and important connections.
 Three standard centrality measures capture a wide range of “importance” in a network:
1. Degree
2. Closeness
3. Betweenness
3.1. Degree centrality
 No. of nodes adjacent to given node
 Often used as measure of a node’s degree of connectedness and hence also influence
and/or popularity
 Useful in assessing which nodes are central with respect to spreading information and
influencing others in their immediate ‘neighborhood’
 Node 3 and 4 have the highest degree 4
 Formula

3.2. Closeness centrality
 An actor is considered important if he/she is relatively close to all other actors.
 Sum of geodesic distances to all other nodes.
 Inverse measure of centrality
 It is a measure of reach, i.e. the speed with which information can reach other nodes from
a given starting node

 Node 3and 5 have the highest closeness, while node 2 fares almost as well.
 formula
3.3. Betweenness centrality
 Number of times a node lies along the shortest path between two others
 Shows which nodes are more likely to be in communication paths between other nodes
 Also useful in determining points where the network would break apart

 Node 5 has the highest Betweenness centrality then 3
 Betweenness centrality can be defined in terms of probability (1/gij),
gij = number of geodesics that bond actors pi and pj.
gij(pk)= number of geodesics which bond pi and pj and content pk.
iij(pk) = probability that actor pk is in a geodesic randomly chosen among the ones which
join pi and pj.
 Betweenness centrality is the sum of these probabilities (Freeman, 1979).
 Normalizad: C’B(pk) = CB(pk) / [(n-1)(n-2)/2]
3.3.1 Centralization
 If we want to measure the degree to which the graph as a whole is centralized, we look at
the dispersion of centrality
 Freeman’s general formula for centralization (which ranges from 0 to 1):

3.3.2 Density
 The more actors are connected to one another, the more dense the network will be.
 Undirected network: n(n-1)/2 = 2n-1 possible pairs of actors.
 Directed network: n(n-1)*2/2 = 2n-2possible lines.
1.8.5 Comparing across centrality values
 Generally, the 3 centrality types will be positively correlated
 When they are not correlated, it probably tells you something interesting about the
network.
Low
Degree
Low
Closeness
Low
Betweenness
High Degree Embedded in cluster
that is far from the rest
of the network
Ego's connections are
redundant -
communication
bypasses him/her
High Closeness Key player tied to
important
important/active alters
Probably multiple paths
in the network, ego is
near many people, but
so are many others

High Betweenness Ego's few ties are
crucial for network
flow
Very rare cell. Would
mean that ego
monopolizes the ties
from a small number
1.8.6 Social network software
1. UCINET
 The Standard network analysis program, runs in Windows
 Good for computing measures of network topography for single nets
 Input-Output of data is a special 2-file format, but is now able to read PAJEK files
directly.
 Not optimal for large networks
 Available from: Analytic Technologie
2. PAJEK
 Program for analyzing and plotting very large networks
 Intuitive windows interface
 Started mainly a graphics program, but has expanded to a wide range of analytic
capabilities
 Can link to the R statistical package
 Free
 Available from: http://vlado.fmf.uni-lj.si/pub/networks/pajek/
3. NetDraw
 Also very new, but by one of the best known names in network analysis software.
 Free
1.9 DISCUSSION NETWORKS
1.9.1 Electronic discussion networks

 Tyler, Wilkinson and Huberman analyze communication among employees of their own
lab by using the corporate email archive. if they had exchanged at least a minimum
number of total emails in a given period, filtering out one-way relationships.
 Adamic and Adar revisits one of the oldest problems of network research, namely the
question of local search.
 How do people find short paths in social networks based on only local information about
their immediate contacts?
 Their findings support earlier results that additional knowledge on contacts such as their
physical location and position in the organization allows employees to conduct their
search much more efficiently than using the simple strategy of always passing the
message to the most connected neighbor.
 Discussions are largely in email and to a smaller part on the phone and in face-to-face
meetings.
 Group communication and collective decision taking in various settings are traditionally
studied using much more limited written information such as transcripts and records of
attendance and voting.
 The main technical contribution of Gloor is a dynamic visualization of the discussion
network that allows to quickly identify the moments when key discussions take place that
activates the entire group and not just a few select members.
 Gloor also performs a comparative study across the various groups based on the
structures that emerge over time.
1.9.2 Blogs and online communities
 Content analysis has also been the most commonly used tool in the computer-aided
analysis of blogs (web logs), primarily with the intention of trend analysis for the
purposes of marketing.
 While blogs are often considered as “personal publishing” or a “digital diary”, bloggers
themselves know that blogs are much more than that: modern blogging tools allow easily
commenting and reacting to the comments of other bloggers, resulting in webs of
communication among bloggers.
 This fig shows some of the features of blogs that have been used in various studies to
establish the networks of bloggers.

 Blogs make a particularly appealing research target due to the availability of
Structured electronic data in the form of RSS (Rich Site Summary) feeds.
 RSS feeds contain the text of the blog posts as well as valuable metadata such as the
timestamp of posts, which is the basis of dynamic analysis.
 The 2004 US election campaign represented a turning point in blog research
 as it has been the first major electoral contest where blogs have been exploited as a
method of building networks among individual activists and supporters
 Online community spaces and social networking services such as MySpace,
Live Journal cater to socialization even more directly than blogs with features such as
social networking (maintaining lists of friends, joining groups), messaging and photo
sharing.
 Most online social networking services (Friendster, Orkut, LinkedIn and their sakes)
closely guard their data even from their own users.
 A technological alternative to these centralized services is the FOAF network.

 FOAF profiles are stored on the web site of the users and linked together using
hyperlinks.
 The drawback of FOAF is that at the moment there is a lack of tools for creating and
maintaining profiles as well as useful services for exploiting this network.
 Advantages
1. Easy to create and fast
2. Easy to add links, photos, videos
3. It can be used to create community
 Disadvantage
1. Generally one author
2. Used for personal opinions and reflection
1.9.3 WEB BASED NETWORKS
 The content of Web pages is the most inexhaustible source of information for social
network analysis.
 This content is not only vast, diverse and free to access but also in many cases more up to
date than any specialized database.
 On the downside, the quality of information varies significantly and reusing it for
network analysis poses significant technical challenges.
 There are two features of web pages that are considered as the basis of extracting social
relations: links and co-occurrences.
 The linking structure of the Web is considered as proxy for real world relationships as
links are chosen by the author of the page and connect to other information sources that
are considered authoritative and relevant enough to be mentioned.
 The biggest drawback of this approach is that such direct links between personal pages
are very sparse: due to the increasing size of the Web searching has taken over browsing
as the primary mode of navigation on the Web.
 As a result, most individuals put little effort in creating new links an updating link targets
or have given up linking to other personal pages altogether.

 Features in web pages that can be used for social network extraction.
 Co-occurrences of names in web pages can also be taken as evidence of relationships and
are a more frequent phenomenon.
 On the other hand, extracting relationships based on co-occurrence of the names of
individuals or institutions requires web mining as names are typically embedded in the
natural text of web pages.
 The techniques employed here are statistical methods possibly combined with an analysis
of the contents of web pages.

NE7012- SOCIAL NETWORK ANALYSIS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to NE7012- SOCIAL NETWORK ANALYSIS

Similar to NE7012- SOCIAL NETWORK ANALYSIS (20)

Recently uploaded

Recently uploaded (20)

NE7012- SOCIAL NETWORK ANALYSIS