SlideShare a Scribd company logo
1 of 68
The Construction and Analysis of
Online Social Networks
Discourse into the use of data mining techniques for the construction and
analysis of online communities.
Kobi Omenaka
@00287065
Kobi Omenaka @00287065 2
List of Figures....................................................................................................................................................2
List of Tables......................................................................................................................................................3
Introduction........................................................................................................................................................4
Aims and Objectives .................................................................................................................................4
A Social Network........................................................................................................................................5
Social Network Theory and Social Network Analysis...............................................................9
Data Mining Evaluation and Research Methods........................................................................17
Extracting Data from Social Media Sources............................................................................18
Text mining using software such as Spinn3r....................................................................18
Programming........................................................................................................................................19
NodeXL.....................................................................................................................................................21
Netvizz......................................................................................................................................................26
Gephi..........................................................................................................................................................30
Method...........................................................................................................................................................33
Results and Analysis...................................................................................................................................34
Flickr Results.............................................................................................................................................34
YouTube Results......................................................................................................................................36
Twitter Results.........................................................................................................................................37
Facebook Results.....................................................................................................................................39
Shrinkage .....................................................................................................................................................42
Capoeira Vibe Case.................................................................................................................................43
Ethics....................................................................................................................................................................46
Quantitative vs. Qualitative ....................................................................................................................48
Conclusion.........................................................................................................................................................50
Appendix 1: Table to show the Number, Name, Group ID and Number of Nodes as
Extracted by Netvizz...................................................................................................................................51
Appendix 2: Python GitHub Repositories for Social Media Sites....................................62
Appendix 3: Other Web Resources....................................................................................................62
Appendix 3 Other Network Images...................................................................................................63
Bibliography....................................................................................................................................................65
List of Figures
Figure 1 Example of Network Diagram showing the Interactions in a Karate Club
(Easley & Kleinberg 2010)...........................................................................................................11
Figure 2 Network Diagram showing the Nodes and Edges on "ARPANET".............16
Figure 3 Showing the ARPANET network in Relation to its Real World Position in
the US........................................................................................................................................................17
Figure 4 Image showing screenshot of NodeXL starting page.........................................22
Figure 5 Image showing the NodeXL imports menu. Including the options to
report from Flickr, Twitter and YouTube...........................................................................23
Figure 6 Image showing the imports menu for a Twitter user network within
NodeXL.....................................................................................................................................................24
Figure 7. Screenshot of imported data from NodeXL from YouTube...........................25
Figure 8 Image showing the Netvizz User Interface within Facebook........................27
Figure 9 Image showing that the Netvizz data extraction has completed and can
be downloaded....................................................................................................................................28
Kobi Omenaka @00287065 3
Figure 10 Image Showing the Group Joining Process on Facebook..............................29
Figure 11 Network graph produced in Gephi showing a Manchester based
capoeira group.....................................................................................................................................31
Figure 12 Network Graph Showing the Keywords Associated with the Term
"capoeira" on Flickr..........................................................................................................................34
Figure 13 Network Graph created from the "capoeira" tag within YouTube..........36
Figure 14 The Most Frequent Posters of Capoeira Videos on YouTube.....................37
Figure 15 Twitter network produced from the hashtag “capoeira” (from 100
tweets)......................................................................................................................................................37
Figure 16 Twitter network produced from the hashtag “capoeira” (from 1000
tweets)......................................................................................................................................................38
Figure 17 Network Graph showing 75453 nodes constructed from 241 Capoeira
Groups......................................................................................................................................................39
Figure 18 Image to show the Top 10 most Connected Nodes from the
Constructed Capoeira Network.................................................................................................40
Figure 19 Network Graph Showing of the "Capoeira Vibe" Network Community
on Flickr...................................................................................................................................................43
Figure 20 Capoeira Vibe level 1.0 YouTube Network............................................................44
Figure 21 Capoeira Vibe Level 2.0 YouTube Network...........................................................45
Figure 22 Capoeira Vibe Twitter Network...................................................................................46
Figure 23 Social Technographics Ladder based on U.S. Adults (Li & Bernoff 2011,
p.43)...........................................................................................................................................................49
Figure 24 Network Graph of 241 Capoeira Facebook Groups with Labels ..............63
Figure 25 Capoeira Vibe Level 1.5 Network on YouTube....................................................64
Figure 26 Image showing the Giant Component from the Facebook capoeira
network. In spite of the different colours depicting different groups each
node can be connected to each other. The longest distance between any two
nodes is 22. This contains 68448 nodes compared to 75453 in the complete
network....................................................................................................................................................65
List of Tables
Table 1 Key Values from Netvizz Data Extraction of Facebook Capoeira Groups41
Kobi Omenaka @00287065 4
Introduction
“The web is more a social creation than a technical one. I designed it for social
effect - to help people work together - and not as a technical toy. The ultimate
goal of the web is to support and improve our weblike existence of the world. We
clump into families, associations, and companies. We develop trust across the
miles and distrusts around the corner” (Berners-Lee 1999).
This quote from Sir Tim Berners-Lee, the inventor of the World Wide Web, gives
a poignant insight into the direction of the research that will be summarised in
the following pages.
The intention of this research is to be an independent introduction into the
world of networks, of on and offline including key theory concepts. This
progresses to assessing methods that can be used to construct these networks
from online resources via data mining. This is presented in the form of research
based both on theoretical and real-world networks.
Once the networks have been established the key questions then follow:
 How can this information be used?
 Are there any ethical concerns related to data mining?
These are discussed towards the end of this report followed by an overall
conclusion.
Aims and Objectives
The research the aims and objectives of this project are highlighted below:
1. Introduce basic social network analysis (SNA) and networking theory.
2. Evaluate the use of various data mining techniques to establish social
network. This includes the use of programming as well as the more
accessible and freely available forms of SNA software.
Kobi Omenaka @00287065 5
3. Use the keyword “capoeira” to construct a social network based on the
data mining techniques discussed and use network analysis theory to
evaluate them.
4. Confirm whether a giant component exists within the capoeira network
5. Establish a new concept “shrinkage” within the network of capoeira on
SNSs.
6. Discourse into the ethicalities of these techniques
7. Evaluation of how valuable the results are without a qualitative aspect.
8. Case study: Evaluation of research techniques with online capoeira
community “Capoeira Vibe” that has accounts on Facebook, Twitter, Flickr
and YouTube.
By nature human beings are social creatures. Any one individual may form part
of any number of networks and groups connected by family, school, university
friends, teammates and colleagues ties etc. An individual’s role within each group
may vary as will the strength of the ties.
It is interesting to see Berners-Lee’s vision the Internet being one of social
concern rather than technical and being designed, in a sense, to magnify the
social experiences that preceded it. The premise of this research is to investigate
social networks on the Internet.
In order to do this we need to understand networking theory, which is based on
anthropological and socio geographical research.
A Social Network
A social network is a social arrangement made up of “nodes”, which are typically
one person that interact with other nodes. The interactions between each node
such as family ties, friendship, academic links, business connections etc. are as
“edges” and structure particular networks.
Kobi Omenaka @00287065 6
On a basic level any network can be defined by the number of nodes and by the
edges that occurs within it.
The concept of a social network is not a new one but is one that has come to
particular prominence in the last decade with the advent of Web 2.0 and
communications made possible via social networking sites (SNS) on the internet.
Of course social networks predate the Internet. By its very nature the internet is
perhaps the world’s largest social network and was, in fact created as a “network
of networks” (Kurose & Ross 2003, p.2) that linked educational facilities
together.
Using the Internet as an oversimplified example of a social network, each
computer could be seen as a node, whilst connections to the other computers can
fulfil the analogy as the edges.
The Internet has facilitated the formation of connections and, to some extent the
growth of networks. This is because “networks are now freed from the
restriction of a geographical location”(Easley & Kleinberg 2010, p.1).
As people are becoming more web literate and webs services improve the easier
it becomes for like-minded people and institutions across the world to find each
other. Those people that were previously isolated can now find it easier to join
networks.
“What makes social network sites unique is not that they allow
individuals to meet strangers, but rather that they enable users to
articulate and make visible their social networks. This can result in
connections between individuals that would not otherwise be
made,”(boyd & Ellison 2007).
It is difficult to determine how the Internet has affected the number of nodes
within a particular network. What can be seen, however, are the virtual
connections made to real-world networks. An example of this would be the
Kobi Omenaka @00287065 7
association with brands such as Coca-Cola. The advent of SNS’s doesn't
necessarily mean there more Coca-Cola drinkers. They can, however, show their
affiliation easily by joining groups or “liking” Coca-Cola’s page on Facebook for
example.
The SNSs however aren't only restricted to commercial brands “one finds
networks appearing in discussion and commentary on an enormous range of
topics”(Easley & Kleinberg 2010, p.1). At the time writing Coca-Cola’s Facebook
page is at number 10 based on the number of “likes” and is the highest placed
tangible product in the list. Also in the top 10 are musicians, sports stars, TV
shows and also the brands Facebook and YouTube. Unsurprisingly, Facebook
comes number one in the list of Facebook Fan pages (Fanpagelist.com 2012). All
of these have something to sell and benefit directly from having contact with
their fans on Facebook and Twitter et al. SNS’s can be described as
“web-based services that allow individuals to construct a public or semi-
public profiles, articulate a list of other users with whom they share a
connection, and view and traverse their list of connections and those
made by others within the system” (boyd & Ellison 2007).
It’s also easy to support causes such as Greenpeace or Oxfam using “likes” or
“tags” on SNS’s. Whilst these global names take up the lion’s share of attention,
smaller groups and fan pages from community centres, small and start up
businesses, local charities and amateur sports teams etc. should not be ignored.
We can see more and more groups of all shapes and sizes seeing the benefits of
integrating SNS’s. A common example of this would be to see how SNS’s are
linked on their websites. It is becoming more and more important for small
groups to become involved with the activities on SNS’s and be familiar with their
social networks.
The importance of SNSs and networking lies in the fact that any party involved
can initiate dialogues easily. It is this two-way interaction that differentiates
social media from traditional broadcast media such as TV, newspapers and radio.
Kobi Omenaka @00287065 8
“individualized messages can simultaneously be delivered to an infinite
number of people, and, each of the people involved shares reciprocal
control over that content.” (Crosbie 2002).
This presents opportunities to exploit the new medium. Both profit and non-
profit organisations can benefit from being contact with their potential
customers and fan base. Social media is about “reaching and connecting people”
(Trompeter 2010). The fact that it is a low cost way to reach certain audiences
means that non-profits and commercial groups alike can start to experiment
without too much financial investment. The use of “likes”, keywords and tags and
user forums can make it easier to identify target markets and people who might
be interested supporting causes.
The problem with social media is that it is quite labour-intensive. In a small
organisation the people who will be using social media to connect with people
may not have the time or the ability to make best use of the medium. For this
reason the wisest use of social media would be to highlight the potential
audience before initiating the conversation. This approach will be more cost and
time effective and as a result of better rewards will be reaped. The scattergun
approach of traditional media has now, in a sense, been refined to pinpointing.
This, however, is easier said than done. With 6.5 billion people on the planet how
do you pinpoint a target audience? Do some people in the target audience have
more effect or influence than others?
What about the unidentified people who do use your products that don’t visibly
like them on the Internet? According to Comscore’s white paper the “Power of
Like” Friends of Fans on Facebook typically represent a much larger set of
consumers, 34 times larger, on average, for the 100 brands (Lipsman et al. 2012,
p.2).
These questions might be difficult answer but are important to consider. There
are clearly economic gains to be made from utilising social media and social
Kobi Omenaka @00287065 9
networks. “Simple online economics: on the Internet, traffic equals money”(Li &
Bernoff 2011, p.11).
This quote taken from Li & Bernoff’s book “groundswell” could be a strapline
advocating the use of social media for promotion. In this rapidly changing
technological environment establishing and understanding online social
identities could prove vital for success. It is important to note that as well as
economic gain, these techniques will also apply to academic and socio-economic
research.
This research will address some of the questions highlighted above by looking
further and deeper into how online social networks can be identified and
constructed. Once constructed the networks can be analysed using SNA
techniques. The purpose of the analysis is to try and determine the size or
extents of the networks created and then establish the key people and
relationships i.e. “nodes and edges”. The argument is that knowledge and
understanding these network qualities can be a key factor in realising
organisational goals.
By not restricting research to a local area means that and network can be
constructed without having to filter for location. This means, however, that it
might be more difficult to construct a network that is truly representative. In
order to establish meaningful and true relationships on a global level the amount
of data collected must be significantly higher, when compared to a local area
such as a town or city. This calls extremely robust and accurate data collection
techniques.
Social Network Theory and Social Network Analysis
Social network analysis (SNA) focuses on the study of patterns of connection in a
wide range of physical and social phenomena (Hansen et al. 2010). It is argued
Kobi Omenaka @00287065 10
that analysis of networks is more concerned about the social interactions rather
than individual qualities of the nodes. Traditional social study emphasised the
qualities of the individual, whereas SNA takes an alternative viewpoint.
Individuals are seen as being less important than their actual relationships with
other nodes in the network.
Research has explored everything from foundational physical systems created by
chemical and genetic connections, animal food pyramids and distinctly human
characteristics such as social cohesion, privacy, markets and trust (Hansen et al.
2010). Social networking analysis (SNA) has been used in the field of disease
research helping to understand how patterns of human contacts help or hinder
the spread of contagious diseases in a population. SNA techniques have also been
used as a mass surveillance, helping determine whether individuals or any
member of a population is a potential terror threat.
The above shows that SNA uses are varied and important. SNA has a history
dating back to 1954 when social anthropologist J.A.Barnes first coined the term
“Social Network” in his journal article “Class and committees in a Norwegian
island parish”.
The early pioneers of SNA placed much emphasis on the gathering of qualitative
and quantitative data by surveys and interviewing techniques. The quantitative
data is has been described above that focuses on the relational ties within any
particular social network.
Qualitative approaches use interviews and open survey questions to generate
data to construct social networks. It is concerned more about how those ties and
relationship bonds were formed within a network and the experience with it.
This kind of research comes from a more anthropological viewpoint and often
focuses on personal networks including communities, neighbourhoods and
friendship groups. As a result of the data gathering techniques used the sample
Kobi Omenaka @00287065 11
sizes are usually a lot smaller than with quantitative analysis, or may run over
extended periods of time.
Quantitative methods may involve the collection of vast amounts of data before
mathematical equations and algorithms are applied to it. This data can then be
processed using software or spreadsheets such as Microsoft Excel networking
diagrams.
This research will focus primarily on the methods for collection and organisation
of quantitative SNA data. It will also compare and contrast qualitative versus
quantitative data as a means of determining contextual usage.
Much of recent SNA has been focused on the visualisation of these social
networks, such as Figure 1 below, and applying graph theory to analyse it. It is
the use of graph theory that can highlight certain relationships within networks.
Figure1 Example of Network Diagram showingthe Interactions in a Karate Club (Easley& Kleinberg
2010)
It is worth talking about graph theory before moving to the next sections of this
research. This will better explain the results produced from the different
methods.
Kobi Omenaka @00287065 12
In the network above the nodes 1 to 34 are aggregated into cliques that are
connected to one another by relations. This positional approach focuses on the
pattern of relations in which individuals are involved. Connections between
other nodes in the network defines its position. (Burt 1978). In the karate club
shown in figure 1 it can be seen that nodes 1 and 34 represent two key nodes
within the relationship of the club, but they are not connected each other.
In graph theory nodes 1 and 34 are not neighbouring nodes as they are not
connected by a single edge. They are, however, connected by a “path” which is
simply a chain of nodes. A path from node 1 to 34 can be defined as 1 to 14 to 34.
Longer paths can be defined, however, often the average shortest path between
all nodes is defined as a quality of the network. A path is defined by the number
of edges in between the two nodes in question. This means the first part as
described above has a length of two(Jackson 2008, p.25). Consider how this is
important with respect to the spread of contagious diseases.
Another quality of networks is how well “connected” it is. A graph is referred to
as connected if there is an edge between every pair of nodes in the social
network. This essentially means that there are no floating sets of nodes and
everything is linked together as one mass “component”. Constructed networks
aspire to exist as connected networks so that dataflow to all nodes is possible if
not easy. There is no reason, however, to suggest all networks especially social
networks should exist as completely connected ones. It is common to process
networking data to find that the separate distinct components within the data
sets. A graph can then be separated into its components as one of it’s
characteristics before going further to describe the characteristics of each
component separately. Figure 1 shows an example of a fully connected network.
In networks that are constructed from large data sets it is common to find sets of
distinct components with one large one in particular that is referred to as a
“giant component” (Erdös & Rényi 1960, p.55).
Kobi Omenaka @00287065 13
To consider how this might work a thought experiment can be performed.
Consider a network made up of all the individuals in a university including the
undergraduate students, graduate students, academic staff, administration staff
and facility staff etc. An edge is formed between people who know each other by
name. When considering this network it is conceivable to think of many groups
of people i.e. components of which it is impossible to link them together via a
path. This is likely to exist even if this network is constructed at the end of the
academic year. It is conceivable, for example, that the caretaking staff in the
science faculties could be disjointed from the admin staff in the geography
department. Hence the creation of distinct component parts.
The existence of a “giant component” is relatively easy to explain as only the only
requirement is for two large components to be joined by one just edge. In the
case of universities where students, as they live together, study together, play
sports together, socialise together etc. would form a large components in
themselves. The students contacts with the lecturing staff would bring academic
staff and their faculties into this component. The academic staff of course have
contact with the administration staff who are then brought into this component.
It becomes difficult to see who wouldn’t be part of this giant component.
One common attribute of such “giant networks” is how small the paths are from
one node to another. This is referred to as “The Small World Phenomenon”
(Milgram 1967). In an experiment, conducted by Stanley Milgram in mid 1960’s,
he tested the theory that any two people can be linked by short paths. This is a
good early example of data collection in order to produce networking data. At
the time the experiment they had a limited budget of $680(a little over £3000
inflation adjusted for today) and no access to any of the networking data that
they would have today. The experiment ran by giving 296 randomly chosen
people a letter that had to be forwarded to target person. Each of the 296 people
were given some information about the target including his address and
occupation. They were asked to give the letter to someone and they knew
personally with the same instructions to forward on to the next person. In terms
of graph theory each person was a node and each “forwarding of the letter”
Kobi Omenaka @00287065 14
referred to as an edge, forming a path from the start person to the target. Of the
296 starting letters 64 successfully made it to the target. Of the 64 that made it
the average path length was six. This experiment is the source of the well-known
phrase 6 degrees of separation. There are many problems with this experiment
and it’s announced results, the first being that 232 letters did not reach the
target. In addition this experiment was conducted in a small area around Boston
in the United States. It is certainly not conclusive evidence that there is only 6
degrees of separation between any two people on the planet.
Whilst Milgram’s experiment was not conclusive evidence of the small world
phenomenon this is something that has been experimented on numerous ways
over the years. When large data sets exhibits a giant component it is often found
that the number of steps between each node is small. Lescovek and Horvitz
conducted a more recent experiment, published in 2007, with a larger dataset
using Instant Messenger. After analysing the accounts of 240 million active
Instant Messenger users they found a giant component consisting of most of the
nodes and that giant component had mean average path length of 6.6(Leskovec
& Horvitz 2008) and a median of 7. Close to that found by Milgram in the 1960’s.
The Instant Messenger experiment is an example of the type of research on
large-scale networks that has become available in light of access to large data
sets on the Internet. These kind of data sets were difficult to analyse previously
and with the scales involved is extremely difficult now to process without
computer power.
Capoeira as the Basis for Networking Research
In order to add value to a particular type of research it is worth considering the
distinct reasons why a particular data sets might be studied. (Easley & Kleinberg
2010)
1. One may care about the subject matter in question.
2. The data sets is being used are related networks that might be difficult to
measure.
Kobi Omenaka @00287065 15
3. There is a search for properties that appear to be common across many
different subjects.
This research is focused on the Brazilian martial art of capoeira. Capoeira is a
martial arts from Brazil that combines elements of dance, acrobatics, fighting and
music. Capoeira has spread all over the world with many Brazilian instructors
teaching in countries ranging from all the countries in the Americas and in all the
continents. In addition it is common for capoeira practitioners to travel outside
of the home country to attend events. The spread of capoeira in real terms is
large and the contacts that “capoeiristas” (practitioners of capoeira) have with
each other is frequent, varied and strong enough to form links on SNS’s such as
Facebook(Essien 2008; Green & Svinth 2010). The spread of capoeira itself has
been in large part due to the advent of social media and in particular the sharing
of videos on sites such as YouTube. The research into the structure of capoeira
should be an interesting case study giving insight into the structure of martial
arts, dance, music and national communities and clubs as well as emerging
activities such as free running and parkour.
In relation to the three points noted above:
1. The subject matter of capoeira is a key interest to the researcher
2. The results from this can be related to other emerging past times and
hobbies.
3. It is of interest to see how attributes such as giant components and small
world phenomenon manifest itself within this network.
From a research point of view, the word “capoeira” as a keyword is a strong one.
The affiliation with the word “capoeira” is strong and is likely that only people
with more than a passing interest, i.e. practitioners, will search for it let alone
affiliate themselves with it as part of network. Capoeira is a single word that can
be tricky to spell, but those that know it are unlikely to spell it incorrectly. It
would be hard to accidentally type “capoeira” as a keyword. In spite of each
social media service being different, the word “capoeira” should highlight
interested participants across all of them. This will identify social networks in
Kobi Omenaka @00287065 16
each social media service(Jones 2011). One key point worth noticing is that
capoeira is a singular term so the amount of variables involved in searching for
this keyword is limited. This should vastly reduce the amount of irrelevance,
inconsequential data that is collected.
It is interesting to conclude this section by considering the earliest example of an
online network i.e. the Internet. The Internet was first started in 1970 as
ARPANET which was a network of 13 U.S.-based universities. With each
university considered a node then the networking diagram can be drawn as
shown below in figures 2 and 3.
Figure2 Network Diagram showingthe Nodes and Edges on"ARPANET"
Figure 2 shows a networking diagram that we could we now know to be
connected as all nodes are connected by image to one or more of the nodes. This
was the intention when ARPANET was first developed to ensure that in the event
of one edge going down, information contained in one node would still be
accessible.
Kobi Omenaka @00287065 17
Figure3 Showing the ARPANET network in Relation to its Real World Positioninthe US
Figure 3 above shows the same ARPANET network but set against its real
location on the US map. This helps to provide context to the networking diagram
as shown in figure 2. If we viewed the Internet as it stands today it’ll be
extremely difficult to navigate the networking diagram without the aid of
computer software. The Internet has grown from its early humble beginnings
with 13 just in the US to an estimated 361 million nodes at the end of 2011
worldwide. It’d be interesting to see just how large the giant component would
be in this network.
Data Mining Evaluation and Research Methods
This research was inspired by the discovery of research advocating the use of
data mining in order to extract information from online sources. Data mining can
be defined as:
“a business process for exploring large amounts of data discover meaningful
patterns and rules”(Linoff & Berry 2011)
Kobi Omenaka @00287065 18
The term “business process” within this definition perhaps makes data mining
sound “cold-hearted”, however, all organisations, profit making or otherwise, are
involved in business activities. Data mining, if performed well, will help increase
profit margins as well as help increase donations, active participation and
support to charities, and improve academic research.
The Internet, and SNSs in particular fits the definition of “large amounts of data”
as given above. The problem is not the amount, rather the filtering of data to
produce meaningful results.
Extracting Data from Social Media Sources
A number of methods were considered for the data mining parts of this research.
The initial ideas included the use of search engines to look for trends and data,
combinations of proprietary software through to tailor-made coding using freely
available programming languages. Each method has different levels of usability
that relates to the time and cost investment required to produce usable results.
It was perceived that the ability to code and tie in with SNSs API’s and databases
would produce the most meaningful results. An API, Application Programming
Interface, is the hidden side of many large net websites and programs and allows
users with additional knowledge access to data.
A breakdown of the techniques evaluated is given below.
Text mining using software such as Spinn3r.
As discussed earlier SNA can be useful for tracking the spread of contagious and
infectious diseases. A recent saw tracking of the word “influenza” blogs (Corley
et al. 2010) using a web service called Spinn3r (http://spinn3r.com/) to crawl
and index content
Kobi Omenaka @00287065 19
“text mining has proven to be more difficult than data mining, as the source data
consists of unstructured collections of documents rather than structured
databases” (Corley et al. 2010, p.600).
Spinn3r is limited to searching and indexing blogs and cannot gather information
from SNS’s. Text mining software does have its merits and can be used as a key
indicator to track trends with in particular subjects over time. The time required
to produce results that can be used to fix patterns is prohibitive over the course
of this research.
It was decided at the preliminary stages not to use text mining as a source of
information for this research.
Programming
Resources such as “Mining the Social Web” (Russell 2011) propose that with
very little programming ability it is possible to extract data from Facebook
Twitter and LinkedIn. This concept is interesting and provided much of the basis
for this research.
Mining the Social Web advocates the use of programming language Python for
making the code the data extraction. It jumps quickly to the use of example
Python code as a means of introducing the subject matter. As the programming
ability prior to commencing this research was limited a decision was made early
on in conception that a foundation level should be established. The intention was
that any data mining code used should be understood to the level that
modifications could be made.
In the months before this research was started a basic level of programming
knowledge was acquired with a combination of self-learning books such as
“Hello World” (W. Sande & C. Sande 2009) and established, professional level
online training resources provided such as “Foundations of Programming:
Fundamentals”(Allardice 2011). These helped to establish a good level of
Kobi Omenaka @00287065 20
understanding of programming such as the syntax, methodology and coding
strategy.
After completing these courses and books progress went then went to specific
coding for data mining in Python.
There are various sources of code for data mining in Python, which can be used
as the basis for research. Books are an important start and help get a good
grounding but can quickly become out-dated especially when trying to connect
with fast-paced technology such as the Internet. SNS’s continually update their
API’s meaning that programs that want to interact with them have to update.
One way of doing this is by using social coding repositories such as “GitHub”.
GitHub provides a place for programmers to leave their code, which can be
updated and modified by other uses as they see fit. This is higher level of
transparency aims to provide the highest quality of code, from a number of
different sources, that can constantly evolve. In such places code is constantly
changing scrutinising and involving the better(Dabbish et al. 2012).
This research investigated a number of Python GitHub resources as shown in the
Appendix 2 at the end. This includes GitHub resource related to the book mining
the social web (https://github.com/ptwobrussell/Mining-the-Social-Web).
“Mining the Social Web” was published in January 2011. When the book was
published it contains details of how to data mine “Google Buzz” which was
Google’s SNS’s at the time. Google Buzz has now, of course been replaced by
Google+ and provides a good clear example of how technologies change quickly
in this industry.
Progress with this route was initially very promising and the understanding of
programing after the foundation and self-training period seemed strong. Despite,
however, having a good understanding of how programming syntax should
work, a lack of experience exposed the weakness in this method of approach.
There were different approaches used to try and readjust the research and
maintain progress in order that results from this method of data mining gained.
Kobi Omenaka @00287065 21
Whilst proceeding along this route of investigation problems were experienced
in the following ways:
1. Modifying code in order to interact with API’s
2. Lack of understanding of the social media API’s
3. Online user forums could not clarify problems
4. Suggested methods of visualisation incompatible with computing
platform
During this period progress was slow and discouraging. It became clear that
despite understanding how the programming code should work, the lack of
experience with programming and using API nomenclature would set too many
roadblocks for a project of this scale.
The decision had to be made, not to continue with using Python programming to
mine data, in order to continue to try and achieve some of the objectives
highlighted.
NodeXL
The focus turned to other ways to explore and mine online. It was accepted that
other methods might not be able to produce specific results, however the
stability and reliability that they could provide was vital at this stage.
NodeXL is a free downloadable template that integrates with Microsoft Excel. Its
advantages compared to core programming are numerous.
1. The interface is a template within Microsoft Excel so those familiar with
spreadsheets should be able to use it.
2. Due to its integration with another program it is very stable to use.
3. Built in functionality with the Twitter, YouTube and Flickr APIs.
4. NodeXL is open source software which means that anyone with
programming knowledge has the ability to make additional add
additional functionality to the software. This should keep template up-to-
date with respect to the APIs but also promises more for the future.
Kobi Omenaka @00287065 22
5. The data collected within NodeXL can be exported to other the
visualisation programs.
Once installed the NodeXL interface is understandable and are intuitive. The
NodeXL templates exist completely as a tab within the newer versions of
Microsoft Excel. Clicking the NodeXL tab presents a dedicated “ribbon” complete
with required icons and menus. See Figure 4 below.
Figure4 Image showingscreenshotof NodeXL starting page.
The ribbon is dominated by the import and export icons that lead you to further
menus for importing mined data. See figure 5 below, showing the import menu
which includes importing options and several data mining options for Flickr,
Twitter and YouTube as discussed previously. Of the SNSs searches can be made
based on a particular users network and on a particular keyword.
With the ability to search a particular user it was decided to perform a case study
on the user name “Capoeira Vibe” that exists on Twitter, Flickr, YouTube and
Facebook. The intention is to compare and contrast the networks of this one user
across the four main social media networking platforms.
Kobi Omenaka @00287065 23
Figure 5 Image showing the NodeXL imports menu. Including the options to report from Flickr,
Twitter and YouTube
The method for extracting data from YouTube, Flickr and Twitter sites is similar.
Each is a case of entering the username or keyword of the network that you wish
to investigate, the kind of relationships want reporting and a limit to the number
of nodes that to inspect. See figure 6 below. One difference is that Twitter
imposes a data retrieval rate of 350 per hour (Hansen et al. 2010, p.152). This
sounds like a considerable number but is quickly reached if the network is
anything other than small.
It should be noted that this restriction is not to limit the absolute amount of data
collected from Twitter it is more to restrict the bandwidth as the amount of calls
a data for the service is very high. This can have a considerable time impact on
the collection of data, with larger networks taking several days to complete
during the course of this research. Planning required if large twitter networks
are to be investigated.
Kobi Omenaka @00287065 24
Figure6 Image showingthe imports menu for a Twitter user network within NodeXL
It can be seen in Figure 6 above that there are various options for investigating
each SNS. This shows an example of an import based on Twitter username. The
options selected determine the extent of the network, where nodes (referred to
as vertices here) can be created just for the follower, the person being followed
or both. How the edges are defined during the data gathering can also be
determined at this stage. Edges can be added in Twitter for a “following”
relationship as well as responses to tweets between people that don’t necessarily
follow each other.
The final key point to choose is how “deep” the network should go. A network of
level 1.0 looks at the user and links to their followers and tweeter relationships.
A network of level 1.5 establishes links between the followers and back to the
user under investigation. Taking a step further to a network level of 2.0 looks to
the network of the followers of followers.
The amount of information requested at this stage determined by the number of
edges vertices or the level of network has a direct impact on the amount of data
requested by NodeXL of each social network.
Kobi Omenaka @00287065 25
This is worth noting for two key reasons.
1. The amount of data requested has an effect on the processing required
once collected. The data requested should relate to the requirement as
the moderator takes more processing power but also more power to
analyse once gathered.
2. The more information requested has an impact on the time frame for
collection. This is particularly relevant to the limitations imposed by
Twitter as discussed above.
The raw output from NodeXL is a list of the nodes and the relationships between
them, but can also include items such as thumbnails tweets and hyperlinks.
Figure 7 below shows a typical screen that is presented after an import has been
completed. This depicts an important after a keyword search on YouTube for the
term “capoeira”. Presented on the right hand side is a raw version of the network
graph before any manipulation.
Figure7. Screenshot of imported data from NodeXL from YouTube
NodeXL proves extremely useful for the extraction of data but less useful when
analysing the networks produced. One major problem is that during data
extraction the visualisation of existing data is not possible. This limits the
production and processing data to a "one way" system, impeding progress.
Kobi Omenaka @00287065 26
Another problem was that the graphical visualisation and network static power
of NodeXL did not match other software such as Gephi. It was decided to export
the files created within NodeXL and manipulate them within Gephi.
Netvizz
The main omission in NodeXL’s data mining arsenal is the fact that it does not
currently link in with Facebook. Whilst Facebook is only one of many SNS’s
available it is clearly the most well known and used. During the course of this
research the number of monthly active Facebook users surpassed 1 Billion (1
000 000 000) users (Zuckerberg 2012). Its impact on the social media landscape
cannot be ignored.
As such this analysis would have felt incomplete without endeavouring to use
data from Facebook as a source for network construction.
Research into extracting data from Facebook led to “Netvizz” which is a small
Facebook app. There is little formal information about Netvizz other than it is a
program created and maintained by Professor Bernhard Rieder at the University
of Amsterdam (Rieder 2012). The application can be accessed by typing
“Netvizz” into the search box within Facebook.
Once Netvizz is accessed a screen such as the one below is presented.
Kobi Omenaka @00287065 27
Figure8 Image showingthe Netvizz User Interface within Facebook
This page shows that Netvizz can be used to extract data from users personal
networks as well as Facebook groups they are members of. Data from Facebook
brand pages cannot currently be extracted.
Clicking on a group name then opens a page showing that the extraction
algorithm is running and tracks progress. A typical page is shown in the image
below.
Kobi Omenaka @00287065 28
Figure9 Image showingthat the Netvizz data extraction has completed and can be downloaded.
In order to fulfil the aims and objectives set up earlier in this research it was
decided that the data from a number of clubs should be extracted and
amalgamated.
Users can only extract data for Facebook groups that they are members of so in
order to complete this research a number of capoeira Facebook groups had to be
joined. These groups were simply found by typing the word “capoeira” into
Facebook search box and requesting to join. A request to join all groups
presented by this simple search was seen as the best way to keep this research
impartial and random. The more random the sample taken, greater degree of
accuracy and confidence can be had in the results (Kotrlik & Higgins 2001). It is
anticipated, however, that as Facebook presents news and information to a
particular user based on their friends, that is similar presentation will occur
when searching for groups. This is governed by Facebook’s “Edgerank”
Algorithm (Bucher 2012), which seems to limit sources and visibility of
information received rather than opening up social media landscape in a
Kobi Omenaka @00287065 29
panoptic fashion (Foucault 1977). This leads to the real concern that achieving
randomness online and in Facebook in particular is difficult. This has led to the
introduction of the phrase Filter Bubble, which relates to how the algorithms and
software that govern the Internet actually serve to hide information from by
aiming to provide users with information that is being most relevant to them
(Pariser 2012).
Figure10 Image Showingthe Group Joining Process onFacebook
The image above gives a typical example of a search page produced after typing
word capoeira into Facebook. There are three distinct stages for the displayed.
1. The groups that have been found by Facebook search and have yet to
have a join request sent
2. Those where the join request is still pending approval by a group
administrator
3. Groups that a user is a member of
It should be noted that there are two distinct types of groups on Facebook, open
groups and closed groups. Both require approval to join so it was felt requested
to join a close group would be the same ethically as requested to join an open
group as admission depends on the administrators within the group.
Once a join request was sent it could take anywhere between one minute and
three months for an approval to be given. Indeed, many club admins approve the
joint request during the time this research was being performed.
Kobi Omenaka @00287065 30
The request to join was sent to over 300 capoeira clubs. The number of clubs
joined was due to be limited to the research period time constraints. It was,
however, found that Facebook started to restrict the number of groups that was
shown with each search. This started to occur once the number of join requests
was around 200 in a relatively short space of time, approximately one week.
This is another indication that the group selection process was not performed as
randomly as possible.
Once the request to join portion had finished the next stage was to use Netvizz to
extract data from each group. Once the extraction is completed the data can be
downloaded as a .gdf file for use within Gephi for analysis.
The extraction time can vary mainly depending on the number of users within a
Facebook group. In some cases the extraction fails. This is something that has to
be accounted for and accommodated for when undertaking analysis of large
groups. It is not a labour-intensive process, however, time is required to monitor
the groups and should be considered for similar undertakings.
Gephi
Gephi is an open source network manipulation software. It uses algorithms and
tools more prevalent in the gaming industry to be able to filter visualise and
exports networks of all sizes. Gephi uses the computers graphics card to do
processing, which means that other operations can occur whilst graphs are
calculated. This does mean, however, that a more powerful graphics orientated
computer is beneficial for optimal usage (Bastian et al. 2009).
Gephi does not have the ability to import the data itself. It can, however, import
a number of different types of files including files type generated by NodeXL and
Netvizz as discussed above.
Kobi Omenaka @00287065 31
The Gephi interface, graphs produced and outputs such as path length mean that
meaningful analysis can be performed once a network has been loaded in. As
graph theory algorithms are built-in to Gephi there is as short learning curve for
those that are well versed in network science. The graph theory and algorithms
were built in from the ground up along with the visualisation aspects of Gephi.
Those unfamiliar to networking analysis but shown a Gephi network graph
should be able to understand key characteristics. Figure 11 below shows a
network graph produced in Gephi.
Figure11 Network graph produced inGephi showinga Manchester based capoeira group
The graph above as shown in figure 11 is produced from a capoeira group
“Capoeira Cordão de Ouro North West UK (Manchester and Liverpool)”as
identified in Facebook. This network graph is a slightly cropped down version of
the group with the unconnected nodes (leaves) removed.
Based on the import of the data into Gephi there are 1001 nodes in capoeira
group. Once the unconnected nodes are removed there are 879 (87.8%) visible.
Kobi Omenaka @00287065 32
By looking at the graph, with the leaves removed, it can be seen that can there is
one large connected network component here.
It is clear to see that C.M.Parente is the most connected node based on the size of
the node showing and the text presented. It should come to no surprise that
C.M.Parente is the instructor of this capoeira group.
This capoeira group is extremely well connected, with an average path length of
2.43. This means that there is less than 3 degrees of separation between anyone
in this group. This is well below the 6.6 as discovered by the research on instant
messenger platform MSN messenger.
From research with in this capoeira group it is known that there are
approximately 100 students that train regularly across the different locations
during the week. So how can the 1001 nodes be explained?
The first point to make is that this figure of 100 students relates to the current
number of students that train within the group. This group has been running in
Manchester since 2002 and will shortly be celebrating its 10-year anniversary. It
is difficult to determine when this group was launched on Facebook, however,
Facebook itself came to the UK in October 2005 (Form 2012), It would,
therefore, be a conservative estimate to say that the group had been on Facebook
since 2007. Approximately 5 years, or half of the group’s existence in
Manchester.
In this time many people would have come and gone, so whilst there are
currently 100 students training the number of students that trained in the five
year period could be much larger.
The second point to make is that not all of these nodes represent the students of
the group. Other visiting teachers and friends of the instructor make up a lot of
these. As talked about earlier capoeira students do travel to other events around
the world and in the country. A large proportion of the thousand and one nodes
Kobi Omenaka @00287065 33
within this consists of students from other groups that have come to visit and
subsequently joined this group on Facebook.
A number of the nodes will be made up of people who are perhaps looking to join
the group or curious as to their activities and want to get to know more. A lot of
these nodes would have been removed as leaves and not represented within the
network graph shown in figure 11.
When referring back to figure 11 it can be seen that graph is divided into four
distinct areas as depicted by the different colours. The red areas generally depict
nodes from unrelated capoeira groups that are based outside. The orange area in
the top of the middle represents people from other capoeira groups based within
the UK. The large purple section consists of people from related capoeira groups
but based outside the UK. The remaining green section shows the UK based
students, the vast majority of which are from this Manchester capoeira group
over the years.
Method
In order to construct the capoeira networks and perform the case study a
combination of NodeXL, Netvizz and Gephi was used. NodeXL was used to extract
data from Twitter, YouTube and Flickr, whilst Netvizz was used to extract data
from Facebook. The data from both of these programs was then visualised within
Gephi.
One key feature of Gephi is the ability to append data. This means that data from
multiple sources can be combined into one file and the same nodes and edges
and not counted more than once preventing overlap.
With time larger networks can be constructed from smaller groups. This is how
the capoeira network was constructed during this research. After requesting to
join over 300 groups data extracted from 245. Four groups would not converge.
Kobi Omenaka @00287065 34
The breakdown of the groups can be shown in Appendix 1 below. It can be seen
from Table 1 below that there is a total of 121,753 nodes across the 241 groups
that converged.
Results and Analysis
Flickr Results
Figure12 Network Graph Showingthe Keywords Associated with the Term "capoeira" on Flickr
Flickr is a SNS based around the display of photographs. Using NodeXL to
perform a keyword search of “capoeira” on Flickr produces a graph that relates
the frequency with other keywords in photos uploaded.
Kobi Omenaka @00287065 35
This shows the keyword association with capoeira both directly related to
martial art itself and art forms relation to photography. It can be seen that the
capoeira node is the largest and that has strong ties with Brazil. The word Brazil
features twice spelt in both the English and Portuguese variations highlights this
fact. Other keywords not related to capoeira but to the medium of presentation
are also prominent. This includes the words Nikon and Canon that relate to
photography in particular.
This kind of result is useful for determining marketing strategy or trying to
address how to get the best impacts online with keywords. If societies, charities
and companies were looking for keywords to make their presence more
pronounced online, this kind of extraction could be a good starting point
Kobi Omenaka @00287065 36
YouTube Results
Figure13 Network Graph created from the "capoeira" tag within YouTube
The results from this graph shown in 13 compared to the graph produced in
Flickr is more difficult to navigate. Each node here represents a video in which
people have commented on or linked to and the label is a string of letters and
numbers not readily understandable by humans e.g. “MFgXmiL78Pg”.
When metrics from within Gephi are applied, however, the network can be more
useable. For example this graphics represents 878 videos and YouTube name of
the most frequent posters can be found as shown in figure 14 below. These can
then be used as the target for further research on YouTube.
Kobi Omenaka @00287065 37
Figure14 The Most Frequent Posters of Capoeira Videos onYouTube
Twitter Results
Figure15 Twitter network produced from the hashtag “capoeira” (from 100 tweets)
Kobi Omenaka @00287065 38
NodeXL was used to mine data from the last 100 and 1000 Twitter users that
used the word capoeira in a tweet. The networks that result from this are shown
in the figures above and below
Figure16 Twitter network produced from the hashtag “capoeira” (from 1000 tweets)
Edges are created in this case when a tweet is retweeted or replied to. It can be
seen that there’s a large difference between the 100 tweet and the 1000 tweet
network graphs. This is due to enough time passing the people to interact with
tweets and highlight relationships.
Of the 11 joined components in the 100-tweet graph two have a group size of
more than three. This is not enough to gauge how big an influence these nodes
have in the capoeira network on Facebook.
Kobi Omenaka @00287065 39
The 100-tweet graph shows a lot more detail that is useful for social network
analysers. There are three sizeable components that are independent from each
other. The green nodes indicate Twitter users that have more connections whilst
the pink nodes attached to them show how far the green node’s influence
spreads with each tweet.
Facebook Results
Figure17 Network Graph showing75453 nodes constructed from 241 Capoeira Groups
The network graph shown in figure above is the culmination of data mining from
Facebook via Netvizz. This is an amplified version of network graphs shown in
Kobi Omenaka @00287065 40
figure 11, which was based on one capoeira group. Graphs of this size are very
difficult to analyse visually and this is where programmes such as Gephi come
into their own. From this diagram many different groups as represented by the
different colour areas can be seen but even then it can be difficult to distinguish
one group from another and indeed how many groups that are. Each Facebook
capoeira user is represented by a circle, the largest belonging to the nodes that
are most connected.
If the labels are included in this diagram it quickly becomes difficult to read. This
is shown in the appendices in figure 24.
When running a variety of statistical metrics on Gephi the data we are looking
for can be found. Figure 18 below shows the top 10 most connected nodes within
the network above.
Figure18 Image to show the Top 10 mostConnected Nodes from the Constructed Capoeira Network
Gephi is also able to confirm the existence of the giant component that consists of
68448 nodes (91%) of the population. This confirms the introduction to network
theory as described. The average network path here is 6.9 which is in agreement
with the work performed on Microsoft Messenger by Lescovek and Horvitz.
Kobi Omenaka @00287065 41
Table 1 Key Values from Netvizz Data Extraction of Facebook Capoeira Groups
Simple Node Count 121753
Number of Unique
Nodes 75453
Number of Groups
Joined 245
Groups converged 241
Shrinkage 0.38
Raw Average Size 512
Unique Average Size 317
When looking at the list of the most influential users within the capoeira
network on Facebook evidence can be seen that selective algorithms guided
group selection. The node list of the large Facebook group shows that
C.M.Parente is most connected node within 75,000 other nodes. He is followed
by C.M.Papa-Leguas. This is the same as the Manchester Capoeira group show in
figure 11.
It should be noted that the Manchester capoeira group was one of the first
groups to be joined as part of this research. Lots of the 245 groups this research
became a member of, the Manchester group was 4th. It is also the first group to
have a substantial number of members with 1001 nodes compared to 150, 68
and 106 for the first, second and third groups joined. It is, therefore, conceivable
that’s Facebook’s Edgerank algorithm has “guided” the rest of the capoeira
groups that were displayed during the search.
In order to assess the impact of this separate data collection exercises could have
to be performed. This time care would have to be taken to ensure the groups
found did not contain the members C.M.Parente and C.M.Papa-Leguas. If their
names showed up under similar circumstances is this would indicate that they
are well connected within capoeira the Facebook community as a whole. If,
Kobi Omenaka @00287065 42
however, they did not show up or if their presence was much smaller it could be
suggested that the Edgerank had a profound effect on this research.
Shrinkage
Table 1 above shows some key data taking during the course of this research.
One of the most interesting points to notice is that 75453 represents the number
of total nodes in their final networking diagram but is less than the simple node
count (121753). This is a concept that is introduced during this research as no
information can be found about it elsewhere.
This is explained by the fact that people are free to join as many groups as they
wish on Facebook only limited by the approval process. This means that the
same person maybe a member of several groups on Facebook as such their
number will become to more than once in raw data. During this research only
one Facebook account were used so in this case one user is a member of 245
capoeira groups. The fact that software such as Gephi only counts each user once
exposes this “shrinkage”.
This is defined here as:
“the fraction by which the simple node count exceeds the actual number
of nodes”
In this case the simple node count is 121753 that feeds into a graph of 75453.
The shrinkage number is given here by 1 – (75453/121753) = 0.38.
A shrinkage number of 0 indicates that each node is a member of a different
group and that each group is discrete, whilst a number close to 1 suggests that
everyone is a member of everyone else’s group. Having a high or low shrinkage
number is not necessarily a good and bad thing but is an indicator of the
dynamics within the network. If one wanted to transmit a certain piece of
information from group to group it would certainly be easier with the networks
that consist of groups with higher shrinkage numbers.
Kobi Omenaka @00287065 43
Capoeira Vibe Case
The final part of the results analysis is a brief case study based on the capoeira
community Capoeira Vibe that exists on all the four SNSs. The intention here is to
give a real high level example of the information can gain from the data mining
process in the paper. Facebook currently does not allow fan pages to be made
mined. So it should be noted for balance that Capoeira Vibe has well over 11,000
likes and would represent its largest network out of these four.
Figure19 Network Graph Showingof the "Capoeira Vibe" Network Communityon Flickr
The Flickr group is smallest of the Capoeira Vibe networks followed by YouTube.
This may be because these sites focus on the production and posting of material.
Based on these findings it would a recommendation to expand their networks on
these sites prior to posting any more data. We have seen that the capoeira
network on Facebook has a shrinkage number of 0.38, which means the
information should pass quite freely within this network. In order to facilitate
the passing of information i.e. new photos and videos, more links should be
made.
Kobi Omenaka @00287065 44
The twitter group shown in figure 22 is made up of 1359 nodes, however, they
have 249 followers. This shows they are simply not engaging with the audience
as much possible. The advantage of the graph shown in figure 22 is that they can
now target key members of the network and can use things like instant
messaging and retweets to prioritise which users to form allegiances with.
Figure20 Capoeira Vibe level 1.0 YouTube Network
Kobi Omenaka @00287065 45
Figure21 Capoeira Vibe Level 2.0 YouTube Network
Kobi Omenaka @00287065 46
Figure22 Capoeira Vibe Twitter Network
Ethics
It can be seen above that large amounts of data can be created, collected,
analysed quickly. This has obvious merits when compare to more traditional
methods of data collection such a surveying and interviewing, especially when
there are time and economic constraints. When this data is collected by surveys,
however, people are consenting to its usage in this manner. One key concern
with data mining is that people allow the data to be used online in one context
but the data is extracted and used for another(Van Wel & Royakkers 2004).
Kobi Omenaka @00287065 47
When users signed up to a SNS’s they typically sign an end user license
agreement (EULA). The vast majority of people don’t read these and agree in
haste simply to start using a service. The first line of Facebook’s EULA is “Your
privacy is very important to us”(Herman & Ullyot 2012). It then moves to
explaining how the user owns their own content but they allow Facebook to use
it as they wish. This is where the information from data mining can be exploited.
Data mining happens as the people, perhaps unwittingly, allow it to happen
maybe due to ignorance or forgoing privacy simply to be able to use services.
Of course the ability to do something does not suggest that it is ethically right. By
setting up the EULA’s the social networks effectively absolve themselves from
the ethical issues of data protection.
The data collected during the course of this research was only restricted by time
and, in the case of Twitter, the concern that accessing data might adversely affect
their bandwidth.
It is clear that those wanting to use data mining are the ones that are carrying
the moral flag. Any information collected during the course of this research has
been thought about such that private data cannot be exploited.
The data collected here has been for the purposes of establishing patterns
structure. It is arguably the next steps following the data mining the pose the
largest threat. If data mining was used to establish whom to contact for further
research then this becomes little more than a telephone directory. As the
discussion moves into the qualitative phases does the concern with ethics start
to lessen. Here communication is on a much more personal level and individual
permission simply has to be granted. This can otherwise be seen as spam
contact, which can be deemed unethical.
Data mining from SNS is still in its infancy and will no doubt continue to grow
given its merits. Its usage needs to be monitored now to ensure that it is not
exploited in a negative way in the future. It is clear that at this stage moral
implications are very much in the hands of those extracting the data.
Kobi Omenaka @00287065 48
Quantitative vs. Qualitative
The data mining processes and the network graphs and analysis produced are all
quantitative. A lot of data can be collected in a short period of time and this can
be used to establish trends help steer the direction that organisations may
follow. Using the keyword analysis from Flickr, for example, can be used to set up
a websites metatag infrastructure that is important for search engine
optimisation (Jones 2011).
The problem with pure quantitative analysis is that patterns and trends can be
seen and predicted. It can, however, be very difficult to understand what and
why things are happening. In the analysis of the Manchester capoeira group
(Figure 11) the breakdown of the different regions of the graph was only
possible after conducting a short interview with C.M.Parente. Without this it
would be very difficult to understand why the communities have collected in the
manner they did.
Indeed the main outcome of the Capoeira Vibe case study is that they now, in the
case of Twitter, YouTube and Flickr, have a much clearer picture of the networks
around them. They can also of course draw information from the “general”
network analysis created from Facebook, twitter and Flickr. In the research
performed by Trompeter on social media tactics (Trompeter 2010) a key thread
emerged about reaching and connecting with people. This data mining helps to
find those people and the next stage would be to connect with them.
In the book “Groundswell” the authors research produced the “Social
Technographics Ladder”, which is reproduced below.
Kobi Omenaka @00287065 49
Figure23 Social Technographics Ladder based onU.S. Adults (Li & Bernoff 2011, p.43)
Each step higher on the ladder indicates a greater level of involvement in social
media. The use of data mining for SNA helps to identify these people and where
they are. Qualitative measures will connect to these people and further
understanding, which may yield results.
This research will not go further into the discourse of qualitative and
quantitative. The output from this paper has been quantitative but that should
not be mutually excluded from qualitative. There has been a line of approach that
suggests a mixed method analysis would be most suitable and fitting to SNA.
“Network structure is not the whole story… we need to supplement
methods of formal network analysis with qualitative observations about
what is “going on” within a network.” (Crossley 2010, p.18).
“Qualitative approaches add an awareness of context which aids the
interpretation of network maps and measures; they add an appreciation
of the perception of the network from the inside; and an appreciation of the
Kobi Omenaka @00287065 50
content of ties in terms of quality, meaning, and changes over time.”
(Edwards 2010, p.24)
In the real cases it is suggested that both methods are used and to the ability of
the time, resources and knowledge base within each organisation.
Conclusion
It can be seen during the course is recess that there are several ways to collect
data from SNS’s. The type of data collected and sites chosen to mine from should
be well considered, as each will return with different types of usable information.
For those inexperienced with computer programming the use of software such
as NodeXL and Netvizz for data mining purposes are good options.
There are often time and budgetary restraints when conducting market research
but given the short learning curve with these programs even those uninitiated
can yield meaningful results. This ties in with one of the key benefits of
marketing by social media i.e. that it is cost-effective (Trompeter 2010).
This is particularly true given as it can be used for research purposes prelaunch
to help establish marketing strategy on sites and such as YouTube and Flickr to
for finding keywords to use online. Once a company or product has been
established these data mining metrics and networking techniques can be used to
monitor progress online and plan future steps.
When it comes to this research, aims and objectives were presented at the outset
of this paper. Each has been addressed to varying levels constrained by the
binding parameters of this research and the conflicting intention of producing a
standalone paper. Many of the aims could form the basis their own research. It is
felt in this case however that an overview would be more beneficial at this stage.
Appendix 1: Table to show the Number, Name, Group ID and Number of Nodes as Extracted by Netvizz
CAPOEIRA
GROUP
NUMBER
FACEBOOK CAPOEIRA GROUP NAME FACEBOOK GROUP
ID
NUMBER OF
MEMBERS
NODES
FROM
UNMINEABLE
GROUPS
245 CAPOEIRA MUZENZA KOREA 188938991204060 298
244 Cordao de Ouro Moscow 217904894908696 28
243 CAPOEIRA MANDINGA MEXICO 73554666612 403
242 Capoeira Meeting Copenhagen 61563872205 302
241 ABADA Capoeira UK 117537624347 373
240 Malandragem Capoeira 14463235598199 1468
239 Ryerson Capoeira Club 257873730977742 510
238 Real Capoeira 87587742228 250
237 Núcleo Abaeté Rede Anca Capoeira 385200501490881 284
236 Grupo Capoeira Brasil - Professora Mariana "Potiguara" 217949151486 379
235 Grupo Axé Capoeira 2248929446 1377
234 Capoeira sul da bahia CM MAXUEL, Monitor dexter 382019125175541 542
233 GRUPOS DE CAPOEIRA EN MEXICO 136198343113436 622
232 Grupo Caymã Capoeira 282348378525310 1494
231 Capoeira SDB Contra Mestre Maxuel, Profesor Estagiário Eddu.. 58338428093 325
230 I love capoeira cdo 182311485177255 487
229 CAPOEIRA Y ACONDICIONAMIENTO FISICO BOGOTA 106388989487731 296
228 GRUPO BERIBAZU E AMIGOS DA CAPOEIRAGEM 187972871213976 826
227 Capoeira Sul da Bahía - C Mestre Maxuel, Prof Estagaria Alena 36077561796 302
Kobi Omenaka @00287065 52
CHILE
226 Capoeira Angola in Manila 124826510863308 231
225 Capoeira "Body & Soul" di Mario Collina 253970458014301 554
224 Porao Capoeira Vienna/Austria 130157855172 477
223 Capoeira Karkara 110588072355664 290
222 Capoeira Q Roda 223415177693922 194
221 Axé Capoeira Bursa 148746176508 451
220 CAPOEIRA - Oficina da Capoeira Internacional "Venezuela" 114404815258982 259
219 Capoeira Senzala (professor Palhaco-Belgrade) 41992525089 369
218 Capoeira Friends 102512993162401 325
217 Capoeira Nagô Malta 23503931496 934
216 Capoeira Videos 217041721655963 527
215 Escola Nestor Capoeira 175658252458533 270
214 Casa Da Capoeira Knysna 21790256882 666
213 Comunidade Capoeira 148020055323480 278
212 Capoeira Brooklyn - Raizes do Brasil 12292780174 203
211 Capoeira Malungos Bayonne 39969985608 592
210 Aú Capoeira New Zealand 19690984056 288
209 Cantigas e Documentarios de Capoeira 279741352133380 1104
208 Capoeira (Finland) 6205335078 203
207 Capoeira Polska 473542129336661 515
206 CAPOEIRA AGUA DE BEBER VENEZUELA 41172345168 519
205 Escola Brasileira de Capoeira, Philippines 27691384580 677
204 XANGO Capoeira Australia 7436426726 614
203 SENZALA MACEDONIA-MESTRE PULMAO ( Capoeira Skopje ) 44283313844
Data extraction not
converged 4959
Kobi Omenaka @00287065 53
202 Capoeira Kościerzyna 175941112541870 270
201 Capoeira Senzala Genève 23039013657 204
200 Ginga Firme Capoeira Makassar 80141919030 384
199 CAPOEIRA MEU DEUS RANCAGUA 253128554776427 307
198 Capoeira Vida 188271654538689 935
197 Capoeira Training Brighton 131365310324937 287
196 Abadá Capoeira Milano 28377257445 438
195 Capoeira Filhos de Angola ~ Lefkada 175698712558150 255
194 Capoeira Cordao de Ouro - South Africa 11284290815 227
193 Filhos De Bimba Escola De Capoeira Lebanon 61878158913 588
192 Capoeira raça 156914091014346 239
191 Capoeira Senzala Scotland 4959664847 356
190 Capoeira Senzala Lara (Venezuela) 56762910082 232
189 Abadá Capoeira Bogotá 174064432704 448
188 FILHOS DE BIMBA ESCOLA DE CAPOEIRA STUTTGART 194794585908 597
187 Capoeira Sobreviventes 2345044102 407
186 Capoeira Cordão De Ouro Costa Rica 136361619738448 337
185 Capoeira Nago Brasil 168077533231588 308
184 Capoeira TRIARTE Genova 113640695314427 227
183 Capoeira Angola Center Finland 11110057026 427
182 Association Swedendê & Capoeira Senzala - Professor Coqueirinho 54700188474 569
181 Capoeira Amazonas Split 16832446806 212
180 Axé Capoeira—Czech Republic 132752446753272 247
179 Grupo Capoeira Males - Burlington 125210507601369
Data extraction not
converged 1095
178 Grupo de Capoeira da Angola Istanbul 128728250490174 360
Kobi Omenaka @00287065 54
177 Rodas e Eventos de Capoeira em SP 154380194685843 862
176 Gruppo Capoeira San Marino 307586172594387 338
175 i want to learn capoeira (Malaysia) 165425803469895 268
174 Grupo de Capoeira Angola Menino Quem Foi Seu Mestre - London 340454615983173 343
173 Soluna Capoeira 28969814448 738
172 GRUPO CAPOEIRA CHALKIDA 39546508542 400
171 Mundo do Capoeira 343881525679887 1004
170 Grupo Capoeira Brasil- New York: Formanda Colibri 311052822323721 809
169 Abadá-Capoeira ::: Designs 361378350570801 737
168 Capoeiraskolen Senzala 46510288650 283
167 Capoeira Ringsted Grupo Malungos 56416035736 329
166 CAPOEIRA RAÇA E FUNDAMENTO "SERRA NEGRA" PRFº JAÚ 239679632776568 261
165 Escola de CAPOEIRA GINGA CARIOCA 28634226177 175
164 CAPOEIRA om formiddagen på St. Kongensgade 79 329466953808780 190
163 CAPOEIRA 150509121659291 467
162 CAPOEIRA BELGRADE-CMPULMAO 10219317260 1009
161 Capoeira Brasil - Bogotá 13283446207 510
160 Capoeira Puerto Rico Senzala 111639975573687 219
159 Capoeira Aché Brasil Malaysia 135483606485823 292
158 Capoeira Kuwait-‫را‬ ‫وي‬ ‫اب‬ ‫ك‬ ‫ت‬ ‫وي‬ ‫ك‬ 48035048877 279
157 Capoeira Belfast 181860001000 209
156 Capoeira Brasil Hermosillo 174108045980534 230
155 Capoeira Cyprus 297572139539 247
154 Group Capoeira Brasil - New Zealand - 278509383840 654
153 Capoeira Senzala de Santos 225993607412207 469
152 CAPOEIRA 177360938985380 307
Kobi Omenaka @00287065 55
151 Capoeira Ijexá 173148365299 213
150 Capoeira Gerais Madrid 143121215744995 1920
149 CAPOEIRA - UFRJ 171081326282080 1144
148 Capoeira Mersin 114280165362620 307
147 Capoeira Malungos Saint Etienne - France 141196022583911 709
146 Capoeira Senzala Montreal 250074501433 216
145 Capoeira Brinquedo de Angola 155219664543731 257
144 Capoeira Raizes do Brasil Milano 33398315109 643
143 CAPOEIRA MANDOU CHAMAR ULM GERMANY 449570235082920 312
142 Capoeira Calédonia, Energia da Bahia - Nouméa - 35682791319 453
141 CapoeirArab 156542164439065 393
140 Capoeira Senzala do Caribe 94715705127 894
139 ASOCIACION ARGENTINA DE CAPOEIRA 8716951310 1713
138 Capoeira in Michigan 2589775177 935
137 Capoeira Angola Israel - ‫קפואירא‬ ‫אנגולה‬ ‫ישראל‬ 297434566996612 709
136
Professor Cebolinha CORDÃO DE OURO CAPOEIRA - Newark-NJ-
USA 406607306047829 1203
135 Capoeira Cordão de Ouro - Bonneuil 522128287802986 283
134 Capoeira Conviver 4125218546 310
133 Capoeira Volta ao Mundo brazil sweden 2407568628 283
132 Capoeira Brasil Cayman Islands - Instructor Koé 236552959187 374
131 Capoeira Malungos Paris - Association Senzaleiros 7538366468 436
130 Capoeira BALI 129610587101010 239
129 CAPOEIRA 269552989741226 266
128 ECAM (Escuela de Capoeira y Artes Mixtas) 315767145181306 1173
127 Capoeira in Toronto 2213115016 474
Kobi Omenaka @00287065 56
126 CAPOEIRA NATIVOS TUNJA - GARAGOA 263013087148341 274
125 Capoeira Malungos Ambérieu-en-Bugey 405396279522761 129
124 Capoeira ALEGRIA 179497432074997 255
123 Capoeira de Camaçari 375681459155880 443
122 CAPOEIRA GINGA DE MAPUTO 92676097159 654
121 Capoeira 198617790151751 399
120 Capoeira Força Natural 27637657908 246
119 Capoeira Passo a Frente 150336321678724 250
118 Grupo MATUMBÉ Capoeira - BARCELONA-SPAIN 368368470689 673
117 Capoeira Sul da Bahia Vienna Professor David 100311471224 279
116 Mundo Capoeira Türkiye - Mundo Capoeira Turquia 5745007875 1093
115 Capoeira "N" Surf Morocco 124449746100 277
114 Capoeira Estilo Livre 367768209937829 561
113 Capoeira 321810367891948 955
112 CAPOEIRA MUSIC ! 175857075765440 860
111 CAPOEIRA MAROC 13963475108 949
110 Capoeira Mandinga Taiwan 158763090818287 286
109 Capoeiranagô Berlin alemanha Professor Rogérinho 339334099467054 577
108 CAPOEIRA ARTES DAS GERAIS BRASIL MESTRE MUSEU FICAG 116408905097958 304
107 Capoeira Del Bruto Genève 251727598259443 368
106 Capoeira Ache Brasil Whistler 305421630796 248
105 Capoeira India 24905600976 476
104 Capoeira Cordão de Ouro Indonesia 15750918795 1095
103 CAPOEIRA - FORÇA JOVEM (CAMPO GRANDE - MS) 385510051471770 368
102 Capoeira 310291139019831 344
101 Capoeira Angola 132756643447610 247
Kobi Omenaka @00287065 57
100 Músicas da Abadá-Capoeira 246876338619 1954
99 Capoeira 182151925232852 437
98 Capoeira 173728919335311 1479
97 Capoeira Natural Do Brasil 57416365662 826
96 Capoeira Angola Center of Mestre João Grande - Oakland 144322595611492 353
95 CAPOEIRA DUBAI 202860073566 609
94 Capoeira Street Rodas Club 88497857586
Data extraction not
converged 344
93 Capoeira Lapinha 416658591683370 490
92 Capoeira Filhos Da Bahia Australia 2362794567 1565
91 CAPOEIRA SENZALA NOVI SAD, SERBIA 63798471319 790
90 CAPOEIRA CIDADÃ 131298673593799 320
89 CAPOEIRA MOVIMENTO 224446160984111 301
88 Anauê Capoeira Internacional 136632315120 985
87 Cordao de Ouro Argentina 165325643526339 52
86 CAPOEIRA SUL DA BAHIA KØBENHAVN DANMARK 121672330541 309
85 Capoeira Batuque 49838989530 912
84 Capoeira Nova Era Mexico 113239375415803 376
83 Capoeira Aguascalientes 120202567998906 298
82 Capoeira Bem-Vindo 168608506486901 338
81 Axe Capoeira Türkiye 5464803047 1686
80 capoeira TwisT ponorogo 146387224975
Data extraction not
converged 375
79 Capoeira Ballymena 382801128443913 68
78 Capoeira en Querétaro 345555565532083 105
77 Capoeira in armenia 129554777132684 1125
Kobi Omenaka @00287065 58
76 Capoeira Senzala Lyon 42847059088 297
75 CORDAO DE OURO COLOMBIA 45725838389 685
74 Capoeira Sul da Bahia - Washington D.C. 289485185402 363
73 CAPOEIRA NOVA ARTE 467158856636578 288
72 Capoeira Rijeka, Jacobina Arte Croatia 47105079044 355
71 Capoeira & break dance 231690590188039 233
70 Capoeira Angola FICA - Bogotá 10742924786 661
69 Capoeira Brasileira Montreal 2627110574 449
68 Capoeira Acrebrasil Milagro 156360181121850 408
67 Capoeira Senzala - Contramestre Steen - Professor Axé Canarinho 7481173828 643
66 Capoeira LDMUNAM (Cabeleira) 146698528752388 386
65 Capoeira Guerreiro Orixas 103209925238 363
64 Capoeira Angola Nottingham 60638711667 339
63 Capoeira 345054108914651 56
62 Capoeira Angola Center Italia 94821955561 368
61 Capoeira Angola Center Sérvia - Contra-Mestre Marquinho 132801996810260 432
60 Capoeira Ioannina - Companhia Pernas Pro Ar 53432940124 716
59 Capoeira De Nazareth Cordao De Ouro Israel 15886395195 699
58 Capoeira Nativos de Minas 4386281365 417
57 Capoeira Raça İtabuna - Bahia - Brasil 386805337998363 776
56 Capoeira Bulgaria 53282511878 720
55 Capoeira Infantil 200096850038709 1646
54 Capoeira Moldova 433028646709900 560
53 Capoeira AMAZONAS - Hrvatska 6074204063 584
52 capoeira surfista - ‫קפוארה‬ ‫סורפיסטה‬ 38643985389 326
51 Capoeira-Music.net 225102747556546 570
Kobi Omenaka @00287065 59
50
Capoeira Origens do Brasil UK - Southampton- Bournemouth -
Portsmouth 19775543856 528
49 Capoeira Jerusalem ‫קפוארה‬ ‫ירושלים‬ 27033528099 934
48 Capoeira Angola Center of Mestre Joao Grande - New York 33918702846 1119
47 Capoeira Mineira 170275553055302 619
46 cordao de ouro sapporo do japao 162369700451798 12
45 Capoeira Sevgisi 229788467082160 323
44 Capoeira Topazio Rieti 51748450901 861
43 Capoeira Associação Sérvia - Professor Touro Branco e Alunos 189648850746 923
42 Capoeira Bristol- Claudio Campos 2333889907 642
41 Cordao de Ouro Norwich Capoeira 7421940457 268
40 Volta por Cima, Cordao de Ouro, Turku 113084175538 73
39 Capoeira Plantando Dendê 414665775257111 241
38 Capoeirando 313048142084018 437
37 roda 153865851333143 58
36
μπαράκι Μαρίας στις Γούβες (ή αλλιώς πειρατικό ή αλλιώς
λεωφορείο ή αλλιώς) 112277615476421 756
35 Contra Mestre Turbina - Norway 232402870103870 407
34 Capoeira Picture 437611929590499 704
33 Capoeira World 353301284719697 1394
32
Capoeira Iowa - CORDÃO DE OURO C. M. Cabeção and Intrutora
Tiririca 300009700073805 771
31 Capoeira Birmingham Cordao de Ouro 365929986778646 81
30 Capoeira Malungos Landes et Béarn 278612768888080 1089
29 CAPOEIRA JERICOACOARA 226519449716 203
28 Come and Play 138134466263978 139
27 Negoteta Capoeira -Guardioes Brasileiros 174957519203487 134
Kobi Omenaka @00287065 60
26 Cordao De Ouro Scotland 199800916760341 146
25 Capoeira Cordao de Ouro Milano -BAMBU- (www.capoeiracdo.it) 190470224308554 483
24 Capoeira Cordao De Ouro Manchester Children's classes 138546192907722 54
23 Capoeira i Bergen :) 242864262413466 52
22 Cordao de Ouro Wirral 216025318428937 44
21 capoeira cdo Marseille 202338545265 252
20 Capoeira Malungos Edinburgh - Scotland 100830409973949 252
19 Capoeira Cordão de Ouro Cheshire 140611855957168 115
18 Capoeira Ceara 2230017764 580
17 Afro Ritmo CDO - Capoeira 120447254654796 196
16 Cordão De Ouro Livorno 118805631471068 530
15 Cordao de Ouro Athens 42033326349 1076
14 Capoeira in Lancaster ( UK ) 57711742048 45
13 Cordão de Ouro Oslo- Instrutor Pirucão 442088410724 205
12
‫לוח‬ ‫ההודעות‬ ‫של‬ ‫המרכז‬ ‫הישראלי‬ ‫לקפוארה‬ - Cordão de ouro Israel
Mestre Edan 133660927258 736
11 CAPOEIRA IN CYPRUS "Cordao de ouro Cyprus" 255023815847 1566
10 Cordão de Ouro Barcelona 54752581524 161
9 Italia Centro Di Capoeira 163960295258 365
8 Cordao de Ouro Capoeira Sheffield 2259263842 219
7 Capoeira Nottingham CDO 47201706570 240
6 Cordão de Ouro Capoeira Crete 34564813057 1447
5 Capoeira Batizados & Workshops in Europe 55370608206 894
4
Capoeira Cordão de Ouro North West UK (Manchester and
Liverpool) 2333577653 1001
3 Capoeira CBF 38593533333 106
Kobi Omenaka @00287065 61
2 University of York Capoeira 32824463126 68
1 Cordao de Ouro Capoeira Derby 2221308745 150
Total Number of
Nodes 121753
Number of Unique
Nodes 75453
Number of Groups
Joined 245
Groups converged 241
Shrinkage 0.38
Appendix 2: Python GitHub Repositories for Social Media Sites
The official online compendium for Mining the Social Web (O'Reilly, 2011)
https://github.com/ptwobrussell/Mining-the-Social-Web
A Python library for accessing the Twitter API
https://github.com/tweepy/tweepy
Facebook.py is a Python Client Library for the Facebook APIs. The goal is to
support all Facebook APIs located at http://developers.facebook.com using only
standard python libraries.
https://github.com/semyazza/Facebook.py
Facebook Platform Python SDK
https://github.com/pythonforfacebook/facebook-sdk
Python client lib for Facebook's new Graph API
https://github.com/iplatform/pyFaceGraph/
Facepy makes it really easy to interact with Facebook's Graph API
https://github.com/jgorset/facepy
A micro api client for writing scripts against the Facebook Graph API
https://github.com/facebook/fbconsole
Appendix 3: Other Web Resources
NodeXL Download for Windows
http://nodexl.codeplex.com/
Netvizz Facebook App
Kobi Omenaka @00287065 63
http://apps.facebook.com/netvizz/?fb_source=search&ref=ts
Gephi Download and Documentation
http://gephi.org/
Appendix 3 Other Network Images
Figure24 Network Graph of 241 Capoeira Facebook Groups with Labels
Kobi Omenaka @00287065 64
Figure25 Capoeira Vibe Level 1.5 Network on YouTube
Kobi Omenaka @00287065 65
Figure26 Image showingthe Giant Component from the Facebook capoeiranetwork. In spite of the
different colours depicting different groups each node canbe connected to each other. The longest
distance between any two nodes is 22. This contains 68448 nodes compared to 75453 inthe
complete network
Bibliography
Allardice, S., 2011. Foundations of Programming: Fundamentals | Video Tutorial
from lynda.com. Available at: http://www.lynda.com/tutorial/83603
[Accessed July 22, 2012].
Bastian, M., Heymann, S. & Jacomy, M., 2009. Gephi : An Open Source Software for
Exploring and Manipulating Networks. In International AAAI Conference
on Weblogs and Social Media 2009.
Berners-Lee, T., 1999. Weaving the Web : the origins and future of the World Wide
Web, London: Orion Business.
Kobi Omenaka @00287065 66
boyd, danah & Ellison, N., 2007. Social Network Sites: Definition, History, and
Scholarship. Journal of Computer-Mediated Communication, 13(1),
p.Article 11.
Bucher, T., 2012. Want to be on the top? Algorithmic power and the threat of
invisibility on Facebook. New Media & Society, 14(6).
Burt, R., 1978. Applied Network Analysis: An Overview. SOCIOLOGICAL
METHODS AND RESEARCH, 7(2).
Corley, C.D. et al., 2010. Text and Structural Data Mining of Influenza Mentions in
Web and Social Media. International Journal of Environmental Research
and Public Health, 7(2), pp.596–615.
Crosbie, V., 2002. What is New Media?
Crossley, N., 2010. The Social World of the Network. Combining Qualitative and
Quantitative Elements in Social Network Analysis. Sociologica, 4(1), pp.0–
0.
Dabbish, L. et al., 2012. Social coding in GitHub: transparency and collaboration
in an open software repository. In Proceedings of the ACM 2012 conference
on Computer Supported Cooperative Work. pp. 1277–1286. Available at:
http://dl.acm.org/citation.cfm?id=2145204.2145396 [Accessed
September 25, 2012].
Easley, D. & Kleinberg, J., 2010. Networks, Crowds, and Markets ; Reasoning about
a Highly Connected World., [S.l.]: Cambridge University Press.
Edwards, G., 2010. Mixed Method Approach to Social Network Analysis. ESRC
National Centre for Research Methods Review Paper, NCRM/015.
Erdös, P. & Rényi, A., 1960. On the evolution of random graphs, Akad. Kiad’o.
Available at: http://bolyai.math-inst.hu/~p_erdos/1960-10.pdf [Accessed
September 18, 2012].
Essien, A., 2008. Capoeira beyond Brazil : from a slave tradition to an
international way of life, Berkeley, Calif.: Blue Snake Books.
Fanpagelist.com, 2012. Top 100 Facebook Fan Pages. Fanpagelist.com. Available
at: http://fanpagelist.com/category/top_users/ [Accessed September 15,
2012].
Form, S., 2012. Facebook, Inc. REGISTRATION STATEMENT Under The Securities
Act of 1933. February, 1, p.2010.
Foucault, M., 1977. Discipline and punish the birth of the prison, New York:
Random House.
Green, T.A. & Svinth, J.R., 2010. Martial arts of the world an encyclopedia of history
and innovation, Santa Barbara, Calif.: ABC-CLIO. Available at:
Kobi Omenaka @00287065 67
http://www.credoreference.com/book/abcmlarts [Accessed September
20, 2012].
Hansen, D.L., Schneiderman, B. & Smith, M.A., 2010. Analyzing social media
networks with NodeXL insights from a connected world, Amsterdam;
Boston: M. Kaufmann. Available at:
http://www.sciencedirect.com/science/book/9780123822291
[Accessed September 14, 2012].
Herman, C. & Ullyot, T., 2012. Facebook. Available at:
http://www.facebook.com/legal/terms [Accessed October 7, 2012].
Jackson, M.O., 2008. Social and economic networks, Princeton Univ Pr. Available
at:
http://books.google.com/books?hl=en&lr=&id=rFzHinVAq7gC&oi=fnd&p
g=PR11&dq=%22The+Symmetric+Connections+Model+.+.%22+%22Exer
cises+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.%22+&ots=vZkgGWWOhW&si
g=mrXT_J9WHs8qFdEjzPQGxI0LeHQ [Accessed September 18, 2012].
Jones, R., 2011. Keyword intelligence : keyword research for search, social, and
beyond, Hoboken, N.J.; Chichester: Wiley ; John Wiley [distributor].
Kotrlik, J.W.K.J.W. & Higgins, C.C.H.C.C., 2001. Organizational Research:
Determining Appropriate Sample Size in Survey Research Appropriate
Sample Size in Survey Research. Information Technology, Learning, and
Performance Journal, 19(1), p.43.
Kurose, J.F. & Ross, K.W., 2003. Computer networking : a top-down approach
featuring the Internet, Boston: Addison-Wesley.
Leskovec, J. & Horvitz, E., 2008. Planetary-scale views on a large instant-
messaging network. In Proceeding of the 17th international conference on
World Wide Web. pp. 915–924. Available at:
http://dl.acm.org/citation.cfm?id=1367620 [Accessed September 19,
2012].
Li, C. & Bernoff, J., 2011. Groundswell : winning in a world transformed by social
technologies, Boston: Harvard Business Review Press.
Linoff, G. & Berry, M.J.A., 2011. Data mining techniques : for marketing, sales, and
customer relationship management, Indianapolis, IN: Wiley Pub.
Lipsman, A. et al., 2012. The Power of Like: How Brands Reach and Influence
Fans Through Social Media Marketing. In comScore.
Milgram, S., 1967. The Small World Problem. Psychology Today, 1(1), pp.61–67.
Pariser, E., 2012. The filter bubble : how the new personalized Web is changing
what we read and how we think, New York, N.Y.: Penguin Books/Penguin
Press.
Kobi Omenaka @00287065 68
Rieder, B., 2012. Bernhard Rieder - Programming. Available at:
http://rieder.polsys.net/programming/ [Accessed October 6, 2012].
Russell, M.A., 2011. Mining the social web, Beijing; Sebastopol, CA: O’Reilly.
Sande, W. & Sande, C., 2009. Hello world! : computer programming for kids and
other beginners, Greenwich, Conn.: Manning.
Trompeter, F., 2010. How NGOs can use Social Media. Available at:
http://www.un.org/esa/socdev/ngo/docs/2010/Farra.pdf.
Veldhuizen, T.L., 2007. Dynamic multilevel graph visualization. arXiv preprint
arXiv:0712.1549. Available at: http://arxiv.org/abs/0712.1549 [Accessed
September 25, 2012].
Van Wel, L. & Royakkers, L., 2004. Ethical issues in web data mining. Ethics and
Information Technology, 6(2), pp.129–140.
Zuckerberg, M., 2012. One Billion People on Facebook - Facebook Newsroom.
Available at: http://newsroom.fb.com/News/One-Billion-People-on-
Facebook-1c9.aspx [Accessed October 6, 2012].

More Related Content

What's hot

Cesvip 2010 first_linux_module
Cesvip 2010 first_linux_moduleCesvip 2010 first_linux_module
Cesvip 2010 first_linux_moduleAlessandro Grandi
 
System administration guide
System administration guideSystem administration guide
System administration guidemeoconhs2612
 
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Accenture
 
Introduction to system_administration
Introduction to system_administrationIntroduction to system_administration
Introduction to system_administrationmeoconhs2612
 
Citrix virtual desktop handbook (5 x)
Citrix virtual desktop handbook (5 x)Citrix virtual desktop handbook (5 x)
Citrix virtual desktop handbook (5 x)Nuno Alves
 
Zenoss administration
Zenoss administrationZenoss administration
Zenoss administrationlibros007
 
RHEL-7 Administrator Guide for RedHat 7
RHEL-7  Administrator Guide for RedHat 7RHEL-7  Administrator Guide for RedHat 7
RHEL-7 Administrator Guide for RedHat 7Hemnath R.
 
Red hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-usRed hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-usmuhammad adeel
 
Creating andrCreating-Android-Applicationsoid-applications
Creating andrCreating-Android-Applicationsoid-applicationsCreating andrCreating-Android-Applicationsoid-applications
Creating andrCreating-Android-Applicationsoid-applicationsMarwoutta Dh
 
Linux_kernelmodule
Linux_kernelmodule Linux_kernelmodule
Linux_kernelmodule sudhir1223
 
Parallels Plesk Panel 9 Client's Guide
Parallels Plesk Panel 9 Client's GuideParallels Plesk Panel 9 Client's Guide
Parallels Plesk Panel 9 Client's Guidewebhostingguy
 
Plesk 9.2-clients-guide
Plesk 9.2-clients-guidePlesk 9.2-clients-guide
Plesk 9.2-clients-guidenin9nin9
 
Ws 2012 white paper hyper v
Ws 2012 white paper hyper vWs 2012 white paper hyper v
Ws 2012 white paper hyper vNuno Alves
 

What's hot (20)

Cesvip 2010 first_linux_module
Cesvip 2010 first_linux_moduleCesvip 2010 first_linux_module
Cesvip 2010 first_linux_module
 
System administration guide
System administration guideSystem administration guide
System administration guide
 
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
 
Introduction to system_administration
Introduction to system_administrationIntroduction to system_administration
Introduction to system_administration
 
MySQL Query Browser
MySQL Query BrowserMySQL Query Browser
MySQL Query Browser
 
Turbo windows-the-ultimate-pc-speed-up-guide
Turbo windows-the-ultimate-pc-speed-up-guideTurbo windows-the-ultimate-pc-speed-up-guide
Turbo windows-the-ultimate-pc-speed-up-guide
 
Citrix virtual desktop handbook (5 x)
Citrix virtual desktop handbook (5 x)Citrix virtual desktop handbook (5 x)
Citrix virtual desktop handbook (5 x)
 
Zenoss administration
Zenoss administrationZenoss administration
Zenoss administration
 
RHEL-7 Administrator Guide for RedHat 7
RHEL-7  Administrator Guide for RedHat 7RHEL-7  Administrator Guide for RedHat 7
RHEL-7 Administrator Guide for RedHat 7
 
Software guide 3.20.0
Software guide 3.20.0Software guide 3.20.0
Software guide 3.20.0
 
Red hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-usRed hat enterprise_linux-7-beta-installation_guide-en-us
Red hat enterprise_linux-7-beta-installation_guide-en-us
 
Creating andrCreating-Android-Applicationsoid-applications
Creating andrCreating-Android-Applicationsoid-applicationsCreating andrCreating-Android-Applicationsoid-applications
Creating andrCreating-Android-Applicationsoid-applications
 
Java web programming
Java web programmingJava web programming
Java web programming
 
Linux_kernelmodule
Linux_kernelmodule Linux_kernelmodule
Linux_kernelmodule
 
Cluster administration rh
Cluster administration rhCluster administration rh
Cluster administration rh
 
Parallels Plesk Panel 9 Client's Guide
Parallels Plesk Panel 9 Client's GuideParallels Plesk Panel 9 Client's Guide
Parallels Plesk Panel 9 Client's Guide
 
Plesk 9.2-clients-guide
Plesk 9.2-clients-guidePlesk 9.2-clients-guide
Plesk 9.2-clients-guide
 
Ws 2012 white paper hyper v
Ws 2012 white paper hyper vWs 2012 white paper hyper v
Ws 2012 white paper hyper v
 
Design Final Report
Design Final ReportDesign Final Report
Design Final Report
 
CPanel User Guide
CPanel User GuideCPanel User Guide
CPanel User Guide
 

Viewers also liked

LITE 2016 – The Importance of Personas [Mike McGrail]
LITE 2016 – The Importance of Personas [Mike McGrail]LITE 2016 – The Importance of Personas [Mike McGrail]
LITE 2016 – The Importance of Personas [Mike McGrail]getadministrate
 
LITE 2016 – Digital Empowerment [Dr Kathryn Waite]
LITE 2016 – Digital Empowerment [Dr Kathryn Waite]LITE 2016 – Digital Empowerment [Dr Kathryn Waite]
LITE 2016 – Digital Empowerment [Dr Kathryn Waite]getadministrate
 
Clase de tecnologia[1]
Clase de tecnologia[1]Clase de tecnologia[1]
Clase de tecnologia[1]jorluarri19
 
이영곤 전략적사고 기획프로세스_130726
이영곤 전략적사고 기획프로세스_130726이영곤 전략적사고 기획프로세스_130726
이영곤 전략적사고 기획프로세스_130726영곤 이
 
SST - IRTP Manejo y uso de extintores
SST - IRTP Manejo y uso de extintores SST - IRTP Manejo y uso de extintores
SST - IRTP Manejo y uso de extintores TVPerú
 
20141213 기획이란무엇인가 정리
20141213 기획이란무엇인가 정리20141213 기획이란무엇인가 정리
20141213 기획이란무엇인가 정리khanbal
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetMateusz Brzoska
 
USE OF DATA MINING IN BANKING SECTOR
USE OF DATA MINING IN BANKING SECTORUSE OF DATA MINING IN BANKING SECTOR
USE OF DATA MINING IN BANKING SECTORarpit bhadoriya
 
java Project report online banking system
java Project report online banking systemjava Project report online banking system
java Project report online banking systemVishNu KuNtal
 

Viewers also liked (12)

LITE 2016 – The Importance of Personas [Mike McGrail]
LITE 2016 – The Importance of Personas [Mike McGrail]LITE 2016 – The Importance of Personas [Mike McGrail]
LITE 2016 – The Importance of Personas [Mike McGrail]
 
Landscape Photography Tips
Landscape Photography TipsLandscape Photography Tips
Landscape Photography Tips
 
Curso seguridad
Curso seguridadCurso seguridad
Curso seguridad
 
LITE 2016 – Digital Empowerment [Dr Kathryn Waite]
LITE 2016 – Digital Empowerment [Dr Kathryn Waite]LITE 2016 – Digital Empowerment [Dr Kathryn Waite]
LITE 2016 – Digital Empowerment [Dr Kathryn Waite]
 
Clase de tecnologia[1]
Clase de tecnologia[1]Clase de tecnologia[1]
Clase de tecnologia[1]
 
이영곤 전략적사고 기획프로세스_130726
이영곤 전략적사고 기획프로세스_130726이영곤 전략적사고 기획프로세스_130726
이영곤 전략적사고 기획프로세스_130726
 
SST - IRTP Manejo y uso de extintores
SST - IRTP Manejo y uso de extintores SST - IRTP Manejo y uso de extintores
SST - IRTP Manejo y uso de extintores
 
20141213 기획이란무엇인가 정리
20141213 기획이란무엇인가 정리20141213 기획이란무엇인가 정리
20141213 기획이란무엇인가 정리
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data Set
 
USE OF DATA MINING IN BANKING SECTOR
USE OF DATA MINING IN BANKING SECTORUSE OF DATA MINING IN BANKING SECTOR
USE OF DATA MINING IN BANKING SECTOR
 
java Project report online banking system
java Project report online banking systemjava Project report online banking system
java Project report online banking system
 
Data mining
Data miningData mining
Data mining
 

Similar to Kobi Omenaka Social Media Dissertation Practicality of Data Mining for Online Social Network Analysis

OAuth with Restful Web Services
OAuth with Restful Web Services OAuth with Restful Web Services
OAuth with Restful Web Services Vinay H G
 
Thematic_Mapping_Engine
Thematic_Mapping_EngineThematic_Mapping_Engine
Thematic_Mapping_Enginetutorialsruby
 
Thematic_Mapping_Engine
Thematic_Mapping_EngineThematic_Mapping_Engine
Thematic_Mapping_Enginetutorialsruby
 
Integrating wind and solar energy in india for a smart grid platform
Integrating wind and solar energy in india for a smart grid platformIntegrating wind and solar energy in india for a smart grid platform
Integrating wind and solar energy in india for a smart grid platformFarhan Beg
 
Enterprise Data Center Networking (with citations)
Enterprise Data Center Networking (with citations)Enterprise Data Center Networking (with citations)
Enterprise Data Center Networking (with citations)Jonathan Williams
 
Describing the Organisation Data Landscape
Describing the Organisation Data LandscapeDescribing the Organisation Data Landscape
Describing the Organisation Data LandscapeAlan McSweeney
 
Embedded systems-and-robotics-by vivekanand goud j
Embedded systems-and-robotics-by vivekanand goud jEmbedded systems-and-robotics-by vivekanand goud j
Embedded systems-and-robotics-by vivekanand goud jQualcomm
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media LayerLinkedTV
 
Designing Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : DocumentationDesigning Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : DocumentationDarwish Ahmad
 
CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale...
CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale...CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale...
CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale...ambitlick
 
Machine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisMachine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisBryan Collazo Santiago
 
Guardi final report
Guardi final reportGuardi final report
Guardi final reportSteph Cliche
 

Similar to Kobi Omenaka Social Media Dissertation Practicality of Data Mining for Online Social Network Analysis (20)

SCE-0188
SCE-0188SCE-0188
SCE-0188
 
OAuth with Restful Web Services
OAuth with Restful Web Services OAuth with Restful Web Services
OAuth with Restful Web Services
 
Thematic_Mapping_Engine
Thematic_Mapping_EngineThematic_Mapping_Engine
Thematic_Mapping_Engine
 
Thematic_Mapping_Engine
Thematic_Mapping_EngineThematic_Mapping_Engine
Thematic_Mapping_Engine
 
Vivarana fyp report
Vivarana fyp reportVivarana fyp report
Vivarana fyp report
 
Integrating wind and solar energy in india for a smart grid platform
Integrating wind and solar energy in india for a smart grid platformIntegrating wind and solar energy in india for a smart grid platform
Integrating wind and solar energy in india for a smart grid platform
 
Enterprise Data Center Networking (with citations)
Enterprise Data Center Networking (with citations)Enterprise Data Center Networking (with citations)
Enterprise Data Center Networking (with citations)
 
WenFei2022.pdf
WenFei2022.pdfWenFei2022.pdf
WenFei2022.pdf
 
bachelor
bachelorbachelor
bachelor
 
Describing the Organisation Data Landscape
Describing the Organisation Data LandscapeDescribing the Organisation Data Landscape
Describing the Organisation Data Landscape
 
Upstill_thesis_2000
Upstill_thesis_2000Upstill_thesis_2000
Upstill_thesis_2000
 
Embedded systems-and-robotics-by vivekanand goud j
Embedded systems-and-robotics-by vivekanand goud jEmbedded systems-and-robotics-by vivekanand goud j
Embedded systems-and-robotics-by vivekanand goud j
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media Layer
 
Graduation Report
Graduation ReportGraduation Report
Graduation Report
 
Designing Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : DocumentationDesigning Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : Documentation
 
CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale...
CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale...CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale...
CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale...
 
z_remy_spaan
z_remy_spaanz_remy_spaan
z_remy_spaan
 
Machine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisMachine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_Thesis
 
Anastasopoulos_DiplomaThesis
Anastasopoulos_DiplomaThesisAnastasopoulos_DiplomaThesis
Anastasopoulos_DiplomaThesis
 
Guardi final report
Guardi final reportGuardi final report
Guardi final report
 

Kobi Omenaka Social Media Dissertation Practicality of Data Mining for Online Social Network Analysis

  • 1. The Construction and Analysis of Online Social Networks Discourse into the use of data mining techniques for the construction and analysis of online communities. Kobi Omenaka @00287065
  • 2. Kobi Omenaka @00287065 2 List of Figures....................................................................................................................................................2 List of Tables......................................................................................................................................................3 Introduction........................................................................................................................................................4 Aims and Objectives .................................................................................................................................4 A Social Network........................................................................................................................................5 Social Network Theory and Social Network Analysis...............................................................9 Data Mining Evaluation and Research Methods........................................................................17 Extracting Data from Social Media Sources............................................................................18 Text mining using software such as Spinn3r....................................................................18 Programming........................................................................................................................................19 NodeXL.....................................................................................................................................................21 Netvizz......................................................................................................................................................26 Gephi..........................................................................................................................................................30 Method...........................................................................................................................................................33 Results and Analysis...................................................................................................................................34 Flickr Results.............................................................................................................................................34 YouTube Results......................................................................................................................................36 Twitter Results.........................................................................................................................................37 Facebook Results.....................................................................................................................................39 Shrinkage .....................................................................................................................................................42 Capoeira Vibe Case.................................................................................................................................43 Ethics....................................................................................................................................................................46 Quantitative vs. Qualitative ....................................................................................................................48 Conclusion.........................................................................................................................................................50 Appendix 1: Table to show the Number, Name, Group ID and Number of Nodes as Extracted by Netvizz...................................................................................................................................51 Appendix 2: Python GitHub Repositories for Social Media Sites....................................62 Appendix 3: Other Web Resources....................................................................................................62 Appendix 3 Other Network Images...................................................................................................63 Bibliography....................................................................................................................................................65 List of Figures Figure 1 Example of Network Diagram showing the Interactions in a Karate Club (Easley & Kleinberg 2010)...........................................................................................................11 Figure 2 Network Diagram showing the Nodes and Edges on "ARPANET".............16 Figure 3 Showing the ARPANET network in Relation to its Real World Position in the US........................................................................................................................................................17 Figure 4 Image showing screenshot of NodeXL starting page.........................................22 Figure 5 Image showing the NodeXL imports menu. Including the options to report from Flickr, Twitter and YouTube...........................................................................23 Figure 6 Image showing the imports menu for a Twitter user network within NodeXL.....................................................................................................................................................24 Figure 7. Screenshot of imported data from NodeXL from YouTube...........................25 Figure 8 Image showing the Netvizz User Interface within Facebook........................27 Figure 9 Image showing that the Netvizz data extraction has completed and can be downloaded....................................................................................................................................28
  • 3. Kobi Omenaka @00287065 3 Figure 10 Image Showing the Group Joining Process on Facebook..............................29 Figure 11 Network graph produced in Gephi showing a Manchester based capoeira group.....................................................................................................................................31 Figure 12 Network Graph Showing the Keywords Associated with the Term "capoeira" on Flickr..........................................................................................................................34 Figure 13 Network Graph created from the "capoeira" tag within YouTube..........36 Figure 14 The Most Frequent Posters of Capoeira Videos on YouTube.....................37 Figure 15 Twitter network produced from the hashtag “capoeira” (from 100 tweets)......................................................................................................................................................37 Figure 16 Twitter network produced from the hashtag “capoeira” (from 1000 tweets)......................................................................................................................................................38 Figure 17 Network Graph showing 75453 nodes constructed from 241 Capoeira Groups......................................................................................................................................................39 Figure 18 Image to show the Top 10 most Connected Nodes from the Constructed Capoeira Network.................................................................................................40 Figure 19 Network Graph Showing of the "Capoeira Vibe" Network Community on Flickr...................................................................................................................................................43 Figure 20 Capoeira Vibe level 1.0 YouTube Network............................................................44 Figure 21 Capoeira Vibe Level 2.0 YouTube Network...........................................................45 Figure 22 Capoeira Vibe Twitter Network...................................................................................46 Figure 23 Social Technographics Ladder based on U.S. Adults (Li & Bernoff 2011, p.43)...........................................................................................................................................................49 Figure 24 Network Graph of 241 Capoeira Facebook Groups with Labels ..............63 Figure 25 Capoeira Vibe Level 1.5 Network on YouTube....................................................64 Figure 26 Image showing the Giant Component from the Facebook capoeira network. In spite of the different colours depicting different groups each node can be connected to each other. The longest distance between any two nodes is 22. This contains 68448 nodes compared to 75453 in the complete network....................................................................................................................................................65 List of Tables Table 1 Key Values from Netvizz Data Extraction of Facebook Capoeira Groups41
  • 4. Kobi Omenaka @00287065 4 Introduction “The web is more a social creation than a technical one. I designed it for social effect - to help people work together - and not as a technical toy. The ultimate goal of the web is to support and improve our weblike existence of the world. We clump into families, associations, and companies. We develop trust across the miles and distrusts around the corner” (Berners-Lee 1999). This quote from Sir Tim Berners-Lee, the inventor of the World Wide Web, gives a poignant insight into the direction of the research that will be summarised in the following pages. The intention of this research is to be an independent introduction into the world of networks, of on and offline including key theory concepts. This progresses to assessing methods that can be used to construct these networks from online resources via data mining. This is presented in the form of research based both on theoretical and real-world networks. Once the networks have been established the key questions then follow:  How can this information be used?  Are there any ethical concerns related to data mining? These are discussed towards the end of this report followed by an overall conclusion. Aims and Objectives The research the aims and objectives of this project are highlighted below: 1. Introduce basic social network analysis (SNA) and networking theory. 2. Evaluate the use of various data mining techniques to establish social network. This includes the use of programming as well as the more accessible and freely available forms of SNA software.
  • 5. Kobi Omenaka @00287065 5 3. Use the keyword “capoeira” to construct a social network based on the data mining techniques discussed and use network analysis theory to evaluate them. 4. Confirm whether a giant component exists within the capoeira network 5. Establish a new concept “shrinkage” within the network of capoeira on SNSs. 6. Discourse into the ethicalities of these techniques 7. Evaluation of how valuable the results are without a qualitative aspect. 8. Case study: Evaluation of research techniques with online capoeira community “Capoeira Vibe” that has accounts on Facebook, Twitter, Flickr and YouTube. By nature human beings are social creatures. Any one individual may form part of any number of networks and groups connected by family, school, university friends, teammates and colleagues ties etc. An individual’s role within each group may vary as will the strength of the ties. It is interesting to see Berners-Lee’s vision the Internet being one of social concern rather than technical and being designed, in a sense, to magnify the social experiences that preceded it. The premise of this research is to investigate social networks on the Internet. In order to do this we need to understand networking theory, which is based on anthropological and socio geographical research. A Social Network A social network is a social arrangement made up of “nodes”, which are typically one person that interact with other nodes. The interactions between each node such as family ties, friendship, academic links, business connections etc. are as “edges” and structure particular networks.
  • 6. Kobi Omenaka @00287065 6 On a basic level any network can be defined by the number of nodes and by the edges that occurs within it. The concept of a social network is not a new one but is one that has come to particular prominence in the last decade with the advent of Web 2.0 and communications made possible via social networking sites (SNS) on the internet. Of course social networks predate the Internet. By its very nature the internet is perhaps the world’s largest social network and was, in fact created as a “network of networks” (Kurose & Ross 2003, p.2) that linked educational facilities together. Using the Internet as an oversimplified example of a social network, each computer could be seen as a node, whilst connections to the other computers can fulfil the analogy as the edges. The Internet has facilitated the formation of connections and, to some extent the growth of networks. This is because “networks are now freed from the restriction of a geographical location”(Easley & Kleinberg 2010, p.1). As people are becoming more web literate and webs services improve the easier it becomes for like-minded people and institutions across the world to find each other. Those people that were previously isolated can now find it easier to join networks. “What makes social network sites unique is not that they allow individuals to meet strangers, but rather that they enable users to articulate and make visible their social networks. This can result in connections between individuals that would not otherwise be made,”(boyd & Ellison 2007). It is difficult to determine how the Internet has affected the number of nodes within a particular network. What can be seen, however, are the virtual connections made to real-world networks. An example of this would be the
  • 7. Kobi Omenaka @00287065 7 association with brands such as Coca-Cola. The advent of SNS’s doesn't necessarily mean there more Coca-Cola drinkers. They can, however, show their affiliation easily by joining groups or “liking” Coca-Cola’s page on Facebook for example. The SNSs however aren't only restricted to commercial brands “one finds networks appearing in discussion and commentary on an enormous range of topics”(Easley & Kleinberg 2010, p.1). At the time writing Coca-Cola’s Facebook page is at number 10 based on the number of “likes” and is the highest placed tangible product in the list. Also in the top 10 are musicians, sports stars, TV shows and also the brands Facebook and YouTube. Unsurprisingly, Facebook comes number one in the list of Facebook Fan pages (Fanpagelist.com 2012). All of these have something to sell and benefit directly from having contact with their fans on Facebook and Twitter et al. SNS’s can be described as “web-based services that allow individuals to construct a public or semi- public profiles, articulate a list of other users with whom they share a connection, and view and traverse their list of connections and those made by others within the system” (boyd & Ellison 2007). It’s also easy to support causes such as Greenpeace or Oxfam using “likes” or “tags” on SNS’s. Whilst these global names take up the lion’s share of attention, smaller groups and fan pages from community centres, small and start up businesses, local charities and amateur sports teams etc. should not be ignored. We can see more and more groups of all shapes and sizes seeing the benefits of integrating SNS’s. A common example of this would be to see how SNS’s are linked on their websites. It is becoming more and more important for small groups to become involved with the activities on SNS’s and be familiar with their social networks. The importance of SNSs and networking lies in the fact that any party involved can initiate dialogues easily. It is this two-way interaction that differentiates social media from traditional broadcast media such as TV, newspapers and radio.
  • 8. Kobi Omenaka @00287065 8 “individualized messages can simultaneously be delivered to an infinite number of people, and, each of the people involved shares reciprocal control over that content.” (Crosbie 2002). This presents opportunities to exploit the new medium. Both profit and non- profit organisations can benefit from being contact with their potential customers and fan base. Social media is about “reaching and connecting people” (Trompeter 2010). The fact that it is a low cost way to reach certain audiences means that non-profits and commercial groups alike can start to experiment without too much financial investment. The use of “likes”, keywords and tags and user forums can make it easier to identify target markets and people who might be interested supporting causes. The problem with social media is that it is quite labour-intensive. In a small organisation the people who will be using social media to connect with people may not have the time or the ability to make best use of the medium. For this reason the wisest use of social media would be to highlight the potential audience before initiating the conversation. This approach will be more cost and time effective and as a result of better rewards will be reaped. The scattergun approach of traditional media has now, in a sense, been refined to pinpointing. This, however, is easier said than done. With 6.5 billion people on the planet how do you pinpoint a target audience? Do some people in the target audience have more effect or influence than others? What about the unidentified people who do use your products that don’t visibly like them on the Internet? According to Comscore’s white paper the “Power of Like” Friends of Fans on Facebook typically represent a much larger set of consumers, 34 times larger, on average, for the 100 brands (Lipsman et al. 2012, p.2). These questions might be difficult answer but are important to consider. There are clearly economic gains to be made from utilising social media and social
  • 9. Kobi Omenaka @00287065 9 networks. “Simple online economics: on the Internet, traffic equals money”(Li & Bernoff 2011, p.11). This quote taken from Li & Bernoff’s book “groundswell” could be a strapline advocating the use of social media for promotion. In this rapidly changing technological environment establishing and understanding online social identities could prove vital for success. It is important to note that as well as economic gain, these techniques will also apply to academic and socio-economic research. This research will address some of the questions highlighted above by looking further and deeper into how online social networks can be identified and constructed. Once constructed the networks can be analysed using SNA techniques. The purpose of the analysis is to try and determine the size or extents of the networks created and then establish the key people and relationships i.e. “nodes and edges”. The argument is that knowledge and understanding these network qualities can be a key factor in realising organisational goals. By not restricting research to a local area means that and network can be constructed without having to filter for location. This means, however, that it might be more difficult to construct a network that is truly representative. In order to establish meaningful and true relationships on a global level the amount of data collected must be significantly higher, when compared to a local area such as a town or city. This calls extremely robust and accurate data collection techniques. Social Network Theory and Social Network Analysis Social network analysis (SNA) focuses on the study of patterns of connection in a wide range of physical and social phenomena (Hansen et al. 2010). It is argued
  • 10. Kobi Omenaka @00287065 10 that analysis of networks is more concerned about the social interactions rather than individual qualities of the nodes. Traditional social study emphasised the qualities of the individual, whereas SNA takes an alternative viewpoint. Individuals are seen as being less important than their actual relationships with other nodes in the network. Research has explored everything from foundational physical systems created by chemical and genetic connections, animal food pyramids and distinctly human characteristics such as social cohesion, privacy, markets and trust (Hansen et al. 2010). Social networking analysis (SNA) has been used in the field of disease research helping to understand how patterns of human contacts help or hinder the spread of contagious diseases in a population. SNA techniques have also been used as a mass surveillance, helping determine whether individuals or any member of a population is a potential terror threat. The above shows that SNA uses are varied and important. SNA has a history dating back to 1954 when social anthropologist J.A.Barnes first coined the term “Social Network” in his journal article “Class and committees in a Norwegian island parish”. The early pioneers of SNA placed much emphasis on the gathering of qualitative and quantitative data by surveys and interviewing techniques. The quantitative data is has been described above that focuses on the relational ties within any particular social network. Qualitative approaches use interviews and open survey questions to generate data to construct social networks. It is concerned more about how those ties and relationship bonds were formed within a network and the experience with it. This kind of research comes from a more anthropological viewpoint and often focuses on personal networks including communities, neighbourhoods and friendship groups. As a result of the data gathering techniques used the sample
  • 11. Kobi Omenaka @00287065 11 sizes are usually a lot smaller than with quantitative analysis, or may run over extended periods of time. Quantitative methods may involve the collection of vast amounts of data before mathematical equations and algorithms are applied to it. This data can then be processed using software or spreadsheets such as Microsoft Excel networking diagrams. This research will focus primarily on the methods for collection and organisation of quantitative SNA data. It will also compare and contrast qualitative versus quantitative data as a means of determining contextual usage. Much of recent SNA has been focused on the visualisation of these social networks, such as Figure 1 below, and applying graph theory to analyse it. It is the use of graph theory that can highlight certain relationships within networks. Figure1 Example of Network Diagram showingthe Interactions in a Karate Club (Easley& Kleinberg 2010) It is worth talking about graph theory before moving to the next sections of this research. This will better explain the results produced from the different methods.
  • 12. Kobi Omenaka @00287065 12 In the network above the nodes 1 to 34 are aggregated into cliques that are connected to one another by relations. This positional approach focuses on the pattern of relations in which individuals are involved. Connections between other nodes in the network defines its position. (Burt 1978). In the karate club shown in figure 1 it can be seen that nodes 1 and 34 represent two key nodes within the relationship of the club, but they are not connected each other. In graph theory nodes 1 and 34 are not neighbouring nodes as they are not connected by a single edge. They are, however, connected by a “path” which is simply a chain of nodes. A path from node 1 to 34 can be defined as 1 to 14 to 34. Longer paths can be defined, however, often the average shortest path between all nodes is defined as a quality of the network. A path is defined by the number of edges in between the two nodes in question. This means the first part as described above has a length of two(Jackson 2008, p.25). Consider how this is important with respect to the spread of contagious diseases. Another quality of networks is how well “connected” it is. A graph is referred to as connected if there is an edge between every pair of nodes in the social network. This essentially means that there are no floating sets of nodes and everything is linked together as one mass “component”. Constructed networks aspire to exist as connected networks so that dataflow to all nodes is possible if not easy. There is no reason, however, to suggest all networks especially social networks should exist as completely connected ones. It is common to process networking data to find that the separate distinct components within the data sets. A graph can then be separated into its components as one of it’s characteristics before going further to describe the characteristics of each component separately. Figure 1 shows an example of a fully connected network. In networks that are constructed from large data sets it is common to find sets of distinct components with one large one in particular that is referred to as a “giant component” (Erdös & Rényi 1960, p.55).
  • 13. Kobi Omenaka @00287065 13 To consider how this might work a thought experiment can be performed. Consider a network made up of all the individuals in a university including the undergraduate students, graduate students, academic staff, administration staff and facility staff etc. An edge is formed between people who know each other by name. When considering this network it is conceivable to think of many groups of people i.e. components of which it is impossible to link them together via a path. This is likely to exist even if this network is constructed at the end of the academic year. It is conceivable, for example, that the caretaking staff in the science faculties could be disjointed from the admin staff in the geography department. Hence the creation of distinct component parts. The existence of a “giant component” is relatively easy to explain as only the only requirement is for two large components to be joined by one just edge. In the case of universities where students, as they live together, study together, play sports together, socialise together etc. would form a large components in themselves. The students contacts with the lecturing staff would bring academic staff and their faculties into this component. The academic staff of course have contact with the administration staff who are then brought into this component. It becomes difficult to see who wouldn’t be part of this giant component. One common attribute of such “giant networks” is how small the paths are from one node to another. This is referred to as “The Small World Phenomenon” (Milgram 1967). In an experiment, conducted by Stanley Milgram in mid 1960’s, he tested the theory that any two people can be linked by short paths. This is a good early example of data collection in order to produce networking data. At the time the experiment they had a limited budget of $680(a little over £3000 inflation adjusted for today) and no access to any of the networking data that they would have today. The experiment ran by giving 296 randomly chosen people a letter that had to be forwarded to target person. Each of the 296 people were given some information about the target including his address and occupation. They were asked to give the letter to someone and they knew personally with the same instructions to forward on to the next person. In terms of graph theory each person was a node and each “forwarding of the letter”
  • 14. Kobi Omenaka @00287065 14 referred to as an edge, forming a path from the start person to the target. Of the 296 starting letters 64 successfully made it to the target. Of the 64 that made it the average path length was six. This experiment is the source of the well-known phrase 6 degrees of separation. There are many problems with this experiment and it’s announced results, the first being that 232 letters did not reach the target. In addition this experiment was conducted in a small area around Boston in the United States. It is certainly not conclusive evidence that there is only 6 degrees of separation between any two people on the planet. Whilst Milgram’s experiment was not conclusive evidence of the small world phenomenon this is something that has been experimented on numerous ways over the years. When large data sets exhibits a giant component it is often found that the number of steps between each node is small. Lescovek and Horvitz conducted a more recent experiment, published in 2007, with a larger dataset using Instant Messenger. After analysing the accounts of 240 million active Instant Messenger users they found a giant component consisting of most of the nodes and that giant component had mean average path length of 6.6(Leskovec & Horvitz 2008) and a median of 7. Close to that found by Milgram in the 1960’s. The Instant Messenger experiment is an example of the type of research on large-scale networks that has become available in light of access to large data sets on the Internet. These kind of data sets were difficult to analyse previously and with the scales involved is extremely difficult now to process without computer power. Capoeira as the Basis for Networking Research In order to add value to a particular type of research it is worth considering the distinct reasons why a particular data sets might be studied. (Easley & Kleinberg 2010) 1. One may care about the subject matter in question. 2. The data sets is being used are related networks that might be difficult to measure.
  • 15. Kobi Omenaka @00287065 15 3. There is a search for properties that appear to be common across many different subjects. This research is focused on the Brazilian martial art of capoeira. Capoeira is a martial arts from Brazil that combines elements of dance, acrobatics, fighting and music. Capoeira has spread all over the world with many Brazilian instructors teaching in countries ranging from all the countries in the Americas and in all the continents. In addition it is common for capoeira practitioners to travel outside of the home country to attend events. The spread of capoeira in real terms is large and the contacts that “capoeiristas” (practitioners of capoeira) have with each other is frequent, varied and strong enough to form links on SNS’s such as Facebook(Essien 2008; Green & Svinth 2010). The spread of capoeira itself has been in large part due to the advent of social media and in particular the sharing of videos on sites such as YouTube. The research into the structure of capoeira should be an interesting case study giving insight into the structure of martial arts, dance, music and national communities and clubs as well as emerging activities such as free running and parkour. In relation to the three points noted above: 1. The subject matter of capoeira is a key interest to the researcher 2. The results from this can be related to other emerging past times and hobbies. 3. It is of interest to see how attributes such as giant components and small world phenomenon manifest itself within this network. From a research point of view, the word “capoeira” as a keyword is a strong one. The affiliation with the word “capoeira” is strong and is likely that only people with more than a passing interest, i.e. practitioners, will search for it let alone affiliate themselves with it as part of network. Capoeira is a single word that can be tricky to spell, but those that know it are unlikely to spell it incorrectly. It would be hard to accidentally type “capoeira” as a keyword. In spite of each social media service being different, the word “capoeira” should highlight interested participants across all of them. This will identify social networks in
  • 16. Kobi Omenaka @00287065 16 each social media service(Jones 2011). One key point worth noticing is that capoeira is a singular term so the amount of variables involved in searching for this keyword is limited. This should vastly reduce the amount of irrelevance, inconsequential data that is collected. It is interesting to conclude this section by considering the earliest example of an online network i.e. the Internet. The Internet was first started in 1970 as ARPANET which was a network of 13 U.S.-based universities. With each university considered a node then the networking diagram can be drawn as shown below in figures 2 and 3. Figure2 Network Diagram showingthe Nodes and Edges on"ARPANET" Figure 2 shows a networking diagram that we could we now know to be connected as all nodes are connected by image to one or more of the nodes. This was the intention when ARPANET was first developed to ensure that in the event of one edge going down, information contained in one node would still be accessible.
  • 17. Kobi Omenaka @00287065 17 Figure3 Showing the ARPANET network in Relation to its Real World Positioninthe US Figure 3 above shows the same ARPANET network but set against its real location on the US map. This helps to provide context to the networking diagram as shown in figure 2. If we viewed the Internet as it stands today it’ll be extremely difficult to navigate the networking diagram without the aid of computer software. The Internet has grown from its early humble beginnings with 13 just in the US to an estimated 361 million nodes at the end of 2011 worldwide. It’d be interesting to see just how large the giant component would be in this network. Data Mining Evaluation and Research Methods This research was inspired by the discovery of research advocating the use of data mining in order to extract information from online sources. Data mining can be defined as: “a business process for exploring large amounts of data discover meaningful patterns and rules”(Linoff & Berry 2011)
  • 18. Kobi Omenaka @00287065 18 The term “business process” within this definition perhaps makes data mining sound “cold-hearted”, however, all organisations, profit making or otherwise, are involved in business activities. Data mining, if performed well, will help increase profit margins as well as help increase donations, active participation and support to charities, and improve academic research. The Internet, and SNSs in particular fits the definition of “large amounts of data” as given above. The problem is not the amount, rather the filtering of data to produce meaningful results. Extracting Data from Social Media Sources A number of methods were considered for the data mining parts of this research. The initial ideas included the use of search engines to look for trends and data, combinations of proprietary software through to tailor-made coding using freely available programming languages. Each method has different levels of usability that relates to the time and cost investment required to produce usable results. It was perceived that the ability to code and tie in with SNSs API’s and databases would produce the most meaningful results. An API, Application Programming Interface, is the hidden side of many large net websites and programs and allows users with additional knowledge access to data. A breakdown of the techniques evaluated is given below. Text mining using software such as Spinn3r. As discussed earlier SNA can be useful for tracking the spread of contagious and infectious diseases. A recent saw tracking of the word “influenza” blogs (Corley et al. 2010) using a web service called Spinn3r (http://spinn3r.com/) to crawl and index content
  • 19. Kobi Omenaka @00287065 19 “text mining has proven to be more difficult than data mining, as the source data consists of unstructured collections of documents rather than structured databases” (Corley et al. 2010, p.600). Spinn3r is limited to searching and indexing blogs and cannot gather information from SNS’s. Text mining software does have its merits and can be used as a key indicator to track trends with in particular subjects over time. The time required to produce results that can be used to fix patterns is prohibitive over the course of this research. It was decided at the preliminary stages not to use text mining as a source of information for this research. Programming Resources such as “Mining the Social Web” (Russell 2011) propose that with very little programming ability it is possible to extract data from Facebook Twitter and LinkedIn. This concept is interesting and provided much of the basis for this research. Mining the Social Web advocates the use of programming language Python for making the code the data extraction. It jumps quickly to the use of example Python code as a means of introducing the subject matter. As the programming ability prior to commencing this research was limited a decision was made early on in conception that a foundation level should be established. The intention was that any data mining code used should be understood to the level that modifications could be made. In the months before this research was started a basic level of programming knowledge was acquired with a combination of self-learning books such as “Hello World” (W. Sande & C. Sande 2009) and established, professional level online training resources provided such as “Foundations of Programming: Fundamentals”(Allardice 2011). These helped to establish a good level of
  • 20. Kobi Omenaka @00287065 20 understanding of programming such as the syntax, methodology and coding strategy. After completing these courses and books progress went then went to specific coding for data mining in Python. There are various sources of code for data mining in Python, which can be used as the basis for research. Books are an important start and help get a good grounding but can quickly become out-dated especially when trying to connect with fast-paced technology such as the Internet. SNS’s continually update their API’s meaning that programs that want to interact with them have to update. One way of doing this is by using social coding repositories such as “GitHub”. GitHub provides a place for programmers to leave their code, which can be updated and modified by other uses as they see fit. This is higher level of transparency aims to provide the highest quality of code, from a number of different sources, that can constantly evolve. In such places code is constantly changing scrutinising and involving the better(Dabbish et al. 2012). This research investigated a number of Python GitHub resources as shown in the Appendix 2 at the end. This includes GitHub resource related to the book mining the social web (https://github.com/ptwobrussell/Mining-the-Social-Web). “Mining the Social Web” was published in January 2011. When the book was published it contains details of how to data mine “Google Buzz” which was Google’s SNS’s at the time. Google Buzz has now, of course been replaced by Google+ and provides a good clear example of how technologies change quickly in this industry. Progress with this route was initially very promising and the understanding of programing after the foundation and self-training period seemed strong. Despite, however, having a good understanding of how programming syntax should work, a lack of experience exposed the weakness in this method of approach. There were different approaches used to try and readjust the research and maintain progress in order that results from this method of data mining gained.
  • 21. Kobi Omenaka @00287065 21 Whilst proceeding along this route of investigation problems were experienced in the following ways: 1. Modifying code in order to interact with API’s 2. Lack of understanding of the social media API’s 3. Online user forums could not clarify problems 4. Suggested methods of visualisation incompatible with computing platform During this period progress was slow and discouraging. It became clear that despite understanding how the programming code should work, the lack of experience with programming and using API nomenclature would set too many roadblocks for a project of this scale. The decision had to be made, not to continue with using Python programming to mine data, in order to continue to try and achieve some of the objectives highlighted. NodeXL The focus turned to other ways to explore and mine online. It was accepted that other methods might not be able to produce specific results, however the stability and reliability that they could provide was vital at this stage. NodeXL is a free downloadable template that integrates with Microsoft Excel. Its advantages compared to core programming are numerous. 1. The interface is a template within Microsoft Excel so those familiar with spreadsheets should be able to use it. 2. Due to its integration with another program it is very stable to use. 3. Built in functionality with the Twitter, YouTube and Flickr APIs. 4. NodeXL is open source software which means that anyone with programming knowledge has the ability to make additional add additional functionality to the software. This should keep template up-to- date with respect to the APIs but also promises more for the future.
  • 22. Kobi Omenaka @00287065 22 5. The data collected within NodeXL can be exported to other the visualisation programs. Once installed the NodeXL interface is understandable and are intuitive. The NodeXL templates exist completely as a tab within the newer versions of Microsoft Excel. Clicking the NodeXL tab presents a dedicated “ribbon” complete with required icons and menus. See Figure 4 below. Figure4 Image showingscreenshotof NodeXL starting page. The ribbon is dominated by the import and export icons that lead you to further menus for importing mined data. See figure 5 below, showing the import menu which includes importing options and several data mining options for Flickr, Twitter and YouTube as discussed previously. Of the SNSs searches can be made based on a particular users network and on a particular keyword. With the ability to search a particular user it was decided to perform a case study on the user name “Capoeira Vibe” that exists on Twitter, Flickr, YouTube and Facebook. The intention is to compare and contrast the networks of this one user across the four main social media networking platforms.
  • 23. Kobi Omenaka @00287065 23 Figure 5 Image showing the NodeXL imports menu. Including the options to report from Flickr, Twitter and YouTube The method for extracting data from YouTube, Flickr and Twitter sites is similar. Each is a case of entering the username or keyword of the network that you wish to investigate, the kind of relationships want reporting and a limit to the number of nodes that to inspect. See figure 6 below. One difference is that Twitter imposes a data retrieval rate of 350 per hour (Hansen et al. 2010, p.152). This sounds like a considerable number but is quickly reached if the network is anything other than small. It should be noted that this restriction is not to limit the absolute amount of data collected from Twitter it is more to restrict the bandwidth as the amount of calls a data for the service is very high. This can have a considerable time impact on the collection of data, with larger networks taking several days to complete during the course of this research. Planning required if large twitter networks are to be investigated.
  • 24. Kobi Omenaka @00287065 24 Figure6 Image showingthe imports menu for a Twitter user network within NodeXL It can be seen in Figure 6 above that there are various options for investigating each SNS. This shows an example of an import based on Twitter username. The options selected determine the extent of the network, where nodes (referred to as vertices here) can be created just for the follower, the person being followed or both. How the edges are defined during the data gathering can also be determined at this stage. Edges can be added in Twitter for a “following” relationship as well as responses to tweets between people that don’t necessarily follow each other. The final key point to choose is how “deep” the network should go. A network of level 1.0 looks at the user and links to their followers and tweeter relationships. A network of level 1.5 establishes links between the followers and back to the user under investigation. Taking a step further to a network level of 2.0 looks to the network of the followers of followers. The amount of information requested at this stage determined by the number of edges vertices or the level of network has a direct impact on the amount of data requested by NodeXL of each social network.
  • 25. Kobi Omenaka @00287065 25 This is worth noting for two key reasons. 1. The amount of data requested has an effect on the processing required once collected. The data requested should relate to the requirement as the moderator takes more processing power but also more power to analyse once gathered. 2. The more information requested has an impact on the time frame for collection. This is particularly relevant to the limitations imposed by Twitter as discussed above. The raw output from NodeXL is a list of the nodes and the relationships between them, but can also include items such as thumbnails tweets and hyperlinks. Figure 7 below shows a typical screen that is presented after an import has been completed. This depicts an important after a keyword search on YouTube for the term “capoeira”. Presented on the right hand side is a raw version of the network graph before any manipulation. Figure7. Screenshot of imported data from NodeXL from YouTube NodeXL proves extremely useful for the extraction of data but less useful when analysing the networks produced. One major problem is that during data extraction the visualisation of existing data is not possible. This limits the production and processing data to a "one way" system, impeding progress.
  • 26. Kobi Omenaka @00287065 26 Another problem was that the graphical visualisation and network static power of NodeXL did not match other software such as Gephi. It was decided to export the files created within NodeXL and manipulate them within Gephi. Netvizz The main omission in NodeXL’s data mining arsenal is the fact that it does not currently link in with Facebook. Whilst Facebook is only one of many SNS’s available it is clearly the most well known and used. During the course of this research the number of monthly active Facebook users surpassed 1 Billion (1 000 000 000) users (Zuckerberg 2012). Its impact on the social media landscape cannot be ignored. As such this analysis would have felt incomplete without endeavouring to use data from Facebook as a source for network construction. Research into extracting data from Facebook led to “Netvizz” which is a small Facebook app. There is little formal information about Netvizz other than it is a program created and maintained by Professor Bernhard Rieder at the University of Amsterdam (Rieder 2012). The application can be accessed by typing “Netvizz” into the search box within Facebook. Once Netvizz is accessed a screen such as the one below is presented.
  • 27. Kobi Omenaka @00287065 27 Figure8 Image showingthe Netvizz User Interface within Facebook This page shows that Netvizz can be used to extract data from users personal networks as well as Facebook groups they are members of. Data from Facebook brand pages cannot currently be extracted. Clicking on a group name then opens a page showing that the extraction algorithm is running and tracks progress. A typical page is shown in the image below.
  • 28. Kobi Omenaka @00287065 28 Figure9 Image showingthat the Netvizz data extraction has completed and can be downloaded. In order to fulfil the aims and objectives set up earlier in this research it was decided that the data from a number of clubs should be extracted and amalgamated. Users can only extract data for Facebook groups that they are members of so in order to complete this research a number of capoeira Facebook groups had to be joined. These groups were simply found by typing the word “capoeira” into Facebook search box and requesting to join. A request to join all groups presented by this simple search was seen as the best way to keep this research impartial and random. The more random the sample taken, greater degree of accuracy and confidence can be had in the results (Kotrlik & Higgins 2001). It is anticipated, however, that as Facebook presents news and information to a particular user based on their friends, that is similar presentation will occur when searching for groups. This is governed by Facebook’s “Edgerank” Algorithm (Bucher 2012), which seems to limit sources and visibility of information received rather than opening up social media landscape in a
  • 29. Kobi Omenaka @00287065 29 panoptic fashion (Foucault 1977). This leads to the real concern that achieving randomness online and in Facebook in particular is difficult. This has led to the introduction of the phrase Filter Bubble, which relates to how the algorithms and software that govern the Internet actually serve to hide information from by aiming to provide users with information that is being most relevant to them (Pariser 2012). Figure10 Image Showingthe Group Joining Process onFacebook The image above gives a typical example of a search page produced after typing word capoeira into Facebook. There are three distinct stages for the displayed. 1. The groups that have been found by Facebook search and have yet to have a join request sent 2. Those where the join request is still pending approval by a group administrator 3. Groups that a user is a member of It should be noted that there are two distinct types of groups on Facebook, open groups and closed groups. Both require approval to join so it was felt requested to join a close group would be the same ethically as requested to join an open group as admission depends on the administrators within the group. Once a join request was sent it could take anywhere between one minute and three months for an approval to be given. Indeed, many club admins approve the joint request during the time this research was being performed.
  • 30. Kobi Omenaka @00287065 30 The request to join was sent to over 300 capoeira clubs. The number of clubs joined was due to be limited to the research period time constraints. It was, however, found that Facebook started to restrict the number of groups that was shown with each search. This started to occur once the number of join requests was around 200 in a relatively short space of time, approximately one week. This is another indication that the group selection process was not performed as randomly as possible. Once the request to join portion had finished the next stage was to use Netvizz to extract data from each group. Once the extraction is completed the data can be downloaded as a .gdf file for use within Gephi for analysis. The extraction time can vary mainly depending on the number of users within a Facebook group. In some cases the extraction fails. This is something that has to be accounted for and accommodated for when undertaking analysis of large groups. It is not a labour-intensive process, however, time is required to monitor the groups and should be considered for similar undertakings. Gephi Gephi is an open source network manipulation software. It uses algorithms and tools more prevalent in the gaming industry to be able to filter visualise and exports networks of all sizes. Gephi uses the computers graphics card to do processing, which means that other operations can occur whilst graphs are calculated. This does mean, however, that a more powerful graphics orientated computer is beneficial for optimal usage (Bastian et al. 2009). Gephi does not have the ability to import the data itself. It can, however, import a number of different types of files including files type generated by NodeXL and Netvizz as discussed above.
  • 31. Kobi Omenaka @00287065 31 The Gephi interface, graphs produced and outputs such as path length mean that meaningful analysis can be performed once a network has been loaded in. As graph theory algorithms are built-in to Gephi there is as short learning curve for those that are well versed in network science. The graph theory and algorithms were built in from the ground up along with the visualisation aspects of Gephi. Those unfamiliar to networking analysis but shown a Gephi network graph should be able to understand key characteristics. Figure 11 below shows a network graph produced in Gephi. Figure11 Network graph produced inGephi showinga Manchester based capoeira group The graph above as shown in figure 11 is produced from a capoeira group “Capoeira Cordão de Ouro North West UK (Manchester and Liverpool)”as identified in Facebook. This network graph is a slightly cropped down version of the group with the unconnected nodes (leaves) removed. Based on the import of the data into Gephi there are 1001 nodes in capoeira group. Once the unconnected nodes are removed there are 879 (87.8%) visible.
  • 32. Kobi Omenaka @00287065 32 By looking at the graph, with the leaves removed, it can be seen that can there is one large connected network component here. It is clear to see that C.M.Parente is the most connected node based on the size of the node showing and the text presented. It should come to no surprise that C.M.Parente is the instructor of this capoeira group. This capoeira group is extremely well connected, with an average path length of 2.43. This means that there is less than 3 degrees of separation between anyone in this group. This is well below the 6.6 as discovered by the research on instant messenger platform MSN messenger. From research with in this capoeira group it is known that there are approximately 100 students that train regularly across the different locations during the week. So how can the 1001 nodes be explained? The first point to make is that this figure of 100 students relates to the current number of students that train within the group. This group has been running in Manchester since 2002 and will shortly be celebrating its 10-year anniversary. It is difficult to determine when this group was launched on Facebook, however, Facebook itself came to the UK in October 2005 (Form 2012), It would, therefore, be a conservative estimate to say that the group had been on Facebook since 2007. Approximately 5 years, or half of the group’s existence in Manchester. In this time many people would have come and gone, so whilst there are currently 100 students training the number of students that trained in the five year period could be much larger. The second point to make is that not all of these nodes represent the students of the group. Other visiting teachers and friends of the instructor make up a lot of these. As talked about earlier capoeira students do travel to other events around the world and in the country. A large proportion of the thousand and one nodes
  • 33. Kobi Omenaka @00287065 33 within this consists of students from other groups that have come to visit and subsequently joined this group on Facebook. A number of the nodes will be made up of people who are perhaps looking to join the group or curious as to their activities and want to get to know more. A lot of these nodes would have been removed as leaves and not represented within the network graph shown in figure 11. When referring back to figure 11 it can be seen that graph is divided into four distinct areas as depicted by the different colours. The red areas generally depict nodes from unrelated capoeira groups that are based outside. The orange area in the top of the middle represents people from other capoeira groups based within the UK. The large purple section consists of people from related capoeira groups but based outside the UK. The remaining green section shows the UK based students, the vast majority of which are from this Manchester capoeira group over the years. Method In order to construct the capoeira networks and perform the case study a combination of NodeXL, Netvizz and Gephi was used. NodeXL was used to extract data from Twitter, YouTube and Flickr, whilst Netvizz was used to extract data from Facebook. The data from both of these programs was then visualised within Gephi. One key feature of Gephi is the ability to append data. This means that data from multiple sources can be combined into one file and the same nodes and edges and not counted more than once preventing overlap. With time larger networks can be constructed from smaller groups. This is how the capoeira network was constructed during this research. After requesting to join over 300 groups data extracted from 245. Four groups would not converge.
  • 34. Kobi Omenaka @00287065 34 The breakdown of the groups can be shown in Appendix 1 below. It can be seen from Table 1 below that there is a total of 121,753 nodes across the 241 groups that converged. Results and Analysis Flickr Results Figure12 Network Graph Showingthe Keywords Associated with the Term "capoeira" on Flickr Flickr is a SNS based around the display of photographs. Using NodeXL to perform a keyword search of “capoeira” on Flickr produces a graph that relates the frequency with other keywords in photos uploaded.
  • 35. Kobi Omenaka @00287065 35 This shows the keyword association with capoeira both directly related to martial art itself and art forms relation to photography. It can be seen that the capoeira node is the largest and that has strong ties with Brazil. The word Brazil features twice spelt in both the English and Portuguese variations highlights this fact. Other keywords not related to capoeira but to the medium of presentation are also prominent. This includes the words Nikon and Canon that relate to photography in particular. This kind of result is useful for determining marketing strategy or trying to address how to get the best impacts online with keywords. If societies, charities and companies were looking for keywords to make their presence more pronounced online, this kind of extraction could be a good starting point
  • 36. Kobi Omenaka @00287065 36 YouTube Results Figure13 Network Graph created from the "capoeira" tag within YouTube The results from this graph shown in 13 compared to the graph produced in Flickr is more difficult to navigate. Each node here represents a video in which people have commented on or linked to and the label is a string of letters and numbers not readily understandable by humans e.g. “MFgXmiL78Pg”. When metrics from within Gephi are applied, however, the network can be more useable. For example this graphics represents 878 videos and YouTube name of the most frequent posters can be found as shown in figure 14 below. These can then be used as the target for further research on YouTube.
  • 37. Kobi Omenaka @00287065 37 Figure14 The Most Frequent Posters of Capoeira Videos onYouTube Twitter Results Figure15 Twitter network produced from the hashtag “capoeira” (from 100 tweets)
  • 38. Kobi Omenaka @00287065 38 NodeXL was used to mine data from the last 100 and 1000 Twitter users that used the word capoeira in a tweet. The networks that result from this are shown in the figures above and below Figure16 Twitter network produced from the hashtag “capoeira” (from 1000 tweets) Edges are created in this case when a tweet is retweeted or replied to. It can be seen that there’s a large difference between the 100 tweet and the 1000 tweet network graphs. This is due to enough time passing the people to interact with tweets and highlight relationships. Of the 11 joined components in the 100-tweet graph two have a group size of more than three. This is not enough to gauge how big an influence these nodes have in the capoeira network on Facebook.
  • 39. Kobi Omenaka @00287065 39 The 100-tweet graph shows a lot more detail that is useful for social network analysers. There are three sizeable components that are independent from each other. The green nodes indicate Twitter users that have more connections whilst the pink nodes attached to them show how far the green node’s influence spreads with each tweet. Facebook Results Figure17 Network Graph showing75453 nodes constructed from 241 Capoeira Groups The network graph shown in figure above is the culmination of data mining from Facebook via Netvizz. This is an amplified version of network graphs shown in
  • 40. Kobi Omenaka @00287065 40 figure 11, which was based on one capoeira group. Graphs of this size are very difficult to analyse visually and this is where programmes such as Gephi come into their own. From this diagram many different groups as represented by the different colour areas can be seen but even then it can be difficult to distinguish one group from another and indeed how many groups that are. Each Facebook capoeira user is represented by a circle, the largest belonging to the nodes that are most connected. If the labels are included in this diagram it quickly becomes difficult to read. This is shown in the appendices in figure 24. When running a variety of statistical metrics on Gephi the data we are looking for can be found. Figure 18 below shows the top 10 most connected nodes within the network above. Figure18 Image to show the Top 10 mostConnected Nodes from the Constructed Capoeira Network Gephi is also able to confirm the existence of the giant component that consists of 68448 nodes (91%) of the population. This confirms the introduction to network theory as described. The average network path here is 6.9 which is in agreement with the work performed on Microsoft Messenger by Lescovek and Horvitz.
  • 41. Kobi Omenaka @00287065 41 Table 1 Key Values from Netvizz Data Extraction of Facebook Capoeira Groups Simple Node Count 121753 Number of Unique Nodes 75453 Number of Groups Joined 245 Groups converged 241 Shrinkage 0.38 Raw Average Size 512 Unique Average Size 317 When looking at the list of the most influential users within the capoeira network on Facebook evidence can be seen that selective algorithms guided group selection. The node list of the large Facebook group shows that C.M.Parente is most connected node within 75,000 other nodes. He is followed by C.M.Papa-Leguas. This is the same as the Manchester Capoeira group show in figure 11. It should be noted that the Manchester capoeira group was one of the first groups to be joined as part of this research. Lots of the 245 groups this research became a member of, the Manchester group was 4th. It is also the first group to have a substantial number of members with 1001 nodes compared to 150, 68 and 106 for the first, second and third groups joined. It is, therefore, conceivable that’s Facebook’s Edgerank algorithm has “guided” the rest of the capoeira groups that were displayed during the search. In order to assess the impact of this separate data collection exercises could have to be performed. This time care would have to be taken to ensure the groups found did not contain the members C.M.Parente and C.M.Papa-Leguas. If their names showed up under similar circumstances is this would indicate that they are well connected within capoeira the Facebook community as a whole. If,
  • 42. Kobi Omenaka @00287065 42 however, they did not show up or if their presence was much smaller it could be suggested that the Edgerank had a profound effect on this research. Shrinkage Table 1 above shows some key data taking during the course of this research. One of the most interesting points to notice is that 75453 represents the number of total nodes in their final networking diagram but is less than the simple node count (121753). This is a concept that is introduced during this research as no information can be found about it elsewhere. This is explained by the fact that people are free to join as many groups as they wish on Facebook only limited by the approval process. This means that the same person maybe a member of several groups on Facebook as such their number will become to more than once in raw data. During this research only one Facebook account were used so in this case one user is a member of 245 capoeira groups. The fact that software such as Gephi only counts each user once exposes this “shrinkage”. This is defined here as: “the fraction by which the simple node count exceeds the actual number of nodes” In this case the simple node count is 121753 that feeds into a graph of 75453. The shrinkage number is given here by 1 – (75453/121753) = 0.38. A shrinkage number of 0 indicates that each node is a member of a different group and that each group is discrete, whilst a number close to 1 suggests that everyone is a member of everyone else’s group. Having a high or low shrinkage number is not necessarily a good and bad thing but is an indicator of the dynamics within the network. If one wanted to transmit a certain piece of information from group to group it would certainly be easier with the networks that consist of groups with higher shrinkage numbers.
  • 43. Kobi Omenaka @00287065 43 Capoeira Vibe Case The final part of the results analysis is a brief case study based on the capoeira community Capoeira Vibe that exists on all the four SNSs. The intention here is to give a real high level example of the information can gain from the data mining process in the paper. Facebook currently does not allow fan pages to be made mined. So it should be noted for balance that Capoeira Vibe has well over 11,000 likes and would represent its largest network out of these four. Figure19 Network Graph Showingof the "Capoeira Vibe" Network Communityon Flickr The Flickr group is smallest of the Capoeira Vibe networks followed by YouTube. This may be because these sites focus on the production and posting of material. Based on these findings it would a recommendation to expand their networks on these sites prior to posting any more data. We have seen that the capoeira network on Facebook has a shrinkage number of 0.38, which means the information should pass quite freely within this network. In order to facilitate the passing of information i.e. new photos and videos, more links should be made.
  • 44. Kobi Omenaka @00287065 44 The twitter group shown in figure 22 is made up of 1359 nodes, however, they have 249 followers. This shows they are simply not engaging with the audience as much possible. The advantage of the graph shown in figure 22 is that they can now target key members of the network and can use things like instant messaging and retweets to prioritise which users to form allegiances with. Figure20 Capoeira Vibe level 1.0 YouTube Network
  • 45. Kobi Omenaka @00287065 45 Figure21 Capoeira Vibe Level 2.0 YouTube Network
  • 46. Kobi Omenaka @00287065 46 Figure22 Capoeira Vibe Twitter Network Ethics It can be seen above that large amounts of data can be created, collected, analysed quickly. This has obvious merits when compare to more traditional methods of data collection such a surveying and interviewing, especially when there are time and economic constraints. When this data is collected by surveys, however, people are consenting to its usage in this manner. One key concern with data mining is that people allow the data to be used online in one context but the data is extracted and used for another(Van Wel & Royakkers 2004).
  • 47. Kobi Omenaka @00287065 47 When users signed up to a SNS’s they typically sign an end user license agreement (EULA). The vast majority of people don’t read these and agree in haste simply to start using a service. The first line of Facebook’s EULA is “Your privacy is very important to us”(Herman & Ullyot 2012). It then moves to explaining how the user owns their own content but they allow Facebook to use it as they wish. This is where the information from data mining can be exploited. Data mining happens as the people, perhaps unwittingly, allow it to happen maybe due to ignorance or forgoing privacy simply to be able to use services. Of course the ability to do something does not suggest that it is ethically right. By setting up the EULA’s the social networks effectively absolve themselves from the ethical issues of data protection. The data collected during the course of this research was only restricted by time and, in the case of Twitter, the concern that accessing data might adversely affect their bandwidth. It is clear that those wanting to use data mining are the ones that are carrying the moral flag. Any information collected during the course of this research has been thought about such that private data cannot be exploited. The data collected here has been for the purposes of establishing patterns structure. It is arguably the next steps following the data mining the pose the largest threat. If data mining was used to establish whom to contact for further research then this becomes little more than a telephone directory. As the discussion moves into the qualitative phases does the concern with ethics start to lessen. Here communication is on a much more personal level and individual permission simply has to be granted. This can otherwise be seen as spam contact, which can be deemed unethical. Data mining from SNS is still in its infancy and will no doubt continue to grow given its merits. Its usage needs to be monitored now to ensure that it is not exploited in a negative way in the future. It is clear that at this stage moral implications are very much in the hands of those extracting the data.
  • 48. Kobi Omenaka @00287065 48 Quantitative vs. Qualitative The data mining processes and the network graphs and analysis produced are all quantitative. A lot of data can be collected in a short period of time and this can be used to establish trends help steer the direction that organisations may follow. Using the keyword analysis from Flickr, for example, can be used to set up a websites metatag infrastructure that is important for search engine optimisation (Jones 2011). The problem with pure quantitative analysis is that patterns and trends can be seen and predicted. It can, however, be very difficult to understand what and why things are happening. In the analysis of the Manchester capoeira group (Figure 11) the breakdown of the different regions of the graph was only possible after conducting a short interview with C.M.Parente. Without this it would be very difficult to understand why the communities have collected in the manner they did. Indeed the main outcome of the Capoeira Vibe case study is that they now, in the case of Twitter, YouTube and Flickr, have a much clearer picture of the networks around them. They can also of course draw information from the “general” network analysis created from Facebook, twitter and Flickr. In the research performed by Trompeter on social media tactics (Trompeter 2010) a key thread emerged about reaching and connecting with people. This data mining helps to find those people and the next stage would be to connect with them. In the book “Groundswell” the authors research produced the “Social Technographics Ladder”, which is reproduced below.
  • 49. Kobi Omenaka @00287065 49 Figure23 Social Technographics Ladder based onU.S. Adults (Li & Bernoff 2011, p.43) Each step higher on the ladder indicates a greater level of involvement in social media. The use of data mining for SNA helps to identify these people and where they are. Qualitative measures will connect to these people and further understanding, which may yield results. This research will not go further into the discourse of qualitative and quantitative. The output from this paper has been quantitative but that should not be mutually excluded from qualitative. There has been a line of approach that suggests a mixed method analysis would be most suitable and fitting to SNA. “Network structure is not the whole story… we need to supplement methods of formal network analysis with qualitative observations about what is “going on” within a network.” (Crossley 2010, p.18). “Qualitative approaches add an awareness of context which aids the interpretation of network maps and measures; they add an appreciation of the perception of the network from the inside; and an appreciation of the
  • 50. Kobi Omenaka @00287065 50 content of ties in terms of quality, meaning, and changes over time.” (Edwards 2010, p.24) In the real cases it is suggested that both methods are used and to the ability of the time, resources and knowledge base within each organisation. Conclusion It can be seen during the course is recess that there are several ways to collect data from SNS’s. The type of data collected and sites chosen to mine from should be well considered, as each will return with different types of usable information. For those inexperienced with computer programming the use of software such as NodeXL and Netvizz for data mining purposes are good options. There are often time and budgetary restraints when conducting market research but given the short learning curve with these programs even those uninitiated can yield meaningful results. This ties in with one of the key benefits of marketing by social media i.e. that it is cost-effective (Trompeter 2010). This is particularly true given as it can be used for research purposes prelaunch to help establish marketing strategy on sites and such as YouTube and Flickr to for finding keywords to use online. Once a company or product has been established these data mining metrics and networking techniques can be used to monitor progress online and plan future steps. When it comes to this research, aims and objectives were presented at the outset of this paper. Each has been addressed to varying levels constrained by the binding parameters of this research and the conflicting intention of producing a standalone paper. Many of the aims could form the basis their own research. It is felt in this case however that an overview would be more beneficial at this stage.
  • 51. Appendix 1: Table to show the Number, Name, Group ID and Number of Nodes as Extracted by Netvizz CAPOEIRA GROUP NUMBER FACEBOOK CAPOEIRA GROUP NAME FACEBOOK GROUP ID NUMBER OF MEMBERS NODES FROM UNMINEABLE GROUPS 245 CAPOEIRA MUZENZA KOREA 188938991204060 298 244 Cordao de Ouro Moscow 217904894908696 28 243 CAPOEIRA MANDINGA MEXICO 73554666612 403 242 Capoeira Meeting Copenhagen 61563872205 302 241 ABADA Capoeira UK 117537624347 373 240 Malandragem Capoeira 14463235598199 1468 239 Ryerson Capoeira Club 257873730977742 510 238 Real Capoeira 87587742228 250 237 Núcleo Abaeté Rede Anca Capoeira 385200501490881 284 236 Grupo Capoeira Brasil - Professora Mariana "Potiguara" 217949151486 379 235 Grupo Axé Capoeira 2248929446 1377 234 Capoeira sul da bahia CM MAXUEL, Monitor dexter 382019125175541 542 233 GRUPOS DE CAPOEIRA EN MEXICO 136198343113436 622 232 Grupo Caymã Capoeira 282348378525310 1494 231 Capoeira SDB Contra Mestre Maxuel, Profesor Estagiário Eddu.. 58338428093 325 230 I love capoeira cdo 182311485177255 487 229 CAPOEIRA Y ACONDICIONAMIENTO FISICO BOGOTA 106388989487731 296 228 GRUPO BERIBAZU E AMIGOS DA CAPOEIRAGEM 187972871213976 826 227 Capoeira Sul da Bahía - C Mestre Maxuel, Prof Estagaria Alena 36077561796 302
  • 52. Kobi Omenaka @00287065 52 CHILE 226 Capoeira Angola in Manila 124826510863308 231 225 Capoeira "Body & Soul" di Mario Collina 253970458014301 554 224 Porao Capoeira Vienna/Austria 130157855172 477 223 Capoeira Karkara 110588072355664 290 222 Capoeira Q Roda 223415177693922 194 221 Axé Capoeira Bursa 148746176508 451 220 CAPOEIRA - Oficina da Capoeira Internacional "Venezuela" 114404815258982 259 219 Capoeira Senzala (professor Palhaco-Belgrade) 41992525089 369 218 Capoeira Friends 102512993162401 325 217 Capoeira Nagô Malta 23503931496 934 216 Capoeira Videos 217041721655963 527 215 Escola Nestor Capoeira 175658252458533 270 214 Casa Da Capoeira Knysna 21790256882 666 213 Comunidade Capoeira 148020055323480 278 212 Capoeira Brooklyn - Raizes do Brasil 12292780174 203 211 Capoeira Malungos Bayonne 39969985608 592 210 Aú Capoeira New Zealand 19690984056 288 209 Cantigas e Documentarios de Capoeira 279741352133380 1104 208 Capoeira (Finland) 6205335078 203 207 Capoeira Polska 473542129336661 515 206 CAPOEIRA AGUA DE BEBER VENEZUELA 41172345168 519 205 Escola Brasileira de Capoeira, Philippines 27691384580 677 204 XANGO Capoeira Australia 7436426726 614 203 SENZALA MACEDONIA-MESTRE PULMAO ( Capoeira Skopje ) 44283313844 Data extraction not converged 4959
  • 53. Kobi Omenaka @00287065 53 202 Capoeira Kościerzyna 175941112541870 270 201 Capoeira Senzala Genève 23039013657 204 200 Ginga Firme Capoeira Makassar 80141919030 384 199 CAPOEIRA MEU DEUS RANCAGUA 253128554776427 307 198 Capoeira Vida 188271654538689 935 197 Capoeira Training Brighton 131365310324937 287 196 Abadá Capoeira Milano 28377257445 438 195 Capoeira Filhos de Angola ~ Lefkada 175698712558150 255 194 Capoeira Cordao de Ouro - South Africa 11284290815 227 193 Filhos De Bimba Escola De Capoeira Lebanon 61878158913 588 192 Capoeira raça 156914091014346 239 191 Capoeira Senzala Scotland 4959664847 356 190 Capoeira Senzala Lara (Venezuela) 56762910082 232 189 Abadá Capoeira Bogotá 174064432704 448 188 FILHOS DE BIMBA ESCOLA DE CAPOEIRA STUTTGART 194794585908 597 187 Capoeira Sobreviventes 2345044102 407 186 Capoeira Cordão De Ouro Costa Rica 136361619738448 337 185 Capoeira Nago Brasil 168077533231588 308 184 Capoeira TRIARTE Genova 113640695314427 227 183 Capoeira Angola Center Finland 11110057026 427 182 Association Swedendê & Capoeira Senzala - Professor Coqueirinho 54700188474 569 181 Capoeira Amazonas Split 16832446806 212 180 Axé Capoeira—Czech Republic 132752446753272 247 179 Grupo Capoeira Males - Burlington 125210507601369 Data extraction not converged 1095 178 Grupo de Capoeira da Angola Istanbul 128728250490174 360
  • 54. Kobi Omenaka @00287065 54 177 Rodas e Eventos de Capoeira em SP 154380194685843 862 176 Gruppo Capoeira San Marino 307586172594387 338 175 i want to learn capoeira (Malaysia) 165425803469895 268 174 Grupo de Capoeira Angola Menino Quem Foi Seu Mestre - London 340454615983173 343 173 Soluna Capoeira 28969814448 738 172 GRUPO CAPOEIRA CHALKIDA 39546508542 400 171 Mundo do Capoeira 343881525679887 1004 170 Grupo Capoeira Brasil- New York: Formanda Colibri 311052822323721 809 169 Abadá-Capoeira ::: Designs 361378350570801 737 168 Capoeiraskolen Senzala 46510288650 283 167 Capoeira Ringsted Grupo Malungos 56416035736 329 166 CAPOEIRA RAÇA E FUNDAMENTO "SERRA NEGRA" PRFº JAÚ 239679632776568 261 165 Escola de CAPOEIRA GINGA CARIOCA 28634226177 175 164 CAPOEIRA om formiddagen på St. Kongensgade 79 329466953808780 190 163 CAPOEIRA 150509121659291 467 162 CAPOEIRA BELGRADE-CMPULMAO 10219317260 1009 161 Capoeira Brasil - Bogotá 13283446207 510 160 Capoeira Puerto Rico Senzala 111639975573687 219 159 Capoeira Aché Brasil Malaysia 135483606485823 292 158 Capoeira Kuwait-‫را‬ ‫وي‬ ‫اب‬ ‫ك‬ ‫ت‬ ‫وي‬ ‫ك‬ 48035048877 279 157 Capoeira Belfast 181860001000 209 156 Capoeira Brasil Hermosillo 174108045980534 230 155 Capoeira Cyprus 297572139539 247 154 Group Capoeira Brasil - New Zealand - 278509383840 654 153 Capoeira Senzala de Santos 225993607412207 469 152 CAPOEIRA 177360938985380 307
  • 55. Kobi Omenaka @00287065 55 151 Capoeira Ijexá 173148365299 213 150 Capoeira Gerais Madrid 143121215744995 1920 149 CAPOEIRA - UFRJ 171081326282080 1144 148 Capoeira Mersin 114280165362620 307 147 Capoeira Malungos Saint Etienne - France 141196022583911 709 146 Capoeira Senzala Montreal 250074501433 216 145 Capoeira Brinquedo de Angola 155219664543731 257 144 Capoeira Raizes do Brasil Milano 33398315109 643 143 CAPOEIRA MANDOU CHAMAR ULM GERMANY 449570235082920 312 142 Capoeira Calédonia, Energia da Bahia - Nouméa - 35682791319 453 141 CapoeirArab 156542164439065 393 140 Capoeira Senzala do Caribe 94715705127 894 139 ASOCIACION ARGENTINA DE CAPOEIRA 8716951310 1713 138 Capoeira in Michigan 2589775177 935 137 Capoeira Angola Israel - ‫קפואירא‬ ‫אנגולה‬ ‫ישראל‬ 297434566996612 709 136 Professor Cebolinha CORDÃO DE OURO CAPOEIRA - Newark-NJ- USA 406607306047829 1203 135 Capoeira Cordão de Ouro - Bonneuil 522128287802986 283 134 Capoeira Conviver 4125218546 310 133 Capoeira Volta ao Mundo brazil sweden 2407568628 283 132 Capoeira Brasil Cayman Islands - Instructor Koé 236552959187 374 131 Capoeira Malungos Paris - Association Senzaleiros 7538366468 436 130 Capoeira BALI 129610587101010 239 129 CAPOEIRA 269552989741226 266 128 ECAM (Escuela de Capoeira y Artes Mixtas) 315767145181306 1173 127 Capoeira in Toronto 2213115016 474
  • 56. Kobi Omenaka @00287065 56 126 CAPOEIRA NATIVOS TUNJA - GARAGOA 263013087148341 274 125 Capoeira Malungos Ambérieu-en-Bugey 405396279522761 129 124 Capoeira ALEGRIA 179497432074997 255 123 Capoeira de Camaçari 375681459155880 443 122 CAPOEIRA GINGA DE MAPUTO 92676097159 654 121 Capoeira 198617790151751 399 120 Capoeira Força Natural 27637657908 246 119 Capoeira Passo a Frente 150336321678724 250 118 Grupo MATUMBÉ Capoeira - BARCELONA-SPAIN 368368470689 673 117 Capoeira Sul da Bahia Vienna Professor David 100311471224 279 116 Mundo Capoeira Türkiye - Mundo Capoeira Turquia 5745007875 1093 115 Capoeira "N" Surf Morocco 124449746100 277 114 Capoeira Estilo Livre 367768209937829 561 113 Capoeira 321810367891948 955 112 CAPOEIRA MUSIC ! 175857075765440 860 111 CAPOEIRA MAROC 13963475108 949 110 Capoeira Mandinga Taiwan 158763090818287 286 109 Capoeiranagô Berlin alemanha Professor Rogérinho 339334099467054 577 108 CAPOEIRA ARTES DAS GERAIS BRASIL MESTRE MUSEU FICAG 116408905097958 304 107 Capoeira Del Bruto Genève 251727598259443 368 106 Capoeira Ache Brasil Whistler 305421630796 248 105 Capoeira India 24905600976 476 104 Capoeira Cordão de Ouro Indonesia 15750918795 1095 103 CAPOEIRA - FORÇA JOVEM (CAMPO GRANDE - MS) 385510051471770 368 102 Capoeira 310291139019831 344 101 Capoeira Angola 132756643447610 247
  • 57. Kobi Omenaka @00287065 57 100 Músicas da Abadá-Capoeira 246876338619 1954 99 Capoeira 182151925232852 437 98 Capoeira 173728919335311 1479 97 Capoeira Natural Do Brasil 57416365662 826 96 Capoeira Angola Center of Mestre João Grande - Oakland 144322595611492 353 95 CAPOEIRA DUBAI 202860073566 609 94 Capoeira Street Rodas Club 88497857586 Data extraction not converged 344 93 Capoeira Lapinha 416658591683370 490 92 Capoeira Filhos Da Bahia Australia 2362794567 1565 91 CAPOEIRA SENZALA NOVI SAD, SERBIA 63798471319 790 90 CAPOEIRA CIDADÃ 131298673593799 320 89 CAPOEIRA MOVIMENTO 224446160984111 301 88 Anauê Capoeira Internacional 136632315120 985 87 Cordao de Ouro Argentina 165325643526339 52 86 CAPOEIRA SUL DA BAHIA KØBENHAVN DANMARK 121672330541 309 85 Capoeira Batuque 49838989530 912 84 Capoeira Nova Era Mexico 113239375415803 376 83 Capoeira Aguascalientes 120202567998906 298 82 Capoeira Bem-Vindo 168608506486901 338 81 Axe Capoeira Türkiye 5464803047 1686 80 capoeira TwisT ponorogo 146387224975 Data extraction not converged 375 79 Capoeira Ballymena 382801128443913 68 78 Capoeira en Querétaro 345555565532083 105 77 Capoeira in armenia 129554777132684 1125
  • 58. Kobi Omenaka @00287065 58 76 Capoeira Senzala Lyon 42847059088 297 75 CORDAO DE OURO COLOMBIA 45725838389 685 74 Capoeira Sul da Bahia - Washington D.C. 289485185402 363 73 CAPOEIRA NOVA ARTE 467158856636578 288 72 Capoeira Rijeka, Jacobina Arte Croatia 47105079044 355 71 Capoeira & break dance 231690590188039 233 70 Capoeira Angola FICA - Bogotá 10742924786 661 69 Capoeira Brasileira Montreal 2627110574 449 68 Capoeira Acrebrasil Milagro 156360181121850 408 67 Capoeira Senzala - Contramestre Steen - Professor Axé Canarinho 7481173828 643 66 Capoeira LDMUNAM (Cabeleira) 146698528752388 386 65 Capoeira Guerreiro Orixas 103209925238 363 64 Capoeira Angola Nottingham 60638711667 339 63 Capoeira 345054108914651 56 62 Capoeira Angola Center Italia 94821955561 368 61 Capoeira Angola Center Sérvia - Contra-Mestre Marquinho 132801996810260 432 60 Capoeira Ioannina - Companhia Pernas Pro Ar 53432940124 716 59 Capoeira De Nazareth Cordao De Ouro Israel 15886395195 699 58 Capoeira Nativos de Minas 4386281365 417 57 Capoeira Raça İtabuna - Bahia - Brasil 386805337998363 776 56 Capoeira Bulgaria 53282511878 720 55 Capoeira Infantil 200096850038709 1646 54 Capoeira Moldova 433028646709900 560 53 Capoeira AMAZONAS - Hrvatska 6074204063 584 52 capoeira surfista - ‫קפוארה‬ ‫סורפיסטה‬ 38643985389 326 51 Capoeira-Music.net 225102747556546 570
  • 59. Kobi Omenaka @00287065 59 50 Capoeira Origens do Brasil UK - Southampton- Bournemouth - Portsmouth 19775543856 528 49 Capoeira Jerusalem ‫קפוארה‬ ‫ירושלים‬ 27033528099 934 48 Capoeira Angola Center of Mestre Joao Grande - New York 33918702846 1119 47 Capoeira Mineira 170275553055302 619 46 cordao de ouro sapporo do japao 162369700451798 12 45 Capoeira Sevgisi 229788467082160 323 44 Capoeira Topazio Rieti 51748450901 861 43 Capoeira Associação Sérvia - Professor Touro Branco e Alunos 189648850746 923 42 Capoeira Bristol- Claudio Campos 2333889907 642 41 Cordao de Ouro Norwich Capoeira 7421940457 268 40 Volta por Cima, Cordao de Ouro, Turku 113084175538 73 39 Capoeira Plantando Dendê 414665775257111 241 38 Capoeirando 313048142084018 437 37 roda 153865851333143 58 36 μπαράκι Μαρίας στις Γούβες (ή αλλιώς πειρατικό ή αλλιώς λεωφορείο ή αλλιώς) 112277615476421 756 35 Contra Mestre Turbina - Norway 232402870103870 407 34 Capoeira Picture 437611929590499 704 33 Capoeira World 353301284719697 1394 32 Capoeira Iowa - CORDÃO DE OURO C. M. Cabeção and Intrutora Tiririca 300009700073805 771 31 Capoeira Birmingham Cordao de Ouro 365929986778646 81 30 Capoeira Malungos Landes et Béarn 278612768888080 1089 29 CAPOEIRA JERICOACOARA 226519449716 203 28 Come and Play 138134466263978 139 27 Negoteta Capoeira -Guardioes Brasileiros 174957519203487 134
  • 60. Kobi Omenaka @00287065 60 26 Cordao De Ouro Scotland 199800916760341 146 25 Capoeira Cordao de Ouro Milano -BAMBU- (www.capoeiracdo.it) 190470224308554 483 24 Capoeira Cordao De Ouro Manchester Children's classes 138546192907722 54 23 Capoeira i Bergen :) 242864262413466 52 22 Cordao de Ouro Wirral 216025318428937 44 21 capoeira cdo Marseille 202338545265 252 20 Capoeira Malungos Edinburgh - Scotland 100830409973949 252 19 Capoeira Cordão de Ouro Cheshire 140611855957168 115 18 Capoeira Ceara 2230017764 580 17 Afro Ritmo CDO - Capoeira 120447254654796 196 16 Cordão De Ouro Livorno 118805631471068 530 15 Cordao de Ouro Athens 42033326349 1076 14 Capoeira in Lancaster ( UK ) 57711742048 45 13 Cordão de Ouro Oslo- Instrutor Pirucão 442088410724 205 12 ‫לוח‬ ‫ההודעות‬ ‫של‬ ‫המרכז‬ ‫הישראלי‬ ‫לקפוארה‬ - Cordão de ouro Israel Mestre Edan 133660927258 736 11 CAPOEIRA IN CYPRUS "Cordao de ouro Cyprus" 255023815847 1566 10 Cordão de Ouro Barcelona 54752581524 161 9 Italia Centro Di Capoeira 163960295258 365 8 Cordao de Ouro Capoeira Sheffield 2259263842 219 7 Capoeira Nottingham CDO 47201706570 240 6 Cordão de Ouro Capoeira Crete 34564813057 1447 5 Capoeira Batizados & Workshops in Europe 55370608206 894 4 Capoeira Cordão de Ouro North West UK (Manchester and Liverpool) 2333577653 1001 3 Capoeira CBF 38593533333 106
  • 61. Kobi Omenaka @00287065 61 2 University of York Capoeira 32824463126 68 1 Cordao de Ouro Capoeira Derby 2221308745 150 Total Number of Nodes 121753 Number of Unique Nodes 75453 Number of Groups Joined 245 Groups converged 241 Shrinkage 0.38
  • 62. Appendix 2: Python GitHub Repositories for Social Media Sites The official online compendium for Mining the Social Web (O'Reilly, 2011) https://github.com/ptwobrussell/Mining-the-Social-Web A Python library for accessing the Twitter API https://github.com/tweepy/tweepy Facebook.py is a Python Client Library for the Facebook APIs. The goal is to support all Facebook APIs located at http://developers.facebook.com using only standard python libraries. https://github.com/semyazza/Facebook.py Facebook Platform Python SDK https://github.com/pythonforfacebook/facebook-sdk Python client lib for Facebook's new Graph API https://github.com/iplatform/pyFaceGraph/ Facepy makes it really easy to interact with Facebook's Graph API https://github.com/jgorset/facepy A micro api client for writing scripts against the Facebook Graph API https://github.com/facebook/fbconsole Appendix 3: Other Web Resources NodeXL Download for Windows http://nodexl.codeplex.com/ Netvizz Facebook App
  • 63. Kobi Omenaka @00287065 63 http://apps.facebook.com/netvizz/?fb_source=search&ref=ts Gephi Download and Documentation http://gephi.org/ Appendix 3 Other Network Images Figure24 Network Graph of 241 Capoeira Facebook Groups with Labels
  • 64. Kobi Omenaka @00287065 64 Figure25 Capoeira Vibe Level 1.5 Network on YouTube
  • 65. Kobi Omenaka @00287065 65 Figure26 Image showingthe Giant Component from the Facebook capoeiranetwork. In spite of the different colours depicting different groups each node canbe connected to each other. The longest distance between any two nodes is 22. This contains 68448 nodes compared to 75453 inthe complete network Bibliography Allardice, S., 2011. Foundations of Programming: Fundamentals | Video Tutorial from lynda.com. Available at: http://www.lynda.com/tutorial/83603 [Accessed July 22, 2012]. Bastian, M., Heymann, S. & Jacomy, M., 2009. Gephi : An Open Source Software for Exploring and Manipulating Networks. In International AAAI Conference on Weblogs and Social Media 2009. Berners-Lee, T., 1999. Weaving the Web : the origins and future of the World Wide Web, London: Orion Business.
  • 66. Kobi Omenaka @00287065 66 boyd, danah & Ellison, N., 2007. Social Network Sites: Definition, History, and Scholarship. Journal of Computer-Mediated Communication, 13(1), p.Article 11. Bucher, T., 2012. Want to be on the top? Algorithmic power and the threat of invisibility on Facebook. New Media & Society, 14(6). Burt, R., 1978. Applied Network Analysis: An Overview. SOCIOLOGICAL METHODS AND RESEARCH, 7(2). Corley, C.D. et al., 2010. Text and Structural Data Mining of Influenza Mentions in Web and Social Media. International Journal of Environmental Research and Public Health, 7(2), pp.596–615. Crosbie, V., 2002. What is New Media? Crossley, N., 2010. The Social World of the Network. Combining Qualitative and Quantitative Elements in Social Network Analysis. Sociologica, 4(1), pp.0– 0. Dabbish, L. et al., 2012. Social coding in GitHub: transparency and collaboration in an open software repository. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. pp. 1277–1286. Available at: http://dl.acm.org/citation.cfm?id=2145204.2145396 [Accessed September 25, 2012]. Easley, D. & Kleinberg, J., 2010. Networks, Crowds, and Markets ; Reasoning about a Highly Connected World., [S.l.]: Cambridge University Press. Edwards, G., 2010. Mixed Method Approach to Social Network Analysis. ESRC National Centre for Research Methods Review Paper, NCRM/015. Erdös, P. & Rényi, A., 1960. On the evolution of random graphs, Akad. Kiad’o. Available at: http://bolyai.math-inst.hu/~p_erdos/1960-10.pdf [Accessed September 18, 2012]. Essien, A., 2008. Capoeira beyond Brazil : from a slave tradition to an international way of life, Berkeley, Calif.: Blue Snake Books. Fanpagelist.com, 2012. Top 100 Facebook Fan Pages. Fanpagelist.com. Available at: http://fanpagelist.com/category/top_users/ [Accessed September 15, 2012]. Form, S., 2012. Facebook, Inc. REGISTRATION STATEMENT Under The Securities Act of 1933. February, 1, p.2010. Foucault, M., 1977. Discipline and punish the birth of the prison, New York: Random House. Green, T.A. & Svinth, J.R., 2010. Martial arts of the world an encyclopedia of history and innovation, Santa Barbara, Calif.: ABC-CLIO. Available at:
  • 67. Kobi Omenaka @00287065 67 http://www.credoreference.com/book/abcmlarts [Accessed September 20, 2012]. Hansen, D.L., Schneiderman, B. & Smith, M.A., 2010. Analyzing social media networks with NodeXL insights from a connected world, Amsterdam; Boston: M. Kaufmann. Available at: http://www.sciencedirect.com/science/book/9780123822291 [Accessed September 14, 2012]. Herman, C. & Ullyot, T., 2012. Facebook. Available at: http://www.facebook.com/legal/terms [Accessed October 7, 2012]. Jackson, M.O., 2008. Social and economic networks, Princeton Univ Pr. Available at: http://books.google.com/books?hl=en&lr=&id=rFzHinVAq7gC&oi=fnd&p g=PR11&dq=%22The+Symmetric+Connections+Model+.+.%22+%22Exer cises+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.%22+&ots=vZkgGWWOhW&si g=mrXT_J9WHs8qFdEjzPQGxI0LeHQ [Accessed September 18, 2012]. Jones, R., 2011. Keyword intelligence : keyword research for search, social, and beyond, Hoboken, N.J.; Chichester: Wiley ; John Wiley [distributor]. Kotrlik, J.W.K.J.W. & Higgins, C.C.H.C.C., 2001. Organizational Research: Determining Appropriate Sample Size in Survey Research Appropriate Sample Size in Survey Research. Information Technology, Learning, and Performance Journal, 19(1), p.43. Kurose, J.F. & Ross, K.W., 2003. Computer networking : a top-down approach featuring the Internet, Boston: Addison-Wesley. Leskovec, J. & Horvitz, E., 2008. Planetary-scale views on a large instant- messaging network. In Proceeding of the 17th international conference on World Wide Web. pp. 915–924. Available at: http://dl.acm.org/citation.cfm?id=1367620 [Accessed September 19, 2012]. Li, C. & Bernoff, J., 2011. Groundswell : winning in a world transformed by social technologies, Boston: Harvard Business Review Press. Linoff, G. & Berry, M.J.A., 2011. Data mining techniques : for marketing, sales, and customer relationship management, Indianapolis, IN: Wiley Pub. Lipsman, A. et al., 2012. The Power of Like: How Brands Reach and Influence Fans Through Social Media Marketing. In comScore. Milgram, S., 1967. The Small World Problem. Psychology Today, 1(1), pp.61–67. Pariser, E., 2012. The filter bubble : how the new personalized Web is changing what we read and how we think, New York, N.Y.: Penguin Books/Penguin Press.
  • 68. Kobi Omenaka @00287065 68 Rieder, B., 2012. Bernhard Rieder - Programming. Available at: http://rieder.polsys.net/programming/ [Accessed October 6, 2012]. Russell, M.A., 2011. Mining the social web, Beijing; Sebastopol, CA: O’Reilly. Sande, W. & Sande, C., 2009. Hello world! : computer programming for kids and other beginners, Greenwich, Conn.: Manning. Trompeter, F., 2010. How NGOs can use Social Media. Available at: http://www.un.org/esa/socdev/ngo/docs/2010/Farra.pdf. Veldhuizen, T.L., 2007. Dynamic multilevel graph visualization. arXiv preprint arXiv:0712.1549. Available at: http://arxiv.org/abs/0712.1549 [Accessed September 25, 2012]. Van Wel, L. & Royakkers, L., 2004. Ethical issues in web data mining. Ethics and Information Technology, 6(2), pp.129–140. Zuckerberg, M., 2012. One Billion People on Facebook - Facebook Newsroom. Available at: http://newsroom.fb.com/News/One-Billion-People-on- Facebook-1c9.aspx [Accessed October 6, 2012].