Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

  1. 1. Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging JUAN DIEGO BORRERO, ESTRELLA GUALDA, University of Huelva Seminários CIEO - Universidade do Algarve Faro, 31 October, 2012 1
  2. 2. Table of Contents• 1. Introduction • 3. Methodology• 2. Theoretical perspective – 3.1. Data Collection procedure – Web 2.0 and Collaborative – 3.2. Analysis procedure. tagging SNA – Tagging and Folksonomy • 4. Results – The collective knowledge – 4.1. Centralization: inherent in social tags Authority – Tagging and Social – 4.2. Node Tags: Users networks producing Tags – Social Web and its impact • 5. Discussion on Information Retrieval – 5.1. Centrality and Power (IR) and Recommender – 5.2. Central Tags: Users Systems (RS) producing Tags • 6. Conclusions and future research 2
  3. 3. 1. IntroductionWhat puzzles? 1. The era of Big Data and Social Media has begun! E.g., Twitter, Facebook, Tumbrl, Delicious, Youtube, Flickr, Wikipedia… 2. Will it transform how we study human communication and social relations? 3. Will it alter what ‘research’ means? Some or all of the above? 3
  4. 4. 1. IntroductionWhat puzzles? 1. Big Data is notable not because of its size, but because of its relationality to other data. Big Data is fundamentally networked. Its value comes from the patterns that can be derived by making connections between pieces of data, about an individual, about individuals in relation to others, about groups of people, or simply about the structure of information itself. 2. Big Data is important because it refers to an analytic phenomenon playing out in academia. 3. Big data is important because of its popular salience. 4
  5. 5. 1. IntroductionTagging • New technologies have made it possible for a wide range of people to produce, share, interact with, and organize data. • People can classify the huge amount of information at her/his disposal in the form of tags. 5
  6. 6. 1. IntroductionTagging in DeliciousKeywordsfreelychosen byusersemployedtoannotatevarioustypes ofdigitalcontent, orsuggestedbyDelicious 6 Source:
  7. 7. 1. Introduction Social Tagging Systems Many users add metadata in the form of tags Source: the-crowds-in-the-audiovisual-archive-domain/ Resulting collective tag structureSource: 7 Source:
  8. 8. 1. IntroductionDeliciousDelicious is afree socialbookmarkingwebsite forstoring,sharing anddiscoveringwebbookmarks 8 Source:
  9. 9. 1. IntroductionOur Assumption • Big Data offers the humanistic disciplines a new way to work in the quantitative side and it also offers other kind of objective method for analysis. • Although in reality, working with Big Data is still subjective. • Due to this, it is crucial to begin asking questions about the analytic assumptions, methodological frameworks, and underlying biases embedded in the Big Data phenomenon. 9
  10. 10. 1. IntroductionOur Objectives 1. Proposing a methodology to use big data from Web 2.0 in social research, 2. Applying it to extract automatically data from Delicious social bookmarking website, and 3. To show the type of results that this kind of analysis can offer to social scientists. 4. We focus our study in globalization agriculture community, and pay special attention to SNA 10
  11. 11. 2. Theoretical perspectiveWeb 2.0… and collaborative taggingWeb 2.0 is the businessrevolution in the computerindustry caused by the move tothe Internet as platform, and anattempt to understand the rulesfor success on that newplatform (O’Reilly, 2007)Collaborative – or social –tagging is the activity in theWeb 2.0 of annotating digitalresources with keywords - tags(Golder and Huberman, 2006;Trant, 2009). Source: 11
  12. 12. 2. Theoretical perspective… collaborative taggingCollaborative – or social – tagging is the activity in theWeb 2.0 of annotating digital resources with keywords -tags (Golder and Huberman, 2006; Trant, 2009). Webpages, photos, videos… A collaborative tagging system is mainly composed of three interconnected components users, tags, and resources (Smith, 2008) 12
  13. 13. 2. Theoretical perspective… collaborative tagging and folksonomySocial taggingsystemsaggregate thetags of allusers anddescribe theresources in aso-calledfolksonomy(Vander Wal,2004) problems Synonyms global warming = climate change Terms variations globalization = globalisation poor=poors 13
  14. 14. 2. Theoretical perspective… folksonomy and collective knowledge Bottom-up process… …the tags of many different users are aggregated and the resulting collective tag structure – such as tag cloud – depicts the collective knowledge of Web users (Cress et al., 2012) 14 Source:
  15. 15. 2. Theoretical perspectiveTagging and social networksThe structure of Social tagging websites can be viewed as anetwork of three different node types: the U users, the Rresources (web sites – URLs) and the T tags that the U usersdeploy to tag the R web sites. Figure 1. A Bipartite Network made of three users U=(u,u’,u’’), three tags T=(t,t’,t’’) and two kinds of links: between users RU (straight lines), and between users and tags RT (dashed lines)A particular class of networksis the bipartite networks,whose nodes are divided intotwo sets –e.g. users and tags.An opinion network (Maslovand Zhang, 2001; Blattner etal., 2007), is a network inwhich users connect to theobjects that they gather. 15 Source: Authors
  16. 16. 2. Theoretical perspectiveSocial web and its impact on InformationRetrieval (IR) and Recommender Systems (RS) 1. From Social IR point of view -i.e. IR that uses folksonomies- IT creates algorithms for folksonomies in order to identify which information is relevant and to identify communities to their need, this paper aims to exhibit a methodology to retrieve big data from Web 2.0 environment. 2. We introduce social tagging as basis for recommendations focused into a ternary relation between users, resources, and tags, to discover latent patterns links to the activity of collaborative tagging, which could be basic in order to provide effective recommendations to different actors. 16
  17. 17. 3. Methodology• Data set from: Delicious – –.• Delicious = social bookmarking system whose – Content is created, annotated and viewed by its users. – Non-hierarchical classification system: users can tag each of their bookmarks on the Delicious website, and provides knowledge about the URL marked – Collective nature: • view bookmarks added or annotated by other users. • organize existing tags into groups (tag bundles). 17
  18. 18. 3.1. Data Collection procedureCollected annotations made in Social Bookmarking Services.At least four parts:• 1. Link to the resource (website…)• 2. One or more tags• 3. User who makes the annotation• 4. Moment/ time when the annotation is made• This article focus more on the co-occurrence of users, resources and tags (user, resource, tag). Dataset collected : U = {u1; u2; : : : ; uK}, R = {r1; r2; : : ; rM}, and T = {t1; t2; : : ; tN} 18
  19. 19. 3.1. Process to retrieve the data Figure 2. Data Collection Procedure(A) Start point. Identify the search attributes.Authoritative source as baseline to find keywordsconnected to the idea of ‘globalization of agriculture’ – Wikipedia definition of “critics of globalization (popular, high reputation) – Other starts points (future) – Selected (manually= researcher expertise) main concepts from the website homepages, tag clouds or topics. – Identified the 5 seed keywords (globalization + agriculture, food, organic, and GMO) – Other concepts rejected(B) With a Perl program web-crawling was made,gathering the sample of users, URLs and tags - For globalization+agriculture; globalization+food; globalization+organic; globalization+GMO - 22 April 2011 and 21 May 2011 (one completed month) Source: Authors - Results: 10,220 taggings that involved 851 users on 1,077 URLs and 1,720 tags.(C) Program in Haskell to reduce the amount of databy cutting the URLs and using key words, including theidentification of synonyms, the elimination of words with (D) Dataset forcapital letters and derivatives such as words in plural. analysis 19
  20. 20. Example: final datasetSource: Authors 526 urls 1,700 tags 20 users 851
  21. 21. Table 1. Keywords Used in the topic “Globalization of agriculture” Search attributes Number of More frequent Tags used resulting tags / (I+II) Main Tags Globalization (I) + 1,116 Food (268), economics (176), agriculture (II) environment (145), politics (85), trade (81), sustainability (70) Globalization (I) + 1,682 Economy (180), economics food (II) (171), environment (122), sustainability (78), politics (60) Globalization (I) + 22 Business (3), fair-trade (3) organic (II) Globalization (I) + 54 Food (13), agriculture (12) GMO (II) 21Source: Authors
  22. 22. 3.2. Analysis procedure: SNANetwork analysis• Node centrality: identification of the nodes that are more “central” than others Network level property = idea of the node’s social power based on how well it “connects” to the network.• Degree of a node = Number of direct connections individuals have with others in the group Highest degree = exerts influence (or authority). In-degree = number of incoming ties that reflect the popularity of a website. As a result, the prominent, well-connected members (those with a high degree of centrality) are usually the opinion leaders. Out-degree = number of outgoing ties which determine if a particular user is an active or passive participant within the network. Software Pajek (big series of data): Delicious bookmarking system’s user is simply using Delicious, latent structures, power that emerges from the network… 22
  23. 23. Figure 3. Hyperlink Network Energy Kamada-Kawai Map. Bipartite Network userurlSource: Authors by Pajek 23
  24. 24. Results 4.1. Centralization (Authority)Centralization: userURLURL’s Indegree: Sum of total inbound linksUser’s Outdegree: Sum of the total outbound linksNetwork highly centralized within a few nodes:Only 10 URLs from 526 (1.90%) account for 32.29% links to URLs.10 URLs got 3,290 inbound links from a total of 10,219.Only 10 users from 851 (1.17%) account for 14.05% links to URLs.These 10 users produced 1,436 outbound links from a total of 10,219.10 most centralized websites. Nine of them were media-based (online newpapers such as The New York Times, BBC, The Guardian, Washington Post, Financial Times, Reason, The Nation, Spiegel and The Economist) (Table 2)Identification of Users with a greater degree of centrality. Mritiunjoy user play a very important role in the network. Mritiunjoy joined to Delicious on 12 march, 2007 and to the date he has 10,020 links and is following 38 users. Mritiunjoy Mohanty - is a professor at the Indian Institute of Management Calcutta, India and his Research Interests are Political Economy of growth and development. 24
  25. 25. Table 2. Top Authoritative Sites in the hyperlink network Indegree Outdegree 1 1203 433 /mritiunjoy 2 674 195 /laura208 3 365 127 /rd108 4 186 112 /amaah 5 158 111 /thepouncer 6 154 100 /anilius 7 147 100 /emmarlyb 8 137 87 /adorngeography 9 136 86 /pagolnari 10 130 85 /freemanlcSource: Authors 25
  26. 26. Figure 4. user-user Unipartite Network Energy Kamada-Kawai Map Degree Cut-off = 1. Size: DegreeSource: Authors by Pajek 26
  27. 27. Figure 5. user-user Unipartite Network Energy Kamada-Kawai Map Degree Cut-off = 30. Nodes = 211. Size: BetweenessSource: Authors by Pajek 27
  28. 28. Figure 6. user-user Unipartite Network Energy Kamada-Kawai Map Degree Cut-off = 30. Nodes = 211. Size: ClosenessSource: Authors by Pajek 28
  29. 29. Figure 7. user-user Unipartite Network Energy Kamada-Kawai Map Degree Cut-off = 30. Nodes = 211. Size: DegreeSource: Authors by Pajek 29
  30. 30. Figure 8. Hyperlink Network. 851 users arranged in rank order by number of outbound links and 1,077 URLs arranged in rank order by number of inbound linksSource: Authors Why?/ How come that a few users and websites are better connected than the majority? 30
  31. 31. Value of identified nodes (websites) due to:• The links that they receive (its instrumental nature)• The profile of these organizations (newspapers that channel big quantities of resources – information) (quality of the links) = central URLs with authority. 31
  32. 32. Results. 4.2. Node Tags: Users producing Tags• Collective tag structure (excluded the key search words, such as globalization, agriculture, food and organic, and GMO) produced with Wordle.• Sizes of the terms in the tag clouds are proportional to the weights - the top 25 highest weighted tags.• Tag clouds: identifying the topical groupings in a tag network – Identification of topics around globalization of agriculture 32
  33. 33. Figure 9. Tag Cloud for Agriculture Globalization Network Identified on the delicious Data Set Source: Authors by wordleResulting main key topics were economics and the environmentMain keywords used by users to describe or characterise in Delicious the topic‘globalization of agriculture’. 33
  34. 34. 50 more frequent TAGS. Tags used more than 20 timesEconomics 350 World 47 BBC 30Environment 274 Global 46 Future 30Sustainability 153 Capitalism 45 Geography 30Politics 152 Green 43 Water 30Economy 144 Research 42 Nutrition 29Trade 131 Crisis 41 Government 27Business 99 International 41 Wto 27Poverty 97 Oil 38 Agribusiness 26Culture 84 Prices 37 Ecology 25Farming 84 Activism 35 Europe 25Africa 83 News 35 Globalwarming 23Health 78 Science 35 Reference 22Development 76 Hunger 34 Technology 22Energy 76 Usa 34 Biofuel 21India 65 Inflation 32 Corporations 21China 59 History 31 Farmers 21Policy 55 Local 31 34
  35. 35. Discussion: 5.1. Centrality and PowerNew York Times in this network of globalization of agriculture in Delicious surpasses by far other URLs (with 1,203 inbound links, followed by BBC website with 674 ones). Most cited, recommended or considered websites with regards to a topic occupy a central place and have an important role in the process of dissemination of news, events, trending topics, ideology, culture and etcetera.Identification of key collective actors (represented here through URLs) allows a better comprehension of leadership, influence process, and power- related structures.For social practitioners, is a good way to identify key informants in a community through whom disseminating useful and important information.Very inequal distribution of power of the URLs cited by users in the topic globalization of agriculture. - Important accumulation of inlinks. ADVANTAGES OF THIS TYPE OF KNOWLEDGE FOR RESEARCHING AND INTERVENING 35
  36. 36. Discussion. 5.1. Centrality and Power• FOCUS ON Users: identification of key actors that disseminate and share URLs, as the previously cited Mritiunjoy – Determine from where key elements that structure the network emerge.• Why ‘that’ so important actor in the network of globalization of agriculture? – Key actors in this type of network could configure and reconfigure the evolution of the network (TIME), and structure and even manipulate the type of interchange of resources in Delicious or in similar bookmarking sites.• Is it by chance? Are most prominent actors in a type of website like Delicious corresponding to a profile of very active and participative people? Do they usually work (or have as hobby) in this area and this is why accumulate and tag so many URLs in Delicious? – Further steps of the research. 36
  37. 37. 5.2. Central Tags: Users producing Tags• Tags suggested by the website + Added new tags in a creative way• ‘Tag cloud’: visual approach to the language used by users• From a total of 1700 tags two words were the main ones.• Each user could label a URL with an unlimited number of tags (average 12 tags per user, max 433 and min 2).• Most frequently tags used were the words: ‘economics’ (350 citations out of 1700 tags -20.6%-) and ‘environment’ (273, 16%).• Other very frequent tags were also sustainability (153), politics (152), economy (144), trade (131), business (99), poverty (97), culture (84), farming (84), africa (83), health (78), and development (76), representing these 13 tags in relatives terms one out of four labelled tags around the topic (25,9%).Questions:• Reasons of the prominence of the two first tags around the globalization of agriculture.• Are some of the 1700 found tags used in a interchangeable basis? – Why sometimes the word economics is used sometimes, and why other times is used economy? – Are they used in the same way at classifying the URLs? 37
  38. 38. Conclusions: achieved goals• Presenting this methodology to use big data from Web 2.0 in socioeconomic research, and the illustration from a social bookmarking site (Delicious) is:• A first step towards the development of empirical techniques capable of automatically differentiating groups of individuals with common interests, and individuals who occupy a more central position.• First stone in the difficult process of understanding and discovering patterns in the process that characterize users tagging URLs for collaborative reasons.• Utility: Discovering latent patterns = provide effective recommendations to different actors.• Understanding the community of more than a thousand links.• Retrieval and analysis of information: complex but easy = working in interdisciplary teams 38
  39. 39. Other topics for Researching: Future• Improvements are necessary regarding in retrieval methods and the implementation of Information Retrieval and Recommender Systems techniques• Influence of first tags on the following ones. Role of innovation and creativity at tagging• Evolution and usage of language around an issue along time.• Ideological and terminological approaches in the national/ international arena• Use of some tags at classifying URLs and the distinction among users in the way they use some words/tags – Distinction between scientifics/ other professionals or users? – Identify users with the same patterns at tagging, or URLs that were similarly labelled: study structural equivalences• Other possible studies based in retrieving the pages and making content analysis• Why some labels are present/ absent?• Are there “traditions”/ “fashions” at tagging in the Web 2.0?• Comparing results from Delicious and from other social bookmarking sites• Go in-depth about users (if possible)• And other explorations, other starting points, other bookmarking sites, other indicators, complementary to those used in this illustration 39
  40. 40. Possible Applications• Producing and manipulating public opinion (at recommending and describing websites) and markets – If we know the interests of users belonging to a network, we could also be able to make recommendations• Recommender Systems, changes into a ternary relation between users, resources, and tags, more complex to manage.• Important for researchers interested in formulating strategies for intervention and mobilisation, but also practitioners, and companies could make use of this.• The discovering of the central elements in a network (users and URLs), at the same time that the tags used by users could be key to design future strategies for the dissemination of messages and to achieve more success in the communications, making use of important keywords, for instance, to atract more attention, etc.• Implementation of Information Retrieval and Recommender Systems techniques in social commerce and social media contexts.• Applications in advertising, mobilising, etc.• Security, Social Studies, Market studies, consumers• Time: longitudinal analysis• Etcétera 40