Developing the korean_internet_network_miner_change
Upcoming SlideShare
Loading in...5
×
 

Developing the korean_internet_network_miner_change

on

  • 752 views

한국자료분석학회 가을철 학술발표대회 2009

한국자료분석학회 가을철 학술발표대회 2009

Statistics

Views

Total Views
752
Views on SlideShare
752
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Developing the korean_internet_network_miner_change Developing the korean_internet_network_miner_change Presentation Transcript

  • Developing the Korean Internet Network Miner(KINM): E-research Tool for Social Network Analysis of Blogospherein South Korea* Anatoliy Gruzd1), Chung Joo Chung2), Jaeeun(Angela) Yoo3), 박한우4) 2009년 한국자료분석학회 가을철 학술대회
  • Anatoliy Gruzd1), A ssistant Professor, School of Information Management, Dalhousie University, Canada E-mail : agruzd@gmail.com Chung Joo Chung2) Ph.D. Candidate, Department of Communication, State University of New York at Buffalo, USA E-mail : idream4you@gmail.com Jaeeun(Angela) Yoo3) 3B.S. Student, Division of Engineering Science, University of Toronto, Canada E-mail : angela.yoo@utoronto.ca 박한우4) (Corresponding A uthor) A ssociate Professor, Department of Media and Information, YeungNam University, Korea. E-mail : hanpark@ ynu.ac.kr 2009년 한국자료분석학회 가을철 학술대회
  • Contents  Introduction  Related Studies: Tools for Blog Network Analysis  Section 1. Development of the Korean Internet Network Miner  Section2. Evaluation of the Name Network Discovery Algorithm  Conclusion Page  3 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회 View slide
  • Introduction The growing adoption of e-research tools has lead to changes in social and communication research(Jankowski, 2009) Typical technologies in this domain include LexiURL(Thelwall, 2009), Virtual Observatory for the Study of Online Networks (Ackland, 2009), and Internet Community Text Analyzer(ICTA; Gruzd, 2009a). Page  4 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회 View slide
  • Introduction In contrast to the e-research developments in North America and Europe, Soon and Park(2009) note that digital tools to support e-research are rare in Asia, even in South Korea (Internet World Stats, 2008). Thus, we attempt to develop an e-research tool for automatic discovery of online communication networks on the Korean Web.  First section deals with prior studies related to large-scale blog network  analysis using automatic tools,  Second section illustrates the process of developing our analytic tool called the Korean Internet Network Miner(KINM). Page  5 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Related Studies: Tools for Blog Network Analysis The structure of networks can be measured mathematically and visualized graphically. The shape of a network emerging from online users’ writing and linking choices reflects interest trends. Research on blog networks have confirmed that large-scale online communities are structurally reflected in higher density network neighborhoods through linking(Kelly and Etling, 2008). Page  6 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Related Studies: Tools for Blog Network Analysis Traditional data mining research • Traditional data mining research focuses largely on algorithms for inferring association rules and other statistical correlation measures in a given data set(Kumar et al., 1999; Jung, 2009). For example) • Kelly and Etling(2008) used research firm Morningside Analytics for blog Selection and data mapping along with Fruchterman-Rheingold's "physics model”algorithm to understand the blog networks of the Iran blogosphere. • Gryc and his colleagues(2008) developed categories for blog networks they studied by analyzing key words, post classification, and linking patterns of blogs. Page  7 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Related Studies: Tools for Blog Network Analysis Current research • The current research uses a web-based system for automated text analysis to discover and understand social networks from blog data. • It focuses not only on chain networks—social networks based on the number of messages exchanged between individuals—but also on name networks—social networks built from mining personal names and nicknames. Page  8 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 1. Development of the Korean Internet Network Miner As an initial framework for the KINM, we used some of the social network discovery and visualization tools and techniques previously developed by Gruzd(2009). These tools and techniques were developed as part of a General purpose web system for content and network analysis of computer- mediated communication in English called ICTA (available at http://textanalytics.net). Page  9 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 1. Development of the Korean Internet Network Miner As an initial framework for the KINM, we used some of the social network discovery and visualization tools and techniques previously developed by Gruzd(2009). These tools and techniques were developed as part of a General purpose web system for content and network analysis of computer- mediated communication in English called ICTA (available at http://textanalytics.net). Barriers> 1. since ICTA only works with texts in English, we had to modify all text processing functions to support Korean texts. 2. ICTA requires the data to be stored in a machine-readable format such as an RSS feed. However, after a manual examination of a number of Korean blogs, we noticed that the majority of them do not provide RSS feeds for their comments data. Page  10 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 1. Development of the Korean Internet Network Miner The reason we were especially interested in analyzing comments data was because comments turned out to be a good source for mining social connections among blog readers because comments contain most of The social interactions on a blog. To address this challenge, we created a script using the Kapow Mashup Server (http://www.kapowtech.com) to retrieve comments from a selected Blog automatically and output them as an RSS feed. Page  11 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 1. Development of the Korean Internet Network Miner After retrieving the blog data, it was processed to build two types of networks. • First, a chain network was extracted. In the chain network, one commentator is connected to another if the first commentator directly replied to the second commentator by clicking on the "reply-to" button. • However, after manually examining a number of comments on several blogs, we found that there are some comments that are not "reply-to" comments, but are addressing or referencing a previous poster. This observation is in-line with a previous empirical study on online Learning communities by Gruzd(2009a), which discovered that the chain network misses on average 40% of possible connections. Page  12 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 1. Development of the Korean Internet Network Miner Name Network> • Instead of just relying on information about who replied to whom, the Name network method starts by automatically finding all mentions of personal names or nicknames in comments and uses them as nodes in a social network. • Next, to discover ties between nodes, the method connects a sender of a comment to all names found in his/her comment. (A more detailed description of this method can be found in Gruzd(2009b).) • Although the name network approach provides additional information about connections among blog commentators, it has its own challenges. - personal name/nickname and a word that just appears to be one. Page  13 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 1. Development of the Korean Internet Network Miner For example, the algorithm marked the word 사람 (people) as a reference to another person on the blog. This happened because there was at least one comment in the dataset posted by a person with the "사람" nickname. Figure 1: Sample comment However, in the sample comment, this word does not refer to another online participant; it is used as a noun that means "people". Page  14 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 1. Development of the Korean Internet Network Miner Name Network> Another good example of challenges associated with the name/nickname disambiguation problem in comments is the word "2mb". 1. this word can be used as a nickname for one of the blog commentators. 2. it could refer to the capacity of a computer memory (2 megabytes). 3. it could be the alias of the current Korean president, Lee Myung-Bak.  To address these challenges and develop recommendations for the next generation of the name network discovery algorithm, we conducted a semi-automated analysis of all names/nicknames discovered from a sample dataset using our initial algorithm. Page  15 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 2. Evaluation of the Name Network Discovery Algorithm To evaluate our automated approach for analyzing communication networks from blog comments, specifically the accuracy of the name Network discovery algorithm, We selected a single blog authored by 방짜(bangzza) from http://blog.ohmynews.com/bangzza Page  16 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 2. Evaluation of the Name Network Discovery Algorithm OhMyNews was ranked as one of the top three web sites in terms of blog users in 2009(Rankey.com, 2009) it is frequently ranked as the most popular blog site in Korea, registering over 20 million page views per day during the presidential election. Users of OhMyNews, known as "news guerrillas", contribute news articles on the Web site. OhMyNews allows individuals in far-flung locations to come together, share, and build strong ties and a sense of community—united in ideology even if separated by geographic distance— that fosters a true Grassroots movement (Streitmatter, 2001). Page  17 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 2. Evaluation of the Name Network Discovery Algorithm For our tests, we retrieved and analyzed a sample set of 943 comments (posted between April 2008 and April 2009) from the selected blog. In the study, we relied on an interactive tag cloud feature available in KINM To explore and evaluate all names and nicknames that were found automatically (see Figure 2). <Figure 2> An interactive tag cloud showing the 50 most frequently used name/nickname candidates found in the sample dataset Page  18 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 2. Evaluation of the Name Network Discovery Algorithm The evaluation procedure involved clicking on each word found by the name network algorithm and exploring the context where each instance of the word was used(see Figure 3). The purpose of this semi-automated analysis was to discover what name/nickname candidates were identified incorrectly and why. <Figure 3> A list of messages containing "2MB” Page  19 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 2. Evaluation of the Name Network Discovery Algorithm The following set includes clues suggesting that a word is likely to be a nickname : ● a word candidate is followed by a context word such as "님" = an honorific or "씨" = Mr./Ms.; other possibilities, although rare, include "굮" = Mr. for younger males or "양양" = Miss/Ms. for younger females at the end of a word candidate, and "미스터" = Mr. or "미스" = Miss at the beginning; ● a word candidate contains a combination of characters(Korean, English and/or Chinese), symbols(e.g., underscores, hyphens) and numbers; ● a word candidate appears to be a real name, which is almost always three characters: a single-character last name followed by the two-character first name, which may be found in a dictionary of first names and/or common characters used therein; ● a word candidate is a less common, non-topic word(e.g., "너구리" = raccoon); ● a word candidate is followed by punctuation indicative of someone being addressed (e.g., "/" or ":"); ● a word candidate contains patterns indicative of non-native words-phonetic Koreanization of English (e.g., "미디어몽골" = mediamogul = Media Mogul) or phonetic romanization of Korean(e.g. "jihwaja" = 지화자). Page  20 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Section 2. Evaluation of the Name Network Discovery Algorithm The second set includes clues suggesting that a word is NOT likely to be used as a nickname: ● a word candidate is a phrase—for example, if the nickname input (the "FROM"field) is Used more like a subject line(possible indicators include white spaces and length); ● a word candidate consists of a single character(e.g., "a" or "ㄱ"); ● a word candidate consists of netspeak, including emoticons(e.g. "=_="), slang and abbreviations(e.g., using "2MB" to refer to the current Korean president), and onomatopoeia (e.g. "ㅉㅉ" = tsk tsk, ” ㅋㅋ" = heehee, "하하" = haha, "음" = hmm); ● a word candidate appears more than one time in the comment; ● a word candidate consists of random characters(e.g. "ㅁㄴㅇㄹ" or "asdf"); ● a word candidate is a short, conversational word or phrase(e.g., "나나 " = me, "아이고" = oh no, "그래서" = so/therefore); ● a word candidate is a common word or idea in the given context/topic(e.g., "대한민국" = Republic of Korea, "쥐체사상" = a newly created word used to refer to political fanatics). Page  21 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Conclusion This research briefly reviewed some of the studies related to a large-scale blog analysis using automatic tools. We reviewed the process of developing our own analysis tool KINM. The main goal of KINM is to automate the process of finding communication networks in the Korean blogosphere that accurately represent social interactions among blog readers. To find these networks, the system relies on a set of text mining techniques to look for personal names and nicknames in users’comments. Page  22 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Conclusion To address some of the challenges associated with the automated discovery of names and nicknames in Korean texts, this paper also presented an exploratory study of a sample dataset. The study suggests a set of additional rules to improve the accuracy of the current name/nickname discovery algorithm used in KINM. These additional rules will be incorporated into KINM and evaluated in a subsequent study. Page  23 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회
  • Thank you. Page  24 2009년 한국자료분석학회 가을철 학술대회 료분석학회 가을철 학술대회