CS 8803 Social Computing Data Mini-Project Harish Kanakaraju Prashanth PalanthandalamProblem IMethod:To analyze the prominence of people who are following a particular celebrity. Threecelebrities who were analyzed are Britney Spears Mariah Carey Ashley TisdaleThese celebrities are all singers and among the top 11 influential celebrities in twitter.Britney spears has close to 7.7 million followers with Ashley Tisdale and Mariah Careyhaving approximately 4.3 millions each.The samples of followers of these celebrities were analyzed to find out how many ofthem were prominent. The prominence of each followers were found out usingThe formula “No of followers/No of following”, higher the value, higher the prominence.We used the sample sizes of 1500, 2000 and 3000. The confidence interval is 1.8 andconfidence level is 95% for the sample size of 3000, considering the total population ofthe celebrity’s followers.The initial analysis with a sample size of 1500 was done to find the effect of sample sizeon the prominence ratio.Results:SS = 1500 Prominence Ratio Mean Median SD Chi square P-valueBritney Spears 0.288 0.056 2.047Mariah Carey 0.265 0.132 1.383Ashley Tisdale 0.239 0.115 0.880SS = 2000 Prominence Ratio Mean Median SD Chi square P-value
Britney Spears 0.546 0.111 3.067Mariah Carey 0.289 0.163 1.230Ashley Tisdale 0.406 0.130 7.007SS = 3000 Prominence Ratio Mean Median SD Chi square P-valueBritney Spears 0.493 0.081 3.403Mariah Carey 0.258 0.154 1.014Ashley Tisdale 0.348 0.133 5.734Basic Analysis:The mean and the standard deviation may swing either ways based on the sample dueto the outliers. If the sample contains one very prominent person, it would boost themean and SD values. But the median trend always remains the same.Using Median: Mariah Carey has prominent followers than Ashley Tisdale. And AshleyTisdale has more prominent followers than Britney spears.From Fig 1, we can see that Britney spears has relatively high number of low prominentfollowers (ratio close to zero), while Ashley and Mariah have large number of followerswith a decent prominence value, while number of followers for Britney in this region islow. That’s why her median is the lowest among the three.From Fig 2, we can find that Britney Spears has relatively more number of veryprominent followers compared to Ashley and Mariah. But the very prominent followersare very very less in number compared to the whole population set.R Commands used:The below sequence was executed for the three celebrities,at4 <- getUser("ashleytisdale")at4Fl <- at4$getFollowers(n=3000)at4FFl <- sapply(at4Fl,followersCount)at4FFd <- sapply(at4Fl,friendsCount)at4Ratio <- mapply("/", at4FFl, at4FFd)med <- median(sort(at4Ratio))stad<- sd(at4Ratio)meanRatio <- mean(at4Ratio)at4sum <- sum(at4Ratio)
Problem IIMethod:To extract tweets from two different geographic locations in the world, and select thetweets which contain the phrase “I want”. A comparison of preferences of the twitterusers from the two locations has been done, with respect to the terms “I want a pizza”and “I want to sleep”. Also, the mood of the users on Monday and Friday has beenstudied, by extracting the tweets with the terms “Monday” and “I hate”; and “Friday”and “Thank God”.The searchTwitter() functionality of the twitteR package for R Studio has been used.The two cities chosen were Seattle, Washington and Southampton, UK.
1000 tweets with the phrase “I want” were extracted within a 20 mile radius of the twocities.southamTweets = searchTwitter("Iwant",1000,NULL,NULL,NULL,NULL,50.903,-1.40625,20mi,NULL)The list of 1000 tweets is then converted into text form by using the lapply() command.southamTweets.text = lapply(southamTweets, function(southampton)southampton$getText())The grep() command is used to extract incidences of the term “pizza” in the tweet list.southamTweets.spec = grep("pizza",southamTweets.text,TRUE)The procedure is repeated for Seattle:seattleTweets = searchTwitter("Iwant",1000,NULL,NULL,NULL,NULL,47.606,-122.299,20mi,NULL)> seattleTweets.text = lapply(seattleTweets,function(seattle)seattle$getText())> seattle.spec = grep("pizza",seattleTweets.text,TRUE)Variations of the “I want a pizza” phrase have also been tried.seattleSpecific.spec = grep("I want pizza",seattleTweets.text,TRUE)Instead of “pizza”, the tweets containing the phrase “sleep” or “I want to sleep” wereused.southamTweetsSleep.spec = grep("sleep",southamTweets.text,TRUE)southamTweetsSleepSpecific.spec = grep("I want tosleep",southamTweets.text,TRUE)seattleSleep.spec = grep("sleep",seattleTweets.text,TRUE)seattleSleepSpecific.spec = grep("I want tosleep",seattleTweets.text,TRUE)seattleSleepSpecific.spec = grep("I wantsleep",seattleTweets.text,TRUE)Another variant of the above experiment was done, with the terms “Monday” and“Friday” and respectively, the phrases “I hate” and “Thank God”
seattleMonday =searchTwitter("Monday",1000,NULL,NULL,NULL,NULL,47.606,-122.299,20mi,NULL)> seattleFriday =searchTwitter("Friday",1000,NULL,NULL,NULL,NULL,47.606,-122.299,20mi,NULL)> southamMonday = searchTwitter("Iwant",1000,NULL,NULL,NULL,NULL,50.903,-1.40625,20mi,NULL)> southamMonday =searchTwitter("Monday",1000,NULL,NULL,NULL,NULL,50.903,-1.40625,20mi,NULL)> southamFriday =searchTwitter("Friday",1000,NULL,NULL,NULL,NULL,50.903,-1.40625,20mi,NULL)> southamMonday.text = lapply(southamMonday, function(southampton)southampton$getText())> southamFriday.text = lapply(southamFriday, function(southampton)southampton$getText())>> seattleFriday.text = lapply(seattleFriday, function(seattle)seattle$getText())>> seattleMonday.text = lapply(seattleMonday, function(seattle)seattle$getText())>> seattleMonday.spec = grep("I hate",seattleMonday.text,TRUE)> seattleFriday.spec = grep("Thank God",seattleFriday.text,TRUE)> southamFriday.spec = grep("Thank God",southamFriday.text,TRUE)> southamMonday.spec = grep("I hate",southamMonday.text,TRUE)The Chi-Square Statistical test was then done on the data obtained using the chisq.test()command.The results obtained were plotted using the following commands:x <- rchisq(southamFriday.spec,southamMonday.spec)> hist(x,prob = TRUE)> curve( dchisq(x, df=5), col=green, add=TRUE)> curve( dchisq(x, df=10), col=red, add=TRUE )> lines( density(x), col=orange)Both histogram and density line plots have been used to depict the results.Result:Broadly, it was found that the terms “I want” and “pizza” featured together in only sixout of 1000 tweets in Seattle, and the single phrase “I want pizza” returned threetweets.The issue with searchTwitter() is that “I want” is not considered as a continuous term,and the command also returned tweets such as “I really think I want…” or “I don’t thinkhe wants..”
Seattle threw up 10 tweets out of 1000 with the term “sleep”. However, “I want tosleep” did not return any values, and “I want sleep” returned just one result.In Southampton, only one tweet out of 1000 expressed the desire to have pizza, indeed,there was only one tweet with comprised of “I want” and “pizza” in the same tweet,while “I want a pizza” returned no results. It appears that pizza is more popular incosmopolitan Seattle than the relatively more conservative Southampton.23 tweets were returned by the query for the term “sleep” in Southampton, and two for“I want to sleep”, which is marginally higher than the results for Seattle.
In the experiment with tweets posted on Mondays and Fridays, it appears that citizensof both cities rant more on Mondays, in comparison to feeling thankful on Fridays. Thesearch for “I hate” and “Monday” returned 54 tweets in Seattle, while “Thank God” and“Friday” returned just one, which is surprising. Southampton returned 8 tweets for theformer query (Monday), and two for the latter.
Thus, it is seen that Southampton returns an almost symmetric plot as compared toSeattle, where the difference between Monday and Friday is more substantial.