Management and analysis of social media data

328 views

Published on

A case study based on Sina Weibo

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
328
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Management and analysis of social media data

  1. 1. Management and analysis of social media dataA case study based on Sina WeiboWeining QianCenter for Cloud Computing and Big DataEast China Normal Universitywnqian@sei.ecnu.edu.cndatabase.ecnu.edu.cn
  2. 2. OutlineSocial mediaDataData collectingModeling microblogsManagementSchemaQueriesData generator: On-going workBenchmarking social media data analytical queriesApplications2 of 53
  3. 3. What is social media?A group of Internet-based applications that build on the ideological andtechnological foundations of Web 2.0, and that allow the creation andexchange of user-generated content.Andreas M. Kaplan, Michael Haenlein. “Users of the world, unite! Thechallenges and opportunities of Social Media”. Business Horizons 53(1). 20103 of 53
  4. 4. Why social media?Sense the world!4 of 53
  5. 5. Finantial index and mood on social media5 of 53
  6. 6. Finantial index and mood on social media6 of 53
  7. 7. Why case study based on Sina Weibo?• “Real-world” data (valuable for universities)• Related to many real applications• (Relatively) easy to get those data• Big data?◦ Unstructured data◦ Time evolving data◦ Fast arriving (if we crawl the data on-line)◦ Low quality (abbr., smileyes, typos, multi-language, . . . )• Intuition helps (everyone understand social media nowadays!)7 of 53
  8. 8. OutlineSocial mediaDataData collectingModeling microblogsManagementSchemaQueriesData generator: On-going workBenchmarking social media data analytical queriesApplications8 of 53
  9. 9. Data collecting: Distributed crawler9 of 53
  10. 10. Data: Gradually updatingFollowship network• Seed users: 11 lawyers and opinion leaders and 21 researchers• 2nd level users from seeds: 120,000+ users• 3rd level users from seeds: 1.7+ million users• 4th level users from seeds: 18+ million users (incomplete)• More than 1 billion following relationshipsTweets from 1.6+ million users• From Aug. 2009 to Jun. 2012• 480+ million tweets (about 51.11% of them are retweeted tweets, andothers are original tweets)10 of 53
  11. 11. Data: Two dimentionsTimeline Followship network11 of 53
  12. 12. OutlineSocial mediaDataData collectingModeling microblogsManagementSchemaQueriesData generator: On-going workBenchmarking social media data analytical queriesApplications12 of 53
  13. 13. The challenge of modelingWhat we expect?13 of 53
  14. 14. The challenge of modelingExternal events14 of 53
  15. 15. The challenge of modelingBursts/tipping points15 of 53
  16. 16. ModelingIt’s difficult to model a long-term time-series insocial media• Affected by external eventsIs it possible to model the life-cycle of a singletweet?To predicate its• retweet path• #retweet• impression16 of 53
  17. 17. Various measurements100101102103104105106100101102103104105106107108#Retweet/#HashtagFrequency#Hashtag#Retweet[1,10) [10~100) [100,1000) [1000,)0102030405060708090100#RetweetThePercentageofTweet93.85.850.65 0.027.4229.4230.5832.76The Percentage of #Tweet*#RetweetThe Percentage of #Tweet17 of 53
  18. 18. The life-cycle of a tweet18 of 53
  19. 19. Sigmoid function: S-CurveF(x) =N1 +a ·e−b(x−c)0 50 100 1500102030405060708090100xya=100,b=0.2a=1000,b=0.2a=100000,b=0.2a=1000,b=0.1a=1000,b=0.319 of 53
  20. 20. Modeling tweets popularity with S-Curve20 of 53
  21. 21. Bursts of a tweet (and its retweets)21 of 53
  22. 22. Tipping points1. (Γ(t +ε)−Γ(t)) > κ2. (Γ(t)−Γ(t −ε)) < κ3. (Γ(t +ε)−Γ(t)) > µ ∗(Γ(t)−Γ(t −ε))4. (Γ(t +ε)−Γ(t)) > N/log(N)22 of 53
  23. 23. Piece-wise Sigmoid functionF(x) =N11 +a0 ·e−b0(x−c0)x <= x1Ni−1 +Ni −Ni−11 +ai ·e−bi (x−ci )xi−1 < x <= xi ,2 ≤ i ≤ λ(1)whereλ∑i=1Ni = N. (2)23 of 53
  24. 24. Result of modeling0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10.50.550.60.650.70.750.80.850.90.951R2(Single S−Curve)R2(MultiS−Curve)y=x24 of 53
  25. 25. What causes a burst in social media?25 of 53
  26. 26. Intuitive illustration26 of 53
  27. 27. OutlineSocial mediaDataData collectingModeling microblogsManagementSchemaQueriesData generator: On-going workBenchmarking social media data analytical queriesApplications27 of 53
  28. 28. Schema: TweetsTable : The microblog TableAttribute Data Type DescriptionMID ID Message identifierUID ID Author’s user identifierTIME DATE/TIME Time that the tweet is postedCONTENT TEXT Content of the tweetTable : The retweet TableAttribute Data Type DescriptionMID ID Message identifier of the retweetREMID ID MID of the tweet that is retweeted28 of 53
  29. 29. Schema: ContentTable : The mention TableAttribute Data Type DescriptionMID ID Message identifierUID ID A user identifier that is mentionedin the messageTable : The topic TableAttribute Data Type DescriptionMID ID Message identifierTAG TEXT The hashtag of a topicCould be extended for links, images, video, etc.29 of 53
  30. 30. Schema: UsersTable : The user TableAttribute Data Type DescriptionUID ID User identifierEmail TEXT Email of the userName TEXT Name of the user. . . . . . Profile attributesTable : The friendlist TableAttribute Data Type DescriptionUID ID User identifierFRIENDID ID A user that is followed by UID30 of 53
  31. 31. OutlineSocial mediaDataData collectingModeling microblogsManagementSchemaQueriesData generator: On-going workBenchmarking social media data analytical queriesApplications31 of 53
  32. 32. QueriesQ: Rank tweets appearing in my followees’ timelines according to the number of retweet.SELECT x.remid FROM microblog,(SELECT retweet.mid AS mid,retweet.remid AS remidFROM microblog,retweetWHERE microblog.mid = retweet.remid) AS xWHERE microblog.mid = x.mid ANDmicroblog.uid IN(SELECT friendID FROM friendListWHERE uid = "A" ORuid IN(SELECT friendID FROM friendListWHERE uid = "A")) ANDmicroblog.time BETWEEN TO_DAYS(’YYYY-MM-DDHH:MM:SS’) ANDDATE_ADD(’YYYY-MM-DD HH:MM:SS’,INTERVAL 1HOUR)GROUP BY x.remidORDER BY COUNT(*)DESCLIMIT 10;32 of 53
  33. 33. DifficultiesJoins of very large tables• self-join of friendList• join of microblog and retweet33 of 53
  34. 34. QueriesQ: Find the set of people who share the same followee with the specified user.SELECT f1.uidFROM friendList AS f1,(SELECT friendIDFROM friendListWHERE uid = "A") AS f2WHERE f1.uid <> "A" ANDf1.friendID = f2.friendID ANDf1.uid <> f2.friendIDGROUP BY f1.uidORDER BY COUNT(f1.friendID)DESCLIMIT 10;34 of 53
  35. 35. DifficultiesPower-law distribution• The size of results from the inner-subquery may vary a lot!10010110210310410510610−610−510−410−310−210−1100101102#FolloweesFrequency(Normalized)TwitterSina Weibo10010110210310410510610710810−610−510−410−310−210−1100101102#FollowersFrequency(Normalized)TwitterSina Weibo35 of 53
  36. 36. OutlineSocial mediaDataData collectingModeling microblogsManagementSchemaQueriesData generator: On-going workBenchmarking social media data analytical queriesApplications36 of 53
  37. 37. Why a data generator is needed?• Useful in benchmark◦ For scalability issue◦ For privacy issue◦ For diversity issues• Though social media data from different services tend to follow similardistribution, they are different.37 of 53
  38. 38. Distribution of real-data vs. generated data1101001000100001000001e+061e+071e+081 10 100 1000 10000FrequencyNumber of Comments per postSIBBSMA1101001000100001000001e+061 10 100 1000 10000FrequencyNumber of FriendsSIBBSMA00.050.10.150.20 5 10 15 20 25 30NumberofPostDaySIBBSMA1101001000100001000001e+061 10 100 1000 100001000001e+06 1e+07FrequencyNumber of PostsSIBBSMA38 of 53
  39. 39. OutlineSocial mediaDataData collectingModeling microblogsManagementSchemaQueriesData generator: On-going workBenchmarking social media data analytical queriesApplications39 of 53
  40. 40. Measurements• Throughput• Latency• Scalability40 of 53
  41. 41. Workloads• 19 queries in 3 categories◦ Social network queries (joins of very large tables)◦ Timeline queries (order-preserving)◦ Hotspot queries (skewed data)41 of 53
  42. 42. Preliminary results0500100015002000Q1Q2Q3Q4Q5Q8Q9Q10Q11Q12Q13Q14Q15Q16Q17Q19Througput(ops)QueryAverage Hightest Throughput05000100001500020000Q1Q2Q3Q4Q5Q8Q9Q10Q11Q12Q13Q14Q15Q16Q17Q19Latency(ms)QueryAverage Hightest Latency42 of 53
  43. 43. Preliminary results1101001000100001000001e+061e+07Q1Q2Q3Q4Q5Q8Q9Q10Q11Q12Q13Q14Q15Q16Q17Q19ScalabilityQueryTeam1Team2Team3Team443 of 53
  44. 44. On-going workBSMA: http://github.com/xiafan68/BSMA• Data generator• Queries related to content of tweets• More queries• Performance testing of more systems44 of 53
  45. 45. Collective bahavior analysisWhat is collective behavior?Three kinds of actions:Conforming : actors follow prevailing normsDeviant : actors violate those normsCollective behavior : a third form of action, takes place when norms areabsent or unclear, or when they contradict each other45 of 53
  46. 46. What is collective bahavior?Four forms of collective behavior• The crowd• The public• The mass• The social movement46 of 53
  47. 47. Mood analysisEssentially time series47 of 53
  48. 48. Mood analysisEssentially time seriesDisasters have strong affect on “death” mood (up-down-up pattern)The mood of death is strongly correlated with mood on anxiety and calm48 of 53
  49. 49. On-going workA shared dataset of hotspots on Sina Weibo• Events and descriptions• Evolutions of hotspots• Information propagation• Spatial attributes• Users’ involvementBy-products• Spamming detection• Fake IDs• . . .49 of 53
  50. 50. Spamming?创意工坊冷笑话精选作业本团800网微博经典语录微博搞笑排行榜时尚经典语录电影工厂最音乐全球热门段子全球创意搜罗全球时尚最前线全球奇闻趣事星座爱情001全球热门排行榜胡椒蓓蓓网新浪数码新浪科技新浪科技新浪科技新浪科技头条新闻新浪财经任志强微群小助手黄健翔环球音乐榜当时我震惊了冷笑话精选薛蛮子徐小平薛蛮子邓飞老榕黄健翔薛蛮子李开复薛蛮子李开复薛蛮子薛蛮子薛蛮子李开复-2袁岳50 of 53
  51. 51. Summary• Data collecting/pre-processing is dirty-work◦ Topic/semantic entity extraction◦ Mood detection◦ . . .• Real-life data depict interesting patterns◦ even with simple exploratory analysis• Modeling is difficult◦ yet possible under certain circumstance◦ Monitoring is possible◦ Prediction remains an open problem• Building system for analyzing social media data is a challenge• Benchmark is a basis for better understanding social media analytics51 of 53
  52. 52. Contributed students• MA Haixin• XIA Fan• WEI Jinxian• YU Chengcheng• ZHANG Qunyan52 of 53
  53. 53. Thanks!

×