Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Social Media Analytics: The Value Proposition


Published on

Rohini K. Srihari delivers her powerful presentation at the KDD 2010 Workshop on Social Media Analytics.

-What is Social Media?
-Value Proposition: Why mine social media?
-Business Analytics
-Technology, Challenges
-Multilingual social media mining

Published in: Education, Business, Technology
  • Be the first to comment

Social Media Analytics: The Value Proposition

  1. 1. Social Media Analytics: the Value Proposition Rohini K. Srihari KDD 2010 Workshop on Social Media Analytics July 25, 2010
  2. 2. Outline What is Social Media? Value Proposition: Why mine social media?  Business Analytics  Counterterrorism Challenges Technology, Challenges Multilingual social media mining Future
  3. 3. Social Media Data Actionable IntelligenceConsumer Generated, Not Edited, Not Authenticated
  4. 4. Data/Text Mining Extracting useful information from large data setsAnalyze Observational Data to find unsuspected relationshipsand Summarize data in novel ways that are understandableand useful to data owner Information Discovery non-trivial, implicit, previously unknown relationships Ex of Trivial: Those who are pregnant are female Summarize as Patterns and Models (usually probabilistic) Usefulness: meaningful: lead to some advantage, usually economic Analysis: Automatic/Semi-Automatic Process (Knowledge Extraction)
  5. 5. Value Proposition
  6. 6. Market Size  Business Analytics market projected to be $28 billion in 2011 (IDC Report)  Social Analytics taking leading position of interest within organizations  Integrating Social Media Analytics and Business IntelligenceSource: HCL India
  7. 7. Customer Relationship Management Data sources are primarily internal  Call center transcripts  E-mail  Customer feedback Cost avoidance  Product exchange mitigation  Early warning detection on new products Increase in customer satisfaction and loyalty Insight towards new products, product features Identification of possible marketing opportunities
  8. 8. e-Service Chat MonitoringOperator: How can I assist you today?Customer: I need help with operating your coffee maker I boughtfrom yesterday.Operator: Certainly. What problem are you facing?Customer: I fill in the coffee powder, water, and then press the redbutton on the side, and nothing happens.Operator: The red button enables the ‘clean coffee maker’ process.You will need to use the white knob on the other side to brewcoffee.Customer: I see.Customer: BTW, in the Nespresso cappuccino machine I recentlybought, it was the red button for start. Is there anything else I can assist with today? SEND Alert: COMPETITOR PRODUCT MENTION
  9. 9. Reputation Management Data sources are primarily external, e.g.    (travel related website) Consumer Brand Analytics  What are people saying about our brand? Marketing Communications  Significant spending on marketing, advertising: companies trying to position their products  Brand analytics helps to determine whether such campaigns are effective
  10. 10. Mining Product Reviews Application is Industrial Design  Automatically mine product reviews for information on product features, new requests, etc.  Focus on wheelchairs Features Extracted Easy to use Fit into a car Comfortable chair Light weight Convenient to fold Sturdy Good price
  11. 11. Viral MarketingJure Leskovec (Stanford), Lada Adamic (U of Michigan), Bernardo A. Huberman (HP Labs)Personalized recommendations Viral marketingCross-selling“people who bought x also bought y”Collaborative filtering“based on ratings of users like you…”Delicious, 68% of consumers consult friends and family before purchasing home electronics (Burke 2003)Success rate: # of purchases following arecommendation / # recommendersBooks overall have a 3% success rate
  12. 12. 500 million active users! Many different groups clamoring for▪ More than 20 million users update their status data and text analytics:at least once each day ▪ FB Engineers▪ More than 850 million photos uploaded to the ▪ Advertiserssite each month ▪ Page owners▪ >1 billion pieces of content (web links, blog ▪ Platform/Connect developersposts, photos, etc.) shared each week ▪ Marketers ▪ Academics
  13. 13. An aside: Social Media Marketing Lead Generation  Breakdown of respondents’ top benefits of social networking:  50%: Generating leads  45%: Keeping up with the industry  44%: Monitoring online conversation  38%: Finding vendors/suppliers Online Forum Users Are Enthusiastic Brand Advocates  79.2% of forum contributors help a friend or family member make a decision about a product purchase – compared with 47.6% of non-contributors and 53.8% overall.  65% of forum contributors share advice (offline and in person) based on information that they’ve read online – compared with 35% of non-contributors and 40.8% overall.  57.7% of forum contributors proactively recommend someone make a particular purchase – compared with 16.9% of non-contributors and 24.9% overall. Only 47% of Companies Experimenting With Social Media  Gartner study predicts that by the end of 2010, more than 60% of Fortune 1000 companies will manage an online community.  ComBlu’s study, The State of Online Branded Communities, shows that most companies do not understand how to engage within online communities and have no real idea of what their customers want on these sites.
  14. 14. Citizen Response E-RuleMaking  the use of digital technologies by government agencies in rulemaking, decision making processes  solicit citizen feedback on bills being debated in Congress  What new issues are being raised, what aspects of bill are popular, unpopular  Better to mine social media than using focus groups? Political Campaigns  Why do people support a candidate- is it really based on issues?
  15. 15. Use Case: Understanding and Visualizing Consumer ResponsesExtracting Entities and Sentiment to Power Alerting, Link Diagrams, and Geo-Mapping 15
  16. 16. Twitter: Real-Time Citizen Journalism • Mumbai terror attack regarded as coming of age of Twitter • citizen journalism provided more valuable information than wire services, broadcast news • information about places to avoid, well being of relatives, friends, etc. • many redundant posts, users have to wade through hundreds of posts to locate useful information • Goal: to mine this data in real-time and produce well organized summaries 16
  17. 17. Law Enforcement, Homeland Security• Facebook • gang members frequently boast about their activities on their facebook pages• Chat rooms • Stalkers, pedophiles• Twitter • protest rallies being planned G20 Summit Protest • who, what, where, when• Craigslist 17
  18. 18. Human Behaviour Analysis  Process social media content, provide tools for analysts to: Predictive  Identify social networks: groups, members  Identify topics of discussion and sentiment Modeling • E.g. angry at govt., wanting retaliation, peacemakers • Thought influencers Link Diagrams  Identify social goals through analysis of verbal communication • Manipulation: Persuasion, threats, coercion • Religious supremacy: religious analogues • recruitmentSocial Media Content
  19. 19. Technology, Challenges
  20. 20. Analyzing Social Media Data Content Analysis  Text analysis, multimedia analysis Structure Analysis Usage Analysis  Search engine optimization  What keywords are driving customers to your site, competitor sites  Query logs, site trafficIdeally combine all three of these!
  21. 21. Solution Framework Mark Logic Thetus Kapow Oracle, MySQL I2 Attensity RDF Triple StoresEnterprise Palantir Themis CouchDBContent Autonomy Jodange, Lexalytics, Cymfony, Blogpulse
  22. 22. Content Acquisition Pre-selected, validated sites ,, NYT blogs, reader comments Search Service , Craigslist  Twitter, Facebook Blog Search Engines  Google Blog Search  Technorati  Blogpulse BoardReader Lucene Index Storage   Spidering
  23. 23. Data Collection: Spidering “Dark Web” : the portion of the WorldWideWeb used to help achieve the sinister objectives of terrorists and extremists. Spider uses breadth and depth first (BFS and DFS) traversal for crawl space URL ordering based on URL tokens, anchor text, and link levels.• Automated discovery of proxy servers to distribute collection and increase reliability.•
  24. 24. Content Analysis Model Based  Develop models that generalize characteristics of data  Machine learning: Supervised, semi-supervised, unsupervised  E.g., sequence labeling, classification  N-gram language models  Linguistic: based on rules of English grammar  Information Extraction• Pattern Mining • frequency analysis, local patterns Google n-gram data What words are used in conjunction with Buffalo, Buffalo Sabres, University at Buffalo Query log analysis Learn spelling corrections, Learn lists of named entities, Learn relationships Discover trends Flu, cough, fever : frequency of queries in certain regions, change from the norm Combine both approaches
  25. 25. Reliability of Data How much trust in data? (Forrester)  Email from people you know: 77%  Consumer product ratings/reviews: 60%  Message board posts: 21%  Personal blog: 18%, company blog: 16% Splog: Spam in weblogs  UK has lawful intercept program  What about results of data mining? Off-topic posts  Comments on blog posts, forums quickly turn into personal rants, completely off-topic Possible Remedies  Focus on sites where data is known to be more reliable  Use technology to filter out spam, splog and off-topic posts
  26. 26. Informal LanguageLoss of Functional Indicators Missing punctuation Missing or raNDOm case information Solutions: Whole phrases reduced to acronymsCasual, Phonetic Spelling • spelling correction tha, teh = the • acronym look-upExplicit Sentiment Commentary • machine learning: treat it as Happy Birthdaaaayyyy!!!1!1! must go <sigh> a machine translation problem! :-P grrr…..Mistaken auto-correction or replacement Co-operation = Cupertino The Queen = Queen Elizabeth, “hundreds of worker bees commanded by Queen Elizabeth”Twitter Conventions alanbr82 RT @royjwells: New Blog Post - Will Old Spice Achieve a ROI? #oldspice #sm #socialmedia RT, hashtags #, url shorteningWord Inventions refudiate, wee-wee’d up momager, rickRoll L33t, IMHO, meh
  27. 27. Legal Issues Privacy of data  UK has lawful intercept program  What about results of data mining? Liability  Major issue for pharmaceutical companies: if they discover report of side effect of drug, they are required to report it  Analysts making positive public statements about company earnings, yet contradicting this on blogs, facebook pages Workplace Issues  Time spent on social media sites during work hours leading to lower productivity
  28. 28. Accuracy of Analysis Text analysis is based on natural language processing which is a useful, but imperfect technology“Bill Gates, the CEO of Microsoft was initially very happy about its site location in Seattle, but now he has other thoughts. He is very displeased with the pollution…. Also, its employees are upset with the construction work…around its vicinity. In all, he wants to abandon the current site…..” Validate performance accuracyWho is expressing an opinion? through benchmarks on specially constructed data setsWhat is the opinion about?Is it positive or negative?
  29. 29. Sentiment AnalysisAims to determine the attitude of a speaker or a writer with respect to some target or topic. I think, Obama needs to begin to take the blame for his failed policies -- his statement "that his policies are getting us out of this mess" are a big lie1. SENTIMENT Attributes ID:ex1 , TargetID:t1, Opinion Holder Topic Polarity: Negative Target 1 -
  30. 30. Opinion summary In product reviews, we are interested in generating a feature-based summary for a product.Digital_camera_1: Feature: picture quality Positive: 253 <individual review sentences> Negative: 6 <individual review sentences> Feature: size Positive: 134 <individual review sentences> Negative: 10 <individual review sentences> …
  31. 31. Scalability: Massively Distributed/Parallel Computing Hadoop  Open-source framework for running Map-Reduce on a cluster of commodity machines, as well as a distributed file system for long-term storage  Map-Reduce (invented at Google) provides a way to process large data sets that scales linearly with the number of machines in the cluster....if your data doubles in size, just buy twice as many computers  Hadoop now an Apache project led by the Grid Computing team at Yahoo! HIVE  SQL-like query language, table partitioning schema, and metadata store built on top of Hadoop  Developed at Facebook, now an Apache subprojectFacebook Analytics:How many people arediscussing being laid off; plotpercentage of total posts bystate
  32. 32. Multilingual Applications
  33. 33. Language Usage Statistics[1] English is not the only language on the internet Urdu speaking Internet users - 12,000,000 (2006) ~ 1.6% of 42.4%[1] Source:Internet World Stats. Based on 1,733,993,741 estimated internet users for Sept 30, 2009Copyright 2009, Miniwatts Marketing Group
  34. 34. Multilingual Social Media MiningHow did people in Egypt, Israel and Pakistan react to the latest presidential speech?Opinion Extraction  Topic: What is the opinion about?  Opinion Holder: Who is expressing it?  What is the intensity of the opinion?  In what context is it being expressed?Emotion Detection  What kind of emotion is being expressed? – goes beyond just the positive or negative emotion Required to perform behavioral analysis, cross cultural analysis
  35. 35. Faceted Search: Sentiment about TopicPeople are filled with anger and sorrow because of the policies made by Musharaf. OPINION HOLDER – Writer, People TARGET –Musharaf’s policies (Musharaf is an implied target)
  36. 36. Multilingual Text Analysis Dealing with script, coding variations Even low-level text analysis becomes difficult  Chinese: no white space between words  Arabic: complex diacriticals Language Training Resources  Lexicons, annotated corpora, etc.  If sufficient training data exists, new languages can be adapted to fairly easily  E.g. core Russian in 3 weeks! Treat language porting as a special case of domain porting  Ideally, should involve creation of new data sources, not new code
  37. 37. Chinese Text Analysis 38
  38. 38. Context Aware Translation斯洛文尼亚总理扬沙,欧洲委员会主席巴罗佐和欧盟外交政策 负责人索拉纳与梅德韦杰夫共进非正式晚餐 Context Aware Translation Babelfish Translation Name translation output: <NeGPE english="Slovenia"> 斯洛文尼亚 </NeGPE> 总理 Slovenia premier the sand blowing, <NePer english="Jansa"> 扬沙 </NePer> ,Council of Europe President Baluozuo <NeOrg english="European Commission"> 欧洲 委员会 </ and European Union foreign policy NeOrg> 主席 <NePer english="Barroso"> 巴罗佐 </NePer> 和 person in charge Solana and <NeGPE english="European Union"> 欧盟 </NeGPE> 外Medvedev have the unofficial supper. 交 政策 负责人 <NePer english="Solana"> 索拉纳 </NePer> 与 <NePer english="Medvedev"> 梅德韦杰夫 </NePer> 共 进 非正式 晚餐 。 Powered by Semantex™ extracted entities, Babelfish translates as: Slovenia Premier Jansa, Council of Europe President Barroso and European Union foreign policy person in charge Solana and Medvedev have the unofficial supper.
  39. 39. Mining Wikipedia for Lexicons• Translation lexicons automatically extracted from Chinese Wikipedia, use cross languagelinks to add English translations• Easy to regenerate with new versions of Wikipedia• Chinese Wikipedia is constantly growing
  40. 40. COLABA: Colloquial Arabic Blog Analysis– Proliferation of open source, social media– Dominance of non-English content– Use of dialects and colloquial language– Limited supply of multilingual analysts
  41. 41. Tools made for MSA fail on Arabic dialectsHuman translation for all Arabic variants below is the same:“There is no electricity, what happened?”Arabic Variant Arabic Source Text Google TranslateEgyptian ‫الكهربا اتقطعت، ليه كده بس؟‬ Atqtat electrical wires, Why are Posted?Levantine ‫شكلو مفيش كهربا، ليش هيك؟‬ Cklo Mafeesh ‫?كهربا‬Lech heck ,Iraqi ‫شو ماكو كهرباء، خير؟‬ Xu MACON electricity, good?MSA ‫ ليوجد كهرباء، ماذا حصل؟‬Does not have electricity, whatArabic Dialects are not handled well in current machine translation systems. happened?COLABA enables MSA tools to interpret dialects correctly. 42
  42. 42. Code Mixing, Switching Use of Latin script: lack of transliteration standards makes it difficult to process Spanglish, Hinglish, Urdish, etc.Afsoos key baat hai . kal tak jo batain Non Muslim bhi kartayhoay dartay thay abhi this man has brought it out in the open. [It is sad to see that those words that even a non muslim wouldfear to utter until yesterday, this man has brought it out in theopen]Solutions:• Apply “romanized” POS tagger, English tagger in tandem: use machine learningto combine evidence and generate final tag, language ID• For longer English spans, use English NLP system
  43. 43. Resource Poor Languages Bootstrap Learning: process of improving the performance of a trained classifier by iteratively adding data that is labeled by the classifier itself to the training set, and retraining the classifierUseful when there is not enough annotated data Requirement NEEDS SEED DATA corrections TRAININGDAT SEEDA CORRECT SAMPLES
  44. 44. The Road Ahead?Strengths Weaknesses free form facilitates capturing  language analysis and miningthe true voice of customer, are challengingwisdom of crowd  susceptible to spam, self- can be expressed through voice, serving use by companiestext messaging on mobile phones,etc. Behaviour, predictive models need more researchThreats Opportunities privacy and security issues:  promise of collective problempossible to assimilate detailed solving: coordination, cooperationknowledge about person’s  mobile use supports dealingactivities, whereabouts with societal problems, disaster can lead to anti-social situations: social network isbehaviour! geospatial proximity