Social Media Analytics: the Value Proposition




                   Rohini K. Srihari
         KDD 2010 Workshop on Social Media Analytics
                        July 25, 2010
Outline
 What is Social Media?
 Value Proposition: Why mine social media?
   Business Analytics

   Counterterrorism


 Challenges
 Technology, Challenges
 Multilingual social media mining
 Future
Social Media Data                                 Actionable Intelligence




Consumer Generated, Not Edited, Not Authenticated
Data/Text Mining
 Extracting useful information from large data sets
Analyze Observational Data to find unsuspected relationships
and Summarize data in novel ways that are understandable
and useful to data owner
  Information Discovery
         non-trivial, implicit, previously unknown relationships
         Ex of Trivial: Those who are pregnant are female

  Summarize
         as Patterns and Models (usually probabilistic)

  Usefulness:
         meaningful: lead to some advantage, usually economic
  Analysis:
         Automatic/Semi-Automatic Process (Knowledge Extraction)
Value Proposition
Market Size
     Business Analytics market projected to be $28 billion in
      2011 (IDC Report)
        Social Analytics taking leading position of interest within
         organizations
     Integrating Social Media Analytics and Business
      Intelligence




Source: HCL India
Customer Relationship Management
 Data sources are primarily internal
   Call center transcripts

   E-mail

   Customer feedback


 Cost avoidance
   Product exchange mitigation

   Early warning detection on new products


 Increase in customer satisfaction and loyalty
 Insight towards new products, product features
 Identification of possible marketing
  opportunities
e-Service Chat Monitoring

Operator: How can I assist you today?
Customer: I need help with operating your coffee maker I bought
from Amazon.com yesterday.
Operator: Certainly. What problem are you facing?
Customer: I fill in the coffee powder, water, and then press the red
button on the side, and nothing happens.
Operator: The red button enables the ‘clean coffee maker’ process.
You will need to use the white knob on the other side to brew
coffee.
Customer: I see.
Customer: BTW, in the Nespresso cappuccino machine I recently
bought, it was the red button for start.


  Is there anything else I can assist with today?         SEND
                                                   Alert:
                                            COMPETITOR PRODUCT
                                                 MENTION
Reputation Management
 Data sources are primarily external, e.g.
      www.youtube.com
      www.epinions.com
      tripadvisor.com (travel related website)
 Consumer Brand Analytics
   What are people saying about our brand?


 Marketing Communications
   Significant spending on marketing, advertising:

    companies trying to position their products
   Brand analytics helps to determine whether such

    campaigns are effective
Mining Product Reviews
 Application is Industrial Design
      Automatically mine product reviews for information on
       product features, new requests, etc.
      Focus on wheelchairs

                                Features Extracted
                               Easy to use
                               Fit into a car
                               Comfortable chair
                               Light weight
                               Convenient to fold
                               Sturdy
                               Good price
Viral Marketing
Jure Leskovec (Stanford), Lada Adamic (U of Michigan),
   Bernardo A. Huberman (HP Labs)
Personalized recommendations
                                                          Viral marketing
Cross-selling
“people who bought x also bought y”

Collaborative filtering
“based on ratings of users like you…”
Delicious, Digg.com



                                           68% of consumers consult friends and family
                                           before purchasing home electronics
                                           (Burke 2003)

Success rate: # of purchases following a
recommendation / # recommenders
Books overall have a 3% success rate
500 million active users!                          Many different groups clamoring for
▪ More than 20 million users update their status   data and text analytics:
at least once each day                             ▪ FB Engineers
▪ More than 850 million photos uploaded to the     ▪ Advertisers
site each month                                    ▪ Page owners
▪ >1 billion pieces of content (web links, blog    ▪ Platform/Connect developers
posts, photos, etc.) shared each week              ▪ Marketers
                                                   ▪ Academics
An aside: Social Media Marketing
http://www.socialmediaexaminer.com/new-studies-show-value-of-social-media/
   Lead Generation
        Breakdown of respondents’ top benefits of social networking:
               50%: Generating leads
               45%: Keeping up with the industry
               44%: Monitoring online conversation
               38%: Finding vendors/suppliers
   Online Forum Users Are Enthusiastic Brand Advocates
           79.2% of forum contributors help a friend or family member make a decision about
            a product purchase – compared with 47.6% of non-contributors and 53.8% overall.
           65% of forum contributors share advice (offline and in person) based on
            information that they’ve read online – compared with 35% of non-contributors and
            40.8% overall.
           57.7% of forum contributors proactively recommend someone make a particular
            purchase – compared with 16.9% of non-contributors and 24.9% overall.
   Only 47% of Companies Experimenting With Social Media
           Gartner study predicts that by the end of 2010, more than 60% of Fortune 1000
            companies will manage an online community.
           ComBlu’s study, The State of Online Branded Communities, shows that most
            companies do not understand how to engage within online communities and have
            no real idea of what their customers want on these sites.
Citizen Response
 E-RuleMaking
   the use of digital technologies by government

    agencies in rulemaking, decision making processes
   solicit citizen feedback on bills being debated in

    Congress
   What new issues are being raised, what aspects of

    bill are popular, unpopular
   Better to mine social media than using focus

    groups?
 Political Campaigns
   Why do people support a candidate- is it really

    based on issues?
Use Case: Understanding and Visualizing Consumer Responses
Extracting Entities and Sentiment to Power Alerting, Link Diagrams, and Geo-Mapping




                                                                                      15
Twitter: Real-Time Citizen Journalism
           • Mumbai terror attack regarded as coming
           of age of Twitter
           • citizen journalism provided more valuable
           information than wire services, broadcast
           news
           • information about places to avoid, well
           being of relatives, friends, etc.
           • many redundant posts, users have to wade
           through hundreds of posts to locate useful
           information
           • Goal: to mine this data in real-time and
           produce well organized summaries




                                                         16
Law Enforcement, Homeland Security

• Facebook
    • gang members frequently boast about their activities on their facebook pages
• Chat rooms
    • Stalkers, pedophiles
• Twitter
    • protest rallies being planned           G20 Summit Protest
    • who, what, where, when
• Craigslist




                                                                                     17
Human Behaviour Analysis
     Process social media content, provide tools for analysts to:
                                                                      Predictive
        Identify social networks: groups, members
        Identify topics of discussion and sentiment                  Modeling
           • E.g. angry at govt., wanting retaliation, peacemakers
           • Thought influencers
                                                                     Link Diagrams
        Identify social goals through analysis of verbal
         communication
           • Manipulation: Persuasion, threats, coercion
           • Religious supremacy: religious analogues
           • recruitment




Social Media
  Content
Technology, Challenges
Analyzing Social Media Data
   Content Analysis
       Text analysis, multimedia analysis
   Structure Analysis




   Usage Analysis
        Search engine optimization
        What keywords are driving customers to your site,
          competitor sites
        Query logs, site traffic
Ideally combine all three of these!
Solution Framework




                                              Mark Logic        Thetus
             Kapow                           Oracle, MySQL         I2
                     Attensity              RDF Triple Stores
Enterprise                                                      Palantir
                     Themis                    CouchDB
Content
                     Autonomy
                     Jodange, Lexalytics,
                     Cymfony, Blogpulse
Content Acquisition
 Pre-selected, validated sites
      Epinions.com, Amazon.com, NYT blogs,
       reader comments
                                                  Search Service
      Tripadvisor.com, Craigslist
      Twitter, Facebook
 Blog Search Engines
      Google Blog Search
       http://blogsearch.google.com/
      Technorati http://technorati.com/
      Blogpulse http://blogpulse.com/

 BoardReader                                      Lucene Index
                                                     Storage
      http://boardreader.com/
      http://www.omgili.com/

 Spidering
Data Collection: Spidering
    “Dark Web” : the portion of the WorldWideWeb used to help achieve the sinister
    objectives of terrorists and extremists.

  Spider uses
   breadth and
   depth first
   (BFS and DFS)
   traversal for
   crawl space
   URL ordering
   based on URL
   tokens, anchor
   text, and link
   levels.
• Automated
   discovery of
   proxy servers
   to distribute
   collection and
   increase
   reliability.
•
Content Analysis
 Model Based
      Develop models that generalize characteristics of data
      Machine learning: Supervised, semi-supervised, unsupervised
        E.g., sequence labeling, classification
        N-gram language models
      Linguistic: based on rules of English grammar
      Information Extraction
• Pattern Mining
  • frequency analysis, local patterns
  Google n-gram data
       What words are used in conjunction with Buffalo, Buffalo Sabres, University at Buffalo
  Query log analysis
       Learn spelling corrections, Learn lists of named entities, Learn relationships
       Discover trends
            Flu, cough, fever : frequency of queries in certain regions, change from the norm


            Combine both approaches
Reliability of Data
 How much trust in data? (Forrester)
      Email from people you know: 77%
      Consumer product ratings/reviews: 60%
      Message board posts: 21%
      Personal blog: 18%, company blog: 16%
 Splog: Spam in weblogs
      UK has lawful intercept program
      What about results of data mining?
 Off-topic posts
      Comments on blog posts, forums quickly turn into personal
       rants, completely off-topic
 Possible Remedies
      Focus on sites where data is known to be more reliable
      Use technology to filter out spam, splog and off-topic posts
Informal Language
Loss of Functional Indicators
       Missing punctuation
       Missing or raNDOm case information
                                                                               Solutions:
       Whole phrases reduced to acronyms

Casual, Phonetic Spelling
                                                                               • spelling correction
       tha, teh = the                                                          • acronym look-up
Explicit Sentiment Commentary
                                                                               • machine learning: treat it as
       Happy Birthdaaaayyyy!!!1!1!
       must go <sigh>
                                                                               a machine translation problem!
       :-P grrr…..

Mistaken auto-correction or replacement
       Co-operation = Cupertino
       The Queen = Queen Elizabeth, “hundreds of worker bees commanded by
           Queen Elizabeth”

Twitter Conventions
       alanbr82 RT @royjwells: New Blog Post - Will Old Spice Achieve a ROI?
            http://ow.ly/2dZf7 #oldspice #sm #socialmedia
       RT, hashtags #, url shortening

Word Inventions
       refudiate, wee-wee’d up
       momager, rickRoll
       L33t, IMHO, meh
Legal Issues
 Privacy of data
      UK has lawful intercept program
      What about results of data mining?
 Liability
      Major issue for pharmaceutical companies: if they discover
       report of side effect of drug, they are required to report it
      Analysts making positive public statements about company
       earnings, yet contradicting this on blogs, facebook pages
 Workplace Issues
      Time spent on social media sites during work hours leading
       to lower productivity
Accuracy of Analysis
 Text analysis is based on natural language
  processing which is a useful, but imperfect
  technology
“Bill Gates, the CEO of Microsoft was initially very
   happy about its site location in Seattle, but now
   he has other thoughts. He is very displeased with
   the pollution…. Also, its employees are upset with
   the construction work…around its vicinity. In
   all, he wants to abandon the current site…..”

                                 Validate performance accuracy
Who is expressing an opinion?    through benchmarks on specially
                                 constructed data sets
What is the opinion about?
Is it positive or negative?
Sentiment Analysis
Aims to determine the attitude of a speaker or a writer with respect to some target or topic.

      I think, Obama needs to begin to take the
      blame for his failed policies -- his statement
      "that his policies are getting us out of this
      mess" are a big lie1.

                                                                                       SENTIMENT
                                                                           Attributes
                                                                           ID:ex1 , TargetID:t1,
        Opinion Holder                              Topic                  Polarity: Negative




                                 Target



     1 - http://gretawire.blogs.foxnews.com/ouch-this-is-not-fair-to-president-obama-yes-an-accident-but-one-that-needs-
Opinion summary
   In product reviews, we are interested in generating a
    feature-based summary for a product.

Digital_camera_1:
    Feature: picture quality
        Positive: 253
               <individual review sentences>
        Negative: 6
               <individual review sentences>
    Feature: size
        Positive: 134
               <individual review sentences>
        Negative: 10
               <individual review sentences>
    …
Scalability: Massively Distributed/Parallel Computing
    Hadoop
         Open-source framework for running Map-Reduce on a cluster of commodity
          machines, as well as a distributed file system for long-term storage
         Map-Reduce (invented at Google) provides a way to process large data sets
          that scales linearly with the number of machines in the cluster....if your data
          doubles in size, just buy twice as many computers
         Hadoop now an Apache project led by the Grid Computing team at Yahoo!
    HIVE
         SQL-like query language, table partitioning schema, and metadata store built
          on top of Hadoop
         Developed at Facebook, now an Apache subproject



Facebook Analytics:
How many people are
discussing being laid off; plot
percentage of total posts by
state
Multilingual Applications
Language Usage Statistics[1]




                                                                                             English is not the only
                                                                                             language on the internet




                                                                                             Urdu speaking Internet users -
                                                                                             12,000,000 (2006)
                                                                                             ~ 1.6% of 42.4%




[1] Source:Internet World Stats. Based on 1,733,993,741 estimated internet users for Sept 30, 2009
Copyright 2009, Miniwatts Marketing Group
Multilingual Social Media Mining
How did people in Egypt, Israel and Pakistan react to the
  latest presidential speech?
Opinion Extraction
     Topic: What is the opinion about?
     Opinion Holder: Who is expressing it?
     What is the intensity of the opinion?
     In what context is it being expressed?


Emotion Detection
     What kind of emotion is being expressed? – goes beyond
      just the positive or negative emotion

    Required to perform behavioral analysis, cross cultural
      analysis
Faceted Search: Sentiment about Topic




People are filled with anger and sorrow because of the policies made by Musharaf.
                       OPINION HOLDER – Writer, People
         TARGET –Musharaf’s policies (Musharaf is an implied target)
Multilingual Text Analysis
 Dealing with script, coding variations
 Even low-level text analysis becomes difficult
   Chinese: no white space between words

   Arabic: complex diacriticals

   Language Training Resources
       Lexicons, annotated corpora, etc.
       If sufficient training data exists, new languages
        can be adapted to fairly easily
         E.g. core Russian in 3 weeks!
 Treat language porting as a special case of
  domain porting
      Ideally, should involve creation of new data
       sources, not new code
Chinese Text Analysis




                    38
Context Aware Translation
斯洛文尼亚总理扬沙,欧洲委员会主席巴罗佐和欧盟外交政策
    负责人索拉纳与梅德韦杰夫共进非正式晚餐

                                       Context Aware Translation
        Babelfish Translation
                                       Name translation output:

                                       <NeGPE english="Slovenia"> 斯洛文尼亚 </NeGPE> 总理
 Slovenia premier the sand blowing,    <NePer english="Jansa"> 扬沙 </NePer> ,
Council of Europe President Baluozuo   <NeOrg english="European Commission"> 欧洲 委员会 </
 and European Union foreign policy     NeOrg> 主席
                                       <NePer english="Barroso"> 巴罗佐 </NePer> 和
    person in charge Solana and        <NeGPE english="European Union"> 欧盟 </NeGPE> 外
Medvedev have the unofficial supper.   交 政策 负责人
                                       <NePer english="Solana"> 索拉纳 </NePer> 与
                                       <NePer english="Medvedev"> 梅德韦杰夫 </NePer>
                                       共 进 非正式 晚餐 。




  Powered by Semantex™ extracted entities, Babelfish translates as:
  Slovenia Premier Jansa, Council of Europe President Barroso and
  European Union foreign policy person in charge Solana and
  Medvedev have the unofficial supper.
Mining Wikipedia for Lexicons




• Translation lexicons automatically extracted from Chinese Wikipedia, use cross language
links to add English translations
• Easy to regenerate with new versions of Wikipedia
• Chinese Wikipedia is constantly growing
COLABA: Colloquial Arabic Blog Analysis
– Proliferation of open source, social media
– Dominance of non-English content
– Use of dialects and colloquial language
– Limited supply of multilingual analysts
Tools made for MSA fail on Arabic dialects

Human translation for all Arabic variants below is the same:
“There is no electricity, what happened?”
Arabic Variant Arabic Source Text             Google Translate

Egyptian        ‫الكهربا اتقطعت، ليه كده بس؟‬   Atqtat electrical wires, Why are Posted?


Levantine       ‫شكلو مفيش كهربا، ليش هيك؟‬     Cklo Mafeesh ‫?كهربا‬Lech heck ,


Iraqi                 ‫شو ماكو كهرباء، خير؟‬    Xu MACON electricity, good?

MSA              ‫ ليوجد كهرباء، ماذا حصل؟‬Does not have electricity, what
Arabic Dialects are not handled well in current machine translation systems.
                                               happened?
COLABA enables MSA tools to interpret dialects correctly.

                                                                 42
Code Mixing, Switching
 Use of Latin script: lack of transliteration
  standards makes it difficult to process
 Spanglish, Hinglish, Urdish, etc.

Afsoos key baat hai . kal tak jo batain Non Muslim bhi kartay
hoay dartay thay abhi this man has brought it out in the open.
 [It is sad to see that those words that even a non muslim would
fear to utter until yesterday, this man has brought it out in the
open]


Solutions:
• Apply “romanized” POS tagger, English tagger in tandem: use machine learning
to combine evidence and generate final tag, language ID
• For longer English spans, use English NLP system
Resource Poor Languages
      Bootstrap Learning: process of improving the performance of a trained
      classifier by iteratively adding data that is labeled by the classifier itself
                    to the training set, and retraining the classifier
Useful when there is not enough annotated data
                                   Requirement
                               NEEDS SEED DATA



                                            corrections

              TRAINING
DAT
                 SEED
A
                                                              CORRECT
                                                              SAMPLES
The Road Ahead?
Strengths                           Weaknesses
 free form facilitates capturing    language analysis and mining
the true voice of customer,         are challenging
wisdom of crowd
                                   susceptible to spam, self-
 can be expressed through voice, serving use by companies
text messaging on mobile phones,
etc.                              Behaviour, predictive models
                                  need more research
Threats                             Opportunities
 privacy and security issues:       promise of collective problem
possible to assimilate detailed     solving: coordination, cooperation
knowledge about person’s
                                     mobile use supports dealing
activities, whereabouts
                                    with societal problems, disaster
 can lead to anti-social           situations: social network is
behaviour!                          geospatial proximity
THANKS! QUESTIONS?

Social Media Analytics: The Value Proposition

  • 1.
    Social Media Analytics:the Value Proposition Rohini K. Srihari KDD 2010 Workshop on Social Media Analytics July 25, 2010
  • 2.
    Outline  What isSocial Media?  Value Proposition: Why mine social media?  Business Analytics  Counterterrorism  Challenges  Technology, Challenges  Multilingual social media mining  Future
  • 3.
    Social Media Data Actionable Intelligence Consumer Generated, Not Edited, Not Authenticated
  • 4.
    Data/Text Mining Extractinguseful information from large data sets Analyze Observational Data to find unsuspected relationships and Summarize data in novel ways that are understandable and useful to data owner Information Discovery non-trivial, implicit, previously unknown relationships Ex of Trivial: Those who are pregnant are female Summarize as Patterns and Models (usually probabilistic) Usefulness: meaningful: lead to some advantage, usually economic Analysis: Automatic/Semi-Automatic Process (Knowledge Extraction)
  • 5.
  • 6.
    Market Size  Business Analytics market projected to be $28 billion in 2011 (IDC Report)  Social Analytics taking leading position of interest within organizations  Integrating Social Media Analytics and Business Intelligence Source: HCL India
  • 7.
    Customer Relationship Management Data sources are primarily internal  Call center transcripts  E-mail  Customer feedback  Cost avoidance  Product exchange mitigation  Early warning detection on new products  Increase in customer satisfaction and loyalty  Insight towards new products, product features  Identification of possible marketing opportunities
  • 8.
    e-Service Chat Monitoring Operator:How can I assist you today? Customer: I need help with operating your coffee maker I bought from Amazon.com yesterday. Operator: Certainly. What problem are you facing? Customer: I fill in the coffee powder, water, and then press the red button on the side, and nothing happens. Operator: The red button enables the ‘clean coffee maker’ process. You will need to use the white knob on the other side to brew coffee. Customer: I see. Customer: BTW, in the Nespresso cappuccino machine I recently bought, it was the red button for start. Is there anything else I can assist with today? SEND Alert: COMPETITOR PRODUCT MENTION
  • 9.
    Reputation Management  Datasources are primarily external, e.g.  www.youtube.com  www.epinions.com  tripadvisor.com (travel related website)  Consumer Brand Analytics  What are people saying about our brand?  Marketing Communications  Significant spending on marketing, advertising: companies trying to position their products  Brand analytics helps to determine whether such campaigns are effective
  • 10.
    Mining Product Reviews Application is Industrial Design  Automatically mine product reviews for information on product features, new requests, etc.  Focus on wheelchairs Features Extracted Easy to use Fit into a car Comfortable chair Light weight Convenient to fold Sturdy Good price
  • 11.
    Viral Marketing Jure Leskovec(Stanford), Lada Adamic (U of Michigan), Bernardo A. Huberman (HP Labs) Personalized recommendations Viral marketing Cross-selling “people who bought x also bought y” Collaborative filtering “based on ratings of users like you…” Delicious, Digg.com 68% of consumers consult friends and family before purchasing home electronics (Burke 2003) Success rate: # of purchases following a recommendation / # recommenders Books overall have a 3% success rate
  • 12.
    500 million activeusers! Many different groups clamoring for ▪ More than 20 million users update their status data and text analytics: at least once each day ▪ FB Engineers ▪ More than 850 million photos uploaded to the ▪ Advertisers site each month ▪ Page owners ▪ >1 billion pieces of content (web links, blog ▪ Platform/Connect developers posts, photos, etc.) shared each week ▪ Marketers ▪ Academics
  • 13.
    An aside: SocialMedia Marketing http://www.socialmediaexaminer.com/new-studies-show-value-of-social-media/  Lead Generation  Breakdown of respondents’ top benefits of social networking:  50%: Generating leads  45%: Keeping up with the industry  44%: Monitoring online conversation  38%: Finding vendors/suppliers  Online Forum Users Are Enthusiastic Brand Advocates  79.2% of forum contributors help a friend or family member make a decision about a product purchase – compared with 47.6% of non-contributors and 53.8% overall.  65% of forum contributors share advice (offline and in person) based on information that they’ve read online – compared with 35% of non-contributors and 40.8% overall.  57.7% of forum contributors proactively recommend someone make a particular purchase – compared with 16.9% of non-contributors and 24.9% overall.  Only 47% of Companies Experimenting With Social Media  Gartner study predicts that by the end of 2010, more than 60% of Fortune 1000 companies will manage an online community.  ComBlu’s study, The State of Online Branded Communities, shows that most companies do not understand how to engage within online communities and have no real idea of what their customers want on these sites.
  • 14.
    Citizen Response  E-RuleMaking  the use of digital technologies by government agencies in rulemaking, decision making processes  solicit citizen feedback on bills being debated in Congress  What new issues are being raised, what aspects of bill are popular, unpopular  Better to mine social media than using focus groups?  Political Campaigns  Why do people support a candidate- is it really based on issues?
  • 15.
    Use Case: Understandingand Visualizing Consumer Responses Extracting Entities and Sentiment to Power Alerting, Link Diagrams, and Geo-Mapping 15
  • 16.
    Twitter: Real-Time CitizenJournalism • Mumbai terror attack regarded as coming of age of Twitter • citizen journalism provided more valuable information than wire services, broadcast news • information about places to avoid, well being of relatives, friends, etc. • many redundant posts, users have to wade through hundreds of posts to locate useful information • Goal: to mine this data in real-time and produce well organized summaries 16
  • 17.
    Law Enforcement, HomelandSecurity • Facebook • gang members frequently boast about their activities on their facebook pages • Chat rooms • Stalkers, pedophiles • Twitter • protest rallies being planned G20 Summit Protest • who, what, where, when • Craigslist 17
  • 18.
    Human Behaviour Analysis  Process social media content, provide tools for analysts to: Predictive  Identify social networks: groups, members  Identify topics of discussion and sentiment Modeling • E.g. angry at govt., wanting retaliation, peacemakers • Thought influencers Link Diagrams  Identify social goals through analysis of verbal communication • Manipulation: Persuasion, threats, coercion • Religious supremacy: religious analogues • recruitment Social Media Content
  • 19.
  • 20.
    Analyzing Social MediaData  Content Analysis  Text analysis, multimedia analysis  Structure Analysis  Usage Analysis  Search engine optimization  What keywords are driving customers to your site, competitor sites  Query logs, site traffic Ideally combine all three of these!
  • 21.
    Solution Framework Mark Logic Thetus Kapow Oracle, MySQL I2 Attensity RDF Triple Stores Enterprise Palantir Themis CouchDB Content Autonomy Jodange, Lexalytics, Cymfony, Blogpulse
  • 22.
    Content Acquisition  Pre-selected,validated sites  Epinions.com, Amazon.com, NYT blogs, reader comments Search Service  Tripadvisor.com, Craigslist  Twitter, Facebook  Blog Search Engines  Google Blog Search http://blogsearch.google.com/  Technorati http://technorati.com/  Blogpulse http://blogpulse.com/  BoardReader Lucene Index Storage  http://boardreader.com/  http://www.omgili.com/  Spidering
  • 23.
    Data Collection: Spidering “Dark Web” : the portion of the WorldWideWeb used to help achieve the sinister objectives of terrorists and extremists.  Spider uses breadth and depth first (BFS and DFS) traversal for crawl space URL ordering based on URL tokens, anchor text, and link levels. • Automated discovery of proxy servers to distribute collection and increase reliability. •
  • 24.
    Content Analysis  ModelBased  Develop models that generalize characteristics of data  Machine learning: Supervised, semi-supervised, unsupervised  E.g., sequence labeling, classification  N-gram language models  Linguistic: based on rules of English grammar  Information Extraction • Pattern Mining • frequency analysis, local patterns Google n-gram data What words are used in conjunction with Buffalo, Buffalo Sabres, University at Buffalo Query log analysis Learn spelling corrections, Learn lists of named entities, Learn relationships Discover trends Flu, cough, fever : frequency of queries in certain regions, change from the norm Combine both approaches
  • 25.
    Reliability of Data How much trust in data? (Forrester)  Email from people you know: 77%  Consumer product ratings/reviews: 60%  Message board posts: 21%  Personal blog: 18%, company blog: 16%  Splog: Spam in weblogs  UK has lawful intercept program  What about results of data mining?  Off-topic posts  Comments on blog posts, forums quickly turn into personal rants, completely off-topic  Possible Remedies  Focus on sites where data is known to be more reliable  Use technology to filter out spam, splog and off-topic posts
  • 27.
    Informal Language Loss ofFunctional Indicators Missing punctuation Missing or raNDOm case information Solutions: Whole phrases reduced to acronyms Casual, Phonetic Spelling • spelling correction tha, teh = the • acronym look-up Explicit Sentiment Commentary • machine learning: treat it as Happy Birthdaaaayyyy!!!1!1! must go <sigh> a machine translation problem! :-P grrr….. Mistaken auto-correction or replacement Co-operation = Cupertino The Queen = Queen Elizabeth, “hundreds of worker bees commanded by Queen Elizabeth” Twitter Conventions alanbr82 RT @royjwells: New Blog Post - Will Old Spice Achieve a ROI? http://ow.ly/2dZf7 #oldspice #sm #socialmedia RT, hashtags #, url shortening Word Inventions refudiate, wee-wee’d up momager, rickRoll L33t, IMHO, meh
  • 28.
    Legal Issues  Privacyof data  UK has lawful intercept program  What about results of data mining?  Liability  Major issue for pharmaceutical companies: if they discover report of side effect of drug, they are required to report it  Analysts making positive public statements about company earnings, yet contradicting this on blogs, facebook pages  Workplace Issues  Time spent on social media sites during work hours leading to lower productivity
  • 29.
    Accuracy of Analysis Text analysis is based on natural language processing which is a useful, but imperfect technology “Bill Gates, the CEO of Microsoft was initially very happy about its site location in Seattle, but now he has other thoughts. He is very displeased with the pollution…. Also, its employees are upset with the construction work…around its vicinity. In all, he wants to abandon the current site…..” Validate performance accuracy Who is expressing an opinion? through benchmarks on specially constructed data sets What is the opinion about? Is it positive or negative?
  • 30.
    Sentiment Analysis Aims todetermine the attitude of a speaker or a writer with respect to some target or topic. I think, Obama needs to begin to take the blame for his failed policies -- his statement "that his policies are getting us out of this mess" are a big lie1. SENTIMENT Attributes ID:ex1 , TargetID:t1, Opinion Holder Topic Polarity: Negative Target 1 - http://gretawire.blogs.foxnews.com/ouch-this-is-not-fair-to-president-obama-yes-an-accident-but-one-that-needs-
  • 31.
    Opinion summary  In product reviews, we are interested in generating a feature-based summary for a product. Digital_camera_1: Feature: picture quality Positive: 253 <individual review sentences> Negative: 6 <individual review sentences> Feature: size Positive: 134 <individual review sentences> Negative: 10 <individual review sentences> …
  • 32.
    Scalability: Massively Distributed/ParallelComputing  Hadoop  Open-source framework for running Map-Reduce on a cluster of commodity machines, as well as a distributed file system for long-term storage  Map-Reduce (invented at Google) provides a way to process large data sets that scales linearly with the number of machines in the cluster....if your data doubles in size, just buy twice as many computers  Hadoop now an Apache project led by the Grid Computing team at Yahoo!  HIVE  SQL-like query language, table partitioning schema, and metadata store built on top of Hadoop  Developed at Facebook, now an Apache subproject Facebook Analytics: How many people are discussing being laid off; plot percentage of total posts by state
  • 33.
  • 34.
    Language Usage Statistics[1] English is not the only language on the internet Urdu speaking Internet users - 12,000,000 (2006) ~ 1.6% of 42.4% [1] Source:Internet World Stats. Based on 1,733,993,741 estimated internet users for Sept 30, 2009 Copyright 2009, Miniwatts Marketing Group
  • 35.
    Multilingual Social MediaMining How did people in Egypt, Israel and Pakistan react to the latest presidential speech? Opinion Extraction  Topic: What is the opinion about?  Opinion Holder: Who is expressing it?  What is the intensity of the opinion?  In what context is it being expressed? Emotion Detection  What kind of emotion is being expressed? – goes beyond just the positive or negative emotion Required to perform behavioral analysis, cross cultural analysis
  • 36.
    Faceted Search: Sentimentabout Topic People are filled with anger and sorrow because of the policies made by Musharaf. OPINION HOLDER – Writer, People TARGET –Musharaf’s policies (Musharaf is an implied target)
  • 37.
    Multilingual Text Analysis Dealing with script, coding variations  Even low-level text analysis becomes difficult  Chinese: no white space between words  Arabic: complex diacriticals  Language Training Resources  Lexicons, annotated corpora, etc.  If sufficient training data exists, new languages can be adapted to fairly easily  E.g. core Russian in 3 weeks!  Treat language porting as a special case of domain porting  Ideally, should involve creation of new data sources, not new code
  • 38.
  • 39.
    Context Aware Translation 斯洛文尼亚总理扬沙,欧洲委员会主席巴罗佐和欧盟外交政策 负责人索拉纳与梅德韦杰夫共进非正式晚餐 Context Aware Translation Babelfish Translation Name translation output: <NeGPE english="Slovenia"> 斯洛文尼亚 </NeGPE> 总理 Slovenia premier the sand blowing, <NePer english="Jansa"> 扬沙 </NePer> , Council of Europe President Baluozuo <NeOrg english="European Commission"> 欧洲 委员会 </ and European Union foreign policy NeOrg> 主席 <NePer english="Barroso"> 巴罗佐 </NePer> 和 person in charge Solana and <NeGPE english="European Union"> 欧盟 </NeGPE> 外 Medvedev have the unofficial supper. 交 政策 负责人 <NePer english="Solana"> 索拉纳 </NePer> 与 <NePer english="Medvedev"> 梅德韦杰夫 </NePer> 共 进 非正式 晚餐 。 Powered by Semantex™ extracted entities, Babelfish translates as: Slovenia Premier Jansa, Council of Europe President Barroso and European Union foreign policy person in charge Solana and Medvedev have the unofficial supper.
  • 40.
    Mining Wikipedia forLexicons • Translation lexicons automatically extracted from Chinese Wikipedia, use cross language links to add English translations • Easy to regenerate with new versions of Wikipedia • Chinese Wikipedia is constantly growing
  • 41.
    COLABA: Colloquial ArabicBlog Analysis – Proliferation of open source, social media – Dominance of non-English content – Use of dialects and colloquial language – Limited supply of multilingual analysts
  • 42.
    Tools made forMSA fail on Arabic dialects Human translation for all Arabic variants below is the same: “There is no electricity, what happened?” Arabic Variant Arabic Source Text Google Translate Egyptian ‫الكهربا اتقطعت، ليه كده بس؟‬ Atqtat electrical wires, Why are Posted? Levantine ‫شكلو مفيش كهربا، ليش هيك؟‬ Cklo Mafeesh ‫?كهربا‬Lech heck , Iraqi ‫شو ماكو كهرباء، خير؟‬ Xu MACON electricity, good? MSA ‫ ليوجد كهرباء، ماذا حصل؟‬Does not have electricity, what Arabic Dialects are not handled well in current machine translation systems. happened? COLABA enables MSA tools to interpret dialects correctly. 42
  • 43.
    Code Mixing, Switching Use of Latin script: lack of transliteration standards makes it difficult to process  Spanglish, Hinglish, Urdish, etc. Afsoos key baat hai . kal tak jo batain Non Muslim bhi kartay hoay dartay thay abhi this man has brought it out in the open. [It is sad to see that those words that even a non muslim would fear to utter until yesterday, this man has brought it out in the open] Solutions: • Apply “romanized” POS tagger, English tagger in tandem: use machine learning to combine evidence and generate final tag, language ID • For longer English spans, use English NLP system
  • 44.
    Resource Poor Languages Bootstrap Learning: process of improving the performance of a trained classifier by iteratively adding data that is labeled by the classifier itself to the training set, and retraining the classifier Useful when there is not enough annotated data Requirement NEEDS SEED DATA corrections TRAINING DAT SEED A CORRECT SAMPLES
  • 45.
    The Road Ahead? Strengths Weaknesses  free form facilitates capturing  language analysis and mining the true voice of customer, are challenging wisdom of crowd  susceptible to spam, self-  can be expressed through voice, serving use by companies text messaging on mobile phones, etc. Behaviour, predictive models need more research Threats Opportunities  privacy and security issues:  promise of collective problem possible to assimilate detailed solving: coordination, cooperation knowledge about person’s  mobile use supports dealing activities, whereabouts with societal problems, disaster  can lead to anti-social situations: social network is behaviour! geospatial proximity
  • 46.

Editor's Notes

  • #30 Copyright Janya Inc, Strictly Confidential