Filter keywords and majority class
   strategies for company name
     disambiguation on Twitter
   Damiano Spina, Enrique Amigó and Julio Gonzalo
        {damiano,enrique,julio}@lsi.uned.es
               UNED NLP & IR Group




                CLEF 2011 Conference
             September 19-22, Amsterdam
Goal
• Two signals coming from intuition:
  – Filter keywords
  – Majority Class
• Do they help characterizing and solving the
  problem?
WePS-3 Online Reputation Management Task
WePS-3 Online Reputation Management Task
WePS-3 Online Reputation Management Task
Tweets for query
       «jaguar»




• related tweets=8
• unrelated tweets=2
• Related ratio = 8/(8+2) = 0.8
Tweets for query
       «orange»




• related tweets=0
• unrelated tweets=10
• Related ratio = 0
Tweets for query
       «apple»




• related tweets=5
• unrelated tweets=5
• Related ratio = 0.5
Fingerprint representation
Fingerprint representation
Fingerprint representation
Fingerprint representation
WePS-3 Task 2 Systems
WePS-3 Task 2 Systems
Filter keywords
Tweets for query
   «apple»
Tweets for query
       «apple»


• positive keyword: store
   • 4 tweets annotated as
       «related»
Tweets for query
       «apple»


• positive keyword: store
   • 4 tweets annotated as
       «related»



• negative keyword: eating
   • 2 tweets annotated as
      «unrelated»
Tweets for query
       «apple»


• positive keyword: store
   • 4 tweets annotated as
       «related»



• negative keyword: eating
   • 2 tweets annotated as
      «unrelated»




    • Accuracy= 1.0
    • Recall=60%
Manual keywords (perfects for a Web user)
 Company name              Positive Keywords                     Negative Keywords
amazon           electronics, books, apparel,             river, rainforest, deforestation,
                 computers, buy                           bolivian, brazilian
fox              tv, broadcast, shows, episodes, fringe, animal, terrier, hunting,
                 bones                                   volkswagen, racing
ford             motor, cars, hybrids, crossovers,        tom, harrison, henry, glenn, gucci
                 mondeo, focus, fiesta, prices, dealer,
                 electric
Manual keywords (perfects for a Web user)
 Company name               Positive Keywords                     Negative Keywords
amazon            electronics, books, apparel,             river, rainforest, deforestation,
                  computers, buy                           bolivian, brazilian
fox               tv, broadcast, shows, episodes, fringe, animal, terrier, hunting,
                  bones                                   volkswagen, racing
ford              motor, cars, hybrids, crossovers,        tom, harrison, henry, glenn, gucci
                  mondeo, focus, fiesta, prices, dealer,
                  electric


                    Oracle keywords (perfects on Twitter)
  Company name                Positive Keywords                     Negative Keywords
amazon              sale, books, deal, deals, gift          followdaibosyu, pest, plug,
                                                            brothers, pirotta
fox                 money, weather, leader, denouncing,     megan, matthew, lazy, valley,
                    viewers                                 michael
ford                mustang, focus, hybrid, motor, truck    tom, harrison, rob, bring,
                                                            coppola
Manual keywords (perfects for a Web user)
 Company name               Positive Keywords                     Negative Keywords
amazon            electronics, books, apparel,             river, rainforest, deforestation,
                  computers, buy                           bolivian, brazilian
fox               tv, broadcast, shows, episodes, fringe, animal, terrier, hunting,
                  bones                                   volkswagen, racing
ford              motor, cars, hybrids, crossovers,        tom, harrison, henry, glenn, gucci
                  mondeo, focus, fiesta, prices, dealer,
                  electric


                    Oracle keywords (perfects on Twitter)
  Company name                Positive Keywords                     Negative Keywords
amazon              sale, books, deal, deals, gift          followdaibosyu, pest, plug,
                                                            brothers, pirotta
fox                 money, weather, leader, denouncing,     megan, matthew, lazy, valley,
                    viewers                                 michael
ford                mustang, focus, hybrid, motor, truck    tom, harrison, rob, bring,
                                                            coppola
Manual keywords (perfects for a Web user)
 Company name               Positive Keywords                     Negative Keywords
amazon            electronics, books, apparel,             river, rainforest, deforestation,
                  computers, buy                           bolivian, brazilian
fox               tv, broadcast, shows, episodes, fringe, animal, terrier, hunting,
                  bones                                   volkswagen, racing
ford              motor, cars, hybrids, crossovers,        tom, harrison, henry, glenn, gucci
                  mondeo, focus, fiesta, prices, dealer,
                  electric


                    Oracle keywords (perfects on Twitter)
  Company name                Positive Keywords                     Negative Keywords
amazon              sale, books, deal, deals, gift          followdaibosyu, pest, plug,
                                                            brothers, pirotta
fox                 money, weather, leader, denouncing,     megan, matthew, lazy, valley,
                    viewers                                 michael
ford                mustang, focus, hybrid, motor, truck    tom, harrison, rob, bring,
                                                            coppola
Upper bound of Filter Keywords
 Oracle keywords


              20 oracle keywords
              ≈ 50% recall




   5 oracle keywords
   ≈ 30% recall
Upper bound of Filter Keywords
 Oracle keywords                    Manual keywords
                                   – ≈10 per company
                                   – 14.61 % recall
              20 oracle keywords   (vs. 39.97% 10 oracle keyword)
              ≈ 50% recall
                                   – 0.86 accuracy


   5 oracle keywords
   ≈ 30% recall
Upper bound of Filter Keywords
 Oracle keywords                    Manual keywords
                                   – ≈10 per company
                                   – 14.61 % recall
              20 oracle keywords   (vs. 39.97% 10 oracle keyword)
              ≈ 50% recall
                                   – 0.86 accuracy


   5 oracle keywords
   ≈ 30% recall


                                          Twitter ≠ Web
Majority Class
Tweets for query
       «jaguar»




• related tweets=8
• unrelated tweets=2
• Related ratio = 8/(8+2) = 0.8




    • Accuracy= 0.80
    • Recall=100%
Upper bound of Majority Class
                        winner-takes-all
• For each test case
  /company name
   – all unrelated or
     all related
Upper bound of Majority Class
                        winner-takes-all
• For each test case
  /company name
   – all unrelated or
     all related
• Optimal decision
   – 0.80 accuracy
Upper bound of Majority Class
                                  winner-takes-all
• For each test case
  /company name
   – all unrelated or
     all related
• Optimal decision
   – 0.80 accuracy
      • ≈ best manual system
        (0.83)
      • > best automatic system
        (0.75)
Filter keywords + majority class
                   upperbound



  Filter keywords
(oracle or manual)



Majority Class?




                                    Tweets
(1) winner-takes-all



  Filter keywords
(oracle or manual)



 Majority Class




                                            Tweets
(2) winner-takes-remainder



  Filter keywords
(oracle or manual)



 Majority Class




                                Tweets
(3) bootstrapping



  Filter keywords
(oracle or manual)


   training
       Machine
       learning



                                         Tweets
(3) bootstrapping



  Filter keywords
(oracle or manual)


   training
       Machine
       learning

              application
                                                Tweets
Filter keywords + majority class
Filter keywords + majority class




                ≈ ‘all related’ baseline
Filter keywords + majority class
           baseline
Filter keywords + majority class
              baseline
• Automatic Discovery of Filter Keywords:

                        Keyword
                      Classification

  Terms                                 Filter keywords
                                          (automatic)
Filter keywords + majority class
              baseline
• Automatic Discovery of Filter Keywords:

                                           Keyword
                                         Classification

  Terms                                                           Filter keywords
                                                                    (automatic)
– 13 Term features:
    • 3 Collection-based features
    • 6 Web-based features
    • 4 Expanded by co-occurrence features
– 3 classification methods
    • Machine learning (Neural net + all features)
    • Heuristic (2 features: col_c_specificity + cooc_om_assoc)
    • Hybrid (Neural net + heuristic’s features)
Automatic Tweets Classification

           0,83
                  0,75   0,73                        WePS-3 systems
                                0,63                 (manual)
accuracy




                                       0,56
                                              0,48
                                                     WePS-3 systems
                                                     (automatic)

                                                     Filter keywords +
                                                     Majority Class
                                                     baseline
Conclusions
• Fingerprint representation
  – Behaviour of binary classification systems on
    skewed datasets
  – Baselines independent of corpus
Conclusions
• Fingerprint representation
  – Behaviour of binary classification systems on
    skewed datasets
  – Baselines independent of corpus
• Twitter ≠ Web
  – Oracle keywords ≠ Manual keywords
Conclusions
• Fingerprint representation
   – Behaviour of binary classification systems on skewed
     datasets
   – Baselines independent of corpus
• Twitter ≠ Web
   – Oracle keywords ≠ Manual keywords
• Filter keywords & majority class strategies
   – Useful signals to help solving the problem
   – Both signals alone already give competitive
     performance
Filter keywords and majority class
   strategies for company name
     disambiguation on Twitter
   Damiano Spina, Enrique Amigó and Julio Gonzalo
        {damiano,enrique,julio}@lsi.uned.es
               UNED NLP & IR Group




               CLEF 2011 Conference
            September 19-22, Amsterdam

Filter keywords and majority class strategies for company name disambiguation on Twitter

  • 1.
    Filter keywords andmajority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22, Amsterdam
  • 5.
    Goal • Two signalscoming from intuition: – Filter keywords – Majority Class • Do they help characterizing and solving the problem?
  • 6.
    WePS-3 Online ReputationManagement Task
  • 7.
    WePS-3 Online ReputationManagement Task
  • 8.
    WePS-3 Online ReputationManagement Task
  • 9.
    Tweets for query «jaguar» • related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8
  • 10.
    Tweets for query «orange» • related tweets=0 • unrelated tweets=10 • Related ratio = 0
  • 11.
    Tweets for query «apple» • related tweets=5 • unrelated tweets=5 • Related ratio = 0.5
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related»
  • 21.
    Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related» • negative keyword: eating • 2 tweets annotated as «unrelated»
  • 22.
    Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related» • negative keyword: eating • 2 tweets annotated as «unrelated» • Accuracy= 1.0 • Recall=60%
  • 23.
    Manual keywords (perfectsfor a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric
  • 24.
    Manual keywords (perfectsfor a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric Oracle keywords (perfects on Twitter) Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, megan, matthew, lazy, valley, viewers michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
  • 25.
    Manual keywords (perfectsfor a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric Oracle keywords (perfects on Twitter) Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, megan, matthew, lazy, valley, viewers michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
  • 26.
    Manual keywords (perfectsfor a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric Oracle keywords (perfects on Twitter) Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, megan, matthew, lazy, valley, viewers michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
  • 27.
    Upper bound ofFilter Keywords Oracle keywords 20 oracle keywords ≈ 50% recall 5 oracle keywords ≈ 30% recall
  • 28.
    Upper bound ofFilter Keywords Oracle keywords Manual keywords – ≈10 per company – 14.61 % recall 20 oracle keywords (vs. 39.97% 10 oracle keyword) ≈ 50% recall – 0.86 accuracy 5 oracle keywords ≈ 30% recall
  • 29.
    Upper bound ofFilter Keywords Oracle keywords Manual keywords – ≈10 per company – 14.61 % recall 20 oracle keywords (vs. 39.97% 10 oracle keyword) ≈ 50% recall – 0.86 accuracy 5 oracle keywords ≈ 30% recall Twitter ≠ Web
  • 30.
  • 31.
    Tweets for query «jaguar» • related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8 • Accuracy= 0.80 • Recall=100%
  • 32.
    Upper bound ofMajority Class winner-takes-all • For each test case /company name – all unrelated or all related
  • 33.
    Upper bound ofMajority Class winner-takes-all • For each test case /company name – all unrelated or all related • Optimal decision – 0.80 accuracy
  • 34.
    Upper bound ofMajority Class winner-takes-all • For each test case /company name – all unrelated or all related • Optimal decision – 0.80 accuracy • ≈ best manual system (0.83) • > best automatic system (0.75)
  • 35.
    Filter keywords +majority class upperbound Filter keywords (oracle or manual) Majority Class? Tweets
  • 36.
    (1) winner-takes-all Filter keywords (oracle or manual) Majority Class Tweets
  • 37.
    (2) winner-takes-remainder Filter keywords (oracle or manual) Majority Class Tweets
  • 38.
    (3) bootstrapping Filter keywords (oracle or manual) training Machine learning Tweets
  • 39.
    (3) bootstrapping Filter keywords (oracle or manual) training Machine learning application Tweets
  • 40.
    Filter keywords +majority class
  • 41.
    Filter keywords +majority class ≈ ‘all related’ baseline
  • 42.
    Filter keywords +majority class baseline
  • 43.
    Filter keywords +majority class baseline • Automatic Discovery of Filter Keywords: Keyword Classification Terms Filter keywords (automatic)
  • 44.
    Filter keywords +majority class baseline • Automatic Discovery of Filter Keywords: Keyword Classification Terms Filter keywords (automatic) – 13 Term features: • 3 Collection-based features • 6 Web-based features • 4 Expanded by co-occurrence features – 3 classification methods • Machine learning (Neural net + all features) • Heuristic (2 features: col_c_specificity + cooc_om_assoc) • Hybrid (Neural net + heuristic’s features)
  • 45.
    Automatic Tweets Classification 0,83 0,75 0,73 WePS-3 systems 0,63 (manual) accuracy 0,56 0,48 WePS-3 systems (automatic) Filter keywords + Majority Class baseline
  • 46.
    Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus
  • 47.
    Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus • Twitter ≠ Web – Oracle keywords ≠ Manual keywords
  • 48.
    Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus • Twitter ≠ Web – Oracle keywords ≠ Manual keywords • Filter keywords & majority class strategies – Useful signals to help solving the problem – Both signals alone already give competitive performance
  • 49.
    Filter keywords andmajority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22, Amsterdam