Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web         Erin MacMurray*, Marguerit...
In a nutshell             •   Introduction & background             •   Textometry and Web Mining: why?             •   Te...
Introduction & background                 Structure ?                          Man versus machine ?    Seth Grimes sees « ...
Textometry and Web Mining: why ?• Improving Linguistic Models   – Semantic complexity of simple units such as NE   – Ident...
Textometry and Web Mining: why ?         •   Text is considered having its own internal structure         •   Application ...
Textometry and Web Mining: how?                                                                                           ...
Textometry and Web Mining: how?Two words or more that appear at the same time in a predetermined span of text- lexicalrela...
Textometry and Web Mining : how?       1/ POINT OF ENTRY                                  2/ CORPUS                       ...
Textometry and Web Mining: results?   Observing forms and repeted segments of « Nicolas Sarkozy »   allows identifying pol...
Textometry and Web Mining: results?             Figure - Monthly variation of specificness for paraphrases for the NE « Ni...
Textometry and Web Mining: results?As a current event is discussed in the media, the lexical network produced by the co-oc...
Textometry and Web Mining: results?22/07/2011   E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic T...
Textometry and Web Mining: results?22/07/2011   E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic T...
Conclusion• Two intelligence use-cases on Le Monde and The New York Times• Two complementary approaches : specificness and...
ReferencesBloom K., Stein S. & Argamon S., Appraisal extraction for news opinion analysis at NTCIR-6, Proceedings of NTCIR...
Upcoming SlideShare
Loading in …5
×

Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web

1,981 views

Published on

Presentation displayed at the ICAI'2011 - ILINTEC Workshop on Intelligent Linguistic Technologies

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web

  1. 1. Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web Erin MacMurray*, Marguerite Leenhardt ** SYLED/CLA2T EA2290, UFR ILPGA, Université Sorbonne Nouvelle Paris 3 *erin.macmurray@gmail.com ** marguerite.leenhardt@gmail.com ICAI’11 Workshop on Intelligent Linguistic Technologies
  2. 2. In a nutshell • Introduction & background • Textometry and Web Mining: why? • Textometry and Web Mining: how? • Textometry and Web Mining: application? • Conclusion22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 2
  3. 3. Introduction & background Structure ? Man versus machine ? Seth Grimes sees « three categories of Neil Glassman « between those on one data : (i) Quantities, whether measured, side who feel the accuracy of automated observed, or computed (ii) Content, which [content analysis] is sufficient and those I’ll characterize as non-quantitative on the other side who feel we can only rely information (iii) Metadata describing on human analysis […] most in the field quantities and content. concur with the idea that we need to Structured/unstructured is a false define a methodology where the software dichotomy. » and the analyst collaborate to get over the noise and deliver accurate analysis. » (July 2011 – IKS Semantic Workshop, France) (May 2011 – Sentiment Analysis Symposium review)22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 3
  4. 4. Textometry and Web Mining: why ?• Improving Linguistic Models – Semantic complexity of simple units such as NE – Identifying paraphrases of NE INTEL Paris Le président de la République Gone with the wind Sarko Harry Potter JIF Peanut Butter Nicolas Sarkozy Sarkoland lyzozym The 4th of July 20GB M. Sarkozy Dulles International Airport Sarkozyste Le Tour de France Mr Sarkozy Sarkozysme www.nytimes.com NE : an heterogeneous category Ehrmann M. (2008) les EN de la linguistique au TAL statut Paraphrases of a single NE théorique et méthodes de désambiguïsation.22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 4
  5. 5. Textometry and Web Mining: why ? • Text is considered having its own internal structure • Application of statistical and probabilistic calculations directly to the textual units of comparable texts in a corpus22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 5
  6. 6. Textometry and Web Mining: how? Form Specificness b 23.43July 4th 2011 b 12.68 b 5.57 b 5.66 Hypergeometric Distribution Form Specificness d 13.73July 5th 2011 d 21.86 d 7.75 d 6.55 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 6
  7. 7. Textometry and Web Mining: how?Two words or more that appear at the same time in a predetermined span of text- lexicalrelationships around a pivot-form (William Martinez, 2003)Result: network of associative relationships A ---A---C---B---D. ---B---C---H---E. ---B-- C --A---E. B C E ---E---B---D---F. ---C---A---D---H. A B C ---F---C---B---D. ---E---B---D---A. E 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 7
  8. 8. Textometry and Web Mining : how? 1/ POINT OF ENTRY 2/ CORPUS 184,761 occurrences / 13,075 forms / 5,194 hapaxNE (companies Article 160 articlesand people) selection 197,341 occurrences / 17,807 formes / 9,416 hapax 103 articles Company NE = Xerox People NE = Nicolas Sarkozy 3/ TEXTOMETRIC ANALYSIS 4/ INTERPRETATION OF RESULTS Hypergeometric Disribution Quantitative information to formulate qualitative interpretations. Specificness Cooccurrences 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 8
  9. 9. Textometry and Web Mining: results? Observing forms and repeted segments of « Nicolas Sarkozy » allows identifying polarities of opinion in paraphrases, providing clues for determining how the NE is perceived.contextuallydependant { negative {22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 9
  10. 10. Textometry and Web Mining: results? Figure - Monthly variation of specificness for paraphrases for the NE « Nicolas Sarkozy ».22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 10
  11. 11. Textometry and Web Mining: results?As a current event is discussed in the media, the lexical network produced by the co-occurrence calculation will be greater during an event than during periods of calmor low activity of the NE ( « buzz effect »)22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 11
  12. 12. Textometry and Web Mining: results?22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 12
  13. 13. Textometry and Web Mining: results?22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 13
  14. 14. Conclusion• Two intelligence use-cases on Le Monde and The New York Times• Two complementary approaches : specificness and co-occurrence analysis• Three main contributions : – Building corpus-driven linguistic ressources (time and cost-cutting) – Identifying trends with specificness calculation – Targeting zones of activity or events through co-occurrence networks• In sum, this method : – Help derive knowledge from corpora without predefined information models – Provides adequate functions enabling interaction between the expertise of the user and processing tools22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 14
  15. 15. ReferencesBloom K., Stein S. & Argamon S., Appraisal extraction for news opinion analysis at NTCIR-6, Proceedings of NTCIR-6, 2007, p 279-289.Bollier, D. The Promise and Peril of Big Data. Washington, DC : The Aspen Institute, 2010.Delanoë, A. 2010. Statistique textuelle et series chronologiques sur un corpus de presse écrite. Le cas de la mise en application du principe de précaution. Proceedings, JADT’2010.Delaplace R., Leenhardt M. & Wu L-C., Méthode de conception d’une application de veille et d’Analyse Linguistique Assistée par Ordinateur, VSST Conference, Toulouse, France, 2010.Fayyard, U.M, Piatesky, G., Smyth, P. & Uthurusamy, R. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.Feldman R. & Sanger J., The Text Mining Handbook : Advanced Approaches in Analyzing Unstructured Data, Cambrigde University Press, 2006, 422 p. Firth, J.R. A Synopsis of Linguistic Theory 1930-1955, Linguistic Analysis Philological Society, Oxford, 1957.Grishman, R. & Sundheim, B. Message Understanding Conference- 6 : A Brief History. Proceedings of the 16th International Conference on Computational Linguistics (COLING), I. Kopenhagen, 1996 p.466–471,.Kodratoff, Y. Knowledge discovery in texts: A definition and applications, Proceedings of the International Symposium on Methodologies for Intelligent Systems, 1999, volume LNAI 1609, p. 16–29.Lebart, L. & Salem, A. Statistique textuelle. Paris, Dunod, 1994. Lent, B., Agrawal, R., & Srikant, R. Discovering trends in text databases, Proceedings KDD’1997, AAAI Press, 14–17 p. 227–230.MacMurray E. & Shen L., Textual Statistics and Information Discovery: Using Co-occurrences to Detect Events, VSST Conference, Toulouse, France, 2010.Martin J.R. & White P.R.R., The language of evaluation: appraisal in English, Palgrave, London, 2005.Martinez, W. Contribution à une méthodologie de l’analyse des cooccurrences lexicales multiples dans les corpus textuels, Thèse pour le doctorat en Sciences du Langage, Université de la Sorbonne nouvelle - Paris 3, 2003.Née, E. Insécurité et élections presidentielles dans le journal Le Monde, Lexicometrica numéro thématique « Explorations Textuelles », S. Fleury, A. Salem. 2008Poibeau T. Extraction automatique d’information. Du texte brut au web sémantique. Paris : Hermès Sciences, 2003.Poibeau, T. Sur le statut référentiel des entités nommées, Proceedings TALN’05. Dourdan, France, 2005.Salem A., Introduction à la résonance textuelle, In Actes des JADT 2004 (7 èmes Journées internationales d’Analyse Statistique des Données Textuelles), 2004, p 986-992.Sandhaus, E. The New York Times Annotated Corpus. Philadelphia: Linguistic Data Consortium, 2008.Tufféry, S. Data mining et statistique décisionnelle: lintelligence des données. Paris : Editions Technip, 2007.Wright, K. Using Open Source Common Sense Reasoning Tools in Text Mining Research, the International Journal of Applied Management and Technology, 2006 vol 4 n°2 p.349-387.22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 15

×