This document presents a new approach called textometry and information discovery for mining textual data on the web. It discusses using textometry and web mining to improve linguistic models by analyzing the semantic complexity of named entities and identifying paraphrases. The approach involves selecting documents based on named entities, performing textometric analysis on the corpus, and interpreting quantitative results to form qualitative insights. Examples are provided of analyzing news articles about people and companies to identify trends, events, and sentiment about entities over time based on variations in specificity and co-occurrence networks. The approach aims to derive knowledge from corpora without predefined models and provide interactive functions between user expertise and tools.
These slides explain the basic meaning of text mining,its comparision with other data retrieval methods,its subtasks and applications, limitations, present and future of text mining. Also included is the topic data mining with its goals and applications.
These slides explain the basic meaning of text mining,its comparision with other data retrieval methods,its subtasks and applications, limitations, present and future of text mining. Also included is the topic data mining with its goals and applications.
Keynote speech at COST 292 final workshop on future of multimedia search and ...Touradj Ebrahimi
This is a one year old keynote I gave on my thoughts about challenges in multimedia search and a high level description of JPSearch standard. JPSearch has been progressing further since then, but responding to frequent and popular demands, I am sharing these with you!
Computing for Human Experience and WellnessAmit Sheth
Talk at Venture Panel in Nov. 2005. Since this very early start, the ideas have substantially matured: a more recent version is at: http://www.slideshare.net/knoesis/computing-for-human-experience-v3
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...IOSRJVSP
One of the biggest and important issues in the video watermarking is the distortion and attacks. The attacks and distortion affect the digital watermarking. Watermarking is an embedding process. With the help of watermarking, we insert the data into the digital objects. There are few methods are available for authentication of data, securing/protection of data. The watermarking technique also provides the data security, copyright protection and authentication of the data. Watermarking provides a comfortable life to authorized users. In my proposed work, we are working on distorted watermarked video. The distortion is present on the watermarked video is rational 7 th and 8 th order distortion model. In this paper, firstly we are embedding the watermark information into the original video and after that work on the distortion model which may be come into the watermarked video. We are also calculating the PSNR (Peak signal to noise ratio), SSIM (Structural similarity index measure), Correlation, BER (Bit Error Rate) and MSE (Mean Square Error) parameters for distorted watermarked video. We are showing the relationship between correlation and SSIM with BER, MSE and PSNR.
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE ijmpict
Video activity recognition has grown to be a dynamic location of analysis in latest years. A widespread
information-driven approach is denoted in this paper that produces descriptions of video content into
textual content description inside the Hindi language. This method combines the final results of modern
item with "real-international" records to pick the in all subject-verb-object triplet for depicting a video. The
usage of this triplet desire technique, a video is tagged via the trainer, mainly, Subject, Verb, and object
(SVO) and then this data is mined to improve the result of checking out video clarification by using pastime
as well as item identity. Contrasting preceding approaches, this method can annotate arbitrary videos
deprived of wanting the large series and annotation of a similar schooling video corpus. The proposed
work affords initial and primary text description within the Hindi language that is producing easy words
and sentence formation. But the fundamental challenging attempt on this work is to extract grammatically
accurate and expressive text records in Hindi textual content regarding video content.
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology EngineeringChristophe Debruyne
In this paper we present GOSPL, which stands for Grounding Ontologies with Social Processes and Natural Language. GOSPL is a method and tool that supports stakeholders in iteratively interpreting and modeling their common hybrid ontologies using their own terminology for semantic interoperability between autonomously developed and maintained information systems. Hybrid ontologies are ontologies in which concepts are both formally and informally described with the help of a special linguistic resource called glossary. Social interactions between the community members drive the ontology evolution process and result in more stable and agreed upon ontologies.
Christophe Debruyne, Robert Meersman: GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering. ADBIS 2012: 153-166
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
More Related Content
Similar to Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web
Keynote speech at COST 292 final workshop on future of multimedia search and ...Touradj Ebrahimi
This is a one year old keynote I gave on my thoughts about challenges in multimedia search and a high level description of JPSearch standard. JPSearch has been progressing further since then, but responding to frequent and popular demands, I am sharing these with you!
Computing for Human Experience and WellnessAmit Sheth
Talk at Venture Panel in Nov. 2005. Since this very early start, the ideas have substantially matured: a more recent version is at: http://www.slideshare.net/knoesis/computing-for-human-experience-v3
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...IOSRJVSP
One of the biggest and important issues in the video watermarking is the distortion and attacks. The attacks and distortion affect the digital watermarking. Watermarking is an embedding process. With the help of watermarking, we insert the data into the digital objects. There are few methods are available for authentication of data, securing/protection of data. The watermarking technique also provides the data security, copyright protection and authentication of the data. Watermarking provides a comfortable life to authorized users. In my proposed work, we are working on distorted watermarked video. The distortion is present on the watermarked video is rational 7 th and 8 th order distortion model. In this paper, firstly we are embedding the watermark information into the original video and after that work on the distortion model which may be come into the watermarked video. We are also calculating the PSNR (Peak signal to noise ratio), SSIM (Structural similarity index measure), Correlation, BER (Bit Error Rate) and MSE (Mean Square Error) parameters for distorted watermarked video. We are showing the relationship between correlation and SSIM with BER, MSE and PSNR.
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE ijmpict
Video activity recognition has grown to be a dynamic location of analysis in latest years. A widespread
information-driven approach is denoted in this paper that produces descriptions of video content into
textual content description inside the Hindi language. This method combines the final results of modern
item with "real-international" records to pick the in all subject-verb-object triplet for depicting a video. The
usage of this triplet desire technique, a video is tagged via the trainer, mainly, Subject, Verb, and object
(SVO) and then this data is mined to improve the result of checking out video clarification by using pastime
as well as item identity. Contrasting preceding approaches, this method can annotate arbitrary videos
deprived of wanting the large series and annotation of a similar schooling video corpus. The proposed
work affords initial and primary text description within the Hindi language that is producing easy words
and sentence formation. But the fundamental challenging attempt on this work is to extract grammatically
accurate and expressive text records in Hindi textual content regarding video content.
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology EngineeringChristophe Debruyne
In this paper we present GOSPL, which stands for Grounding Ontologies with Social Processes and Natural Language. GOSPL is a method and tool that supports stakeholders in iteratively interpreting and modeling their common hybrid ontologies using their own terminology for semantic interoperability between autonomously developed and maintained information systems. Hybrid ontologies are ontologies in which concepts are both formally and informally described with the help of a special linguistic resource called glossary. Social interactions between the community members drive the ontology evolution process and result in more stable and agreed upon ontologies.
Christophe Debruyne, Robert Meersman: GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering. ADBIS 2012: 153-166
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web
1. Textometry and Information Discovery : A New
Approach to Mining Textual Data on the Web
Erin MacMurray*, Marguerite Leenhardt **
SYLED/CLA2T EA2290, UFR ILPGA, Université Sorbonne
Nouvelle Paris 3
*erin.macmurray@gmail.com
** marguerite.leenhardt@gmail.com
ICAI’11 Workshop on Intelligent Linguistic Technologies
2. In a nutshell
• Introduction & background
• Textometry and Web Mining: why?
• Textometry and Web Mining: how?
• Textometry and Web Mining: application?
• Conclusion
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 2
3. Introduction & background
Structure ? Man versus machine ?
Seth Grimes sees « three categories of Neil Glassman « between those on one
data : (i) Quantities, whether measured, side who feel the accuracy of automated
observed, or computed (ii) Content, which [content analysis] is sufficient and those
I’ll characterize as non-quantitative on the other side who feel we can only rely
information (iii) Metadata describing on human analysis […] most in the field
quantities and content. concur with the idea that we need to
Structured/unstructured is a false define a methodology where the software
dichotomy. » and the analyst collaborate to get over the
noise and deliver accurate analysis. »
(July 2011 – IKS Semantic Workshop, France)
(May 2011 – Sentiment Analysis Symposium
review)
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 3
4. Textometry and Web Mining: why ?
• Improving Linguistic Models
– Semantic complexity of simple units such as NE
– Identifying paraphrases of NE
INTEL Paris
Le président de la République
Gone with the wind
Sarko
Harry Potter JIF Peanut Butter Nicolas Sarkozy
Sarkoland
lyzozym
The 4th of July 20GB
M. Sarkozy
Dulles International Airport Sarkozyste
Le Tour de France Mr Sarkozy
Sarkozysme
www.nytimes.com
NE : an heterogeneous category
Ehrmann M. (2008) les EN de la linguistique au TAL statut Paraphrases of a single NE
théorique et méthodes de désambiguïsation.
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 4
5. Textometry and Web Mining: why ?
• Text is considered having its own internal structure
• Application of statistical and probabilistic calculations directly to the textual
units of comparable texts in a corpus
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 5
6. Textometry and Web Mining: how?
Form Specificness
b 23.43
July 4th 2011
b 12.68
b 5.57
b 5.66
Hypergeometric Distribution
Form Specificness
d 13.73
July 5th 2011
d 21.86
d 7.75
d 6.55
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 6
7. Textometry and Web Mining: how?
Two words or more that appear at the same time in a predetermined span of text- lexical
relationships around a pivot-form (William Martinez, 2003)
Result: network of associative relationships
A
---A---C---B---D.
---B---C---H---E.
---B-- C --A---E. B C E
---E---B---D---F.
---C---A---D---H.
A B C
---F---C---B---D.
---E---B---D---A.
E
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 7
8. Textometry and Web Mining : how?
1/ POINT OF ENTRY 2/ CORPUS
184,761 occurrences / 13,075 forms / 5,194 hapax
NE (companies Article 160 articles
and people) selection
197,341 occurrences / 17,807 formes / 9,416 hapax
103 articles
Company NE = Xerox
People NE = Nicolas Sarkozy 3/ TEXTOMETRIC ANALYSIS
4/ INTERPRETATION OF RESULTS
Hypergeometric
Disribution Quantitative information
to formulate qualitative interpretations.
Specificness
Cooccurrences
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 8
9. Textometry and Web Mining: results?
Observing forms and repeted segments of « Nicolas Sarkozy »
allows identifying polarities of opinion in paraphrases,
providing clues for determining how the NE is perceived.
contextually
dependant {
negative {
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 9
10. Textometry and Web Mining: results?
Figure - Monthly variation of specificness for paraphrases for the NE « Nicolas Sarkozy ».
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 10
11. Textometry and Web Mining: results?
As a current event is discussed in the media, the lexical network produced by the co-
occurrence calculation will be greater during an event than during periods of calm
or low activity of the NE
( « buzz effect »)
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 11
12. Textometry and Web Mining: results?
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 12
13. Textometry and Web Mining: results?
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 13
14. Conclusion
• Two intelligence use-cases on Le Monde and The New York Times
• Two complementary approaches : specificness and co-occurrence analysis
• Three main contributions :
– Building corpus-driven linguistic ressources (time and cost-cutting)
– Identifying trends with specificness calculation
– Targeting zones of activity or events through co-occurrence networks
• In sum, this method :
– Help derive knowledge from corpora without predefined information
models
– Provides adequate functions enabling interaction between the
expertise of the user and processing tools
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 14
15. References
Bloom K., Stein S. & Argamon S., Appraisal extraction for news opinion analysis at NTCIR-6, Proceedings of NTCIR-6, 2007, p 279-289.
Bollier, D. The Promise and Peril of Big Data. Washington, DC : The Aspen Institute, 2010.
Delanoë, A. 2010. Statistique textuelle et series chronologiques sur un corpus de presse écrite. Le cas de la mise en application du principe de précaution.
Proceedings, JADT’2010.
Delaplace R., Leenhardt M. & Wu L-C., Méthode de conception d’une application de veille et d’Analyse Linguistique Assistée par Ordinateur, VSST
Conference, Toulouse, France, 2010.
Fayyard, U.M, Piatesky, G., Smyth, P. & Uthurusamy, R. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
Feldman R. & Sanger J., The Text Mining Handbook : Advanced Approaches in Analyzing Unstructured Data, Cambrigde University Press, 2006, 422 p.
Firth, J.R. A Synopsis of Linguistic Theory 1930-1955, Linguistic Analysis Philological Society, Oxford, 1957.
Grishman, R. & Sundheim, B. Message Understanding Conference- 6 : A Brief History. Proceedings of the 16th International Conference on Computational
Linguistics (COLING), I. Kopenhagen, 1996 p.466–471,.
Kodratoff, Y. Knowledge discovery in texts: A definition and applications, Proceedings of the International Symposium on Methodologies for Intelligent
Systems, 1999, volume LNAI 1609, p. 16–29.
Lebart, L. & Salem, A. Statistique textuelle. Paris, Dunod, 1994.
Lent, B., Agrawal, R., & Srikant, R. Discovering trends in text databases, Proceedings KDD’1997, AAAI Press, 14–17 p. 227–230.
MacMurray E. & Shen L., Textual Statistics and Information Discovery: Using Co-occurrences to Detect Events, VSST Conference, Toulouse, France, 2010.
Martin J.R. & White P.R.R., The language of evaluation: appraisal in English, Palgrave, London, 2005.
Martinez, W. Contribution à une méthodologie de l’analyse des cooccurrences lexicales multiples dans les corpus textuels, Thèse pour le doctorat en
Sciences du Langage, Université de la Sorbonne nouvelle - Paris 3, 2003.
Née, E. Insécurité et élections presidentielles dans le journal Le Monde, Lexicometrica numéro thématique « Explorations Textuelles », S. Fleury, A. Salem.
2008
Poibeau T. Extraction automatique d’information. Du texte brut au web sémantique. Paris : Hermès Sciences, 2003.
Poibeau, T. Sur le statut référentiel des entités nommées, Proceedings TALN’05. Dourdan, France, 2005.
Salem A., Introduction à la résonance textuelle, In Actes des JADT 2004 (7 èmes Journées internationales d’Analyse Statistique des Données Textuelles),
2004, p 986-992.
Sandhaus, E. The New York Times Annotated Corpus. Philadelphia: Linguistic Data Consortium, 2008.
Tufféry, S. Data mining et statistique décisionnelle: l'intelligence des données. Paris : Editions Technip, 2007.
Wright, K. Using Open Source Common Sense Reasoning Tools in Text Mining Research, the International Journal of Applied Management and Technology,
2006 vol 4 n°2 p.349-387.
22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 15