Slides for the class, From Pattern Matching to Knowledge Discovery Using Text Mining and Visualization Techniques, presented June 13, 2010, at the Special Libraries Association 2010 annual meeting.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
The slide aids to understand and provide insights on the following topics,
* Overview for Data Science
* Definition of Data and Information
* Types of Data and Representation
* Data Value Chain - [ Data Acquisition; Data Analysis; Data Curating; Data Storage; Data Usage ]
* Basic concepts of Big Data
Big Data - The 5 Vs Everyone Must KnowBernard Marr
This slide deck, by Big Data guru Bernard Marr, outlines the 5 Vs of big data. It describes in simple language what big data is, in terms of Volume, Velocity, Variety, Veracity and Value.
Visualizing Text: Seth Redmore at the 2015 Smart Data Conferencesredmore
Seth Redmore talks about text and data visualization at this year's Smart Data Conference.
He covers:
-Common software packages for visualization
-Structured plots for unstructured text: Lines vs. bars vs. boxplots vs. piecharts vs. bubble charts
-Less structured plots: word clouds vs. treemaps vs. clusters vs. graphs
-Moving plots: animations over time
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
The slide aids to understand and provide insights on the following topics,
* Overview for Data Science
* Definition of Data and Information
* Types of Data and Representation
* Data Value Chain - [ Data Acquisition; Data Analysis; Data Curating; Data Storage; Data Usage ]
* Basic concepts of Big Data
Big Data - The 5 Vs Everyone Must KnowBernard Marr
This slide deck, by Big Data guru Bernard Marr, outlines the 5 Vs of big data. It describes in simple language what big data is, in terms of Volume, Velocity, Variety, Veracity and Value.
Visualizing Text: Seth Redmore at the 2015 Smart Data Conferencesredmore
Seth Redmore talks about text and data visualization at this year's Smart Data Conference.
He covers:
-Common software packages for visualization
-Structured plots for unstructured text: Lines vs. bars vs. boxplots vs. piecharts vs. bubble charts
-Less structured plots: word clouds vs. treemaps vs. clusters vs. graphs
-Moving plots: animations over time
Big Data & Text Mining: Finding Nuggets in Mountains of Textual Data
Big amount of information is available in textual form in databases or online sources, and for many enterprise functions (marketing, maintenance, finance, etc.) represents a huge opportunity to improve their business knowledge. For example, text mining is starting to be used in marketing, more specifically in analytical customer relationship management, in order to achieve the holy 360° view of the customer (integrating elements from inbound mails, web comments, surveys, internal notes, etc.).
Facing this new domain I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The below presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen Ralf Stockmann
The amount of online data that supplies geo-spatial and temporal metadata has grown rapidly in recent years. Social networks like Twitter, Flickr, and YouTube are popular providers of masses of data that are hard to browse.
Our europeana 4D interface – e4D – enables comparative visualisation of multiple queries and supports data annotated with time span data. We implemented our design in a prototype application in the context of the European project EuropeanaConnect. It is based on a client-server architecture that charges the client with the main functionality of the system.
Social Listening for the Travel & Hospitality IndustryBrandwatch
This report details the changes, challenges and opportunities facing the travel and hospitality industry, including:
- How airlines can justify price premiums and boost brand reputation
- What leading hotel and booking brands are doing to better understand and respond to user reviews
- Which metrics marketers can use to measure their campaigns and ROI
More info available here: http://www.brandwatch.com/report-travel-2014/
Gartner webinar social media analytics 23.10.2014Irene Ventayol
Virtually every modern marketer has a presence in social channels, and many use social listening tools to monitor what people say about their brands. Yet despite being a maturing discipline, social analytics remains stubbornly difficult and frustrating to apply. How much is a Facebook fan worth? Does it matter that your "net sentiment" is in the single digits? Your "share of voice" on Twitter is down this week – should you panic? This presentation focuses on the social analytics vendors, techniques, metrics and cases that can help you most.
Neuroinformatics_Databses_Ontologies_Federated Database.pptxJagannath University
This will introduce and describe NIF(Neuroscience information framework), Federated databases, data federation vs data warehouse, ontology, ontology vs database, steps in creating ontology.
Neuroinformatics Databases Ontologies Federated Database.pptxJagannath University
Neuroscience Information Framework(NIF), Federated Database, Data Federation vs Data warehouse, ontology, steps in creating ontology, ontology vs database
Federated Search Webinar for SLA (Special Libraries Assoc.)Helen Mitchell
A comprehensive presentation on Federated Search (FS) Technologies including the types of FS, FS Challenges & Benefits, a case study, FS Evaluation Criteria, Examples of FS Solutions, Best Practices and Future Vision of where FS Technologies may go.
"At the toolbar (menu, whatever) associated with a document there is a button marked "Oh, yeah?". You press it when you lose that feeling of trust. It says to the Web, 'so how do I know I can trust this information?'. The software then goes directly or indirectly back to metainformation about the document, which suggests a number of reasons."
Tim Berners-Lee, W3C Chair, Web Design Issues, September 1997
Provenance is focused on the description and understanding of where and how data is produced, the actors involved in the production of such data, and the processes by which the data was manipulated and transformed until it arrived to the collection from which it is being accessed. Provenance aims at providing the ability to trace the sources of data, enabling the exploration not just of the relationships between datasets, but also of their authors and affiliations, with the goal of preserving data ownership and establishing a notion of trust based on authenticity and reliability.
The Future Internet poses important challenges for provenance, derived from complex and rich scenarios characterized by the presence of large amounts of data stemming from heterogeneous sources like user communities, services, and things. Such challenges span across technical but also socioeconomic dimensions. The former includes aspects like vocabularies for representing provenance, interoperability and scalability issues, and means to produce, acquire, and reason with provenance in order to provide measures of trust and information quality. However, it is probably in the socieconomic dimension where more significant efforts need to be made as to addressing issues like the role of provenance in the overall picture of the Future Internet, entry barriers preventing the generation of provenance-aware internet content, means required to incentivate the production of such content, and ways to prevent provenance forgery.
In this talk, we provide and overview on provenance and the above mentioned challenges and introduce ongoing work in order to address trust issues from the provenance perspective in the Future Internet. We also link provenance to other relevant aspects for trust discussed in the session, like security, legal frameworks, and economics.
6 - Making Information Pay 2011 -- SOLOMON, MADI (Pearson)bisg
"Smart Content: The Importance of Semantics in Publishing"
The way we organize our information is shifting from the book-centric table of contents or bibliographic citations to a more dynamic approach that directs us to content that may never have been initially intended, or previously encountered.
Smart content is content that is equipped with structured data that allows it to soar across domains, user groups, profiles, and knowledge maps to reach readers in non-linear ways. Through the guidance of taxonomies and the exploitation of classifications, smart content no longer waits for the wisdom of the reader, but seeks the most appropriate reader for its content.
This presentation explores how semantics and reliable metadata act as agents to broker such relationships.
"Linked Data, an opportunity to mitigate complexity pharmaceutical research and development" A poster accapted for first international workshop on linked web data management in Uppsala, 25 March, 2011
Semantics empowered Physical-Cyber-Social Systems for EarthCubeAmit Sheth
Presentation at the EarthCube Face Face-to-Face Workshop of Semantics & Ontologies Workgroup: April 30-May 1, 2012, Ballston, VA.
Workshop site: http://earthcube.ning.com/group/semantics-and-ontologies/page/workshops
For more recent material on this topic, see: http://wiki.knoesis.org/index.php/PCS
Linked Data and Semantic Technologies can support a next generation of science. This talk shows examples of discovery, access, integration, analysis, and shows directions towards prediction and vision.
Creating an AI Startup: What You Need to KnowSeth Grimes
Seth Grimes presented "Creating an AI Startup: What You Need to Know," at a May 20, 2021 Launch Annapolis + Maryland AI (https://www.meetup.com/MarylandAI) program, focusing on opportunity and resources for Maryland tech entrepreneurs.
Efficient Deep Learning in Natural Language Processing Production, with Moshe...Seth Grimes
Moshe Wasserblat, Intel AI, presents on Efficient Deep Learning in Natural Language Processing Production to an online NLP meetup audience, August 3, 2020. Visit https://www.meetup.com/NY-NLP for the New York NLP meetup.
From Customer Emotions to Actionable Insights, with Peter DorringtonSeth Grimes
From Customer Emotions to Actionable Insights -- A presentation by Peter Dorrington, founder, XMplify Consulting, at the 2020 CX Emotion conference (https://cx-emotion.com), July 22, 2020.
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AISeth Grimes
Dan Lee from Dentuit AI presented an Intro to Deep Learning for Medical Image Analysis at the Maryland AI meetup (https://www.meetup.com/Maryland-AI), May 27, 2020. Visit https://www.youtube.com/watch?v=xl8i7CGDQi0 for video.
Emotion AI refers to a set of technologies -- natural language processing, voice tech, facial coding, neuroscience, and behavioral analytics -- applied to interactions to extract, convey, and induce emotion. Emotion AI is a presentation by Seth Grimes at AI for Human Language, March 5, 2020 in Tel Aviv.
Text Analytics for NLPers, a presentation by Seth Grimes, created for the December 2, 2019 Natural Language Processing-New York (NYC-NLP) meetup, https://www.meetup.com/NLP-NY/events/266093296/
Our FinTech Future – AI’s Opportunities and Challenges? Seth Grimes
"Our FinTech Future – AI’s Opportunities and Challenges?" is a presentation by Jim Kyung-Soo Liew, Ph.D. to the Artificial Intelligence Maryland (MD-AI) meetup (https://www.meetup.com/Maryland-AI/), November 20, 2019. Dr. Liew is Co-Founder of SoKat.co and Associate Professor at Johns Hopkins Carey Business School.
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
Presentation by Nathan Schneider, Assistant Professor of Linguistics and Computer Science at Georgetown University, to the Washington DC Natural Language Processing meetup, October 14, 2019 (https://www.meetup.com/DC-NLP/events/264894589/).
The Ins and Outs of Preposition Semantics: Challenges in Comprehensive Corpu...Seth Grimes
Presentation by Nathan Scheider, Georgetown University, to the Washington DC Natural Language Processing meetup, October 14, 2019, https://www.meetup.com/DC-NLP/events/264894589/.
Nick Schmidt of BLDS, LLC to the Maryland AI meetup, June 4, 2019 (https://www.meetup.com/Maryland-AI). Nick discusses ideas of fairness and how they apply to machine learning. He explores recent academic work on identifying and mitigating bias, and how his work in lending and employment can be applied to other industries. Nick explains how to measure whether an algorithm is fair and also demonstrate the techniques that model builders can use to ameliorate bias when it is found.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
Text Mining and Visualization
1. From Pattern Matching to
Knowledge Discovery Using
Text Mining and Visualization
Techniques
Seth Grimes
Alta Plana Corporation
@sethgrimes – 301-270-0795 -- http://altaplana.com
Special Libraries Association 2010
June 13, 2010
2. Text Mining and Visualization 2
Introduction
Seth Grimes –
Principal Consultant, Alta Plana Corporation.
Contributing Editor, IntelligentEnterprise.com.
Channel Expert, BeyeNETWORK.com.
Instructor, The Data Warehousing Institute, tdwi.org.
Founding Chair, Sentiment Analysis Symposium.
Founding Chair, Text Analytics Summit.
2010 Special Libraries Association
3. Text Mining and Visualization 3
Perspectives
Perspective #1: You support research or work in IT.
You help end users who have lots of text.
Perspective #2: You’re a researcher, business analyst, or
other “end user.”
You have lots of text. You want an automated way to deal
with it.
Perspective #3: You work for a solution provider.
Perspective #4: Other?
--------------------
Perspective A: Your focus is Information Retrieval.
Perspective B: Your focus is Data Analysis.
2010 Special Libraries Association
4. Text Mining and Visualization 4
Agenda
1. The “Unstructured” Data Challenge.
2. Text analytics for information retrieval and BI.
3. Text analysis technologies and processes.
4. Applications.
5. Software and tools.
6. Text visualization for exploratory analysis.
Note:
I will not cover the agenda in a linear fashion. Text mining and
viz are intermixed.
Class coverage is for both information analysts and end users.
Text analytics ≈ text mining ≈ text data mining.
2010 Special Libraries Association
5. Text Mining and Visualization 5
Value in Text
It’s a truism that 80% of enterprise-relevant information
originates in “unstructured” form:
E-mail and messages.
Web pages, news & blog articles, forum postings, and other
social media.
Contact-center notes and transcripts.
Surveys, feedback forms, warranty claims.
Scientific literature, books, legal documents.
...
Non-text “unstructured” forms?
http://upload.wikimedia.org/wikipedia/commons/thumb
/9/90/LOC_Brooklyn_Bridge_and_East_River_3.png/753p
x-LOC_Brooklyn_Bridge_and_East_River_3.png
2010 Special Libraries Association
6. Text Mining and Visualization 6
Unstructured Sources
These sources may contain “traditional” data.
The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard
& Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85.
And they may not.
Axin and Frat1 interact with dvl and GSK, bridging
Dvl to GSK in Wnt-mediated regulation of LEF-1.
Wnt proteins transduce their signals through
dishevelled (Dvl) proteins to inhibit glycogen synthase
kinase 3beta (GSK), leading to the accumulation of
cytosolic beta-catenin and activation of TCF/LEF-1
transcription factors. To understand the mechanism
by which Dvl acts through GSK to regulate LEF-1, we
investigated the roles of Axin and Frat1 in Wnt-
mediated activation of LEF-1 in mammalian cells. We
found that Dvl interacts with Axin and with Frat1,
both of which interact with GSK. Similarly, the Frat1
homolog GBP binds Xenopus Dishevelled in an
interaction that requires GSK.
www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&
cmd=Retrieve&list_uids=10428961&dopt=Abstract
www.stanford.edu/%7ernusse/wntwindow.html
2010 Special Libraries Association
7. Text Mining and Visualization 7
Unstructured Sources
Sources may mix fact and sentiment –
When you walk in the foyer of the hotel it seems quite inviting
but the room was very basis and smelt very badly of stale
cigarette smoke, it would have been nice to be asked if we
wanted a non smoking room, I know the room was very cheap
but I found this very off putting to have to sleep with the
smell, and it was to cold to leave the window open.
Overall I would never sell/buy a Motorola V3 unless it is
demanded. My life would be way better without this phone
being around (I am being 100% serious) Motorola should pay
me directly for all the problems I have had with these phones.
:-(
– and contain information that isn’t either…
2010 Special Libraries Association
8. Text Mining and Visualization 8
Unstructured Sources
Neither fact nor fiction, but definitely narrative:
2010 Special Libraries Association
9. Text Mining and Visualization 9
Unstructured Sources
Sources contain/provide metadata:
2010 Special Libraries Association
10. Text Mining and Visualization 10
Unstructured Sources
Sources may intermix content and “noise”:
2010 Special Libraries Association
11. Text Mining and Visualization 11
Unstructured Sources
“Unstructured” materials likely contain structure:
2010 Special Libraries Association
12. Text Mining and Visualization 12
The “Unstructured” Data Challenge
The task:
1. Find the right information.
2. Use inherent structure and latent semantics to –
Infer meaning.
Discern information content.
Structure content for machine use.
3. Apply analytical methods to generate insight.
4. Present findings.
2010 Special Libraries Association
13. Text Mining and Visualization 13
From the Analytics/Business Perspective
1. If you are not analyzing text – if you're analyzing only
transactional information – you're missing opportunity
or incurring risk.
2. Text analytics can boost business results –
“Organizations embracing text analytics all report having an
epiphany moment when they suddenly knew more than
before.”
-- Philip Russom, the Data Warehousing Institute
– via established BI / data-mining programs, or
independently.
2010 Special Libraries Association
14. Text Mining and Visualization 14
From the Analytics/Business Perspective
Some folks may need to expand their views of what BI and
business analytics are about.
Others can do text analytics without worrying about BI or
data mining.
Let’s deal with text-BI first...
“The bulk of information value is perceived as coming from data
in relational tables. The reason is that data that is structured is
easy to mine and analyze.”
-- Prabhakar Raghavan, Yahoo Research
2010 Special Libraries Association
15. Text Mining and Visualization 15
Business Intelligence
Conventional BI feeds off:
"SUMLEV","STATE","COUNTY","STNAME","CTYNAME","YEAR","POPESTIMATE",
50,19,1,"Iowa","Adair County",1,8243,4036,4207,446,225,221,994,509
50,19,1,"Iowa","Adair County",2,8243,4036,4207,446,225,221,994,509
It runs off: 50,19,1,"Iowa","Adair County",3,8212,4020,4192,442,222,220,987,505
50,19,1,"Iowa","Adair County",4,8095,3967,4128,432,208,224,935,488
50,19,1,"Iowa","Adair County",5,8003,3924,4079,405,186,219,928,495
50,19,1,"Iowa","Adair County",6,7961,3892,4069,384,183,201,907,472
50,19,1,"Iowa","Adair County",7,7875,3855,4020,366,179,187,871,454
50,19,1,"Iowa","Adair County",8,7795,3817,3978,343,162,181,841,439
50,19,1,"Iowa","Adair County",9,7714,3777,3937,338,159,179,805,417
2010 Special Libraries Association
16. Text Mining and Visualization 16
Business Intelligence
Conventional BI produces:
2010 Special Libraries Association
17. Text Mining and Visualization 17
Text-BI: Back to the Future
Note that business intelligence (BI) was first defined in 1958:
“In this paper, business is a collection of activities carried on for
whatever purpose, be it science, technology, commerce,
industry, law, government, defense, et cetera... The notion of
intelligence is also defined here... as ‘the ability to apprehend
the interrelationships of presented facts in such a way as to
guide action towards a desired goal.’”
-- Hans Peter Luhn
“A Business Intelligence System”
IBM Journal, October 1958
What was IT like in the ‘50s?
2010 Special Libraries Association
18. Document
input and
processing
Knowledge
Desk Set (1957): Computer engineer
handling is Richard Sumner (Spencer Tracy)
key and television network librarian
Bunny Watson (Katherine Hepburn)
and the "electronic brain" EMERAC.
19. Text Mining and Visualization 19
From the Information Retrieval Perspective
What do people do with electronic documents?
1. Publish, Manage, and Archive.
2. Index and Search.
3. Categorize and Classify according to metadata &
contents.
4. Information Extraction.
For textual documents, text analytics enhances #1 & #2 and
enables #3 & #4.
You need linguistics to do #1 & #4 well, to deal with meaning
(a.k.a. semantics).
Search is not enough...
2010 Special Libraries Association
20. Text Mining and Visualization 20
From the Information Retrieval Perspective
Keyword search is just a start, the dumbest form of pattern
matching.
It doesn’t help you discover things you’re unaware of.
Results often lack relevance.
Basic search finds documents, not knowledge.
Articles
from a
forum site
Articles
from
1987
2010 Special Libraries Association
21. Text Mining and Visualization 21
Semantics
Text analytics adds semantic understanding of –
Named entities: people, companies, places, etc.
Pattern-based entities: e-mail addresses, phone numbers, etc.
Concepts: abstractions of entities.
Facts and relationships.
Concrete and abstract attributes (e.g., 10-year, expensive,
comfortable).
Subjectivity in the forms of opinions, sentiments, and
emotions: attitudinal data.
Call these elements, collectively, features.
2010 Special Libraries Association
22. Text Mining and Visualization 22
Semantics, Analytics, and IR
In a sense, text analytics, by generating semantics, bridges
search and BI to turn Information Retrieval into
Information Access.
Information Access
Search BI
Text Integrated analytics
Semantic search Analytics
2010 Special Libraries Association
23. Text Mining and Visualization 23
Information Access
Text analytics transforms Information Retrieval (IR) into
Information Access (IA).
• Search terms become queries.
• Retrieved material is mined for larger-scale structure.
• Retrieved material is mined for features such as entities and
topics or themes.
• Retrieved material is mined for smaller-scale structure such
as facts and relationships.
• Results are presented intelligently, for instance, grouping
on mined topics-themes.
• Extracted information may be visualized and explored.
2010 Special Libraries Association
24. Text Mining and Visualization 24
Information Access
Text analytics enables results that suit the information and
the user, e.g., answers –
2010 Special Libraries Association
25. Text Mining and Visualization 25
Intelligent Results Presentation
E.g., results clustering on extracted topics:
http://touchgraph.com/
TGGoogleBrowser.php
?start=text%20analytics
2010 Special Libraries Association
26. Text Mining and Visualization 26
Text Data Mining
Data Mining = Knowledge Discovery in Data.
Text Mining = Data Mining of textual sources.
Clustering and Classification.
Link Analysis.
Association Rules.
Predictive Modelling.
Regression.
Forecasting.
2010 Special Libraries Association
27. Text Mining and Visualization 27
Text Data Mining Enables Content Exploration
2010 Special Libraries Association
28. Text Mining and Visualization 28
Text Analytics Definition
Text analytics automates what researchers, writers,
scholars, and all the rest of us have been doing for years.
Text analytics –
Applies linguistic and/or statistical techniques to extract
concepts and patterns that can be applied to categorize and
classify documents, audio, video, images.
Transforms “unstructured” information into data for
application of traditional analysis techniques.
Discerns meaning and relationships in large volumes of
information that were previously unprocessable by
computer.
2010 Special Libraries Association
29. Text Mining and Visualization 29
Glossary: Information in Text
Information Extraction (IE) involves pulling features –
entities & their attributes, facts, relationships, etc. – out
of textual sources.
Entity: Typically a name (person, place, organization, etc.) or
a patterned composite (phone number, e-mail address).
Concept: An abstract entity or collection of entities.
Co-reference: Multiple expressions that describe the same
thing. Anaphora including pronoun use is an example:
John pushed Max. He fell.
John pushed Max. He laughed.
-- Laure Vieu and Patrick Saint-Dizier
Feature: An element of interest, e.g., an entity, concept,
topic, event, etc.
2010 Special Libraries Association
30. Text Mining and Visualization 30
Glossary : Information in Text
Fact: A relationship between two entities or an entity and an
attribute.
Sentiment: A valuation at the entity or higher level.
Polarity/valence/tone and intensity are sentiment
attributes.
Opinion: A statement that involves a sentiment. Opinion
holder is typically different from the opinion object.
Semantics: Meaning, typically contextually dependent and
hinted at by...
Syntax: The arrangement of words and terms.
2010 Special Libraries Association
31. Text Mining and Visualization 31
Glossary: Methods
Natural Language Processing (NLP) Computers hear
:
humans.
Parsing: Evaluating the content of a document or text.
Tokenization: Identification of distinct elements, e.g.,
words, punctuation marks, n-grams.
Stemming/Lemmatization: Reducing word variants
(conjugation, declension, case, pluralization) to bases.
Term reduction: Use of synonyms, taxonomy, similarity
measures to group like terms.
Tagging: Wrapping XML tags around distinct features, a.k.a.
text augmentation. May involve text enrichment.
POS Tagging: Specifically identifying parts of speech.
2010 Special Libraries Association
32. Text Mining and Visualization 32
Glossary: Organizing and Structuring
Categorization: Specification of feature groupings.
Clustering: Creating categories according to outcome-
similarity criteria.
Taxonomy: An exhaustive, hierarchical categorization of
entities and concepts, either specified or generated by
clustering.
Classification: Assigning an item to a category, perhaps
using a taxonomy.
Ontology : In practice, a classification of a set of items in a
way that represents knowledge.
An oak is a tree. A rose is a flower. A deer is an animal. A sparrow is a
bird. Russia is our fatherland. Death is inevitable.
-- P. Smirnovskii, A Textbook of Russian Grammar
2010 Special Libraries Association
33. Text Mining and Visualization 33
Glossary: Evaluation
Precision: The proportion of decisions (e.g., classifications)
that are correct.
Recall: The proportion of actual correct decisions (e.g.,
classifications) relative to the total number of correct
decisions.
Find the even numbers:
9 17 12 4 1 6 2 20 7 3 8 10
What is my Precision? What is my Recall?
Accuracy: How well an IE or IR task has been performed,
computed as an F-score weighting Precision & Recall,
typically:
f = 2*(precision * recall) / (precision + recall)
Relevance: Do results match the individual user’s needs?
2010 Special Libraries Association
34. Text Mining and Visualization 34
Text Analytics Pipeline
Typical steps in text analytics include –
1. Identify and retrieve documents for analysis.
2. Apply statistical &/ linguistic &/ structural techniques to
discern, tag, and extract entities, concepts, relationships,
and events (features) within document sets.
3. Apply statistical pattern-matching & similarity techniques
to classify documents and organize extracted features
according to a specified or generated categorization /
taxonomy.
– via a pipeline of statistical & linguistic steps.
Let’s look at them, at steps to model text...
2010 Special Libraries Association
35. Text Mining and Visualization 35
Modelling Text
Metadata.
E.g., title, author, date.
Statistics.
Typically via vector space methods.
E.g., term frequency, co-occurrence, proximity.
Linguistics.
Lexicons, gazetteers, phrase books.
Word morphology, parts of speech, syntactic rules.
Semantic networks.
Larger-scale structure including discourse.
Machine learning.
2010 Special Libraries Association
36.
37.
38. Text Mining and Visualization 38
http://wordle.net
“Statistical information derived from word frequency and distribution is
used by the machine to compute a relative measure of significance, first
for individual words and then for sentences. Sentences scoring highest in
significance are extracted and printed out to become the auto-abstract.”
-- H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal, 1958.
2010 Special Libraries Association
39. Text Mining and Visualization 39
Modelling Text
The text content of a document can be considered an
unordered “bag of words.”
Particular documents are points in a high-dimensional vector
space.
Salton, Wong &
Yang, “A Vector
Space Model for
Automatic
Indexing,”
November 1975.
2010 Special Libraries Association
40. Text Mining and Visualization 40
Modelling Text
We might construct a document-term matrix...
D1 = “I like databases”
D2 = “I hate hate databases”
I like hate databases
D1 1 1 0 1
D2 1 0 2 1
http://en.wikipedia.org/wiki/Term-document_matrix
and use a weighting such as TF-IDF (term frequency–inverse
document frequency)…
in computing the cosine of the angle between weighted
doc-vectors to determine similarity.
2010 Special Libraries Association
41. Text Mining and Visualization 41
Modelling Text
Analytical methods make text tractable.
Latent semantic indexing utilizing singular value
decomposition for term reduction / feature selection.
Creates a new, reduced concept space.
Takes care of synonymy, polysemy, stemming, etc.
Classification technologies / methods:
Naive Bayes.
Support Vector Machine.
K-nearest neighbor.
2010 Special Libraries Association
42. Text Mining and Visualization 42
Modelling Text
In the form of query-document similarity, this is Information
Retrieval 101.
See, for instance, Salton & Buckley, “Term-Weighting
Approaches in Automatic Text Retrieval,” 1988.
A useful basic tech paper: Russ Albright, SAS, “Taming Text
with the SVD,” 2004.
Given the complexity of human language, statistical models
may fall short.
“Reading from text in general is a hard problem, because it
involves all of common sense knowledge.”
-- Expert systems pioneer Edward A. Feigenbaum
2010 Special Libraries Association
43. Text Mining and Visualization 43
“Tri-grams” here
are pretty good at
describing the
Whatness of the
source text. Yet...
“This rather unsophisticated argument on ‘significance’
avoids such linguistic implications as grammar and syntax...
No attention is paid to the logical and semantic
relationships the author has established.”
-- Hans Peter Luhn, 1958
2010 Special Libraries Association
44. Text Mining and Visualization 44
New York Times,
September 8, 1957
Anaphora /
coreference:
“They”
2010 Special Libraries Association
45. Text Mining and Visualization 45
Advanced Term Counting
Counting term hits, in one source, at the doc level, doesn’t
take you far...
Good or bad? What’s behind the posts?
2010 Special Libraries Association
46. Text Mining and Visualization 46
Why Do We Need Linguistics?
To get more out of text than can be delivered by a
bag/vector of words and term counting.
The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard
& Poor's 500 index gained1.44, or 0.11 percent, to 1,263.85.
The Dow gained 46.58, or 0.42 percent, to 11,002.14. The
Standard & Poor's 500 index fell 1.44, or 0.11 percent, to
1,263.85.
-- Luca Scagliarini, Expert System
Time flies like an arrow. Fruit flies like a banana.
-- Groucho Marx
(Statistical co-occurrence to build a model for analysis of
text such as these is possible but still limited.)
2010 Special Libraries Association
47. Text Mining and Visualization 47
Parts of Speech
2010 Special Libraries Association
48. Text Mining and Visualization 48
Parts of Speech
2010 Special Libraries Association
49. Text Mining and Visualization 49
Parts of Speech
2010 Special Libraries Association
50. Text Mining and Visualization 50
From POS to Relationships
When we understand, for instance, parts of speech (POS),
e.g. –
<subject> <verb> <object>
– we’re in a position to discern facts and relationships...
Semantics networks such as WordNet are an asset for word-
sense disambiguation.
“WordNet is a large lexical database of English... Nouns, verbs,
adjectives and adverbs are grouped into sets of cognitive
synonyms (synsets), each expressing a distinct concept. Synsets
are interlinked by means of conceptual-semantic and lexical
relations…WordNet's structure makes it a useful tool for
computational linguistics and natural language processing.”
http://wordnet.princeton.edu/
2010 Special Libraries Association
51.
52. Text Mining and Visualization 52
Tagging with GATE
Annotation (tagging) in action via GATE, an open-source
tool:
2010 Special Libraries Association
53. Text Mining and Visualization 53
GATE Language Processing Pipeline
2010 Special Libraries Association
54. Text Mining and Visualization 54
GATE Text Annotation
2010 Special Libraries Association
56. Text Mining and Visualization 56
Information Extraction
For content analysis, key in on extracting information.
Text features are typically marked up (annotated) in-place
with XML.
Entities and concepts may correspond to dimensions in a
standard BI model.
Both classes of object are hierarchically organized and have
attributes.
We can have both discovered and predetermined classifications
(taxonomies) of text features.
Dimensional modelling facilitates extraction to databases...
2010 Special Libraries Association
57. Text Mining and Visualization 57
Database Insert
Illustrated via an IBM example:
“The standard features are stored in the STANDARD_KW table,
keywords with their occurrences in the KEYWORD_KW_OCC
table, and the text list features in the TEXTLIST_TEXT table.
Every feature table contains the DOC_ID as a reference to the
DOCUMENT table.”
2010 Special Libraries Association
58. Text Mining and Visualization 58
Sophisticate Pattern Matching
Lexicons and language rules boost accuracy. An example –
a bit complicated – GATE Extension via JAPE Rules...
/* locationcontext2.jape
* Dhavalkumar Thakker, Nottingham Trent University/PA Photos 15 Sept 2008*/
Phase: locationcontext2
Input: Lookup Token
Options: control = all debug = false
//Manchester, UK
Rule: locationcontext2
Priority:50
({Token.string == "at"})
( ({Token.string =~ "[Tt]he"})?
(
(
{Token.kind == word, Token.category == NNP, Token.orth == upperInitial}
({Token.kind == punctuation})?
{Token.kind == word, Token.category == NNP, Token.orth == upperInitial}
({Token.kind == punctuation})?
{Token.kind == word, Token.category == NNP, Token.orth == upperInitial} ) |
( {Token.kind == word, Token.category == NNP, Token.orth == upperInitial}
({Token.kind == punctuation})?
( {Token.kind == word, Token.category == NNP, Token.orth == allCaps} |
{Token.kind == word, Token.category == NNP, Token.orth == upperInitial} )
) |
...
2010 Special Libraries Association
59. Text Mining and Visualization 59
Predictive Modeling
Another processing pipeline and more rules…
2010 Special Libraries Association
60. Text Mining and Visualization 60
Predictive Modeling
In the text context, predictive analytics is mostly about
classification and automated processing.
Modeling also helps, operationally, with:
Completion
http://en.wikipedia.org/wiki/File:
ITap_on_Motorola_C350.jpg
Disambiguation
: use dictionaries, context
Error correction
2010 Special Libraries Association
61. Text Mining and Visualization 61
Error Correction
“Search logs suggest that from 10-15% of queries contain
spelling or typographical errors. Fittingly, one important
query reformulation tool is spelling suggestions or
corrections.”
-- Marti Hearst, Search User Interfaces
2010 Special Libraries Association
62. Text Mining and Visualization 62
Accuracy and Semi-Structured Sources
An e-mail message is “semi-structured,” which facilitates
extracting metadata --
Date: Sun, 13 Mar 2005 19:58:39 -0500
From: Adam L. Buchsbaum <alb@research.att.com>
To: Seth Grimes <grimes@altaplana.com>
Subject: Re: Papers on analysis on streaming data
seth, you should contact divesh srivastava,
divesh@research.att.com regarding at&t labs data streaming
technology.
Adam
“Reading from text in structured domains I don’t think is as
hard.”
-- Edward A. Feigenbaum
Surveys are also typically s-s in a different way...
2010 Special Libraries Association
63. Text Mining and Visualization 63
Structured &‘Unstructured’
The respondent is invited to explain his/her attitude:
2010 Special Libraries Association
64. Text Mining and Visualization 64
Structured &‘Unstructured’
We typically look at frequencies and distributions of coded-
response questions:
Linkage of responses to
coded ratings helps in
analyses of free text:
2010 Special Libraries Association
65. Text Mining and Visualization 65
Sentiment Analysis
“Sentiment analysis is the task of identifying positive and
negative opinions, emotions, and evaluations.”
-- Wilson, Wiebe & Hoffman, 2005, “Recognizing Contextual
Polarity in Phrase-Level Sentiment Analysis”
“Sentiment analysis or opinion mining is the computational
study of opinions, sentiments and emotions expressed in
text… An opinion on a feature f is a positive or negative
view, attitude, emotion or appraisal on f from an opinion
holder.”
-- Bing Liu, 2010, “Sentiment Analysis and Subjectivity,” in Handbook
of Natural Language Processing
“Dell really... REALLY need to stop overcharging... and when i
say overcharing... i mean atleast double what you would
pay to pick up the ram yourself.”
-- From Dell’s IdeaStorm.com
2010 Special Libraries Association
66. Text Mining and Visualization 66
Sentiment Analysis
Applications include:
Brand / Reputation Management.
Competitive intelligence.
Customer Experience Management.
Enterprise Feedback Management.
Quality improvement.
Trend spotting.
2010 Special Libraries Association
67. Text Mining and Visualization 67
Steps in the Right Direction
2010 Special Libraries Association
70. Text Mining and Visualization 70
... and Missteps
“Kind” =
type, variety,
not a
sentiment.
Complete
External misclassification
reference
Unfiltered
duplicates
2010 Special Libraries Association
71. Text Mining and Visualization 71
Sentiment Complications
There are many complications.
Sentiment may be of interest at multiple levels.
Corpus / data space, i.e., across multiple sources.
Document.
Statement / sentence.
Entity / topic / concept.
Human language is noisy and chaotic!
Jargon, slang, irony, ambiguity, anaphora, polysemy, synonymy,
etc.
Context is key. Discourse analysis comes into play.
Must distinguish the sentiment holder from the object:
“Geithner said the recession may worsen.”
2010 Special Libraries Association
72. Text Mining and Visualization 72
Beyond Polarity
We present a system that adds an emotional dimension to an
activity that Internet users engage in frequently, search.”
-- Sood, Vasserman & Hoffman, 2009, “ESSE: Exploring Mood on
the Web”
2010 Special Libraries Association
73. Text Mining and Visualization 73
Happy Sad Angry
Energetic Confused Aggravated
Bouncy Crappy Angry
Happy Crushed Bitchy
Hyper Depressed Enraged
Cheerful Distressed Infuriated
Ecstatic Envious Irate
Excited Gloomy Pissed off
Jubilant Guilty
Giddy Intimidated
Giggly Jealous
Lonely
Rejected
Sad
Scared
-----------------------
The three prominent mood groups
that emerged from K-Means
Clustering on the set of
LiveJournal mood labels.
2010 Special Libraries Association
74. Text Mining and Visualization 74
Applications
Text analytics has applications in –
• Intelligence & law enforcement.
• Life sciences.
• Media & publishing including social-media analysis and
contextual advertizing.
• Competitive intelligence.
• Voice of the Customer: CRM, product management &
marketing.
• Legal, tax & regulatory (LTR) including compliance.
• Recruiting.
2010 Special Libraries Association
75. Text Mining and Visualization 75
Online Commerce
Text analytics is applied for marketing, search optimization,
competitive intelligence.
Analyze social media and enterprise feedback to understand
the Voice of the Market:
• Opportunities
• Threats
• Trends
Categorize product and service offerings for on-site search
and faceted navigation and to enrich content delivery.
Annotate pages to enhance Web-search findability, ranking.
Scrape competitor sites for offers and pricing.
Analyze social and news media for competitive information.
2010 Special Libraries Association
76. Text Mining and Visualization 76
Voice of the Customer
Text analytics is applied to enhance customer service and
satisfaction.
Analyze customer interactions and opinions –
• E-mail, contact-center notes, survey responses
• Forum & blog posting and other social media
– to –
• Address customer product & service issues
• Improve quality
• Manage brand & reputation
If you can link qualitative information from text you can –
• Link feedback to transactions
• Assess customer value
• Understand root causes
• Mine data for measures such as churn likelihood
2010 Special Libraries Association
77. Text Mining and Visualization 77
E-Discovery and Compliance
Text analytics is applied for compliance, fraud and risk, and
e-discovery.
Regulatory mandates and corporate practices dictate –
• Monitoring corporate communications
• Managing electronic stored information for production in
event of litigation
Sources include e-mail (!!), news, social media
Risk avoidance and fraud detection are key to effective
decision making
• Text analytics mines critical data from unstructured sources
• Integrated text-transactional analytics provides rich insights
2010 Special Libraries Association
78. Text Mining and Visualization 78
The Semantic Web vision
"
An open-standards architecture,
coordinated by the W3C (World
Wide Web Consortium).
Linked Data: “exposing, sharing, and
connecting pieces of data, information,
and knowledge on the Semantic Web.”
2010 Special Libraries Association
79. Text Mining and Visualization 79
Getting Started
A best practices approach…
Assess:
• Assess business goals.
• Understand information sources.
• Consult and educate stakeholders.
Evaluate:
• Evaluate installed, hosted/SaaS, database-integrated options.
• Determine performance and business requirements.
• Match methods to goals, sources, and work practices.
Implement:
• Start with basic functions such as search, modest goals, or
with a single information source.
• Go for clear wins to gain support.
• Build out applications, capacity, BI/research integration.
2010 Special Libraries Association
80. Text Mining and Visualization 80
Users’ Perspectives
I estimate a $425 million global market in 2009, up from $350
in 2008. I foresee 25% growth in 2010.
Last year, I published a study report, “Text Analytics 2009:
User Perspectives on Solutions and Providers.”
http://www.slideshare.net/SethGrimes/text-analytics-2009-user-
perspectives-on-solutions-and-providers
I relayed findings from a survey that asked…
2010 Special Libraries Association
81. Text Mining and Visualization 81
Primary Applications
What are your primary applications where text comes into
play?
Brand / product / reputation management 40%
Competitive intelligence 37%
Voice of the Customer / Customer Experience… 33%
Research (not listed) 33%
Customer service 22%
Content management or publishing 19%
Life sciences or clinical medicine 18%
Insurance, risk management, or fraud 17%
Financial services 15%
E-discovery 15%
Product/service design, quality assurance,… 14%
Other 13%
Compliance 8%
Law enforcement 7%
0% 10% 20% 30% 40% 50%
2010 Special Libraries Association
82. Text Mining and Visualization 82
Analyzed Textual Information
What textual information are you analyzing or do you plan
to analyze?
Current users responded:
blogs and other social media (twitter, 62%
social-network sites, etc.)
news articles 55%
on-line forums 41%
e-mail and correspondence 38%
customer/market surveys 35%
2010 Special Libraries Association
83. Text Mining and Visualization 83
Extracted Information
Do you need (or expect to need) to extract or analyze:
Other 15%
Other entities – phone numbers, e-mail
40%
& street addresses
Metadata such as document author,
53%
publication date, title, headers, etc.
Events, relationships, and/or facts 55%
Concepts, that is, abstract groups of
58%
entities
Sentiment, opinions, attitudes,
60%
emotions
Topics and themes 65%
Named entities – people, companies,
71%
geographic locations, brands, ticker…
0% 10% 20% 30% 40% 50% 60% 70% 80%
2010 Special Libraries Association
84. Text Mining and Visualization 84
Software & Platform Options
Text-analytics options may be grouped in general classes.
• Installed text-analysis application, whether desktop or
server or deployed in-database.
• Data mining workbench.
• Hosted.
• Programming tool.
• As-a-service, via an application programming interface
(API).
• Code library or component of a business/vertical
application, for instance for CRM, e-discovery, search.
The slides that follow next will present leading options in
each category except Hosted…
2010 Special Libraries Association
85. Text Mining and Visualization 85
Text Analysis Applications
Vendors:
Attensity, Clarabridge, IBM Cognos, Linguamatics, Provalis
Research, Nstein (Open Text), SAP, SAS Teragram, SRA
NetOwl, TEMIS Luxid.
Typical uses:
Customer experience management (CEM), survey analysis,
social-media analysis, law enforcement.
Typical characteristics:
Interface that allows the user to configure a processing
pipeline.
Interface for text exploration and visualization.
Export to databases
2010 Special Libraries Association
86. Text Mining and Visualization 86
Data Mining Workbench
Vendors:
IBM SPSS Modeler, Megaputer PolyAnalyst, Rapid-I
RapidMiner, SAS Text Miner.
Typical uses:
Customer experience management (CEM), marketing
analytics, survey analysis, social-media analysis, law
enforcement.
Predictive modeling.
Typical characteristics:
Same as text-analysis applications, but with more
sophisticated modeling and analysis capabilities.
2010 Special Libraries Association
87. Text Mining and Visualization 87
Programming/Development Tool
Vendors:
GATE, Python NLTK, R – open source.
NooJ – free, non-open source.
IBM LanguageWare.
Typical uses:
Language modeling.
Data exploration.
Up to the programmer.
Typical characteristics:
Text is an add-in to a programming language/environment.
2010 Special Libraries Association
88. Text Mining and Visualization 88
As a Service, API
Vendors:
Lexalytics, Open Amplify, Orchest8 Alchemy API, Thomson
Reuters Open Calais.
GeeYee, Jodange, Sentimentrix
Typical uses:
Annotation and content enrichment with the application
domain up to the user.
Typical characteristics:
Relies on remote, server-resident processing resources.
May or, more likely, may not be end-user customizable.
2010 Special Libraries Association
89. Text Mining and Visualization 89
Code Library or Application Component
Vendors:
Alias-I LingPipe, GATE.
Basis Technology Rosette, SAP Inxight, SAS Teragram, TEMIS.
Typical uses:
Information extraction in support of business applications.
Typical characteristics:
Same as text-analysis applications, but with more
sophisticated modeling and analysis capabilities.
2010 Special Libraries Association
90. Text Mining and Visualization 90
Text Visualizations
There are many text-visualization options.
We’ve seen –
• Structured search-generated information, slide 24.
• Clustered search results, slide 25.
• Document-set exploration via tabular & graphically rendered
facets, slide 27.
• Mined feature-relationship network, slides 27, 51.
• Annotated documents, slides 37, 54.
• Word cloud, slide 38.
• Trend lines, extracted terms, topics, and sentiment, slides 45, 68,
69.
• Sentence diagrams with parts of speech, slides 48, 49.
2010 Special Libraries Association
91. Text Mining and Visualization 91
Text Visualizations
Text visualizations operate at –
The document level, visualizations in-situ.
The document-set level, organizing documents based on
metadata and/or extracted information.
At the feature level, organizing extracted information.
Examples to this point have been –
Freely usable on the Web.
Part of products or solutions.
IBM’s Many Eyes is a great, free resource, supporting text
viz types:
Word Tree, Tag Cloud, Phrase Net, Word Cloud Generator
The illustrations that follow complement the ones we’ve
seen…
2010 Special Libraries Association
92. Text Mining and Visualization 92
Visualization: Relationship Network
“Using a program called Sitkis, we extracted data on citations
from the Web of Science… To map the network of citations,
we imported the data into a Microsoft Excel spreadsheet
running a social-network analysis extension called NodeXL.”
-- Peter Aldhous, “The stem
cell wars: data, methods and
results,” New Scientist, 2
June 2010.
http://www.newscientist.com/article/dn18996-the-
stem-cell-wars-data-methods-and-results.html
2010 Special Libraries Association
93. Text Mining and Visualization 93
Visualization: Content Network
Operating on text-extracted features & themes…
Decisive Analytics
http://www.dac.us/
TouchGraph Navigator (commercial),
http://www.touchgraph.com/navigator.html
2010 Special Libraries Association
94. Text Mining and Visualization 94
Visualization: Document Set Categorization
http://manybills.researchlabs.ibm.com/collections/114
2010 Special Libraries Association
95. Text Mining and Visualization 95
Visualization: Word Tree
http://www.nytimes.com/2009/12/22/opinion/22viegas.ready.html
2010 Special Libraries Association
96. Text Mining and Visualization 96
Visualization: Phrase Net
2010 Special Libraries Association
97. Text Mining and Visualization 97
Visualization: Discourse Analysis
2010 Special Libraries Association
98. Text Mining and Visualization 98
Visualization: Text Statistics
Literary analysis: “Stefanie Posavec’s… maps capture
regularities and patterns within a literary
space…highlighting and noting sentence length, prosody
and themes.”
http://www.notcot.com/archives/2008/04/stefanie-posave.php
2010 Special Libraries Association
99. Text Mining and Visualization 99
Visualization: Stream Graph
Acknowledges narrative structure.
http://www.neoformix.com/2008/TomSawyer.html
2010 Special Libraries Association
100. Text Mining and Visualization 100
Selected Resources
Christopher D. Manning, Prabhakar Raghavan and Hinrich
Schütze, Introduction to Information Retrieval, 2008.
http://nlp.stanford.edu/IR-book/information-retrieval-book.html
Bo Pang and Lillian Lee, “Opinion mining and sentiment
analysis,” 2008.
http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html
Tim Showers, “Visualization Strategies: Text & Documents,”
2008.
http://www.timshowers.com/2008/08/visualization-strategies-text-documents/
IBM Many Eyes visualization site.
http://manyeyes.alphaworks.ibm.com/manyeyes/
Neoformix: discovering & illustrating patterns in data.
http://neoformix.com/
Seth Grimes, various material.
http://www.slideshare.net/SethGrimes/, http://sethgrimes.com
2010 Special Libraries Association
101. From Pattern Matching to
Knowledge Discovery Using
Text Mining and Visualization
Techniques
Seth Grimes
Alta Plana Corporation
@sethgrimes – 301-270-0795 -- http://altaplana.com
Special Libraries Association 2010
June 13, 2010