SlideShare a Scribd company logo
white paper
Conceptualizing LSI-Based
Text Analytics
John Felahi
Senior Vice President of Products, Content Analyst, LLC.
Trevor J. Morgan, Ph.D.
Relevance Analytics
1© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
A Better Way to Find Information
Corporations are undeniably overloaded with information.
They thrive on the innovation and intellectual property
captured in their corporate documents. However, when
document sets grow at explosive rates like they have over
the past five years, critical information becomes buried
and lost. The unfortunate reality is that lost information
is useless information. Companies lose a key competitive
advantage by not being able to find and exploit valuable
information embedded within large document sets.1
The traditional solution to this problem is Boolean
(keyword) search or keyword-based document tagging
and organization. These Boolean search and analytics
tools may work just fine when the user knows exactly
what word or words the desired documents must contain,
but this is rarely the case. Nobody can ever anticipate
what specific words or phrases are in a document—
finding the right query is laborious and often futile.
The goal, then, is to make information more findable
by complementing keyword search systems with
advanced text analytics technologies. Advanced text
analytics encompasses the complex convergence
of linguistics, mathematics, probability theory, and
software development. Advanced text analytics software
employs sophisticated algorithms in an attempt to “read”
document content and figure out what that content
actually means. These solutions provide users with a rich
array of features, including concept-based search and
document organization functions. Advanced text analytics
software tries to determine document content and
meaning in the same way humans do, except on a scale
of volume far beyond human capabilities. The purpose is
not to replace humans but rather to refocus humans on
what they do best.2
“Text ANALYTICS
SOFTWARE EMPLOYS
SOPHISTICATED ALGORITHMS
IN AN ATTEMPT TO "READ"
DOCUMENT CONTENT
AND FIGURE OUT WHAT
THAT CONTENT ACTUALLY
MEANS.
”
2© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
Differentiators in Text Analytics
Text analytics solutions distinguish themselves in two
ways:
1.	 The manner in which the engine discovers the
meaning (concepts) of text in a document set. This
is essentially the content discovery aspect of text
analytics.
2.	 The variety of features that end users can leverage
once the engine has discovered all document content.
Enough stress cannot be placed on the content
discovery phase. The following sections focus on how
different technologies approach the daunting challenge
of discovering the meaning embedded within text
documents. The discussion shows the natural evolution
of indexing technologies to the present generation
of advanced text analytics engines, which deal with
document conceptuality. It also points to an extremely
viable advanced text analytics technology known as
Latent Semantic Indexing (LSI), discusses how LSI
functions, and compares it to alternative technologies.
LSI is a very powerful and scalable text analytics
technology, but the key to unleashing its potential is
understanding how it works and what problems it solves.
Content Discovery is the Major Differentiator
As mentioned earlier, all text analytics technologies need
to discover the contents of the documents presented to
them. This indexing process involves contextual term
analysis and is the first step in enabling users to work
with the information in a document set. Without this step,
the text analytics engine cannot possibly execute a search
or categorize documents: how could it when the contents
of the documents are unknown? The way in which an
analytics engine discovers document content is critical to
overall functionality and is a major distinguishing factor.
Many competing indexing technologies have been
developed and brought to market. Comprehensively
exploring all the different technical solutions and how
they function would be time consuming. However, a
quick look at the evolution of text analytics is instructive
in comparing the relative strengths of the predominant
text analytics approaches. When you view text analytics
technologies from the most general perspective, you
will find that each platform falls into one of the following
types:
»» Lexical, focusing primarily on the linguistic and
semantic indicators in text.
»» Probabilistic, focusing primarily on the statistical
potentialities in text.
»» LSI-based, focusing primarily on the holistic co-
occurrences of unique terms in text.
»» Hybridization, combining elements of any of the
previous three.
The First Generation—Term Occurrence and
Keyword Indexing
Boolean-based search engines initially employed a
simple linguistic method to index documents. These
platforms performed semantic discovery amounting to
counting all the individual words (and word frequencies)
found in a document set. The resultant index, therefore,
was a comprehensive term look-up list with varying
ranks applied depending on term frequency within the
document set. Over time, content enrichment methods
have been added but are still lexical based given the
nature of the system.
With these systems, when a user submits a search
term or phrase (known as the query), the search engine
compares the query to the contents of the look-up list.
A match occurs if a document in the index meets the
conditional logic of the query. For example, a query for
“dog” returns all documents containing the word “dog”
one or more times in a ranked order—documents with
many instances of the word rank higher in the results list
than ones with fewer instances. A query for “dog NOT
cat” returns all documents containing the word “dog”
but not the word “cat” because of the conditional logic
conveyed by NOT. The advent of this keyword indexing
and search methodology revolutionized the way users
found the documents for which they were searching. In
fact, keyword technology still solves many problems
in information management today, especially when the
precise query is known to the user. In fact, when the
user is looking for the presence (or non-presence) of very
3© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
specific words or phrases, keyword search is still the
most effective discovery tool.
What to Like About Keyword Technology
»» Simple approach to indexing the content of
documents.
»» Easy to understand resultant document list—
prevalence of query terms affects rank.
»» A good way to find very specific or uniquely worded
information in smaller document sets.
While technically a linguistic approach (it did, after all,
focus on the language within the indexed documents),
keyword indexing was not designed to discover word
and document meaning. Furthermore, it can be prone to
false positives and negatives. One problematic issue with
keyword indexing and search was that the presence of the
query word (or words) in a document did not guarantee
that document’s relevance.
For example, a user looking for documents about
“fraud” quickly finds lots of documents might contain
the word “fraud,” but those documents are not
necessarily about fraudulence. The user then is forced
to construct more and more complex queries strung
together with conditional operators (such as AND, OR) in
order to work within the search engine’s method to find
what is truly being sought. The problem is that most of
the time what is being sought is not a word or series of
words at all. Users really want to find documents in the
same way their own human brains work with information
every hour of every day: they want to find concepts
without having to know the exact terms to use in order to
convey that meaning to the analytics engine.
Shortcomings of Keyword Technology
»» Likelihood of introducing false positives and negatives
with vague queries.
»» Overly complex query construction when searching
for more expansive ideas or themes.
The Next Generation: Linguistic Analysis
and Indexing
A concept is different from a word or series of words, no
matter how much conditional logic is thrown in to make
the keyword query more nuanced. According to the online
American Heritage Dictionary of the English Language, a
concept is “a general idea or understanding of something.”
In a document, a concept might be an expressed idea
or thematic content employing any number of different
words to articulate it. A word is a unique entity with a
finite number of restricted meanings. A concept is a
larger idea not restricted to any particular terminology in
order to express it.
To distinguish between the two, take note of the word (or
keyword) “music” versus the idea of auditory stimulation
which excites the senses and the mind. The concept of
music could encompass thousands of different ideas—
rock and roll, blues, Woodstock, Beethoven, Justin
Bieber. Undeniably, human thoughts occur in the form
of ideas and concepts most of the time, not specific
words. Another fact is that rarely will a keyword or series
of keywords fully express all the facets of a concept.
Therefore, keyword search can be inherently inadequate
for most users’ needs when the “right” query is unknown.
This realization spurred researchers and software
developers past first-generation keyword text analytics.
The race was on to figure out how to find the concepts
embedded within documents, the “aboutness” rather than
just the individual words that composed them.3
The next generation of text analytics platforms continued
to rely on the analysis of language to find conceptual
content within documents, just as earlier keyword
technologies had. These linguistic indexing technologies,
though, went beyond the keyword lookup process. They
incorporated algorithms and ancillary reference tools
in order to interpret the complexities of language found
within documents. Such lexical analytics software
“Keyword indexing
was not designed to
discover word and
document meaning
”
4© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
came pre-programmed with the static rules of language
(grammar) and word use (semantics). Reference
cartridges or modules such as dictionaries and thesauri
fed into this linguistic indexing methodology.
This approach was certainly more sophisticated than
keyword indexing, but the problem was and still is
that the rules and conventions of language—grammar,
semantics, and colloquial usage—are so fluid and
changeable that early indexing software either could not
derive concepts accurately or effectively, or they could
not keep up with the ever-changing dynamics of modern
language. In the 21st century, a word can come into being
overnight, or an existing word can be imbued with vastly
different meaning and propagated around the world
within hours. Linguistic indexing engines, relying on
laborious human pre-programming and updating, could
not keep pace.
Another problem with the linguistic approach is that
it was language dependent, not language agnostic.
A dictionary is a reference tool for words in a given
language, so if a linguistic indexing engine encountered
a document in a language not supported by its reference
dictionary, it could not derive meaning from that
document. Some other technique or technology was
required to resolve these shortcomings.
The Irony Is…
Language, when viewed from a linguistic perspective,
seems hopelessly complex and impenetrable due to its
rich variety and dynamic state. Language when treated
from a mathematical perspective becomes elegantly
understandable. The importance of mathematics in
language analysis continues to gain traction.4
Advanced Text Analytics: Mathematical
Analysis and Indexing
In a very counter-intuitive fashion, researchers and
innovators turned away from the analysis of language
rules in order to figure out conceptuality within text.
Instead, they approached the content indexing problem
from a mathematical perspective. Employing an
impressive array of mathematical approaches and
maneuvers during the indexing process, text analytics
vendors were able to create even more advanced text
analytics software. No longer were indexing engines
dependent on static linguistic references necessitating
frequent updating as language usage changed.
Advanced text analytics engines now can rely on
statistical analyses of probable meaning within a
document—known as a probabilistic indexing approach—
or on linear algebraic analyses of total word co-
occurrence and correlation—known as an LSI-based
indexing approach—to figure out the concepts contained
within a document. Other approaches include a
hybridization of these two techniques. With these math-
based advanced analytics techniques, conceptual search
and classification across large volumes of documents is
possible with a very high degree of reliability, flexibility,
and adaptability. Sophisticated text analytics has finally
arrived.
The technology behind probabilistic text analysis
leverages research based in statistical computations.
It builds upon the ideas of probable confidence in
an assumption and the continuous updating of that
confidence based on evidence. Applied to text analysis,
a probabilistic approach—relying on algorithms rooted
in statistical analysis—analyzes local term patterns
and computes probable or likely meanings. These
calculations in part depend on previous assumptions of
meaning, which means that faulty assumptions introduce
“LANGUAGE WHEN
TREATED FROM A
MATHEMATICAL PERSPECTIVE
BECOMES ELEGANTLY
UNDERSTANDABLE
”
5© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
error into the textual analysis. The length of the text
also influences the accuracy of probabilistic indexing,
with shorter texts presenting a particular problem in
conceptual derivation.5
Latent Semantic Indexing, or LSI, is a linear algebraic
approach to deriving conceptuality out of the textual
content of a document. LSI uses sophisticated linear
algebraic computations in order to assess term co-
occurrence and contextual associations. While the scale
of calculations “under the hood” is extensive, the overall
approach can be explained in comprehensible and non-
technical language. Once understood, the utility of the LSI
technique and the myriad features it facilitates becomes
quite apparent.
LSI and Concept-Based Text Analytics
In order to figure out what concepts are contained within
a document, an LSI-based text analytics engine first must
acquire an understanding of term interrelationships.
Keep in mind that, as a math-based indexing technique,
an LSI engine does not have ancillary linguistic references
or any pre-programmed understanding of language at
all, so this understanding is mathematical rather than
linguistic. Using algorithms based in linear algebra, the
LSI technique generates this understanding by assessing
all the words within a document and within an entire
document set. The engine then calculates word co-
occurrences across the entire document set—accounting
for different emphases on certain word prevalence—to
figure out the interrelationships between words and
the logical combinations of word co-occurrence which
leads to comprehensible concepts. This ability to derive
conceptuality from text is one of its most valuable
commercial traits.6
To state it another way, we can compare LSI to human
thought and communication. A human must use logical
and accepted word combinations in order to convey a
thought, regardless of the language used. Too many
illogical word co-occurrences create incomprehensible
gibberish. LSI uses advanced math to figure out
these inherent combinations based on the documents
themselves, allowing it to respond to a conceptual query
with documents containing the same concept—again, not
the same words, but the same or similar concept. In a
way, LSI mimics the human brain in its attempt to make
sense out of word co-occurrences and figure out what text
really means. If it cannot figure out what the words mean,
that probably indicates that the word combinations are
meaningless. As Bradford states, conceptuality derived
from LSI “correspond remarkably well with judgments of
conceptual similarity made by human beings.”7
LSI is not new technology. As a matter of fact, it uses
individual mathematical maneuvers that have been
known to scientists and mathematicians for decades. In
the 1980s, a group of Bellcore scientists applied these
mathematical principles to their research in language
and subsequently patented the LSI technique. This
technology passed hands in the 1990s and also in the first
decade of this century until a new organization—Content
Analyst Company—was created in 2004 to advance the
technology in several markets. Along the way, numerous
other patents have been granted around the original LSI
approach, allowing it to grow into a full-blown advanced
text analytics platform.
“IN A WAY, LSI MIMICS
THE HUMAN BRAIN IN ITS
ATTEMPT TO MAKE SENSE
OUT OF WORD
CO-OCCURRENCES AND
FIGURE OUT WHAT TEXT
REALLY MEANS
”
6© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
LSI and Some Misconceptions
Throughout the different stages of LSI development
and evolution, the challenges which its inventors
and developers had to overcome gave rise to some
misconceptions about the LSI indexing technique. Most of
these misconceptions derive from its earliest days when
the technique was just beginning to evolve. Some of
these misconceptions include:
»» LSI is slow and does not scale
»» LSI is expensive to implement and maintain
»» LSI is not defensible
»» LSI does not differentiate semantic nuances
»» LSI cannot replace human inspection of documents
One of the biggest misconceptions about LSI technology
is that it is slow and non-scalable when presented
with large volumes of documents. We can trace this
misconception back to the days of less powerful,
more expensive hardware. Sometimes, technology
is ahead of its time and utilizes techniques to which
other technologies must catch up. The reality is that
LSI was invented and patented during a time when
the sophisticated math which the technique requires
consumed vast resources on the limited hardware
available. Hardware constraints admittedly slowed down
not only the indexing process but also the text analytics
functions carried out post-indexing.
As already touched upon, though, the rapid reduction in
hardware costs and huge gains in performance witnessed
over the past five years (resulting in inexpensive many-
core processors and cheap memory) have eliminated
these problems. Not only does an LSI engine now
have vast hardware resources available to it on a wide
range of servers, but its ongoing evolution has resulted
in distributed indexing capabilities and load-sharing
deployments. Concept searches typically result in
sub-one-second results, and hundreds of thousands of
documents can be classified, organized, and tagged in
mere minutes as opposed to the days or weeks required
of humans to assess large volumes of information.
LSI performs its functions rapidly on very affordable
hardware.8
The associated perception that LSI is expensive to install
and maintain is refuted by the same explanation. Cheap
hardware, extensible features, and deployment best
practices learned along the way all contribute to an
economical answer for anybody’s advanced text analytics
needs. For the value it provides, LSI is a compelling
indexing technology for concept-based analytics.
Microchip flashback
In 1990, Intel®
introduced the
33 MHz 486
microprocessor chip
with a processing
speed of 27 MIPS.
Today’s Intel Xeon®
chip is capable of
4.4 GHz, which is
exponentially more
powerful than the
state-of-the-art 486
was at the time. In
fact, the Xeon is 133
times faster than
its 1990 ancestor
486 chip.
7© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
Another misconception plays more upon the human fears
of automation by questioning how defensible the results
of LSI indexing and concept-based analytics are. Humans
are always suspicious when “the machine” encroaches
upon tasks and abilities thought to be best performed by
intelligent human and well-trained humans. Because LSI
effectively removes humans from the process of reading
and assessing the content of documents, critics can easily
play upon the fear of the unknown. After all, if no humans
read all the documents, who really knows what’s in them?
This fear is understandable but has no basis in research
or documented observation. The algorithms and
proprietary technology that power LSI indexing are well
documented and can be defended by the principles of
advanced mathematics. The reality is that the math of
LSI and its approach driven by sophisticated analysis of
total word co-occurrence can be defended both from a
mathematical and linguistic perspective.
A particularly erroneous claim is that LSI cannot
detect word similarities or semantic variations. This
misconception insists that LSI cannot distinguish between
“cool” as an indicator of temperature versus “cool” as a
qualitative judgment of relevance. This claim is patently
false. As a matter of fact, LSI is ideal for penetrating
the mysteries of semantics—including synonymy and
polysemy—and based upon its core approach (term co-
occurrence) actually figures out semantic quandaries in
the same way humans do.9
If one person says to another, “this is cool,” the
recipient might not immediately understand what is
being indicated, especially if the speaker has touched
something that might actually be cold. The hearer might
ask for clarification with something elegant like “what
is?” The first person might then elaborate with, “This
bronze statue is really cool. It’s quite post-modern.” With
the additional accompanying words in the follow up, the
speaker provides enough term co-occurrence for the
hearer to understand the meaning, which has nothing to
do with temperature. Conversely, a metal smith casting
the same bronze statue might also say “this is cool”
referring to the temperature, indicating it’s ready to be
handled without getting burned. LSI analysis of text
works exactly like this and mirrors the human ability to
interpret meaning based on term co-occurrence.
A final misconception involves the question of technology
replacing human assessment of document content. It is
human nature to resist technologies and processes that
we don’t fully understand or that we feel are replacing
us, but that does not mean that the technology itself
is not effective or appropriate to implement. With the
nearly exponential explosion in volume of enterprise
data, technology must replace the inadequate solutions
provided by earlier text analytics techniques and
expensive human activity. The knowledge management
market has indeed reached that inflection point where
inexpensive hardware costs coupled with the advanced
capabilities of CAAT™ create a compelling counter-
argument for those who are technophobic.
LSI and CAAT™
As with all other vendors of search and text analytics
technologies, Content Analyst Company has had to
focus on the tenets of precision, speed, and flexibility.
Overcoming the inherent obstacles found within text
that distract CAAT™ and hamper its ability to determine
document conceptuality was also a necessity. For
example, filtering out header and footer information
in emails is necessary in order to extract the useful
authored content in these types of documents.
Considerable research has gone into preparing document
text for more precise conceptual analysis.
In the earlier days of the technology, the inventors and
developers had the additional problem of less powerful
but much more expensive hardware. The math required
to perform LSI indexing is not insignificant, so the
workstations and servers of the 1990s and early 2000s
had to be robust, with as much memory (RAM) and
CPU horsepower as possible. Furthermore, the 32-bit
processors and operations systems prevalent at that
time were not capable of addressing large amounts of
RAM. Until the inflection point of more powerful but
less expensive hardware occurred in the mid-2000s,
the ongoing development of the CAAT engine focused on
refinement of the code base to allow for more accurate
and speedier functionality.
Distributed subsystem deployment and text filtering
capabilities were also incorporated along the way, the
latter of which suppresses extraneous or “garbage
8© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
text” during indexing. Finally, the evolution of CAAT
over the years grew to encompass not only traditional
search (concept search) and concept-based document
classification (multiple techniques and optimizing
algorithms), but also dynamic clustering, document
summarization, primary language identification, and
advanced text comparisons (for thread detection in
emails and for identifying duplicate text). CAAT is now an
advanced text analytics platform with dozens of discrete
analytical applications which developers can assemble
in combination to creatively add concept-based text
analytics to larger software solutions.
Key Capabilities
The key features of CAAT include concept-based search
and document organization/classification. Concept
search and concept-based document classification are
far more powerful than keyword-based approaches, for
reasons discussed previously. Now, the user does not
have to know the “right” words to use when submitting
queries, and documents that are related to each other
conceptually can be grouped together regardless of
whether they share the same terminology.
Document classification is particularly attractive for
software vendors who need workflow automation
and document routing within their solutions – such as
enterprise content management, enterprise archiving,
compliance and e-discovery software vendors. The
ability to rapidly organize large volumes of documents
based on their relevance to each other or to an
overarching category reduces or eliminates costly
human document inspection. With the reduction of
human document inspection come the benefits of more
precise classification results. Properly trained text
analytics software such as CAAT more objectively and
consistently assesses the conceptual and thematic
content of documents—it does not get tired (and therefore
increasingly inconsistent), and it does not get tripped up
over the interpretive nuances of language that plague
human inspection.
Because of the importance of classification, CAAT includes
two major ways to organize documents: clustering
and categorization. These two modes of document
classification differ in the amount of human intervention
required to establish and define the organizational
structure, known as the taxonomy. In theory, a taxonomy
is nothing more than the discrete organization units—
arranged in either a flat or hierarchical manner—into
which documents can be placed, along with the rules
dictating the type of content a document must contain
in order to qualify for a category. In practice, taxonomy
development and maintenance is an enormous
undertaking for enterprises, requiring highly specialized
knowledge workers (known as taxonomists) and their
support staff. CAAT’s ability to cluster documents into
an automatically generated taxonomy or alternately
accommodating the more refined process of human-
trained document categorization means that nearly any
automated workflow requiring classification can be
supported.
The value propositions for these two classification
methods are compelling. Automated clustering
provides rapid organization of documents so that users
can quickly understand the conceptual composition
and distribution spread within large document sets.
Clustering automatically classifies documents based on
each document’s predominant concept or theme—it also
creates the taxonomy structure and category naming
scheme without human intervention. Because the
unsupervised clustering feature is dynamic and can be
just-in-time within a solution’s workflow, it is ideal when
quick and general insight into a document set is required.
“Documents that are
related to each other
conceptually can be
grouped together
regardless of whether
they share the same
terminology.
”
9© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
Categorization, on the other hand, allows users to
participate in the taxonomy development and analytics
training process. As with all other types of supervised
document classification technologies, categorization
demands more upfront effort during taxonomy
development and the requisite learning process to define
the categories for CAAT. The compelling benefit is a
much more refined and accurate result set placed into
pre-defined categories of interest. Categorization is a
powerful solution where the best of human insight and
software efficiency can be combined to yield the most
accurate classification results possible.
CAAT also provides a term expansion feature, which is
capable of detecting for any word all the other terms
in the indexed document set which are either highly
correlated or synonymous with it. Using the “dog”
example from earlier, CAAT would also identify “husky,”
“spaniel,” “mutt,” “pup,” and “man’s best friend,” as highly
correlated or synonymous terms.
Being a math-based indexing engine, CAAT has no
native understanding of language, the benefit of which
is complete language agnosticism. Indexing of German
documents enables the same analytics features and
yields the same accurate results as the indexing of
documents in English, Chinese, or Arabic. Despite being
language agnostic, CAAT does have the ability to detect
the differences between languages due to its term
analysis. Therefore, it can identify the primary language
of a document.
All of these features can be accessed whenever needed
in a larger software platform to increase the findability
of documents and improve the accuracy and relevance of
document classification.
Learning More About CAAT
The power of CAAT and its LSI indexing technology has
been integrated into dozens of software solutions in a
number of different markets. To learn more about these
success stories, go to www.contentanalyst.com.
CAAT finds tight groupings of concepts, then groups and
subgroups are picked based on settings. Numeric Values help
interpret results, and document scores indicate closeness to
the center of the clusters.
Example documents plus threshold define “hit spheres.”
Documents in the search index which fall in the “hit spheres”
get categorized.
Figure 1.2
Figure 1.1
10© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
About Content Analyst Company
We provide powerful and proven Advanced Analytics
that exponentially reduce the time needed to discern
relevant information from unstructured data. CAAT, our
dynamic suite of text analytics technologies, delivers
significant value wherever knowledge workers need to
extract insights from large amounts of unstructured data.
Our capabilities are easily integrated into any software
solution, and our support strategy for our partners is
second to none.
© 2013 Content Analyst Company, LLC. All rights
reserved. Content Analyst, CAAT and the Content Analyst
and CAAT logos are registered trademarks of Content
Analyst, LLC in the United States. All other marks are the
property of their respective owners.
References
1 Frank Ohlhorst, “The Promise of Big Data,” Infoworld, September 2010.
2 John Markoff, “Armies of Expensive Lawyers, Replaced by Cheaper Software,”
New York Times, March 4, 2011.
3 C. Korycinski and Alan F. Newell, “Natural Language Processing and Automatic
Indexing,” The Indexer, April 1990.
4 Mark Liberman, “Linguists Who Count,” Language Log, May 28, 2009.
5 Yangqui Song et al, “Short Text Conceptualization Using a Probabilistic
Knowledgebase,” Proceedings of the 22nd International Joint Conference on
Artificial Intelligence, 2011.
6 Roger Bradford, “Comparability of LSI and Human Judgment in Text Analysis
Tasks,” Proceedings of the Applied Computing Conference, September 2009.
7 Bradford, “Comparability of LSI and Human Judgment in Text Analysis Tasks.”
8 Roger Bradford, “Implementation Techniques for Large-Scale Latent Semantic
Indexing Applications,” Proceedings of the 20th ACM International Conference on
Information and Knowledge Management, October 2011.
9 Bradford, “Comparability of LSI and Human Judgment in Text Analysis Tasks.”

More Related Content

What's hot

Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011
Seth Grimes
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
Peter Mika
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
Seth Grimes
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics
 
Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics PresentationSkylar Ritchie
 
What IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each OtherWhat IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each Other
Ian Lurie
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
Peter Mika
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction
Grace Hui Yang
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
Peter Mika
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey
dannyijwest
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
muzzy4friends
 
Competitive Intelligence Made easy
Competitive Intelligence Made easyCompetitive Intelligence Made easy
Competitive Intelligence Made easy
Raghav Shaligram
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
Brave new search world
Brave new search worldBrave new search world
Brave new search world
voginip
 
A survey on various architectures, models and methodologies for information r...
A survey on various architectures, models and methodologies for information r...A survey on various architectures, models and methodologies for information r...
A survey on various architectures, models and methodologies for information r...IAEME Publication
 

What's hot (19)

Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text Analytics
 
Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics Presentation
 
What IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each OtherWhat IA, UX and SEO Can Learn from Each Other
What IA, UX and SEO Can Learn from Each Other
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
 
Locating sources and search techniques
Locating sources and search techniquesLocating sources and search techniques
Locating sources and search techniques
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
 
Competitive Intelligence Made easy
Competitive Intelligence Made easyCompetitive Intelligence Made easy
Competitive Intelligence Made easy
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Brave new search world
Brave new search worldBrave new search world
Brave new search world
 
A survey on various architectures, models and methodologies for information r...
A survey on various architectures, models and methodologies for information r...A survey on various architectures, models and methodologies for information r...
A survey on various architectures, models and methodologies for information r...
 

Viewers also liked

Corporate Presentation: example
Corporate Presentation: exampleCorporate Presentation: example
Corporate Presentation: example
Luiz Fernando Lizardo Rodrigues
 
3 Music Video Analyses by Thomas Griffiths - 0601 FINAL
3 Music Video Analyses by Thomas Griffiths - 0601 FINAL3 Music Video Analyses by Thomas Griffiths - 0601 FINAL
3 Music Video Analyses by Thomas Griffiths - 0601 FINALThomas Griffiths
 
An Empirical Characterization of Touch-Gesture Input-Force on Mobile Devices
An Empirical Characterization of Touch-Gesture Input-Force on Mobile DevicesAn Empirical Characterization of Touch-Gesture Input-Force on Mobile Devices
An Empirical Characterization of Touch-Gesture Input-Force on Mobile Devices
University of Sussex
 
Ash edu 695 week 5 dq 2
Ash edu 695 week 5 dq 2Ash edu 695 week 5 dq 2
Ash edu 695 week 5 dq 2
robertesparza1011
 
Like tears in the rain’ postmodern media
Like tears in the rain’ postmodern mediaLike tears in the rain’ postmodern media
Like tears in the rain’ postmodern mediaThomas Griffiths
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
R A Akerkar
 
Muhammad ali
Muhammad aliMuhammad ali
Muhammad ali
blancaales
 
Muhammad Ali
Muhammad AliMuhammad Ali
Muhammad AliAlen_99
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
Edureka!
 
Enterprise Demand Management Framework
Enterprise Demand Management FrameworkEnterprise Demand Management Framework
Enterprise Demand Management Framework
Luiz Fernando Lizardo Rodrigues
 
Ali
AliAli
Ali
Nick535
 
The Prospect of IoT in the Oil & Gas
The Prospect of IoT in the Oil & Gas The Prospect of IoT in the Oil & Gas
The Prospect of IoT in the Oil & Gas
Ghazi Wadi, PMP
 
Big Data For Flight Delay Report
Big Data For Flight Delay ReportBig Data For Flight Delay Report
Big Data For Flight Delay Report
JSPM's JSCOE , Pune Maharashtra.
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 

Viewers also liked (15)

Corporate Presentation: example
Corporate Presentation: exampleCorporate Presentation: example
Corporate Presentation: example
 
3 Music Video Analyses by Thomas Griffiths - 0601 FINAL
3 Music Video Analyses by Thomas Griffiths - 0601 FINAL3 Music Video Analyses by Thomas Griffiths - 0601 FINAL
3 Music Video Analyses by Thomas Griffiths - 0601 FINAL
 
An Empirical Characterization of Touch-Gesture Input-Force on Mobile Devices
An Empirical Characterization of Touch-Gesture Input-Force on Mobile DevicesAn Empirical Characterization of Touch-Gesture Input-Force on Mobile Devices
An Empirical Characterization of Touch-Gesture Input-Force on Mobile Devices
 
Ash edu 695 week 5 dq 2
Ash edu 695 week 5 dq 2Ash edu 695 week 5 dq 2
Ash edu 695 week 5 dq 2
 
Like tears in the rain’ postmodern media
Like tears in the rain’ postmodern mediaLike tears in the rain’ postmodern media
Like tears in the rain’ postmodern media
 
예측
예측예측
예측
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
 
Muhammad ali
Muhammad aliMuhammad ali
Muhammad ali
 
Muhammad Ali
Muhammad AliMuhammad Ali
Muhammad Ali
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
Enterprise Demand Management Framework
Enterprise Demand Management FrameworkEnterprise Demand Management Framework
Enterprise Demand Management Framework
 
Ali
AliAli
Ali
 
The Prospect of IoT in the Oil & Gas
The Prospect of IoT in the Oil & Gas The Prospect of IoT in the Oil & Gas
The Prospect of IoT in the Oil & Gas
 
Big Data For Flight Delay Report
Big Data For Flight Delay ReportBig Data For Flight Delay Report
Big Data For Flight Delay Report
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 

Similar to Content Analyst - Conceptualizing LSI Based Text Analytics White Paper

Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic Enrichment
The Digital Group
 
Technical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search EngineTechnical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search Engine
s0P5a41b
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
ijsrd.com
 
Information Architecture Primer - Integrating search,tagging, taxonomy and us...
Information Architecture Primer - Integrating search,tagging, taxonomy and us...Information Architecture Primer - Integrating search,tagging, taxonomy and us...
Information Architecture Primer - Integrating search,tagging, taxonomy and us...Dan Keldsen
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Metaphic or the art of looking another way.
Metaphic or the art of looking another way.
Suresh Manian
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
IOSR Journals
 
16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx
RAJU852744
 
16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx
herminaprocter
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Derek Kane
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair Kerner
Darrell W. Gunter
 
Henry stewart dam2010_taxonomicsearch_markohurst
Henry stewart dam2010_taxonomicsearch_markohurstHenry stewart dam2010_taxonomicsearch_markohurst
Henry stewart dam2010_taxonomicsearch_markohurstWIKOLO
 
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic AnalysisII-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic AnalysisDr. Haxel Consult
 
Return to the Materials Digital Humanities Conference 2013
Return to the Materials Digital Humanities Conference 2013Return to the Materials Digital Humanities Conference 2013
Return to the Materials Digital Humanities Conference 2013Sean Connolly
 
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOMTEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
ITC Infotech
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
Marianne Sweeny
 
Information Retrieval on Text using Concept Similarity
Information Retrieval on Text using Concept SimilarityInformation Retrieval on Text using Concept Similarity
Information Retrieval on Text using Concept Similarity
rahulmonikasharma
 
NLP and its applications
NLP and its applicationsNLP and its applications
NLP and its applicationsUtphala P
 
Rule Legal Services, General Counsel, And Miscellaneous Claims Service Organi...
Rule Legal Services, General Counsel, And Miscellaneous Claims Service Organi...Rule Legal Services, General Counsel, And Miscellaneous Claims Service Organi...
Rule Legal Services, General Counsel, And Miscellaneous Claims Service Organi...legalservices
 

Similar to Content Analyst - Conceptualizing LSI Based Text Analytics White Paper (20)

Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic Enrichment
 
Technical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search EngineTechnical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search Engine
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Information Architecture Primer - Integrating search,tagging, taxonomy and us...
Information Architecture Primer - Integrating search,tagging, taxonomy and us...Information Architecture Primer - Integrating search,tagging, taxonomy and us...
Information Architecture Primer - Integrating search,tagging, taxonomy and us...
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Metaphic or the art of looking another way.
Metaphic or the art of looking another way.
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx
 
16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair Kerner
 
Henry stewart dam2010_taxonomicsearch_markohurst
Henry stewart dam2010_taxonomicsearch_markohurstHenry stewart dam2010_taxonomicsearch_markohurst
Henry stewart dam2010_taxonomicsearch_markohurst
 
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic AnalysisII-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
 
Return to the Materials Digital Humanities Conference 2013
Return to the Materials Digital Humanities Conference 2013Return to the Materials Digital Humanities Conference 2013
Return to the Materials Digital Humanities Conference 2013
 
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOMTEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
Information Retrieval on Text using Concept Similarity
Information Retrieval on Text using Concept SimilarityInformation Retrieval on Text using Concept Similarity
Information Retrieval on Text using Concept Similarity
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
NLP and its applications
NLP and its applicationsNLP and its applications
NLP and its applications
 
Rule Legal Services, General Counsel, And Miscellaneous Claims Service Organi...
Rule Legal Services, General Counsel, And Miscellaneous Claims Service Organi...Rule Legal Services, General Counsel, And Miscellaneous Claims Service Organi...
Rule Legal Services, General Counsel, And Miscellaneous Claims Service Organi...
 

Content Analyst - Conceptualizing LSI Based Text Analytics White Paper

  • 1. white paper Conceptualizing LSI-Based Text Analytics John Felahi Senior Vice President of Products, Content Analyst, LLC. Trevor J. Morgan, Ph.D. Relevance Analytics
  • 2. 1© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. A Better Way to Find Information Corporations are undeniably overloaded with information. They thrive on the innovation and intellectual property captured in their corporate documents. However, when document sets grow at explosive rates like they have over the past five years, critical information becomes buried and lost. The unfortunate reality is that lost information is useless information. Companies lose a key competitive advantage by not being able to find and exploit valuable information embedded within large document sets.1 The traditional solution to this problem is Boolean (keyword) search or keyword-based document tagging and organization. These Boolean search and analytics tools may work just fine when the user knows exactly what word or words the desired documents must contain, but this is rarely the case. Nobody can ever anticipate what specific words or phrases are in a document— finding the right query is laborious and often futile. The goal, then, is to make information more findable by complementing keyword search systems with advanced text analytics technologies. Advanced text analytics encompasses the complex convergence of linguistics, mathematics, probability theory, and software development. Advanced text analytics software employs sophisticated algorithms in an attempt to “read” document content and figure out what that content actually means. These solutions provide users with a rich array of features, including concept-based search and document organization functions. Advanced text analytics software tries to determine document content and meaning in the same way humans do, except on a scale of volume far beyond human capabilities. The purpose is not to replace humans but rather to refocus humans on what they do best.2 “Text ANALYTICS SOFTWARE EMPLOYS SOPHISTICATED ALGORITHMS IN AN ATTEMPT TO "READ" DOCUMENT CONTENT AND FIGURE OUT WHAT THAT CONTENT ACTUALLY MEANS. ”
  • 3. 2© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. Differentiators in Text Analytics Text analytics solutions distinguish themselves in two ways: 1. The manner in which the engine discovers the meaning (concepts) of text in a document set. This is essentially the content discovery aspect of text analytics. 2. The variety of features that end users can leverage once the engine has discovered all document content. Enough stress cannot be placed on the content discovery phase. The following sections focus on how different technologies approach the daunting challenge of discovering the meaning embedded within text documents. The discussion shows the natural evolution of indexing technologies to the present generation of advanced text analytics engines, which deal with document conceptuality. It also points to an extremely viable advanced text analytics technology known as Latent Semantic Indexing (LSI), discusses how LSI functions, and compares it to alternative technologies. LSI is a very powerful and scalable text analytics technology, but the key to unleashing its potential is understanding how it works and what problems it solves. Content Discovery is the Major Differentiator As mentioned earlier, all text analytics technologies need to discover the contents of the documents presented to them. This indexing process involves contextual term analysis and is the first step in enabling users to work with the information in a document set. Without this step, the text analytics engine cannot possibly execute a search or categorize documents: how could it when the contents of the documents are unknown? The way in which an analytics engine discovers document content is critical to overall functionality and is a major distinguishing factor. Many competing indexing technologies have been developed and brought to market. Comprehensively exploring all the different technical solutions and how they function would be time consuming. However, a quick look at the evolution of text analytics is instructive in comparing the relative strengths of the predominant text analytics approaches. When you view text analytics technologies from the most general perspective, you will find that each platform falls into one of the following types: »» Lexical, focusing primarily on the linguistic and semantic indicators in text. »» Probabilistic, focusing primarily on the statistical potentialities in text. »» LSI-based, focusing primarily on the holistic co- occurrences of unique terms in text. »» Hybridization, combining elements of any of the previous three. The First Generation—Term Occurrence and Keyword Indexing Boolean-based search engines initially employed a simple linguistic method to index documents. These platforms performed semantic discovery amounting to counting all the individual words (and word frequencies) found in a document set. The resultant index, therefore, was a comprehensive term look-up list with varying ranks applied depending on term frequency within the document set. Over time, content enrichment methods have been added but are still lexical based given the nature of the system. With these systems, when a user submits a search term or phrase (known as the query), the search engine compares the query to the contents of the look-up list. A match occurs if a document in the index meets the conditional logic of the query. For example, a query for “dog” returns all documents containing the word “dog” one or more times in a ranked order—documents with many instances of the word rank higher in the results list than ones with fewer instances. A query for “dog NOT cat” returns all documents containing the word “dog” but not the word “cat” because of the conditional logic conveyed by NOT. The advent of this keyword indexing and search methodology revolutionized the way users found the documents for which they were searching. In fact, keyword technology still solves many problems in information management today, especially when the precise query is known to the user. In fact, when the user is looking for the presence (or non-presence) of very
  • 4. 3© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. specific words or phrases, keyword search is still the most effective discovery tool. What to Like About Keyword Technology »» Simple approach to indexing the content of documents. »» Easy to understand resultant document list— prevalence of query terms affects rank. »» A good way to find very specific or uniquely worded information in smaller document sets. While technically a linguistic approach (it did, after all, focus on the language within the indexed documents), keyword indexing was not designed to discover word and document meaning. Furthermore, it can be prone to false positives and negatives. One problematic issue with keyword indexing and search was that the presence of the query word (or words) in a document did not guarantee that document’s relevance. For example, a user looking for documents about “fraud” quickly finds lots of documents might contain the word “fraud,” but those documents are not necessarily about fraudulence. The user then is forced to construct more and more complex queries strung together with conditional operators (such as AND, OR) in order to work within the search engine’s method to find what is truly being sought. The problem is that most of the time what is being sought is not a word or series of words at all. Users really want to find documents in the same way their own human brains work with information every hour of every day: they want to find concepts without having to know the exact terms to use in order to convey that meaning to the analytics engine. Shortcomings of Keyword Technology »» Likelihood of introducing false positives and negatives with vague queries. »» Overly complex query construction when searching for more expansive ideas or themes. The Next Generation: Linguistic Analysis and Indexing A concept is different from a word or series of words, no matter how much conditional logic is thrown in to make the keyword query more nuanced. According to the online American Heritage Dictionary of the English Language, a concept is “a general idea or understanding of something.” In a document, a concept might be an expressed idea or thematic content employing any number of different words to articulate it. A word is a unique entity with a finite number of restricted meanings. A concept is a larger idea not restricted to any particular terminology in order to express it. To distinguish between the two, take note of the word (or keyword) “music” versus the idea of auditory stimulation which excites the senses and the mind. The concept of music could encompass thousands of different ideas— rock and roll, blues, Woodstock, Beethoven, Justin Bieber. Undeniably, human thoughts occur in the form of ideas and concepts most of the time, not specific words. Another fact is that rarely will a keyword or series of keywords fully express all the facets of a concept. Therefore, keyword search can be inherently inadequate for most users’ needs when the “right” query is unknown. This realization spurred researchers and software developers past first-generation keyword text analytics. The race was on to figure out how to find the concepts embedded within documents, the “aboutness” rather than just the individual words that composed them.3 The next generation of text analytics platforms continued to rely on the analysis of language to find conceptual content within documents, just as earlier keyword technologies had. These linguistic indexing technologies, though, went beyond the keyword lookup process. They incorporated algorithms and ancillary reference tools in order to interpret the complexities of language found within documents. Such lexical analytics software “Keyword indexing was not designed to discover word and document meaning ”
  • 5. 4© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. came pre-programmed with the static rules of language (grammar) and word use (semantics). Reference cartridges or modules such as dictionaries and thesauri fed into this linguistic indexing methodology. This approach was certainly more sophisticated than keyword indexing, but the problem was and still is that the rules and conventions of language—grammar, semantics, and colloquial usage—are so fluid and changeable that early indexing software either could not derive concepts accurately or effectively, or they could not keep up with the ever-changing dynamics of modern language. In the 21st century, a word can come into being overnight, or an existing word can be imbued with vastly different meaning and propagated around the world within hours. Linguistic indexing engines, relying on laborious human pre-programming and updating, could not keep pace. Another problem with the linguistic approach is that it was language dependent, not language agnostic. A dictionary is a reference tool for words in a given language, so if a linguistic indexing engine encountered a document in a language not supported by its reference dictionary, it could not derive meaning from that document. Some other technique or technology was required to resolve these shortcomings. The Irony Is… Language, when viewed from a linguistic perspective, seems hopelessly complex and impenetrable due to its rich variety and dynamic state. Language when treated from a mathematical perspective becomes elegantly understandable. The importance of mathematics in language analysis continues to gain traction.4 Advanced Text Analytics: Mathematical Analysis and Indexing In a very counter-intuitive fashion, researchers and innovators turned away from the analysis of language rules in order to figure out conceptuality within text. Instead, they approached the content indexing problem from a mathematical perspective. Employing an impressive array of mathematical approaches and maneuvers during the indexing process, text analytics vendors were able to create even more advanced text analytics software. No longer were indexing engines dependent on static linguistic references necessitating frequent updating as language usage changed. Advanced text analytics engines now can rely on statistical analyses of probable meaning within a document—known as a probabilistic indexing approach— or on linear algebraic analyses of total word co- occurrence and correlation—known as an LSI-based indexing approach—to figure out the concepts contained within a document. Other approaches include a hybridization of these two techniques. With these math- based advanced analytics techniques, conceptual search and classification across large volumes of documents is possible with a very high degree of reliability, flexibility, and adaptability. Sophisticated text analytics has finally arrived. The technology behind probabilistic text analysis leverages research based in statistical computations. It builds upon the ideas of probable confidence in an assumption and the continuous updating of that confidence based on evidence. Applied to text analysis, a probabilistic approach—relying on algorithms rooted in statistical analysis—analyzes local term patterns and computes probable or likely meanings. These calculations in part depend on previous assumptions of meaning, which means that faulty assumptions introduce “LANGUAGE WHEN TREATED FROM A MATHEMATICAL PERSPECTIVE BECOMES ELEGANTLY UNDERSTANDABLE ”
  • 6. 5© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. error into the textual analysis. The length of the text also influences the accuracy of probabilistic indexing, with shorter texts presenting a particular problem in conceptual derivation.5 Latent Semantic Indexing, or LSI, is a linear algebraic approach to deriving conceptuality out of the textual content of a document. LSI uses sophisticated linear algebraic computations in order to assess term co- occurrence and contextual associations. While the scale of calculations “under the hood” is extensive, the overall approach can be explained in comprehensible and non- technical language. Once understood, the utility of the LSI technique and the myriad features it facilitates becomes quite apparent. LSI and Concept-Based Text Analytics In order to figure out what concepts are contained within a document, an LSI-based text analytics engine first must acquire an understanding of term interrelationships. Keep in mind that, as a math-based indexing technique, an LSI engine does not have ancillary linguistic references or any pre-programmed understanding of language at all, so this understanding is mathematical rather than linguistic. Using algorithms based in linear algebra, the LSI technique generates this understanding by assessing all the words within a document and within an entire document set. The engine then calculates word co- occurrences across the entire document set—accounting for different emphases on certain word prevalence—to figure out the interrelationships between words and the logical combinations of word co-occurrence which leads to comprehensible concepts. This ability to derive conceptuality from text is one of its most valuable commercial traits.6 To state it another way, we can compare LSI to human thought and communication. A human must use logical and accepted word combinations in order to convey a thought, regardless of the language used. Too many illogical word co-occurrences create incomprehensible gibberish. LSI uses advanced math to figure out these inherent combinations based on the documents themselves, allowing it to respond to a conceptual query with documents containing the same concept—again, not the same words, but the same or similar concept. In a way, LSI mimics the human brain in its attempt to make sense out of word co-occurrences and figure out what text really means. If it cannot figure out what the words mean, that probably indicates that the word combinations are meaningless. As Bradford states, conceptuality derived from LSI “correspond remarkably well with judgments of conceptual similarity made by human beings.”7 LSI is not new technology. As a matter of fact, it uses individual mathematical maneuvers that have been known to scientists and mathematicians for decades. In the 1980s, a group of Bellcore scientists applied these mathematical principles to their research in language and subsequently patented the LSI technique. This technology passed hands in the 1990s and also in the first decade of this century until a new organization—Content Analyst Company—was created in 2004 to advance the technology in several markets. Along the way, numerous other patents have been granted around the original LSI approach, allowing it to grow into a full-blown advanced text analytics platform. “IN A WAY, LSI MIMICS THE HUMAN BRAIN IN ITS ATTEMPT TO MAKE SENSE OUT OF WORD CO-OCCURRENCES AND FIGURE OUT WHAT TEXT REALLY MEANS ”
  • 7. 6© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. LSI and Some Misconceptions Throughout the different stages of LSI development and evolution, the challenges which its inventors and developers had to overcome gave rise to some misconceptions about the LSI indexing technique. Most of these misconceptions derive from its earliest days when the technique was just beginning to evolve. Some of these misconceptions include: »» LSI is slow and does not scale »» LSI is expensive to implement and maintain »» LSI is not defensible »» LSI does not differentiate semantic nuances »» LSI cannot replace human inspection of documents One of the biggest misconceptions about LSI technology is that it is slow and non-scalable when presented with large volumes of documents. We can trace this misconception back to the days of less powerful, more expensive hardware. Sometimes, technology is ahead of its time and utilizes techniques to which other technologies must catch up. The reality is that LSI was invented and patented during a time when the sophisticated math which the technique requires consumed vast resources on the limited hardware available. Hardware constraints admittedly slowed down not only the indexing process but also the text analytics functions carried out post-indexing. As already touched upon, though, the rapid reduction in hardware costs and huge gains in performance witnessed over the past five years (resulting in inexpensive many- core processors and cheap memory) have eliminated these problems. Not only does an LSI engine now have vast hardware resources available to it on a wide range of servers, but its ongoing evolution has resulted in distributed indexing capabilities and load-sharing deployments. Concept searches typically result in sub-one-second results, and hundreds of thousands of documents can be classified, organized, and tagged in mere minutes as opposed to the days or weeks required of humans to assess large volumes of information. LSI performs its functions rapidly on very affordable hardware.8 The associated perception that LSI is expensive to install and maintain is refuted by the same explanation. Cheap hardware, extensible features, and deployment best practices learned along the way all contribute to an economical answer for anybody’s advanced text analytics needs. For the value it provides, LSI is a compelling indexing technology for concept-based analytics. Microchip flashback In 1990, Intel® introduced the 33 MHz 486 microprocessor chip with a processing speed of 27 MIPS. Today’s Intel Xeon® chip is capable of 4.4 GHz, which is exponentially more powerful than the state-of-the-art 486 was at the time. In fact, the Xeon is 133 times faster than its 1990 ancestor 486 chip.
  • 8. 7© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. Another misconception plays more upon the human fears of automation by questioning how defensible the results of LSI indexing and concept-based analytics are. Humans are always suspicious when “the machine” encroaches upon tasks and abilities thought to be best performed by intelligent human and well-trained humans. Because LSI effectively removes humans from the process of reading and assessing the content of documents, critics can easily play upon the fear of the unknown. After all, if no humans read all the documents, who really knows what’s in them? This fear is understandable but has no basis in research or documented observation. The algorithms and proprietary technology that power LSI indexing are well documented and can be defended by the principles of advanced mathematics. The reality is that the math of LSI and its approach driven by sophisticated analysis of total word co-occurrence can be defended both from a mathematical and linguistic perspective. A particularly erroneous claim is that LSI cannot detect word similarities or semantic variations. This misconception insists that LSI cannot distinguish between “cool” as an indicator of temperature versus “cool” as a qualitative judgment of relevance. This claim is patently false. As a matter of fact, LSI is ideal for penetrating the mysteries of semantics—including synonymy and polysemy—and based upon its core approach (term co- occurrence) actually figures out semantic quandaries in the same way humans do.9 If one person says to another, “this is cool,” the recipient might not immediately understand what is being indicated, especially if the speaker has touched something that might actually be cold. The hearer might ask for clarification with something elegant like “what is?” The first person might then elaborate with, “This bronze statue is really cool. It’s quite post-modern.” With the additional accompanying words in the follow up, the speaker provides enough term co-occurrence for the hearer to understand the meaning, which has nothing to do with temperature. Conversely, a metal smith casting the same bronze statue might also say “this is cool” referring to the temperature, indicating it’s ready to be handled without getting burned. LSI analysis of text works exactly like this and mirrors the human ability to interpret meaning based on term co-occurrence. A final misconception involves the question of technology replacing human assessment of document content. It is human nature to resist technologies and processes that we don’t fully understand or that we feel are replacing us, but that does not mean that the technology itself is not effective or appropriate to implement. With the nearly exponential explosion in volume of enterprise data, technology must replace the inadequate solutions provided by earlier text analytics techniques and expensive human activity. The knowledge management market has indeed reached that inflection point where inexpensive hardware costs coupled with the advanced capabilities of CAAT™ create a compelling counter- argument for those who are technophobic. LSI and CAAT™ As with all other vendors of search and text analytics technologies, Content Analyst Company has had to focus on the tenets of precision, speed, and flexibility. Overcoming the inherent obstacles found within text that distract CAAT™ and hamper its ability to determine document conceptuality was also a necessity. For example, filtering out header and footer information in emails is necessary in order to extract the useful authored content in these types of documents. Considerable research has gone into preparing document text for more precise conceptual analysis. In the earlier days of the technology, the inventors and developers had the additional problem of less powerful but much more expensive hardware. The math required to perform LSI indexing is not insignificant, so the workstations and servers of the 1990s and early 2000s had to be robust, with as much memory (RAM) and CPU horsepower as possible. Furthermore, the 32-bit processors and operations systems prevalent at that time were not capable of addressing large amounts of RAM. Until the inflection point of more powerful but less expensive hardware occurred in the mid-2000s, the ongoing development of the CAAT engine focused on refinement of the code base to allow for more accurate and speedier functionality. Distributed subsystem deployment and text filtering capabilities were also incorporated along the way, the latter of which suppresses extraneous or “garbage
  • 9. 8© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. text” during indexing. Finally, the evolution of CAAT over the years grew to encompass not only traditional search (concept search) and concept-based document classification (multiple techniques and optimizing algorithms), but also dynamic clustering, document summarization, primary language identification, and advanced text comparisons (for thread detection in emails and for identifying duplicate text). CAAT is now an advanced text analytics platform with dozens of discrete analytical applications which developers can assemble in combination to creatively add concept-based text analytics to larger software solutions. Key Capabilities The key features of CAAT include concept-based search and document organization/classification. Concept search and concept-based document classification are far more powerful than keyword-based approaches, for reasons discussed previously. Now, the user does not have to know the “right” words to use when submitting queries, and documents that are related to each other conceptually can be grouped together regardless of whether they share the same terminology. Document classification is particularly attractive for software vendors who need workflow automation and document routing within their solutions – such as enterprise content management, enterprise archiving, compliance and e-discovery software vendors. The ability to rapidly organize large volumes of documents based on their relevance to each other or to an overarching category reduces or eliminates costly human document inspection. With the reduction of human document inspection come the benefits of more precise classification results. Properly trained text analytics software such as CAAT more objectively and consistently assesses the conceptual and thematic content of documents—it does not get tired (and therefore increasingly inconsistent), and it does not get tripped up over the interpretive nuances of language that plague human inspection. Because of the importance of classification, CAAT includes two major ways to organize documents: clustering and categorization. These two modes of document classification differ in the amount of human intervention required to establish and define the organizational structure, known as the taxonomy. In theory, a taxonomy is nothing more than the discrete organization units— arranged in either a flat or hierarchical manner—into which documents can be placed, along with the rules dictating the type of content a document must contain in order to qualify for a category. In practice, taxonomy development and maintenance is an enormous undertaking for enterprises, requiring highly specialized knowledge workers (known as taxonomists) and their support staff. CAAT’s ability to cluster documents into an automatically generated taxonomy or alternately accommodating the more refined process of human- trained document categorization means that nearly any automated workflow requiring classification can be supported. The value propositions for these two classification methods are compelling. Automated clustering provides rapid organization of documents so that users can quickly understand the conceptual composition and distribution spread within large document sets. Clustering automatically classifies documents based on each document’s predominant concept or theme—it also creates the taxonomy structure and category naming scheme without human intervention. Because the unsupervised clustering feature is dynamic and can be just-in-time within a solution’s workflow, it is ideal when quick and general insight into a document set is required. “Documents that are related to each other conceptually can be grouped together regardless of whether they share the same terminology. ”
  • 10. 9© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. Categorization, on the other hand, allows users to participate in the taxonomy development and analytics training process. As with all other types of supervised document classification technologies, categorization demands more upfront effort during taxonomy development and the requisite learning process to define the categories for CAAT. The compelling benefit is a much more refined and accurate result set placed into pre-defined categories of interest. Categorization is a powerful solution where the best of human insight and software efficiency can be combined to yield the most accurate classification results possible. CAAT also provides a term expansion feature, which is capable of detecting for any word all the other terms in the indexed document set which are either highly correlated or synonymous with it. Using the “dog” example from earlier, CAAT would also identify “husky,” “spaniel,” “mutt,” “pup,” and “man’s best friend,” as highly correlated or synonymous terms. Being a math-based indexing engine, CAAT has no native understanding of language, the benefit of which is complete language agnosticism. Indexing of German documents enables the same analytics features and yields the same accurate results as the indexing of documents in English, Chinese, or Arabic. Despite being language agnostic, CAAT does have the ability to detect the differences between languages due to its term analysis. Therefore, it can identify the primary language of a document. All of these features can be accessed whenever needed in a larger software platform to increase the findability of documents and improve the accuracy and relevance of document classification. Learning More About CAAT The power of CAAT and its LSI indexing technology has been integrated into dozens of software solutions in a number of different markets. To learn more about these success stories, go to www.contentanalyst.com. CAAT finds tight groupings of concepts, then groups and subgroups are picked based on settings. Numeric Values help interpret results, and document scores indicate closeness to the center of the clusters. Example documents plus threshold define “hit spheres.” Documents in the search index which fall in the “hit spheres” get categorized. Figure 1.2 Figure 1.1
  • 11. 10© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. About Content Analyst Company We provide powerful and proven Advanced Analytics that exponentially reduce the time needed to discern relevant information from unstructured data. CAAT, our dynamic suite of text analytics technologies, delivers significant value wherever knowledge workers need to extract insights from large amounts of unstructured data. Our capabilities are easily integrated into any software solution, and our support strategy for our partners is second to none. © 2013 Content Analyst Company, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners. References 1 Frank Ohlhorst, “The Promise of Big Data,” Infoworld, September 2010. 2 John Markoff, “Armies of Expensive Lawyers, Replaced by Cheaper Software,” New York Times, March 4, 2011. 3 C. Korycinski and Alan F. Newell, “Natural Language Processing and Automatic Indexing,” The Indexer, April 1990. 4 Mark Liberman, “Linguists Who Count,” Language Log, May 28, 2009. 5 Yangqui Song et al, “Short Text Conceptualization Using a Probabilistic Knowledgebase,” Proceedings of the 22nd International Joint Conference on Artificial Intelligence, 2011. 6 Roger Bradford, “Comparability of LSI and Human Judgment in Text Analysis Tasks,” Proceedings of the Applied Computing Conference, September 2009. 7 Bradford, “Comparability of LSI and Human Judgment in Text Analysis Tasks.” 8 Roger Bradford, “Implementation Techniques for Large-Scale Latent Semantic Indexing Applications,” Proceedings of the 20th ACM International Conference on Information and Knowledge Management, October 2011. 9 Bradford, “Comparability of LSI and Human Judgment in Text Analysis Tasks.”