Content Analyst - Conceptualizing LSI Based Text Analytics White Paper

white paper
Conceptualizing LSI-Based
Text Analytics
John Felahi
Senior Vice President of Products, Content Analyst, LLC.
Trevor J. Morgan, Ph.D.
Relevance Analytics

1© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC
in the United States. All other marks are the property of their respective owners.
A Better Way to Find Information
Corporations are undeniably overloaded with information.
They thrive on the innovation and intellectual property
captured in their corporate documents. However, when
document sets grow at explosive rates like they have over
the past five years, critical information becomes buried
and lost. The unfortunate reality is that lost information
is useless information. Companies lose a key competitive
advantage by not being able to find and exploit valuable
information embedded within large document sets.1
The traditional solution to this problem is Boolean
(keyword) search or keyword-based document tagging
and organization. These Boolean search and analytics
tools may work just fine when the user knows exactly
what word or words the desired documents must contain,
but this is rarely the case. Nobody can ever anticipate
what specific words or phrases are in a document—
finding the right query is laborious and often futile.
The goal, then, is to make information more findable
by complementing keyword search systems with
advanced text analytics technologies. Advanced text
analytics encompasses the complex convergence
of linguistics, mathematics, probability theory, and
software development. Advanced text analytics software
employs sophisticated algorithms in an attempt to “read”
document content and figure out what that content
actually means. These solutions provide users with a rich
array of features, including concept-based search and
document organization functions. Advanced text analytics
software tries to determine document content and
meaning in the same way humans do, except on a scale
of volume far beyond human capabilities. The purpose is
not to replace humans but rather to refocus humans on
what they do best.2
“Text ANALYTICS
SOFTWARE EMPLOYS
SOPHISTICATED ALGORITHMS
IN AN ATTEMPT TO "READ"
DOCUMENT CONTENT
AND FIGURE OUT WHAT
THAT CONTENT ACTUALLY
MEANS.
”

Differentiators in Text Analytics
Text analytics solutions distinguish themselves in two
ways:
1. The manner in which the engine discovers the
meaning (concepts) of text in a document set. This
is essentially the content discovery aspect of text
analytics.
2. The variety of features that end users can leverage
once the engine has discovered all document content.
Enough stress cannot be placed on the content
discovery phase. The following sections focus on how
different technologies approach the daunting challenge
of discovering the meaning embedded within text
documents. The discussion shows the natural evolution
of indexing technologies to the present generation
of advanced text analytics engines, which deal with
document conceptuality. It also points to an extremely
viable advanced text analytics technology known as
Latent Semantic Indexing (LSI), discusses how LSI
functions, and compares it to alternative technologies.
LSI is a very powerful and scalable text analytics
technology, but the key to unleashing its potential is
understanding how it works and what problems it solves.
Content Discovery is the Major Differentiator
As mentioned earlier, all text analytics technologies need
to discover the contents of the documents presented to
them. This indexing process involves contextual term
analysis and is the first step in enabling users to work
with the information in a document set. Without this step,
the text analytics engine cannot possibly execute a search
or categorize documents: how could it when the contents
of the documents are unknown? The way in which an
analytics engine discovers document content is critical to
overall functionality and is a major distinguishing factor.
Many competing indexing technologies have been
developed and brought to market. Comprehensively
exploring all the different technical solutions and how
they function would be time consuming. However, a
quick look at the evolution of text analytics is instructive
in comparing the relative strengths of the predominant
text analytics approaches. When you view text analytics
technologies from the most general perspective, you
will find that each platform falls into one of the following
types:
»» Lexical, focusing primarily on the linguistic and
semantic indicators in text.
»» Probabilistic, focusing primarily on the statistical
potentialities in text.
»» LSI-based, focusing primarily on the holistic co-
occurrences of unique terms in text.
»» Hybridization, combining elements of any of the
previous three.
The First Generation—Term Occurrence and
Keyword Indexing
Boolean-based search engines initially employed a
simple linguistic method to index documents. These
platforms performed semantic discovery amounting to
counting all the individual words (and word frequencies)
found in a document set. The resultant index, therefore,
was a comprehensive term look-up list with varying
ranks applied depending on term frequency within the
document set. Over time, content enrichment methods
have been added but are still lexical based given the
nature of the system.
With these systems, when a user submits a search
term or phrase (known as the query), the search engine
compares the query to the contents of the look-up list.
A match occurs if a document in the index meets the
conditional logic of the query. For example, a query for
“dog” returns all documents containing the word “dog”
one or more times in a ranked order—documents with
many instances of the word rank higher in the results list
than ones with fewer instances. A query for “dog NOT
cat” returns all documents containing the word “dog”
but not the word “cat” because of the conditional logic
conveyed by NOT. The advent of this keyword indexing
and search methodology revolutionized the way users
found the documents for which they were searching. In
fact, keyword technology still solves many problems
in information management today, especially when the
precise query is known to the user. In fact, when the
user is looking for the presence (or non-presence) of very

specific words or phrases, keyword search is still the
most effective discovery tool.
What to Like About Keyword Technology
»» Simple approach to indexing the content of
documents.
»» Easy to understand resultant document list—
prevalence of query terms affects rank.
»» A good way to find very specific or uniquely worded
information in smaller document sets.
While technically a linguistic approach (it did, after all,
focus on the language within the indexed documents),
keyword indexing was not designed to discover word
and document meaning. Furthermore, it can be prone to
false positives and negatives. One problematic issue with
keyword indexing and search was that the presence of the
query word (or words) in a document did not guarantee
that document’s relevance.
For example, a user looking for documents about
“fraud” quickly finds lots of documents might contain
the word “fraud,” but those documents are not
necessarily about fraudulence. The user then is forced
to construct more and more complex queries strung
together with conditional operators (such as AND, OR) in
order to work within the search engine’s method to find
what is truly being sought. The problem is that most of
the time what is being sought is not a word or series of
words at all. Users really want to find documents in the
same way their own human brains work with information
every hour of every day: they want to find concepts
without having to know the exact terms to use in order to
convey that meaning to the analytics engine.
Shortcomings of Keyword Technology
»» Likelihood of introducing false positives and negatives
with vague queries.
»» Overly complex query construction when searching
for more expansive ideas or themes.
The Next Generation: Linguistic Analysis
and Indexing
A concept is different from a word or series of words, no
matter how much conditional logic is thrown in to make
the keyword query more nuanced. According to the online
American Heritage Dictionary of the English Language, a
concept is “a general idea or understanding of something.”
In a document, a concept might be an expressed idea
or thematic content employing any number of different
words to articulate it. A word is a unique entity with a
finite number of restricted meanings. A concept is a
larger idea not restricted to any particular terminology in
order to express it.
To distinguish between the two, take note of the word (or
keyword) “music” versus the idea of auditory stimulation
which excites the senses and the mind. The concept of
music could encompass thousands of different ideas—
rock and roll, blues, Woodstock, Beethoven, Justin
Bieber. Undeniably, human thoughts occur in the form
of ideas and concepts most of the time, not specific
words. Another fact is that rarely will a keyword or series
of keywords fully express all the facets of a concept.
Therefore, keyword search can be inherently inadequate
for most users’ needs when the “right” query is unknown.
This realization spurred researchers and software
developers past first-generation keyword text analytics.
The race was on to figure out how to find the concepts
embedded within documents, the “aboutness” rather than
just the individual words that composed them.3
The next generation of text analytics platforms continued
to rely on the analysis of language to find conceptual
content within documents, just as earlier keyword
technologies had. These linguistic indexing technologies,
though, went beyond the keyword lookup process. They
incorporated algorithms and ancillary reference tools
in order to interpret the complexities of language found
within documents. Such lexical analytics software
“Keyword indexing
was not designed to
discover word and
document meaning
”

came pre-programmed with the static rules of language
(grammar) and word use (semantics). Reference
cartridges or modules such as dictionaries and thesauri
fed into this linguistic indexing methodology.
This approach was certainly more sophisticated than
keyword indexing, but the problem was and still is
that the rules and conventions of language—grammar,
semantics, and colloquial usage—are so fluid and
changeable that early indexing software either could not
derive concepts accurately or effectively, or they could
not keep up with the ever-changing dynamics of modern
language. In the 21st century, a word can come into being
overnight, or an existing word can be imbued with vastly
different meaning and propagated around the world
within hours. Linguistic indexing engines, relying on
laborious human pre-programming and updating, could
not keep pace.
Another problem with the linguistic approach is that
it was language dependent, not language agnostic.
A dictionary is a reference tool for words in a given
language, so if a linguistic indexing engine encountered
a document in a language not supported by its reference
dictionary, it could not derive meaning from that
document. Some other technique or technology was
required to resolve these shortcomings.
The Irony Is…
Language, when viewed from a linguistic perspective,
seems hopelessly complex and impenetrable due to its
rich variety and dynamic state. Language when treated
from a mathematical perspective becomes elegantly
understandable. The importance of mathematics in
language analysis continues to gain traction.4
Advanced Text Analytics: Mathematical
Analysis and Indexing
In a very counter-intuitive fashion, researchers and
innovators turned away from the analysis of language
rules in order to figure out conceptuality within text.
Instead, they approached the content indexing problem
from a mathematical perspective. Employing an
impressive array of mathematical approaches and
maneuvers during the indexing process, text analytics
vendors were able to create even more advanced text
analytics software. No longer were indexing engines
dependent on static linguistic references necessitating
frequent updating as language usage changed.
Advanced text analytics engines now can rely on
statistical analyses of probable meaning within a
document—known as a probabilistic indexing approach—
or on linear algebraic analyses of total word co-
occurrence and correlation—known as an LSI-based
indexing approach—to figure out the concepts contained
within a document. Other approaches include a
hybridization of these two techniques. With these math-
based advanced analytics techniques, conceptual search
and classification across large volumes of documents is
possible with a very high degree of reliability, flexibility,
and adaptability. Sophisticated text analytics has finally
arrived.
The technology behind probabilistic text analysis
leverages research based in statistical computations.
It builds upon the ideas of probable confidence in
an assumption and the continuous updating of that
confidence based on evidence. Applied to text analysis,
a probabilistic approach—relying on algorithms rooted
in statistical analysis—analyzes local term patterns
and computes probable or likely meanings. These
calculations in part depend on previous assumptions of
meaning, which means that faulty assumptions introduce
“LANGUAGE WHEN
TREATED FROM A
MATHEMATICAL PERSPECTIVE
BECOMES ELEGANTLY
UNDERSTANDABLE
”

error into the textual analysis. The length of the text
also influences the accuracy of probabilistic indexing,
with shorter texts presenting a particular problem in
conceptual derivation.5
Latent Semantic Indexing, or LSI, is a linear algebraic
approach to deriving conceptuality out of the textual
content of a document. LSI uses sophisticated linear
algebraic computations in order to assess term co-
occurrence and contextual associations. While the scale
of calculations “under the hood” is extensive, the overall
approach can be explained in comprehensible and non-
technical language. Once understood, the utility of the LSI
technique and the myriad features it facilitates becomes
quite apparent.
LSI and Concept-Based Text Analytics
In order to figure out what concepts are contained within
a document, an LSI-based text analytics engine first must
acquire an understanding of term interrelationships.
Keep in mind that, as a math-based indexing technique,
an LSI engine does not have ancillary linguistic references
or any pre-programmed understanding of language at
all, so this understanding is mathematical rather than
linguistic. Using algorithms based in linear algebra, the
LSI technique generates this understanding by assessing
all the words within a document and within an entire
document set. The engine then calculates word co-
occurrences across the entire document set—accounting
for different emphases on certain word prevalence—to
figure out the interrelationships between words and
the logical combinations of word co-occurrence which
leads to comprehensible concepts. This ability to derive
conceptuality from text is one of its most valuable
commercial traits.6
To state it another way, we can compare LSI to human
thought and communication. A human must use logical
and accepted word combinations in order to convey a
thought, regardless of the language used. Too many
illogical word co-occurrences create incomprehensible
gibberish. LSI uses advanced math to figure out
these inherent combinations based on the documents
themselves, allowing it to respond to a conceptual query
with documents containing the same concept—again, not
the same words, but the same or similar concept. In a
way, LSI mimics the human brain in its attempt to make
sense out of word co-occurrences and figure out what text
really means. If it cannot figure out what the words mean,
that probably indicates that the word combinations are
meaningless. As Bradford states, conceptuality derived
from LSI “correspond remarkably well with judgments of
conceptual similarity made by human beings.”7
LSI is not new technology. As a matter of fact, it uses
individual mathematical maneuvers that have been
known to scientists and mathematicians for decades. In
the 1980s, a group of Bellcore scientists applied these
mathematical principles to their research in language
and subsequently patented the LSI technique. This
technology passed hands in the 1990s and also in the first
decade of this century until a new organization—Content
Analyst Company—was created in 2004 to advance the
technology in several markets. Along the way, numerous
other patents have been granted around the original LSI
approach, allowing it to grow into a full-blown advanced
text analytics platform.
“IN A WAY, LSI MIMICS
THE HUMAN BRAIN IN ITS
ATTEMPT TO MAKE SENSE
OUT OF WORD
CO-OCCURRENCES AND
FIGURE OUT WHAT TEXT
REALLY MEANS
”

LSI and Some Misconceptions
Throughout the different stages of LSI development
and evolution, the challenges which its inventors
and developers had to overcome gave rise to some
misconceptions about the LSI indexing technique. Most of
these misconceptions derive from its earliest days when
the technique was just beginning to evolve. Some of
these misconceptions include:
»» LSI is slow and does not scale
»» LSI is expensive to implement and maintain
»» LSI is not defensible
»» LSI does not differentiate semantic nuances
»» LSI cannot replace human inspection of documents
One of the biggest misconceptions about LSI technology
is that it is slow and non-scalable when presented
with large volumes of documents. We can trace this
misconception back to the days of less powerful,
more expensive hardware. Sometimes, technology
is ahead of its time and utilizes techniques to which
other technologies must catch up. The reality is that
LSI was invented and patented during a time when
the sophisticated math which the technique requires
consumed vast resources on the limited hardware
available. Hardware constraints admittedly slowed down
not only the indexing process but also the text analytics
functions carried out post-indexing.
As already touched upon, though, the rapid reduction in
hardware costs and huge gains in performance witnessed
over the past five years (resulting in inexpensive many-
core processors and cheap memory) have eliminated
these problems. Not only does an LSI engine now
have vast hardware resources available to it on a wide
range of servers, but its ongoing evolution has resulted
in distributed indexing capabilities and load-sharing
deployments. Concept searches typically result in
sub-one-second results, and hundreds of thousands of
documents can be classified, organized, and tagged in
mere minutes as opposed to the days or weeks required
of humans to assess large volumes of information.
LSI performs its functions rapidly on very affordable
hardware.8
The associated perception that LSI is expensive to install
and maintain is refuted by the same explanation. Cheap
hardware, extensible features, and deployment best
practices learned along the way all contribute to an
economical answer for anybody’s advanced text analytics
needs. For the value it provides, LSI is a compelling
indexing technology for concept-based analytics.
Microchip flashback
In 1990, Intel®
introduced the
33 MHz 486
microprocessor chip
with a processing
speed of 27 MIPS.
Today’s Intel Xeon®
chip is capable of
4.4 GHz, which is
exponentially more
powerful than the
state-of-the-art 486
was at the time. In
fact, the Xeon is 133
times faster than
its 1990 ancestor
486 chip.

Another misconception plays more upon the human fears
of automation by questioning how defensible the results
of LSI indexing and concept-based analytics are. Humans
are always suspicious when “the machine” encroaches
upon tasks and abilities thought to be best performed by
intelligent human and well-trained humans. Because LSI
effectively removes humans from the process of reading
and assessing the content of documents, critics can easily
play upon the fear of the unknown. After all, if no humans
read all the documents, who really knows what’s in them?
This fear is understandable but has no basis in research
or documented observation. The algorithms and
proprietary technology that power LSI indexing are well
documented and can be defended by the principles of
advanced mathematics. The reality is that the math of
LSI and its approach driven by sophisticated analysis of
total word co-occurrence can be defended both from a
mathematical and linguistic perspective.
A particularly erroneous claim is that LSI cannot
detect word similarities or semantic variations. This
misconception insists that LSI cannot distinguish between
“cool” as an indicator of temperature versus “cool” as a
qualitative judgment of relevance. This claim is patently
false. As a matter of fact, LSI is ideal for penetrating
the mysteries of semantics—including synonymy and
polysemy—and based upon its core approach (term co-
occurrence) actually figures out semantic quandaries in
the same way humans do.9
If one person says to another, “this is cool,” the
recipient might not immediately understand what is
being indicated, especially if the speaker has touched
something that might actually be cold. The hearer might
ask for clarification with something elegant like “what
is?” The first person might then elaborate with, “This
bronze statue is really cool. It’s quite post-modern.” With
the additional accompanying words in the follow up, the
speaker provides enough term co-occurrence for the
hearer to understand the meaning, which has nothing to
do with temperature. Conversely, a metal smith casting
the same bronze statue might also say “this is cool”
referring to the temperature, indicating it’s ready to be
handled without getting burned. LSI analysis of text
works exactly like this and mirrors the human ability to
interpret meaning based on term co-occurrence.
A final misconception involves the question of technology
replacing human assessment of document content. It is
human nature to resist technologies and processes that
we don’t fully understand or that we feel are replacing
us, but that does not mean that the technology itself
is not effective or appropriate to implement. With the
nearly exponential explosion in volume of enterprise
data, technology must replace the inadequate solutions
provided by earlier text analytics techniques and
expensive human activity. The knowledge management
market has indeed reached that inflection point where
inexpensive hardware costs coupled with the advanced
capabilities of CAAT™ create a compelling counter-
argument for those who are technophobic.
LSI and CAAT™
As with all other vendors of search and text analytics
technologies, Content Analyst Company has had to
focus on the tenets of precision, speed, and flexibility.
Overcoming the inherent obstacles found within text
that distract CAAT™ and hamper its ability to determine
document conceptuality was also a necessity. For
example, filtering out header and footer information
in emails is necessary in order to extract the useful
authored content in these types of documents.
Considerable research has gone into preparing document
text for more precise conceptual analysis.
In the earlier days of the technology, the inventors and
developers had the additional problem of less powerful
but much more expensive hardware. The math required
to perform LSI indexing is not insignificant, so the
workstations and servers of the 1990s and early 2000s
had to be robust, with as much memory (RAM) and
CPU horsepower as possible. Furthermore, the 32-bit
processors and operations systems prevalent at that
time were not capable of addressing large amounts of
RAM. Until the inflection point of more powerful but
less expensive hardware occurred in the mid-2000s,
the ongoing development of the CAAT engine focused on
refinement of the code base to allow for more accurate
and speedier functionality.
Distributed subsystem deployment and text filtering
capabilities were also incorporated along the way, the
latter of which suppresses extraneous or “garbage

text” during indexing. Finally, the evolution of CAAT
over the years grew to encompass not only traditional
search (concept search) and concept-based document
classification (multiple techniques and optimizing
algorithms), but also dynamic clustering, document
summarization, primary language identification, and
advanced text comparisons (for thread detection in
emails and for identifying duplicate text). CAAT is now an
advanced text analytics platform with dozens of discrete
analytical applications which developers can assemble
in combination to creatively add concept-based text
analytics to larger software solutions.
Key Capabilities
The key features of CAAT include concept-based search
and document organization/classification. Concept
search and concept-based document classification are
far more powerful than keyword-based approaches, for
reasons discussed previously. Now, the user does not
have to know the “right” words to use when submitting
queries, and documents that are related to each other
conceptually can be grouped together regardless of
whether they share the same terminology.
Document classification is particularly attractive for
software vendors who need workflow automation
and document routing within their solutions – such as
enterprise content management, enterprise archiving,
compliance and e-discovery software vendors. The
ability to rapidly organize large volumes of documents
based on their relevance to each other or to an
overarching category reduces or eliminates costly
human document inspection. With the reduction of
human document inspection come the benefits of more
precise classification results. Properly trained text
analytics software such as CAAT more objectively and
consistently assesses the conceptual and thematic
content of documents—it does not get tired (and therefore
increasingly inconsistent), and it does not get tripped up
over the interpretive nuances of language that plague
human inspection.
Because of the importance of classification, CAAT includes
two major ways to organize documents: clustering
and categorization. These two modes of document
classification differ in the amount of human intervention
required to establish and define the organizational
structure, known as the taxonomy. In theory, a taxonomy
is nothing more than the discrete organization units—
arranged in either a flat or hierarchical manner—into
which documents can be placed, along with the rules
dictating the type of content a document must contain
in order to qualify for a category. In practice, taxonomy
development and maintenance is an enormous
undertaking for enterprises, requiring highly specialized
knowledge workers (known as taxonomists) and their
support staff. CAAT’s ability to cluster documents into
an automatically generated taxonomy or alternately
accommodating the more refined process of human-
trained document categorization means that nearly any
automated workflow requiring classification can be
supported.
The value propositions for these two classification
methods are compelling. Automated clustering
provides rapid organization of documents so that users
can quickly understand the conceptual composition
and distribution spread within large document sets.
Clustering automatically classifies documents based on
each document’s predominant concept or theme—it also
creates the taxonomy structure and category naming
scheme without human intervention. Because the
unsupervised clustering feature is dynamic and can be
just-in-time within a solution’s workflow, it is ideal when
quick and general insight into a document set is required.
“Documents that are
related to each other
conceptually can be
grouped together
regardless of whether
they share the same
terminology.
”

Categorization, on the other hand, allows users to
participate in the taxonomy development and analytics
training process. As with all other types of supervised
document classification technologies, categorization
demands more upfront effort during taxonomy
development and the requisite learning process to define
the categories for CAAT. The compelling benefit is a
much more refined and accurate result set placed into
pre-defined categories of interest. Categorization is a
powerful solution where the best of human insight and
software efficiency can be combined to yield the most
accurate classification results possible.
CAAT also provides a term expansion feature, which is
capable of detecting for any word all the other terms
in the indexed document set which are either highly
correlated or synonymous with it. Using the “dog”
example from earlier, CAAT would also identify “husky,”
“spaniel,” “mutt,” “pup,” and “man’s best friend,” as highly
correlated or synonymous terms.
Being a math-based indexing engine, CAAT has no
native understanding of language, the benefit of which
is complete language agnosticism. Indexing of German
documents enables the same analytics features and
yields the same accurate results as the indexing of
documents in English, Chinese, or Arabic. Despite being
language agnostic, CAAT does have the ability to detect
the differences between languages due to its term
analysis. Therefore, it can identify the primary language
of a document.
All of these features can be accessed whenever needed
in a larger software platform to increase the findability
of documents and improve the accuracy and relevance of
document classification.
Learning More About CAAT
The power of CAAT and its LSI indexing technology has
been integrated into dozens of software solutions in a
number of different markets. To learn more about these
success stories, go to www.contentanalyst.com.
CAAT finds tight groupings of concepts, then groups and
subgroups are picked based on settings. Numeric Values help
interpret results, and document scores indicate closeness to
the center of the clusters.
Example documents plus threshold define “hit spheres.”
Documents in the search index which fall in the “hit spheres”
get categorized.
Figure 1.2
Figure 1.1

About Content Analyst Company
We provide powerful and proven Advanced Analytics
that exponentially reduce the time needed to discern
relevant information from unstructured data. CAAT, our
dynamic suite of text analytics technologies, delivers
significant value wherever knowledge workers need to
extract insights from large amounts of unstructured data.
Our capabilities are easily integrated into any software
solution, and our support strategy for our partners is
second to none.
© 2013 Content Analyst Company, LLC. All rights
reserved. Content Analyst, CAAT and the Content Analyst
and CAAT logos are registered trademarks of Content
Analyst, LLC in the United States. All other marks are the
property of their respective owners.
References
1 Frank Ohlhorst, “The Promise of Big Data,” Infoworld, September 2010.
2 John Markoff, “Armies of Expensive Lawyers, Replaced by Cheaper Software,”
New York Times, March 4, 2011.
3 C. Korycinski and Alan F. Newell, “Natural Language Processing and Automatic
Indexing,” The Indexer, April 1990.
4 Mark Liberman, “Linguists Who Count,” Language Log, May 28, 2009.
5 Yangqui Song et al, “Short Text Conceptualization Using a Probabilistic
Knowledgebase,” Proceedings of the 22nd International Joint Conference on
Artificial Intelligence, 2011.
6 Roger Bradford, “Comparability of LSI and Human Judgment in Text Analysis
Tasks,” Proceedings of the Applied Computing Conference, September 2009.
7 Bradford, “Comparability of LSI and Human Judgment in Text Analysis Tasks.”
8 Roger Bradford, “Implementation Techniques for Large-Scale Latent Semantic
Indexing Applications,” Proceedings of the 20th ACM International Conference on
Information and Knowledge Management, October 2011.
9 Bradford, “Comparability of LSI and Human Judgment in Text Analysis Tasks.”

Content Analyst - Conceptualizing LSI Based Text Analytics White Paper

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (15)

Similar to Content Analyst - Conceptualizing LSI Based Text Analytics White Paper

Similar to Content Analyst - Conceptualizing LSI Based Text Analytics White Paper (20)

Content Analyst - Conceptualizing LSI Based Text Analytics White Paper