Search vs Text Classification


Published on

Is search always the right solution? There are many things you can do with a hammer, but it’s not so great if you need to turn a screw.

Text Classification is an alternative to search that may be more appropriate for social media data analysis. Text classification is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world. Using text classification as the foundation for analysis – i.e., teaching a machine to categorize posts the way humans do – can dramatically improve your ability to gather the right data and, ultimately, increase the chances that you’ll uncover what you need to know.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Search vs Text Classification

  1. 1. White PaperSearch vs.Text ClassificationIncreasing the signal, decreasing the noise 1 West Street New York NY 10004 | 646-545-3900 | |
  2. 2. White Paper Networked Insights NetworkSearch vs. Text ClassificationIncreasing the signal, decreasing the noiseSince the advent of the World Wide Web, businesses and Topic discovery—consumers have used a variety of ways to find information. letting data speak for itselfThese various methods of discovery have trained us to think Topic discovery is a valuable type ofand behave in ways that make understanding analytics semantic analysis based on textchallenging. In fact, what makes retrieving information easy classification. Whereas sentiment analysisfor individuals is not the manner in which we should examine simply reveals people’s likes and dislikes,social data. Confused? semantic analysis refers to a group of methods that allow machines to discoverIn the infancy of the commercial public Web, navigation was nearly impos- the fundamental patterns of words orsible without directories and then information portals. With the explosion phrases that act as building blocks in aof the Web in the late 1990s, keyword searching and using search engines large set of text. Topics, themes, sentimenthas become as ubiquitous as the Internet itself. While the underlying and similar elements of meaning appearmethods of search have evolved over the years, its primary use has stayed as intricate weavings of those fundamentalconstant since the early days of companies like Yahoo!, Altavista, Lycos, patterns. So semantic analysis is theExcite and Google. Reflecting its mass popularity and understanding, summarization of large amounts of textsearch is often the first tool applied to a wide variety of data challenges. by automatically discovering the topics and themes within.But is search always the right solution? There are many things you can dowith a hammer, but it’s not so great if you need to turn a screw. By grouping social media posts based on semantic similarity, rather than presetTo learn what customers think about your products and services, you may sentiment categories such as positive, nega-need to apply sentiment analysis across millions of social media posts. tive and neutral, topic discovery can helpOr, to guide your media buying, you might use topic discovery to uncover companies uncover important information –market trends in the social conversation. for example, what exactly people are saying about a product or service; where and howIn either case, using search to identify the set of posts you’ll submit to they use it; the features they use most; andscrutiny could send your social media analysis down the wrong path from the enhancements or new offerings they’rethe start. Your approach to conducting sentiment analysis or topic interested in. All of this information candiscovery could be spot on. But if it’s based on a number of posts that ultimately drive product development, newaren’t actually about what you think they are, which typically happens revenue streams and strategies for market-with search, the noise created can flaw the inferences and conclusions you ing, advertising and media planning.ultimately draw.Text classification is an alternative to search that may be more appropri-ate for social media data analysis. Text classification is the task of assigningpredefined categories to free-text documents. It can provide conceptualviews of document collections and has important applications in the realworld. Using text classification as the foundation for analysis – i.e., teach-ing a machine to categorize posts the way humans do – can dramaticallyimprove your ability to gather the right data and, ultimately, increase thechances that you’ll uncover what you need to know.2
  3. 3. White Paper Networked InsightsSearch vs. Text ClassificationThe impact of bad dataA look at several related but distinct topics illustrates how seriously theproblems of search can impact analysis.A Networked Insights analyst designed search queries for five topics thatmoms typically discuss – pregnancy and newborns; school-aged children;food, nutrition and health; shopping and money; and illness and injury.Searches were run on the five topics, then another analyst reviewedthe results under two test scenarios to determine how well the searchdelivered posts fitting the intended criteria as defined by the query.In the first test, the analyst reviewed only the top 20 results returned traditional searchby each search as ordered by the search engine. In the second test, theanalyst reviewed a random sample of 200 results returned by the search.In each case, the analyst was asked to judge whether each resulting postwas appropriate for the intended category or if it fit better in a differentone. The percent of appropriate posts is a measure of the “precision” ofthe search.The test results (Table 1) reveal search’s severe limitations. Precision was Significant problems arisehigh when only the top 20 results were examined (90 percent or higher), with search when you’rebut falls precipitously when examining a larger number of randomly sam-pled posts. In only one search, pregnancy and newborns, did the results after a broad collection ofyield a somewhat reliable level of precision (86.5 percent). In three of the similar posts, not a handfulfive searches, precision rates were under 50 percent. of the best ones.In practical terms, these results mean there’s a greater chance that a ran-domly selected search result will not meet the intended criteria than thatit will. Said another way, search might be used to support other analysesby returning a large number of posts assumed to cover the same basictopic. The problem: the majority of the data isn’t relevant to the topic youwant to understand.Table 1. Keyword Search Precision Desired Topic Top 20 Results Only Random Sample Pregnancy and newborns 95% 86.5% School-aged children 95% 19.5% Food, nutrition, health 90% 39.5% Shopping and money 100% 57.5% Illness and Injury 100% 41% Overall 96% 48.8%3
  4. 4. White Paper Networked InsightsSearch vs. Text ClassificationThe shortcomings of searchBy definition, the intent of search is to uncover the best responses to aquery. A search engine goes out and grabs hundreds of thousands of poststhat match the word or phrase programmed into the query and attemptsto rank them in order of relevance. Its goal is to put the post most likely tobe the one you’re looking for at the top of the list. The search engine doesthis effectively, as seen in the first column of results in Table 1.Significant problems arise with search when you’re after a broadcollection of similar posts, not a handful of the best ones. This is often thecase in social media analysis, when the goal is to analyze millions of poststo identify trends that can inform marketing decisions or uncover insights traditional searchthat can reveal business opportunities. Simply stated, more data points aresometimes much better than a few. In these cases, search will undermineyour efforts. The first 20, or even 200, posts might be great matches. Butthe last 20 or 200 might not match at all, as seen in the second resultscolumn of Table 1.Search methodology has other significant shortcomings, which aremore apparent when it’s applied to social media data than when used Search cannot contemplatewith other, more structured forms of text. For example, search struggles the context of how wordswhen you’re looking for something more complicated than whetheror not a document contains a particular word or phrase. Search and phrases are used incannot contemplate the context of how words and phrases are used relationship to one another;in relationship to one another; it simply can identify whether or not it simply can identify wheth-that word or phrase is present. er or not that word or phraseSearch also suffers a bias problem. If the searcher uses words that are is present.not a direct reflection of the words that millions of other people use fora given topic, search can’t accommodate the differences.To sum up the problems, search does not inherently provide a mechanismfor determining which results should belong to the desired group andwhich should not. The norm is to simply say that all posts that match aquery belong to the desired topic and use all of them in further analyses.A better way — the power of classification classificationIn contrast to search, text classification uses machine-learning algorithmsto learn from a set of examples how to separate posts into topics. If analgorithm, or program, is presented with examples of how a human wouldseparate posts based on topic, it can learn to mimic that person’s process Classification offers theon new, previously unseen posts. One major advantage of this approach is potential to produce athat the program can scale up to perform its process on millions of docu- dataset in which all of thements. People do not scale up so easily. posts are relevant to theClassification offers the potential to produce a dataset in which all of the topics being analyzed. Theposts are relevant to the topics being analyzed. The last 20 are as valuable last 20 are as valuable toto the analysis as the first 20. the analysis as the first 20.4 © 2011 Networked Insights, Inc. All rights reserved.
  5. 5. White Paper Networked InsightsSearch vs. Text ClassificationThe classification process begins with a human analyst selecting a samplingof posts that relate to a specific topic, such as pregnancy and newborns.The analyst also selects posts that are irrelevant, so the algorithm beingused can detect the difference. These posts serve as the training examplesfrom which the machine will learn.A variety of algorithms can be used for classification, including artificialneural networks, support vector machines and Naive Bayes algorithms.Selecting the right algorithm and tuning it are critical, as some do well atcertain problems and not so well at others. creating a stronger signalIn the next step, the algorithm learns how to categorize new posts byreading the example posts and identifying general rules that differentiatethe relevant and irrelevant posts. For example, when the program sees the Millions of people usephrases “little one” and “hospital” together in a post, it might notice thatthe probability the post belongs to the pregnancy and newborns category search every day to findincreases significantly. It then uses this knowledge in categorizing other what they’re looking forposts. The goal is not to memorize the training examples, but to find gen- online. But search can senderal characteristics that help the algorithm categorize new posts. you off into the social mediaTable 2 adds a third column to Table 1 that shows the result of using clas- wilderness if you’re usingsification instead of search to identify posts presumably related to the fivemom topics. The analysis approach for classification was the same as that traditional monitoring toolsapplied to the search precision test. An independent analyst reviewed 200 to discover conversationsrandomly sampled results from classification and determined whether or and trends. So stopnot they matched the intended topic. The improvement over the searchprecision test is dramatic. The overall precision of using classification was searching. Instead, start86 percent vs. 49 percent using search across all posts. For one topic – asking how real-time datafood, nutrition and health – precision rose from 39.5 percent with search can support your existingto 100 percent through classification. decision-making processesTable 2. Precision of Using Classification to Identify Posts in Comparison to Search and then use classification Top 20 Results Only Random Sample Classification Desired Topic techniques to cut through Pregnancy and newborns 95% 86.5% 88.0% School-aged children 95% 19.5% 72% the noise and sharpen your Food, nutrition, health 90% 39.5% 100% social analysis. Shopping and money 100% 57.5% 87% Illness and Injury 100% 41% 83% Overall 96% 48.8% 86%Classification clearly provides greater precision in social data analysis.It offers deeper insights – both on a broad scale and when drilling intospecific topics – than can be gleaned from standard search techniques.Questions about this report? Want a free consultation on how social datacan improve your media planning and other marketing? Contact us. 646-545-3900 info@networkedinsights.com5 © 2011 Networked Insights, Inc. All rights reserved.