IDC Analyst Connection Connotate Diving Deep Outside the Firewall for Market Research Insights


Published on

For many enterprises, Big Data is now a mainstream concern, as evidenced by changes in organizational structure and budgets to focus on this area. However, most enterprises have yet to tap into the vast resource of data outside the firewall to incorporate Web-based Big Data in real time. The Web provides a lot of data that can be useful to market research efforts, particularly if organizations go beyond analyzing quantitative data such as statistics or demographics and look at customer sentiment as revealed in comments on product reviews as well as posts on social networks.
The following questions were posed by Connotate to David Schubmehl, research manager at IDC, on behalf of Connotate's customers.

Published in: Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

IDC Analyst Connection Connotate Diving Deep Outside the Firewall for Market Research Insights

  1. 1. I D C A N A L Y S T C O N N E C T I O N David Schubmehl Research ManagerDiving Deep Outside the Firew all for MarketResearch InsightsOctober 2012For many enterprises, Big Data is now a mainstream concern, as evidenced by changes inorganizational structure and budgets to focus on this area. However, most enterprises have yet to tapinto the vast resource of data outside the firewall to incorporate Web-based Big Data in real time. TheWeb provides a lot of data that can be useful to market research efforts, particularly if organizationsgo beyond analyzing quantitative data such as statistics or demographics and look at customersentiment as revealed in comments on product reviews as well as posts on social networks.The following questions were posed by Connotate to David Schubmehl, research manager at IDC, onbehalf of Connotates customers.Q. How are enterprises missing out by failing to tap into the Web?A. The Web has become a global repository that contains over 8 billion pages of unstructured information ranging from news and social media to research and philosophical treatises. The Web is a tremendous source of information about an enterprises prospects, customers, and competitors, which is why leading organizations are making heavy use of the Web as a research tool. Survey research indicates that global CEOs are looking to Big Data on the Web to understand their customers and build engagement models with their existing customers and prospective customers. Where enterprises are missing out is by failing to tap into the tremendous amount of social media information on the Web. Many organizations are beginning to understand that their customers are out there talking about them on the Web and on social media sites, yet they dont have a very good handle on how to collect all of that information. As a result, many companies are missing opportunities because they arent aware of or dont understand the conversations — both good and bad — that are going on about them, particularly in focused blogs and online user group communities. By tapping into these specialized online sources (not just Twitter and Facebook), companies can better understand what their customers are saying, thinking, or looking for regarding specific products and services. Just think of all the product reviews that are posted on the Web. Companies can gain a lot of insight about customer sentiment by tapping into this information.IDC 1390
  2. 2. On a similar note, organizations can make use of the wealth of competitive information on the Web. Competitor product data, prices, reviews, and even comparisons can be found on the Web. In the same manner that organizations can tap into the "voice of the customer," they can also tap into their competitors data to understand and compete more effectively. These are just a few examples of valuable data that is out there waiting for organizations that are willing to go find it and collect it.Q. What are the benefits — and challenges — of using Web-based data to fuel customer sentiment analysis in market research?A. The benefits of Web-based data revolve around three factors: timeliness, legitimacy, and aggregation. Typically, collecting data from social media sites, product review sites, and other sources can be very current and even provide up-to-the-minute feedback. Still, it can be a challenge to figure out how to collect that information in a manner that is as close to real time as possible and also to determine what kind of feedback that can be collected is going to evolve — and therefore be more valuable for trend analysis — over time. For many organizations, trend analysis actually is extremely valuable and can provide long-term benefits. Legitimacy is also a major factor. Are the review and the sentiment real? Is someone posting something because he or she wants to share true feelings about a product, or is it a competitor looking to sabotage reviews? Perhaps a reviewer is being paid to say something positive, which could skew results, so how can an organization identify the unpaid reviews? All of these factors can be challenging to quantify. Finally, a wide variety of customer reviews and feelings need to be collected in order to accurately gauge customer sentiment, especially if the collection is being done automatically. Small samples can skew results and analysis. Most organizations would like to collect as many comments or as much information from as many relevant sites as possible. The problem is that the number of sites that may have valuable content is expanding at a tremendous rate. Its a challenge for an organization thats trying to collect all this information and pull it together in a way that is useful. Thats why aggregation into a single structure is important. Its relatively easy to pull things from a Twitter stream or a Facebook feed, but organizations often have to contend with all of the other sites that are out there, and this is often where this type of data collection can become complicated. The fragility and the rate of change of content within Web pose an additional challenge. Web sites change constantly, pages are moved or modified, and content is added or deleted on a regular basis. Less robust approaches to collecting Web data will "break" and cease to return valid output when a change is made to the target Web page. A fragile system delivers only a fragment of the value when Web content changes and doesnt allow for time series analytics. A more robust solution features resiliency to change and, in the long run, delivers higher value.Q. What is "deep" Web data, and why is it more valuable than "surface" Web data?A. Deep Web data, which builds on IDCs traditional definition of Web data, is typically data that cant be crawled or accessed at all except through some kind of authentication process. A typical place for such data is in a document management system that is available via the Web, but only through authentication. However, many organizations now view the deep Web as the layers below the surface of a typical Web site. For example, the comments section of a Web-based ecommerce site might be buried 30 or 40 levels deep within the organizations Web site; some types of crawlers and aggregators wouldnt easily be able to find this type of information. Organizations often want to look at this information because there can be real value in it. What is hidden deep within the system can often reveal more insights than data at 2 ©2012 IDC
  3. 3. the surface level. The ability to ferret out all of the information contained in the deep Web will be more valuable to organizations than just looking at what is easily crawled at a surface level.Q. How can enterprises tap into deep Web data, and what are the stumbling blocks to doing this? What complementary technologies should they consider, and/or how can they simplify this process?A. The barriers to accessing deep Web data typically involve the inability to obtain that data through a standard RSS feed or a standard Twitter API screen feed. Organizations may collect information at this surface level, but extra processing is required, such as in the case of shortened URLs. There are different shortening techniques for compressing Web site locators into the 140-character maximum length of a Twitter or RSS stream. One approach is to use technology that can shorten the URLs and then use them to go down 20, 30, or 40 levels — however many levels it takes to get at the relevant information. Technologies are available today that can help automate this process, and they are worthy of consideration for extracting value from deep Web data. There are also technologies that include an authentication method if its necessary to require a user ID and a password. Then the actual crawling is automated in the system, as if an end user is pulling up the information and manipulating and extracting it. Then the data can be handed off in some fashion to another system for something like sentiment analysis or content analytics to actually understand whats being said on that page or in that set of comments. Once you have identified relevant data sources and the technologies required to access them, the next step is to identify technologies needed for extracting the valuable information — such as product numbers, prices, descriptions, comments, and other fields — normalize that information, and then place the information into some kind of structured repository such as a database or search system. These tools often have to be tailored to the kinds of Web data that is being collected, but they are absolutely essential to the process of deep Web data collection.Q. What are some specific use cases and vertical market applications for deep Web data?A. From a market research standpoint, there are many different applications where deep Web data can be used to gain insights. Manufacturers of 35in. large-screen TVs, for example, could use deep Web extraction technology to pull the pricing information from other Web sites or from Web-based catalogs. This software can collect product and pricing information from vendors such as Wal-Mart, Target, Best Buy,, and many others in an automatic fashion. These types of applications collect all of the relevant information, extract it, aggregate it, and then place the data in one or more relational database tables. A TV manufacturer using this type of system could then find out what the current prices are for TVs and could also go back to previous months or even years to determine pricing trends. Another potentially interesting application is in the pharmaceuticals industry. A pharmaceutical manufacturer can see what prices are charged for its products on targeted Web sites anywhere in the world. If products are being sold below market value in one part of the world, this can indicate a black market– or potentially white market–type sales activity. A manufacturer can look at these sites and look at the data aggregation to try to understand why some locales are selling products at prices that may seem to be below market level. Appliances are another common use case where deep Web data can be very useful. Perhaps consumers are looking at reviews for washing machines in an effort to determine reliability versus price for when they need to make a purchase. It would certainly be helpful©2012 IDC 3
  4. 4. for the manufacturers to understand what the consumers are saying about their washing machines and what potential buyers might see if they went to these sites. Manufacturers can collect this deep Web information from all of these different sites — whether retail sites, repair sites, review sites, or competitor sites — to find out what people are saying about washers with regard to reliability, price, and even ease of use. Many similar use cases fall into this category. Another market research use is trying to understand future buying patterns by conducting trend analysis. Whats trending in terms of hot new smartphones, best-selling books, or video games? What are people talking about on Twitter and on social media Web sites? Are they talking about the latest weight loss medication approved by the FDA? Who is spending money and where? IDC is seeing a lot of companies starting to think about trend analysis. Data supporting trend analysis can be used to design future products. A manufacturer can look to the Web to find out what features of a new phone are being discussed or what features are being disparaged. This type of information is valuable to designers and engineers because it provides a view into what customers are actually thinking about when they use a product. Deep Web data has many uses in market research, and IDC expects that more and more organizations will have a deep Web data collection and use strategy as part of their ongoing market research efforts. A B O U T T H I S A N A L Y S T Dave Schubmehl is research manager for IDCs search, content analytics, and discovery research. His research covers information access technologies including content analytics, search systems, unstructured information representation, unified access to structured and unstructured information, Big Data, visualization, and rich media search. This research analyzes the trends and dynamics of the content analytics, search, and discovery software markets and the costs, benefits, and workflow impacts of solutions that use these technologies.A B O U T T H I S P U B L I C A T I O NThis publication was produced by IDC Go-to-Market Services. The opinion, analysis, and research results presented hereinare drawn from more detailed research and analysis independently conducted and published by IDC, unless specific vendorsponsorship is noted. IDC Go-to-Market Services makes IDC content available in a wide range of formats for distribution byvarious companies. A license to distribute IDC content does not imply endorsement of or opinion about the licensee.C O P Y R I G H T A N D R E S T R I C T I O N SAny IDC information or reference to IDC that is to be used in advertising, press releases, or promotional materials requiresprior written approval from IDC. For permission requests, contact the GMS information line at 508-988-7610 or and/or localization of this document requires an additional license from IDC.For more information on IDC, visit For more information on IDC GMS, visit Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 4 ©2012 IDC