SlideShare a Scribd company logo
How to Mine Unstructured Data
In 2003, Information Management published a series of articles, starting with a missive titled “The Problem with
Unstructured Data.” In what seems to be a generation ago, the big problem identified was many tools and
techniques’ inability to make sense of unstructured data. Not surprisingly, much has changed in the last decade.
On one side, the explosion in data has accelerated, thanks to a combination of innovation (YouTube started in 2005
and today, users upload 35 hours of video every minute), falling costs of data storage and transmission (cheap Web
access over the phone lines has brought millions more to the Internet) and, perhaps most significantly, the
emergence of mobile technologies (mobile data traffic alone in 2010 was three times the entire size of the global
Internet in 2000). Most of this growth has been in the form of unstructured data, such as video files, audio clips,
blogs and posts and conversations on social networking platforms.
Spurred in part by this explosion in data, there has also been an impressive influx of technological innovations
aimed at making sense of unstructured data. As these technologies go mainstream, it has becoming increasingly
evident that unstructured data presents a new set of opportunities. Some of the possibilities:


Up until now, customer care centers have focused on traditional metrics, such as call volumes, call lengths
and other operational aspects, in order to better manage and improve call handling capabilities. But with
sophisticated natural language processing algorithms, call transcripts can now be mined to generate
important insights into what the customers are thinking about products and services.



Traditionally, customer surveys have been the only means for companies to reach out and understand
perceptions about their products and services. As any statistician worth his or her salt knows, any sample
runs the risk of incorrectly representing the overall population, in addition to the fact that surveys rarely
capture the customer experience at the moment of consumption. Enter the world of blogs, tweets and
social networking; these tools have given people a channel to express their perceptions about products and
services in near real time. Social media data is proving to be an invaluable source of information for
companies to understand customer sentiment as a feedback loop. In addition, it is a well-known fact that
social networks are dominated by a small set of key influencers. For companies trying to improve brand
perception, focusing on these key influencers is turning out to be a far more effective than traditional
advertising. What used to be a mass of unstructured data, in the form of conversations and comments, is
turning out to be a highly effective window into the minds of customers.

Where Do We Go from Here?
It is no longer an option for companies to ignore the vast amounts of unstructured data that is being accumulated,
within and outside their firewalls. And as technologies and data management techniques have evolved, mining for
gold in unstructured data is no longer an ”occult science”; companies can approach it in a structured manner.
Create a platform (infrastructure and algorithms) to process large quantities of data. Unstructured data typically
appears along with its sibling, big data. It is very important to choose data sources with care. Companies typically
identify and store only data that solves business problems. However, there are benefits to conducting exploratory
analysis of unstructured data without a business purpose. For example, Twitter users generate 12 terabytes of data
every day. Given this scale, it is important to explore new data sources.
To handle unstructured data, many companies try to use modified version of RDBMS. RDBMS technology is more
than 30 years old and is not sufficient to handle neither the complexity nor the scale of unstructured data. Big data
needs scale: scalable storage, scalable computing, etc. The current favorite with most companies is the open source
Hadoop, one of many available tools. A host of vendorshave created integrated offerings to process large amounts
of data.
Considering the fact that it is very expensive to create scale, building platforms in the cloud should be seriously
considered. Some companies have already announced support for Hadoop on their cloud offerings. There is also a
lot of focus on in-memory analytics that can support big data.
Establish a machine-processable structure. Because the data is unstructured, it is important to apply some kind of
structure so analysis can take place. For example, a company collecting tweets may want to capture additional
information about the tweets, such as language, geography or user ID, to help in further processing. Since we are
dealing with unstructured data, the structure imposed on it may change drastically, based on initial results and new
data. It is important to note that the data may never be complete and organizations have to make tradeoffs in
accuracy. Establishing a structure can be achieved via algorithms or manually, through heuristics. Algorithms
reduce the accuracy but achieve scale. Humans, on the other hand, increase accuracy but reduce scale. It is
important to have a data model that is flexible, as well as capable of constantly evolving.
Create smaller representative data sets. Not every stakeholder will have the capability to work on unstructured
data. It is important to democratize the results by creating smaller structured data sets that address specific
business needs. These data sets can then be queried upon using the regular methods of data analysis. Companies
should also utilize the benefits of sampling. Many analyses can be performed to a reasonable level of accuracy, with
much shorter turnaround times. The random sample cuts data size and cost, by an order of magnitude, and still
gives reasonably accurate results. Many times, the analyses performed on smaller data sets could lead to insights,
which could be used to deep-dive using the complete data.
Develop/acquire algorithms. There are several kinds of approaches to process unstructured data. Companies can
use heuristics, as well as machine learning techniques, to process data. For text mining, natural language
processing, combined with neural networks, allows companies to capture the sentiment of social media feeds.
Support vector machines allow classification of graphics that enable companies to identify whether visitors are
customers or store associates. Techniques, like Bayesian Networks, can help discover patterns across multiple
dimensions, allowing organizations to learn by mining existing data. There are several algorithms that are freely
available through open source communities. However, it is important to note that you will almost never get it right
the first time. Companies should also invest in good visualization techniques. Given the volume of data,
visualization is a key factor in ensuring better consumption of insight from the data; often, visualization becomes
as important as the analysis.
Develop policies to constantly review and purge data. Figuring out what data to delete is almost as important as
acquiring data. There are several startups that have created a business model out of data destruction. Companies
need to brutally prioritize what data they should retain. They should develop and implement strict data purge and
archival policies. The focus should be on increasing ROI and efficiency. Learning from the most recent information
should be a priority, so as to not have historical biases.
Gaining insight from unstructured data is difficult but not impossible. Recent interest and developments in the field
have made it easier for companies to work on this data and generate insight. Companies should start exploring
unstructured data as the next step in information management.
Krishna Rupanagunta is a Delivery Unit Head for Mu Sigma. In this role he is responsible for client delivery, team
development and providing thought leadership across projects. His background is in Business Consulting, servicing
Fortune 500 clients across multiple industries with specific focus on Supply Chain Optimization. Prior to joining Mu
Sigma, he was part of a non-profit that helped the Indian government in the conceptualization and design of the
largest citizen identity project ever attempted in the world. He has a deep interest in Behavioral Economics and
actively contributes to the Mu Sigma blog.
David Zakkam is a Senior Delivery Manager for Mu Sigma. He has 10 years of experience working in the analytics
industry. His current focus areas include Rapid Impact Analytics, Search Engine Monetization, Big Data and CRM. He
has an MBA from the Indian Institute of Management, Calcutta and received an Engineering degree from Indian
Institute of Technology, Delhi.
Harsha Rao is a Senior Manager at Mu Sigma where he oversees delivery for clients in the financial services and
technology space. His current interests lie in the space of Big Data as well as Risk Management. Prior to Mu Sigma,
Harsha was at a leading U.S. bank where he focused on credit risk management. He earned his Ph.D. in Statistics
from North Carolina State University.

More Related Content

Recently uploaded

Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 

Recently uploaded (20)

Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 

Featured

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Whitepaper on How to Mine Unstructured Data - Analytics Solutions from Mu Sigma

  • 1.
  • 2. How to Mine Unstructured Data In 2003, Information Management published a series of articles, starting with a missive titled “The Problem with Unstructured Data.” In what seems to be a generation ago, the big problem identified was many tools and techniques’ inability to make sense of unstructured data. Not surprisingly, much has changed in the last decade. On one side, the explosion in data has accelerated, thanks to a combination of innovation (YouTube started in 2005 and today, users upload 35 hours of video every minute), falling costs of data storage and transmission (cheap Web access over the phone lines has brought millions more to the Internet) and, perhaps most significantly, the emergence of mobile technologies (mobile data traffic alone in 2010 was three times the entire size of the global Internet in 2000). Most of this growth has been in the form of unstructured data, such as video files, audio clips, blogs and posts and conversations on social networking platforms. Spurred in part by this explosion in data, there has also been an impressive influx of technological innovations aimed at making sense of unstructured data. As these technologies go mainstream, it has becoming increasingly evident that unstructured data presents a new set of opportunities. Some of the possibilities:  Up until now, customer care centers have focused on traditional metrics, such as call volumes, call lengths and other operational aspects, in order to better manage and improve call handling capabilities. But with sophisticated natural language processing algorithms, call transcripts can now be mined to generate important insights into what the customers are thinking about products and services.  Traditionally, customer surveys have been the only means for companies to reach out and understand perceptions about their products and services. As any statistician worth his or her salt knows, any sample runs the risk of incorrectly representing the overall population, in addition to the fact that surveys rarely capture the customer experience at the moment of consumption. Enter the world of blogs, tweets and social networking; these tools have given people a channel to express their perceptions about products and services in near real time. Social media data is proving to be an invaluable source of information for companies to understand customer sentiment as a feedback loop. In addition, it is a well-known fact that social networks are dominated by a small set of key influencers. For companies trying to improve brand perception, focusing on these key influencers is turning out to be a far more effective than traditional advertising. What used to be a mass of unstructured data, in the form of conversations and comments, is turning out to be a highly effective window into the minds of customers. Where Do We Go from Here? It is no longer an option for companies to ignore the vast amounts of unstructured data that is being accumulated, within and outside their firewalls. And as technologies and data management techniques have evolved, mining for gold in unstructured data is no longer an ”occult science”; companies can approach it in a structured manner. Create a platform (infrastructure and algorithms) to process large quantities of data. Unstructured data typically appears along with its sibling, big data. It is very important to choose data sources with care. Companies typically
  • 3. identify and store only data that solves business problems. However, there are benefits to conducting exploratory analysis of unstructured data without a business purpose. For example, Twitter users generate 12 terabytes of data every day. Given this scale, it is important to explore new data sources. To handle unstructured data, many companies try to use modified version of RDBMS. RDBMS technology is more than 30 years old and is not sufficient to handle neither the complexity nor the scale of unstructured data. Big data needs scale: scalable storage, scalable computing, etc. The current favorite with most companies is the open source Hadoop, one of many available tools. A host of vendorshave created integrated offerings to process large amounts of data. Considering the fact that it is very expensive to create scale, building platforms in the cloud should be seriously considered. Some companies have already announced support for Hadoop on their cloud offerings. There is also a lot of focus on in-memory analytics that can support big data. Establish a machine-processable structure. Because the data is unstructured, it is important to apply some kind of structure so analysis can take place. For example, a company collecting tweets may want to capture additional information about the tweets, such as language, geography or user ID, to help in further processing. Since we are dealing with unstructured data, the structure imposed on it may change drastically, based on initial results and new data. It is important to note that the data may never be complete and organizations have to make tradeoffs in accuracy. Establishing a structure can be achieved via algorithms or manually, through heuristics. Algorithms reduce the accuracy but achieve scale. Humans, on the other hand, increase accuracy but reduce scale. It is important to have a data model that is flexible, as well as capable of constantly evolving. Create smaller representative data sets. Not every stakeholder will have the capability to work on unstructured data. It is important to democratize the results by creating smaller structured data sets that address specific business needs. These data sets can then be queried upon using the regular methods of data analysis. Companies should also utilize the benefits of sampling. Many analyses can be performed to a reasonable level of accuracy, with much shorter turnaround times. The random sample cuts data size and cost, by an order of magnitude, and still gives reasonably accurate results. Many times, the analyses performed on smaller data sets could lead to insights, which could be used to deep-dive using the complete data. Develop/acquire algorithms. There are several kinds of approaches to process unstructured data. Companies can use heuristics, as well as machine learning techniques, to process data. For text mining, natural language processing, combined with neural networks, allows companies to capture the sentiment of social media feeds. Support vector machines allow classification of graphics that enable companies to identify whether visitors are customers or store associates. Techniques, like Bayesian Networks, can help discover patterns across multiple dimensions, allowing organizations to learn by mining existing data. There are several algorithms that are freely available through open source communities. However, it is important to note that you will almost never get it right the first time. Companies should also invest in good visualization techniques. Given the volume of data, visualization is a key factor in ensuring better consumption of insight from the data; often, visualization becomes as important as the analysis.
  • 4. Develop policies to constantly review and purge data. Figuring out what data to delete is almost as important as acquiring data. There are several startups that have created a business model out of data destruction. Companies need to brutally prioritize what data they should retain. They should develop and implement strict data purge and archival policies. The focus should be on increasing ROI and efficiency. Learning from the most recent information should be a priority, so as to not have historical biases. Gaining insight from unstructured data is difficult but not impossible. Recent interest and developments in the field have made it easier for companies to work on this data and generate insight. Companies should start exploring unstructured data as the next step in information management. Krishna Rupanagunta is a Delivery Unit Head for Mu Sigma. In this role he is responsible for client delivery, team development and providing thought leadership across projects. His background is in Business Consulting, servicing Fortune 500 clients across multiple industries with specific focus on Supply Chain Optimization. Prior to joining Mu Sigma, he was part of a non-profit that helped the Indian government in the conceptualization and design of the largest citizen identity project ever attempted in the world. He has a deep interest in Behavioral Economics and actively contributes to the Mu Sigma blog. David Zakkam is a Senior Delivery Manager for Mu Sigma. He has 10 years of experience working in the analytics industry. His current focus areas include Rapid Impact Analytics, Search Engine Monetization, Big Data and CRM. He has an MBA from the Indian Institute of Management, Calcutta and received an Engineering degree from Indian Institute of Technology, Delhi. Harsha Rao is a Senior Manager at Mu Sigma where he oversees delivery for clients in the financial services and technology space. His current interests lie in the space of Big Data as well as Risk Management. Prior to Mu Sigma, Harsha was at a leading U.S. bank where he focused on credit risk management. He earned his Ph.D. in Statistics from North Carolina State University.