• Web is no longer a static library that people passively browse
• Web is a place where people:
o Consume and create content
o Interact with other people:
Internet forums, Blogs, Social networks, Twitter, Wikis, Podcasts, Slide
sharing, Bookmark sharing, Product reviews, Comments, …
• DATA POINT: Facebook traffic tops Google (for USA)
• March 2010: FB > 7% of US traffic
Social Media : Big Change
• Rich and big data:
• Billions users, billions contents
• Textual, Multimedia (image, videos, etc.)
• Billions of connections
• Behaviours, preferences, trends...
• Data is open and easy to access
• It’s easy to get data from Social Media
• Developers APIs
• Spidering the Web
Social Media : Rich and Big data
Social Media : Opportunities
Any user can share and contribute content, express opinions, link to others
This means: Can data-mine opinions and behaviours of millions of users to gain
• Human behaviour
• Marketing analytics
• Product sentiment
• Consumer Brand Analytics
• What are people saying about our brand?
• Marketing Communications
• Significant spending on marketing, advertising:
• Companies trying to position their products
• Brand analytics helps to determine whether such campaigns are effective
• Product reviews
• Automatically mine product reviews for information on product features, new
• Easy to use, Comfortable chair, Light weight, Sturdy, Good price
Applications: Reputation Management
• Citizen response
• Solicit citizen feedback on bills debated in Congress
• What new issues are being raised, what aspects of bill are popular, unpopular
• Political Campaigns
• Why do people support a candidate?
• Law enforcement
• Gang members boast about their activities on Facebook
• Protests being planned through Twitter
• NYT: Sending the Police Before There’s a Crime
Applications: Citizen Response
• Viral marketing:
• Personalized recommendations Online forum users are
• Brand advocates:
• 79.2% of forum contributors help a friend to make a decision about a product
• purchase (47.6% of non-contributors).
• 65% of forum contributors share advice (offline and in person) based on
information that they’ve read online (35% of non-contributors)
Applications: Social Media Marketing
How do we capture and model
the flow of information?
Given that social media generate a wealth of consumer data, how can brands turn raw
social media comment data from Twitter, Facebook, blogs, and forums into actionable
business insights? The answer lies in the application of text-mining and semantic
technology to these new sources of unstructured data.
How does it work?
• Text mining is similar to data mining in that it is aimed at identifying interesting patterns
• The first step in any text-mining effort is to identify the text-based sources to be
analysed and gather this material through information retrieval or selecting the corpus
that comprises the set of textual files and content of interest.
• Extensive NLP is deployed that invokes "part of speech tagging" and text sequencing to
parse for syntax (that is, tokenizing text) and applying Named Entity Recognition (that is,
identifying the mention of brands, people's names, places, common abbreviations, and
Text mining and semantic methods
Unique challenges exist when setting out to apply text mining to social media
data. The data that social networking sites, blogs, and forums generate falls in
the category of what is commonly referred to as big data. The data is
unstructured and semi-structured, petabytes are generated around larger
brands on a daily basis, and traditional relational databases cannot efficiently
scale to support real-time analytics based on the data. Big data and NoSQL
database solutions are therefore required.
Social media datamarts and big data
There are several commercial and open source options for text-mining software and
Of the open source text mining tools, RapidMiner and R appear to be two of the most
popular. R has a wider user base; a programming language in which source code is
required, it has a large selection of algorithms. However, scalability is an issue with R so it's
not ideal for large datasets without workarounds. RapidMiner has a smaller user base, but
it doesn't require source code and has a powerful user interface (UI).
Embedded is a list of other Text Mining tools:
Text mining tools
Spinn3r is a web service that provides raw access to posts, articles, tweets, status
updates, etc. being published - in real or near real time, allowing you to focus on building
your application, mashup, or search engine. We find the sources, index their content and
take care of all the heavy lifting around delivering large amounts of relevant data.
They publish an API for companies to build Analytic products on top of this data
• Spinn3r Dataset: http://spinn3r.com
• 30 million articles/day (50GB of data)
• 20,000 news sources + millions blogs and forums
• And lots of Tweets and public Facebook posts
Gnip and DataSift are among the many others who provide these
kind of Datasets
There are many product companies who use these datasets and build analytical products
With InsideView CRM+, your marketing, sales, and service teams can:
• Research market, company, contact, and competitor information
• Use real-time news and social network connections to target new leads and engage with
• Enrich leads to help sales move from lead to win
• One-click integration with CRM to update leads and contacts into your CRM
Tealeaf's Customer Behavior Analysis Suite
• Improving online customer experience is a top priority for many organizations and
Tealeaf's Customer Behavior Analysis Suite was created with this goal in mind. By
utilizing cxImpact, cxResults and cxView in concert, companies have both the
quantitative data, as well as the qualitative experience information necessary to
understand customers' true experiences
Further list of product companies those provide analytical tools from datasets
And many more..
Sentiment analysis depends on an appropriate subjectivity lexicon that understands the
relative positive, neutral or negative context of a word or expression. It is both language
and context specific.
A good example can be seen below:
I find PRODUCTX to be very good and useful, but it is a bit too expensive.
The expression (and therefore the PRODUCTX) is rated as positive, since there are two
positive words “good” and “useful” – and one negative word “expensive”. In addition, one
of the positive words is enhanced with the word “very” while the negative word is put
into perspective by the qualifier “a bit”. The more advanced the lexica, the more detailed
the analysis and the findings can be.
Sentiment analysis is a well-established, stand-alone predictive analytic technique.
Sentiment Analysis: Predictive Analytic Technique
These tools are generally cloud-based applications that pull many different social media
data sources (datasets) together including communities and blogs. They are able to do
this because they generally incorporate a massive back end infrastructure that constantly
crawls and captures new data as it occurs from the API’s.
They all provide an interface to filter the data and enter selection criteria to look across a
broad range of channel choices. The results usually take some form of a visual scorecard
that combines different graphical and tabular techniques for displaying the summarized
information. Many allow an interactive “drill down” to see further details, most of them
allowing you to drill right through to the original source of the data.
Social Media Scorecards
Technologies Used by these Product Companies
Big Data Technologies:
• Hadoop Frameworks (hdfs, Pig, Hive, oozie, Hbase, Mahout),
• Cloudera (CDH3 & CDH4) distributions,
• Postgres+ Postgis,
Cloud computing technologies:
• Amazon Web Services (AWS) / Amazon EC2,
• Amazon S3,
• Amazon EMR,
• Amazon Cloud watch