Your SlideShare is downloading. ×
Preparing Social Media Data for Advanced Analytics
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Preparing Social Media Data for Advanced Analytics


Published on

Social media data, like other data sets, is completely unstructured and humongous in size. In order to gather insights from advanced analytics, the data needs to be preprocessed. …

Social media data, like other data sets, is completely unstructured and humongous in size. In order to gather insights from advanced analytics, the data needs to be preprocessed.

The major preparation work for social media data includes:

• Filtering duplicates, spam, blacklists and whitelists
• Detecting author language and country
• Analyzing sentiment by content tone and brand references
• Measuring author influences
• Indexing content and metadata

Published in: Technology, Business

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. 1 Preparing Social Data for Advanced Analytics Jason Xue, Director of Engineering
  • 2. In order to measure what matters to your business, we need to prepare/pre-process social media data before it can be used to gather insights.
  • 3.  Filter duplicates, spams, blacklists, whitelists  Detect author language  Detect author country  Analyze sentiment – content tone and brand references  Measure author influences  Index content and metadata  Store content in distributed systems Here’s How We Do It 3
  • 4.  Since we get social media data from different data sources we need to identify duplicated content via permalinks and remove them.  Filter the content that is spam to save computing resources. Spam can be detected by URL, title and content.  Define blacklist and whitelist to filter content. Filter Content 4
  • 5.  Approaches: (see details at of-language-identification-methods)  Common words and unique letter combinations  N-gram approach by Cavnar  Statistical approach by Ted Dunning  Compression based approach with PPM by Teahan  Language identification and character sets by Kikui  Software  Google's Compact Language Detector from Chrome (Library)  Google’s Translate APIs  SDL BeGlobal APIs  Microsoft language detection Detect Language 5
  • 6. We can use user-provided location information to detect a user’s country if it exists. Detect Author Country – by User-Provided Location 6
  • 7.  Data inaccuracy  Data specificity  Location level  Multiple locations with the same place  Alternative spelling or abbreviation The Challenges with User-provided Location 7
  • 8.  We can sometimes use URL domains and sub-domains to detect author’s country  Challenges with URL  Improperly used country code domains  Domain hacks Detect Author Country – By URL 8
  • 9.  When country information is absent, we can use the result of language detection as a signal for author’s country.  Challenges with Language  Hopefully results of language detection will include geography  If not, we will make a “best guess” based a list of defaulted countries by language. Detect Author Country – By Language 9
  • 10.  Author influences or authority rankings can be measured by the following factors:  Facebook – friends, profiles, likes , replies  Twitter - followers , retweets , mentions, replies.  YouTube - watch ViewCount  Flickr – view counts  IdentiCa - subscribers  Wikipedia - rankings  Platforms /websites for measuring influences:   Measure Author Influences 10
  • 11. Content tone defines overall sentiment of a conversation. Calculate Content Tone 11
  • 12.  Content tone can be measured against predefined key words of positive and negative emotions (Posemo, Negemo)  Content tone can be calculated by the difference between the positive and negative words over total words in a conversation, and then converted to Likert scale.  Content tone calculation can be improved by machine learning How to Calculate Content Tone 12
  • 13. Brand References analyzes the positive and negative words surrounding a brand keyword within a conversation. Results are scored either Negative, Neutral or Positive. Calculate Brand References 13
  • 14.  It considers the proximity of a Posemo or Negemo keyword to the brand keyword queried. This will identify the phrase as a positive or negative sentiment.  It considers any Negating keyword close to the brand keyword and will invert the sentiment of the phrase to its opposite.  An overall label of Positive or Negative is applied depending on which phrases have the larger count in the content.  If no positive or negative phrases are found, or if there are the same number of each, then the content is given a label of Neutral. How to Measure Brand References 14
  • 15.  Social media data needs to be indexed so that it can be searched and analyzed.  Preparation:  Convert conversations to searchable terms by removing stop words  Stop words are defined for different languages  Index  Conversations are indexed by publication dates, and content types  Terms and meta data are mapped into document IDs (permalinks) and then shard locations on machine nodes  Shard location is chosen by hashing the document ID  The permalink of a conversation (document) is stored on a primary shard, and optionally one or more replica shards Index Social Media Data 15
  • 16.  Social media data are non-structured and humongous in size. They need to be stored in distributed systems to be scalable and computable.  HBase is a key/value store. Specifically it is a Sparse, Consistent, Distributed, Multidimensional, Sorted map. We can use permalinks as an Hbase key for social content.  Hadoop (High-availability distributed object-oriented platform) is an open-source software framework that supports data- intensive distributed applications. It supports the running of advanced social analytics on large clusters of commodity hardware. Storing Social Media Data 16
  • 17. To learn more, visit: 17