Preparing Social Media Data for Advanced Analytics
Upcoming SlideShare
Loading in...5
×
 

Preparing Social Media Data for Advanced Analytics

on

  • 1,016 views

Social media data, like other data sets, is completely unstructured and humongous in size. In order to gather insights from advanced analytics, the data needs to be preprocessed. ...

Social media data, like other data sets, is completely unstructured and humongous in size. In order to gather insights from advanced analytics, the data needs to be preprocessed.

The major preparation work for social media data includes:

• Filtering duplicates, spam, blacklists and whitelists
• Detecting author language and country
• Analyzing sentiment by content tone and brand references
• Measuring author influences
• Indexing content and metadata

Statistics

Views

Total Views
1,016
Views on SlideShare
905
Embed Views
111

Actions

Likes
0
Downloads
5
Comments
0

9 Embeds 111

https://twitter.com 77
http://www.sdl.com 15
http://kred.com 5
http://staging.sdl.com 4
http://www.kred.com 3
http://sdlwebdevcd.ams.dev 3
http://www.linkedin.com 2
http://staging1.sdl.com 1
http://acc.live.sdl.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Preparing Social Media Data for Advanced Analytics Preparing Social Media Data for Advanced Analytics Presentation Transcript

  • 1 Preparing Social Data for Advanced Analytics Jason Xue, Director of Engineering
  • In order to measure what matters to your business, we need to prepare/pre-process social media data before it can be used to gather insights.
  •  Filter duplicates, spams, blacklists, whitelists  Detect author language  Detect author country  Analyze sentiment – content tone and brand references  Measure author influences  Index content and metadata  Store content in distributed systems Here’s How We Do It 3
  •  Since we get social media data from different data sources we need to identify duplicated content via permalinks and remove them.  Filter the content that is spam to save computing resources. Spam can be detected by URL, title and content.  Define blacklist and whitelist to filter content. Filter Content 4
  •  Approaches: (see details at http://www.slideshare.net/edma2/evaluation- of-language-identification-methods)  Common words and unique letter combinations  N-gram approach by Cavnar  Statistical approach by Ted Dunning  Compression based approach with PPM by Teahan  Language identification and character sets by Kikui  Software  Google's Compact Language Detector from Chrome (Library)  Google’s Translate APIs  SDL BeGlobal APIs  Microsoft language detection Detect Language 5
  • We can use user-provided location information to detect a user’s country if it exists. Detect Author Country – by User-Provided Location 6
  •  Data inaccuracy  Data specificity  Location level  Multiple locations with the same place  Alternative spelling or abbreviation The Challenges with User-provided Location 7
  •  We can sometimes use URL domains and sub-domains to detect author’s country  Challenges with URL  Improperly used country code domains  Domain hacks Detect Author Country – By URL 8
  •  When country information is absent, we can use the result of language detection as a signal for author’s country.  Challenges with Language  Hopefully results of language detection will include geography  If not, we will make a “best guess” based a list of defaulted countries by language. Detect Author Country – By Language 9
  •  Author influences or authority rankings can be measured by the following factors:  Facebook – friends, profiles, likes , replies  Twitter - followers , retweets , mentions, replies.  YouTube - watch ViewCount  Flickr – view counts  IdentiCa - subscribers  Wikipedia - rankings  Platforms /websites for measuring influences:  http://klout.com  http://traackr.com/ Measure Author Influences 10
  • Content tone defines overall sentiment of a conversation. Calculate Content Tone 11
  •  Content tone can be measured against predefined key words of positive and negative emotions (Posemo, Negemo)  Content tone can be calculated by the difference between the positive and negative words over total words in a conversation, and then converted to Likert scale.  Content tone calculation can be improved by machine learning How to Calculate Content Tone 12
  • Brand References analyzes the positive and negative words surrounding a brand keyword within a conversation. Results are scored either Negative, Neutral or Positive. Calculate Brand References 13
  •  It considers the proximity of a Posemo or Negemo keyword to the brand keyword queried. This will identify the phrase as a positive or negative sentiment.  It considers any Negating keyword close to the brand keyword and will invert the sentiment of the phrase to its opposite.  An overall label of Positive or Negative is applied depending on which phrases have the larger count in the content.  If no positive or negative phrases are found, or if there are the same number of each, then the content is given a label of Neutral. How to Measure Brand References 14
  •  Social media data needs to be indexed so that it can be searched and analyzed.  Preparation:  Convert conversations to searchable terms by removing stop words  Stop words are defined for different languages  Index  Conversations are indexed by publication dates, and content types  Terms and meta data are mapped into document IDs (permalinks) and then shard locations on machine nodes  Shard location is chosen by hashing the document ID  The permalink of a conversation (document) is stored on a primary shard, and optionally one or more replica shards Index Social Media Data 15
  •  Social media data are non-structured and humongous in size. They need to be stored in distributed systems to be scalable and computable.  HBase is a key/value store. Specifically it is a Sparse, Consistent, Distributed, Multidimensional, Sorted map. We can use permalinks as an Hbase key for social content.  Hadoop (High-availability distributed object-oriented platform) is an open-source software framework that supports data- intensive distributed applications. It supports the running of advanced social analytics on large clusters of commodity hardware. Storing Social Media Data 16
  • To learn more, visit: sdl.com/si 17