Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Klout as an Example Application of Topics-oriented NLP APIs


Published on

Klout in its iterations is a prime example of leveraging large scale NLP data science with topical assignment. Klout makes this available through its website,, and also through its developer API,

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Klout as an Example Application of Topics-oriented NLP APIs

  1. 1. Topics-oriented APIs May 2015 – APIdays Barcelona Tyler Singletary - @harmophone Director of Platform
  2. 2. HI.
  4. 4. A Practical Application of Social Media Machine Learning and NLP1
  5. 5. WHAT IS KLOUT, REALLY? • Klout is an API client application of the social web. • Federated identity across platforms • Macro and micro understanding of profile, conversation, and content. ple linked by Topics.
  6. 6. UNIFYING PRINCIPLE: TOPICS • TBs of Social Interactions a Day • NLP applied to posts • Aggregated to profiles: – effects are Klout Score, topical strengths – The what becomes topics – The why becomes TopicSets • Links crawled, NLP summarization tent and people linked by Topics.
  7. 7. TOPIC SETS + USERS + SCORING • Allow for time-series slicing • Aggregate counting • Slicing of set to create ordered list Topic-oriented view
  8. 8. NLP-based Building Blocks 2
  9. 9. KLOUT DEALS WITH RIDICULOUS AMOUNTS OF DATA o Topic assignment at scale: o ~650 M new pieces of data daily o hundreds of millions of profiles o ~10,000 topics in 3-level hierarchy o Daily update o Multiple Social networks and various data sources: o Twitter, Facebook, LinkedIn, Google+, Wikipedia o User activity, profiles, connections o Topics normalized to an evolving, managed ontology
  10. 10. WEIGHTING, NORMALIZATION, CALIBRATION Signals are weighted and normalized to mirror real-world influence – Machine-learned weighting based on regression analysis of survey data Advanced algorithm based on 1500 signal combinations of relationships and ratios – Where: Which network is the action taking place? – What: What action was taken? – Who: Who acted on your content? – How much: How many actions and unique actors? – When: When was the action performed?
  11. 11. TOPIC SETS FOR CONTEXT User’s Influence With various Scores User’s Interests With various Scores User’s Self- selection Based on registered self- declared interest Audience Influence Rollup of User’s Influence within a user’s downlevel and uplevel networks Audience Interests Rollup of User’s Interests within a User’s downlevel (and uplevel) networks
  12. 12. CHALLENGES IN BIG DATA ● Message size: Overall data size may be huge, but message size per user may be small. ● Text Sparsity: Many users may be passive consumers of content. ● Noise: colloquial language, slang, grammatical errors, abbreviations. ● Context: Need to expand context to get more information ● False positives are embarrassing when user-facing
  13. 13. CHALLENGES TO SCALE NLP* - StanfordNLP english.conll.4class.distsim.crf.ser.gz ● Speed Matters (650M messages a day): ○ Stanford Named Entity Extraction - 10.959 ms (82.0 CPU days) ○ Dictionary - 0.056ms (0.42 CPU days) ● Corpus ○ Stanford Named Entity Extraction: ■ {‘the rule of law’=1.0} ○ Dictionary based: ■ {‘the rule of law’=1.0, ‘nsa’=1.0, ‘eff’=1.0}
  14. 14. WEBSTER
  15. 15. MACHINE LEARNING AT KLOUT We our leverage past machine learning and NLP classification assets to: • Train new models for adding additional data sources • Retraining Topics classification • Predict “actionability” of support • Predict virality of content [macro and micro] • Predict the “personhood” of a social media account • Content-targeting based on downlevel predictions
  16. 16. How do you productize this in APIs? 3
  17. 17. INPUTS AND OUTPUTS People- Specific Insights Input: People(s) Output: TopicSet(s) Topic- Specific People Input: Topic(s) Output: People Topic- Aggregate Insights Input: Topic(s) Output: Metadata, Aggregation People- Aggregate Insights Input: User(s) Output: Metadata, Aggregate Sets GET user.json/[id]/i nsights/influe nce-topics GET user.json/insight s/aggregated/inf luence- topics?userIds= 1,2,3 GET topic.json/[ids]/pe ople GET topic.json /[ids]/insi ghts
  18. 18. PAYLOADS { topicSetType: "expertise", topicSet: [ { topicId: "7516448513106795305", score: 0.999596145670965, strength: "strong", displayName: "APIs", name: "APIs", slug: "api", imageUrl: "", displayType: "visible", topicType: "entity" }, { topicId: "10000000000000008253", score: 0.9992839644220868, strength: "strong", displayName: "Twitter", name: "Twitter", slug: "twitter", imageUrl: "", displayType: "visible", topicType: "entity" }, { topicId: "8961164588331655920", score: 0.9992326280041798, strength: "strong", displayName: "Klout", name: "Klout", slug: "klout", imageUrl: "", displayType: "visible", topicType: "entity” topicSetType: "interest", topicSet: [ { topicId: "10000000000000008253", score: 0.9946672348339362, strength: "strong", displayName: "Twitter", name: "Twitter", slug: "twitter", imageUrl: "", displayType: "visible", topicType: "entity" }, { topicId: "6485494992525344250", score: 0.9918719149780779, strength: "strong", displayName: "Marketing", name: "Marketing", slug: "marketing", imageUrl: "", displayType: "visible", topicType: "sub" }, { topicId: "7516448513106795305", score: 0.9888798650771197, strength: "strong", displayName: "APIs", name: "APIs", slug: "api", imageUrl: " displayType: "visible", topicType: "entity" },
  19. 19. Let’s get practical, prescriptive and talk about the future4
  20. 20. PARAMETERIZATION • Topics Scoring uses different models in each topic set • Overall Topic Scoring is based on hundreds of features, weights, decays, spanning short and long term • Parameterize scoring for different contexts
  21. 21. EXAMPLES Use interchanging, specified models, with rules modifiers
  22. 22. EXAMPLES • Treated like a product, you must think through implementations others would make. • Maybe even make them your own.
  23. 23. POLICY • Data is great. • Representation of data is hard. • Raw data rarely if ever needs to be displayed. • Balance innovation on data assets with brand and utility, allowed use cases.
  25. 25. Bye! May 2015 – APIdays Tyler Singletary - @harmophone Director of Platform