Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The worst jobs in history industria... by mrsbreedsclassshs 382 views
- #NewWayToEngage Influencers by IBM Watson Commerce 1746 views
- DATEV - Java Made with z Systems by IBM_CICS 1514 views
- Tour de cyclopath v10 by Thomas Erickson 884 views
- Three Critical Elements for Buildin... by IBM BPM 2277 views
- The Power of Cognitive Commerce for... by Lynn Kesterson-To... 1997 views

470 views

Published on

Invited talks at UC Santa Barbara and UC Santa Cruz, May 2013.

Published in:
Data & Analytics

No Downloads

Total views

470

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

11

Comments

0

Likes

1

No embeds

No notes for slide

popular resource for marketing campaigns to monitor the opinions of consumers on particular products and to launch viral advertising. Identifying key influencers in microblogs is required for such marketing activities.

To learn the various distributions in the FLDA model, we use the Gibbs sampling method. Gibbs sampling is a Markov chain Monte Carlo algorithm to approximate distributions of latent variables based on the observed data. The Gibbs sampling process usually starts with some initial value for each variable, then iteratively sample each variable conditioned on the current values of the remaining variables, and then update the vairable with its new value. This process will repeat for 100s of iterations. And at the end, the produced samples can be used to approximate the distributions of latent variablw. The key and also the most challenging part of Gibbs sampling is to derive the conditional distributions of each latent variable. Here are the derived conditional probabilities for the FLDA model. They look very complicated. But essentially, what we derived are..

Gibbs sampling is a widely used approach to Bayesian Inference. We also use it here to learn the various distributions.

Gibbs sampling is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximately from a specified multivariate probability distribution，This sequence can be used to approximate the joint distribution and approximate the marginal distribution of latent variables. The Gibbs sampling algorithm generates an instance from the distribution of each variable in turn, conditional on the current values of the other variables. We begin with some initial value for each variable, for each sample, sample each varibale from the conditional distribution. That is sample each variable from the distribution of that variable condidtioned on all other variables, making use of the most recent values and updating the varibales with its new value as soon as it has been sampled.. Then samples then approximate the join distribution of all variables. The idea is that observed data is incorporated into the sampling process by creating separate variables for each piece of observed data and fixing the variables in question to their observed values, rather than sampling from those variables. The distribution of the remaining variables is then effectively a posterior distribution conditioned on the observed data. A collapsed Gibbs sampler integrates out (marginalizes over) one or more variables when sampling for some other variable.

current workers.

. We explore a wide range of sizes (from 12.5% all the way up to 100%), number of topics (from 25 to 200) and number of workers (from 25 to 200). The figure shows that the distributed FLDA scales well along all dimensions.

- 1. Scalable Topic-Specific Influence Analysis on © 2013 IBM Corporation Microblogs Yuanyuan Tian IBM Almaden Research Center
- 2. Motivation Huge amount of textual and social information produced by popular microblogging sites. • Twitter had over 500 million users creating over 340 million tweets daily, reported in March 2012. A popular resource for marketing campaigns • Monitor opinions of consumers • Launch viral advertising Identifying key influencers is crucial for market analysis © 2 2013 IBM Corporation
- 3. Existing Social Influence Analysis Majority existing influence analyses are only based on network structures • E.g. influence maximization work by Kleinberg et al • Valuable textual content is ignored • Only global influencers are identified Topic-specific influence analysis • Network + Content • Differentiate influence in different aspects of life (topics) © 3 2013 IBM Corporation
- 4. Existing Topic-Specific Influence Analysis Separate analysis of content and analysis of networks • E.g. Topic-sensitive PageRank (TSPR), TwitterRank • Text analysis topics • For each topic, apply influence analysis on the network structure • But in microblog networks, content and links are often related A user tends to follow another who tweets in similar topics Analysis of content and network together • E.g. Link-LDA • But designed for citation and hyperlink networks • Assumption: content is the only cause of links • But in microblog networks, a user sometimes follows another for a content-independent reason A user follows celebrities simply because of their fame and stardom © 4 2013 IBM Corporation
- 5. © 2013 IBM Corporation Overview Goal: Search for topic-specific key influencers on microblogs Model topic-specific influence in microblog networks Learn topic-specific influence efficiently Put it all together in a search framework healthy food Michelle Obama Jimmy Oliver … SSKKIITT
- 6. Road Map Followship-LDA (FLDA) Model Scalable Gibbs Sampling Algorithm for FLDA A General Search Framework for Topic-Specific Influencers Experimental Results © 6 2013 IBM Corporation
- 7. © 2013 IBM Corporation Intuition Each microblog user has a bag of words (from tweets) and a set of followees Content: web, organic, veggie, cookie, cloud, … Followees: Michelle Obama, Mark Zuckerberg, Barack Obama Alice A user tweets in multiple topics • Alice tweets about technology and food A topic is a mixture of words • “web” and “cloud” are more likely to appear in the technology topic A user follows another for different reasons • Content-based: follow for topics • Content-independent: follow for popularity A topic has a mixture of followees • Mark Zuckerberg is more likely to be followed for the technology topic • Topic-specific influence: the probability of a user u being followed for a topic t
- 8. © 2013 IBM Corporation Followship-LDA (FLDA) FLDA: A Bayesian generative model that extends Latent Dirichlet Allocation (LDA) • Specify a probabilistic procedure by which the content and links of each user are generated in a microblog network • Latent variables (hidden structure) are introduced to explain the observed data Topics, reasons of a user following another, the topic-specific influence, etc • Inference: “reverse” the generative process What hidden structure is most likely to have generated the observed data?
- 9. © 2013 IBM Corporation Hidden Structure in FLDA Tech Food Poli Alice 0.8 0.19 0.01 Bob 0.6 0.3 0.1 Mark … … … Michelle … … … Barack … … … Per-user topic distribution Per-topic word distribution Per-user followship preference Per-topic followee distribution Global followee distribution web cookie veggie organic congress … Tech 0.3 0.1 0.001 0.001 0.001 … Food 0.001 0.15 0.3 0.1 0.001 … Poli 0.005 0.001 0.001 0.002 0.25 … Alice Bob Mark Michelle Barack Tech 0.1 0.05 0.7 0.05 0.1 Food 0.04 0.1 0.06 0.75 0.05 Poli 0.03 0.02 0.05 0.1 0.8 Y N Alice 0.75 0.25 Bob 0.5 0.5 Mark … … Michelle … … Barack … … Alice Bob Mark Michelle Barack Global 0.005 0.001 0.244 0.25 0.5 User Topic Topic Topic Word User User Preference User Topic-specific influence Global popularity
- 10. Generative Process of FLDA For the mth user: Pick a per-user topic distribution θm Pick a per-user followship preference μm For the nth word • Pick a topic based on topic distribution θm • Pick a word based on per-topic word distribution φt For the lth followee • Choose the cause based on followship preference μm • If content-related Pick a topic based on topic distribution θm Pick a followee based on per-topic followee distribution σt • Otherwise Pick a followee based on global followee distribution π Alice Plate Notation Content: Followees: user link word Tech Tech: 0.8, Food: 0.19, Poli: 0.01 Follow for content: content 0.75, not: 0.25 0.2 web , organic, … Michelle Obama , Barack Obama, … © 10 2013 IBM Corporation
- 11. Road Map Followship-LDA (FLDA) Model Scalable Gibbs Sampling Algorithm for FLDA A General Search Framework for Topic-Specific Influencers Experimental Results © 11 2013 IBM Corporation
- 12. Inference of FLDA Model Gibbs Sampling: A Markov chain Monte Carlo algorithm to approximate the distribution of latent variables based on the observed data. Gibbs Sampling Process • Begin with some initial value for each variable • Iteratively sample each variable conditioned on the current values of the other variables, then update the variable with its new value (100s of iterations) • The samples are used to approximate distributions of latent variables Derived conditional distributions for FLDA Gibbs Sampling p(zm,n=t|-zm,n): prob. topic of nth word of mth user is t given the current values of all others p(ym,l=r|-ym,l): prob. preference of lth link of mth user is r given the current values of all others p(xm,l=t|-xm,l, ym,l=1): prob. topic of lth link of mth user is t given the current values of all others © 12 2013 IBM Corporation
- 13. © 2013 IBM Corporation Gibbs Sampling for FLDA In each pass of data, for the mth user • For the nth word (observed value w), Sample a new topic assignment t’, based on p(zm,n|-zm,n) • For lth followee (observed value e), Sample a new preference r’, based on p(ym,l|-ym,l) If r’=1 (content based), sample a new topic t’, based on p(xm,l|-xm,l, ym,l=1) • Keep counters while sampling cm w,t: # times w is assigned to t for mth user dm e,t : # times e is assigned to t for mth user Estimate posterior distributions for latent variables per-user topic distribution per-topic followee distribution (influence)
- 14. Distributed Gibbs Sampling for FLDA Challenge: Gibbs Sampling process is sequential • Each sample step relies on the most recent values of all other variables. • Sequential algorithm would run for 21 days, on a high end server (192 GB RAM, 3.2GHz processor) for a Twitter data set with 1.8M users, 2.4B words and 183M links. Observation: dependency between variable assignments is relatively weak, given the abundance of words and links Solution: Distributed Gibbs Sampling for FLDA • Relax the sequential requirement of Gibbs Sampling • Implemented on Spark – a distributed processing framework for iterative machine learning workloads © 14 2013 IBM Corporation
- 15. Distributed Gibbs Sampling for FLDA global counters Local counters User 1 User 2 … Local counters global counters User 13 User 14 … Global Counter: ce,t, … * User 25 User 26 … Word W1 W2 W3 Topic T5 T4 T5 Link L1 L2 Prefer Y N Topic T4 . Local Counter: cw,t, de,t, … m m w,t, d* master global counters Local counters worker worker worker © 15 2013 IBM Corporation
- 16. © 2013 IBM Corporation Issues with Distributed Gibbs Sampling Check-pointing for fault tolerance • Spark used lineage for fault tolerance Frequency of global synchronization • Even every 10 iterations won’t affect the quality of result Random number generator in a distributed environment • Provably independent multiple streams of uniform numbers • A long-period jump ahead F2-Linear Random Number Generator Efficient local computation • Aware of the local memory hierachy • Avoid random memory access by sampling in order
- 17. Road Map Followship-LDA (FLDA) Model Scalable Gibbs Sampling Algorithm for FLDA A General Search Framework for Topic-Specific Influencers Experimental Results © 17 2013 IBM Corporation
- 18. Search Framework For Topic-Specific Influencers SKIT: search framework for topic-spefic key influencers • Input: key words • Output: ranked list of key influencers • Can plug in various key influencer methods FLDA, Link-LDA, Topic-Sensitive PageRank (TSPR), TwitterRank Step 1: Derivation of interested topics SSKKIITT from key words • Treat the key words as the content of a new user • “Fold in” to the existing model • Result: a topic distribution of the input <p1, p2, …, pt, …> Step 2: Compute influence score for each user • Per-topic influence score for each user from FLDA: infu t • Combine the score infu=Ʃ pt x infu t healthy food Michelle Obama Jimmy Oliver … Step 3: Sort users by influence scores and return the top influencers © 18 2013 IBM Corporation
- 19. Road Map Followship-LDA (FLDA) Model Scalable Gibbs Sampling Algorithm for FLDA A General Search Framework for Topic-Specific Influencers Experimental Results © 19 2013 IBM Corporation
- 20. Effectiveness on Twitter Dataset Twitter Dataset crawled in 2010 • 1.76M users, 2.36B words, 183M links, 159K distinct words Topic Top-10 key words Top-5 Influencers “Information Technology” data, web, cloud, software, open, windows, Microsoft, server, security, code “Cycling and Running” Bike, ride, race, training, running, miles, team, workout, marathon, fitness “Jobs” Business, job, jobs, management, manger, sales, services, company, service, small, hiring Interesting findings Tim O’ Reilly, Gartner Inc, Scott Hanselman (software blogger), Jeff Atwood (software blogger, co-founder of stackoverflow.com), Elijah Manor (software blogger) Lance Armstrong, Levi Leipheimer, Geroge Hincapie (all 3 US Postal pro cycling team members), Johan Bruyeel (US Postal team director), RSLT (a pro cycling team) Job-hunt.org, jobsguy.com, integritystaffing.com/blog (job search Ninja), jobConcierge.com, careerealism.com • Overall ~ 15% of followships are content-independent • Top-5 global popular users: Pete Wentz (singer), Ashton Kutcher (actor), Greg Grunberg (actor), Britney Spears (singer), Ellen Degeneres (comedian) © 20 2013 IBM Corporation
- 21. Effectiveness on Tencent Weibo Dataset Tencent Weibo Dataset form KDD Cup 2012 • 2.33M users, 492M words, 51M links, 714K distinct words • VIP users – used as “ground truth” Organized in categories Manually labeled “key influencers” in the corresponding categories • Quality evaluation Search Input: content of the VIP user Precision: how many of the top-k results are the other VIP users of the same category Measure: Mean Average Precision (MAP) for all the VIP users across all categories FLDA is significantly better than previous approaches • 2x better than TSPR and TwitterRank • 1.6x better than Link-LDA Distributed FLDA achieves similar quality results as sequential FLDA © 21 2013 IBM Corporation
- 22. Efficiency of Distributed FLDA Setup • Sequential: a high end server (192 GB RAM, 3.2GHz) • Distributed: 27 servers (32GB RAM, 2.5GHz, 8 cores) connected by 1Gbit Ethernet, 1 master and 200 workers Dataset • Tencent Weibo: 2.33M users, 492M words, 51M links • Twitter: 1.76M users, 2.36B words, 183M links Sequential vs. Distributed • 100 topics, 500 iterations Sequential Distributed Weibo 4.6 days 8 hours Twitter 21 days 1.5 days © 22 2013 IBM Corporation
- 23. © 2013 IBM Corporation Scalability of Distributed FLDA Scale up experiment on Twitter Dataset 23 corpus size=12.5% corpus size=25% corpus size=50% corpus size=100% 3000 2000 1000 0 3000 2000 1000 0 25 50 100 200 25 50 100 200 #concurrent workers Wall clock time per Iteration (sec) nTopics 25 50 100 200
- 24. Conclusion FLDA model for topic-specific influence analysis • Content + Links Distributed Gibbs Sampling algorithm for FLDA • Comparable results as sequential Gibbs Sampling algorithm • Scales nicely with # workers and data sizes General search framework for topic-specific key influencers • Can plug in various key influencer methods Significantly higher quality results than previous work • 2x better than PageRank based methods • 1.6x better than Link-LDA © 24 2013 IBM Corporation

No public clipboards found for this slide

Be the first to comment