Churnbar topic modelling (Asda)


Published on

An exploration of the value of topic modelling (Latent Semantic Indexing and Latent Dirichlet Allocation) in drawing insight from Social Media.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Churnbar topic modelling (Asda)

  1. 1. Topic Models: Posts about Asda, December 2011. Comparative Report.Period: 01/12/2011 - 31/12/2011Introduction.Topic Models are statistical techniques used to find abstract topics within a set of documents.We have created Latent Semantic Index (LSI) and Latent Dirichlet Allocation (LDA) modelsfrom a set of social media posts mentioning the British supermarket chain Asda (a subsidiary ofWlamart). This set contains 106,319 posts, and was not separated or classified in any way.It is hoped that using topic models in reporting for clients will allow us to provide a greaterdegree of insight into the conversations occurring around their brand or company name insocial media.Methods.The models were created using Python with the Gensim Library. Initially, a dictionary containingall words that occurred more than once was created, after which the texts were converted into anumeric vector corpus.It is not considered necessary here to go into the mathematics of the models, except to say thatthey calculate the probabilities of words occurring together in the same document, and can thenbe used to categorise new documents as belonging to one of these categories.LSI and LDA models were then created, each with 10 topics, in order that they can becomparable. The models were saved to disk, so that they can be used in future.Latent Semantic Indexing.The topics produced by this model are as follows:topic #0: "i" ,"asda," ,"my" ,"you" ,"@" ,"at" ,"is" ,"was" ,"get" ,"on"topic #1: "zayn:" ,"harry:" ,"louis:" ,"smell" ,"like?""" ,"hair" ,"""what" ,"""loreal" ,"""apples.""","worth"topic #2: "asda," ,"@glasgowfort:" ,"@" ,"cd/dvd" ,"up" ,"our" ,"chance" ,"choice" ,"win" ,"living,"topic #3: "asda," ,"@glasgowfort:" ,"cd/dvd" ,"our" ,"chance" ,"living," ,"choice" ,"win","giveaway" ,"ã‚â£10"topic #4: "hamper" ,"#win" ,"christmas" ,"extra" ,"special" ,"an" ,"#competition"© Churnbar 2012
  2. 2. ,"@fussfreeflavour" ,"middletons" ,"awkward."topic #5: "middletons" ,"awkward." ,"sandringham.," ,"value" ,"sent" ,"@queen_uk:" ,"#win","very" ,"christmas" ,"well"topic #6: "@" ,"asda," ,"christmas" ,"my" ,"#win" ,"shopping" ,"masturbating" ,"at" ,"scratch","card.,"topic #7: "@" ,"masturbating" ,"scratch" ,"turns" ,"card.," ,"sitting" ,"earlier" ,"today." ,"outside","playing"topic #8: "todayã¢â€â™s" ,"cd" ,"ã‚â£10." ,"winner" ,"dvd" ,"give" ,"away" ,"living" ,"cd/dvd","announced"topic #9: "buyz" ,"justine" ,"fukin" ,"her" ,"sket," ,"@itstracybeaker:" ,"wot" ,"clothes" ,"lol" ,"sing,"CommentaryTopics (4) and (5) are a result of the following tweet from the @queen_uk account beingretweeted: RT @queen_uk: Well this is very awkward. The Middletons have sent an Asda value hamper up to Sandringham.Topic (6) and (7), which again seem unlikely, were a result of I thought I saw a woman sitting in her car outside Asda masturbating earlier today. Turns out she was only playing a scratch card.,This method appears to demonstrate some of the major topics, though the fact that tweets likethose above are retweeted can tend to swamp the analysis.Latent Dirichlet Allocation.The topics produced by this model are as follows:topic #0: i ,asda ,my ,was ,me ,get ,just ,have ,so ,gotopic #1: asda ,@asda ,opened ,is ,harry ,potter ,will ,behind ,kobo ,announcedtopic #2: asda ,is ,on ,at ,as ,that ,they ,by ,this ,aretopic #3: asda ,christmas ,advent ,at ,hamper ,an ,found ,on ,jmebbk ,@topic #4: asda ,asda, ,are ,is ,tesco ,supermarket ,no ,as ,has ,tesco,topic #5: asda ,from ,@ ,special ,extra ,an ,#competition ,ago ,comment ,ontopic #6: asda ,at ,is ,"im ,checkout ,hours ,this ,with ,£10. ,earliertopic #7: asda ,you ,@ ,at ,your ,or ,them ,have ,be ,itopic #8: "rt ,because ,im ,does ,worth ,"" ,your ,hair ,@ ,icetopic #9: asda ,from ,up ,our ,your ,& ,at ,living ,choice ,on© Churnbar 2012
  3. 3. CommentaryThese do not appear to show much differentiation. Topic (1) will be related to a release of aHarry Potter film on DVD, while topic 4 appears to cover comparisons between Asda andTesco.Additional Analysis.In order to try and bring out more meaningful variation across all topics found using bothalgorithms, additional words were added to the stopword list1 and the analysis re-run. Thisproduced the following topic lists:Latent Semantic Indexing.topic #0: "i" ,"asda," ,"my" ,"@" ,"at" ,"is" ,"was" ,"on" ,"just" ,"from" - Not possible to discern any particular posts that would exemplify this topic.topic #1: "zayn:" ,"harry:" ,"louis:" ,"smell" ,"like?""" ,"hair" ,"""what" ,"""loreal" ,"""apples.""","worth" - This topic appears to be mostly associated with “RT @ wowzayn : #1DQuotes "What does your hair smell like?" Harry; Apples. Zayn; Asda s anti- dandruff shampoo. Louis; LOreal because Im worth it.” ( #2: "asda," ,"@glasgowfort:" ,"@" ,"cd/dvd" ,"up" ,"our" ,"chance" ,"choice" ,"win" ,"from"topic #3: "@glasgowfort:" ,"asda," ,"cd/dvd" ,"our" ,"chance" ,"living," ,"choice" ,"win","giveaway" ,"£10" - These two topics (and topic #8) appear tobe related to a cd/dvd competition or giveaway organised by Asda.topic #4: "hamper" ,"#win" ,"christmas" ,"extra" ,"special" ,"an" ,"#competition","@fussfreeflavour" ,"#maisoncupcake" ,"#comp" - Like topics #2 & #3, this is related to a competition, this time for a hamper of food.topic #5: "middletons" ,"awkward." ,"sandringham.," ,"value" ,"sent" ,"@queen_uk:" ,"#win","very" ,"well" ,"is" - As in the previous analysis, this topic relates to the tweet from the @queen_uk account regarding a present of an Asda Hamper.topic #6: "@" ,"asda," ,"christmas" ,"masturbating" ,"scratch" ,"card.," ,"turns" ,"earlier" ,"sitting"1 The stopword lists is a list of words which are removed from the text prior to analysis. These are common words that do notadd differentiation. For the current evaluation, the amended stopword list is for a of the and to in http # : - A? he she it rt Iyou we. “A?” is an artefact produced by our systems when they encounter non Latin characters.© Churnbar 2012
  4. 4. ,"my"topic #7: "@" ,"masturbating" ,"scratch" ,"turns" ,"card.," ,"sitting" ,"earlier" ,"today." ,"outside","playing" - See previous analysis.topic #8: "today’s" ,"cd" ,"winner" ,"£10." ,"dvd" ,"give" ,"away" ,"living" ,"cd/dvd" ,"announced" - See topics #2 & #3topic #9: "buyz" ,"justine" ,"fukin" ,"sket," ,"her" ,"@itstracybeaker:" ,"wot" ,"clothes" ,"lol" ,"sing,"This analysis appears to be bringing out some stronger topics, though it might be necessary toincrease the number of topics to cover all possibilities. Retweets and shares of posts appear tobe having an effect on the content of the topics found by the Latent Semantic Indexing process.If we remove these, however, we do not get a fair view of the real topics surrounding the brand.While meant satirically, the tweet that gave rise to topic #5 could be seen as a backhandedendorsement of the brand, although the company might not want to be associated with topics#6 & #7.Latent Dirichlet Allocation.The topics which are output from the LDA analysis are not as easy to interpret as those fromthe LSI, even after the expansion of the stopword list:topic #0: i ,asda ,was ,have ,@ ,that ,from ,got ,but ,sotopic #1: asda ,from ,christmas ,an ,special ,up ,price ,hamper ,extra ,attopic #2: asda ,cd ,on ,behind ,no, ,& ,watch ,working ,with ,winetopic #3: asda ,on ,her ,just ,at ,was ,that ,with ,as ,hastopic #4: "rt ,because ,your ,@ ,asda ,win ,does ,worth ,curry ,saucetopic #5: asda ,at ,@ ,dvd ,opened ,& ,chance ,harry ,number ,ontopic #6: asda ,them ,i ,have ,advent ,is ,@ ,at ,#win ,alltopic #7: asda ,i ,get ,my ,go ,im ,at ,me ,going ,withtopic #8: asda ,is ,on ,my ,with ,car ,at ,this ,food ,betopic #9: asda ,my ,at ,me ,is ,like ,i ,@ ,from ,beTopic #1 contains the word “hamper”, and could therefore be an amalgamation of topics #4 &#5 from the LSI above. Topic #4, containing “curry sauce”, could have picked up on the AsdaCurry sauce account.My feeling is that LSI gives us a much better picture of the structure of the text than LDA, atleast in this case. While I will reserve judgement and carry out both analyses for projects wheretopic modelling is required, I will only provide a commentary or discussion on LSI until I havefurther data regarding the LDA output.© Churnbar 2012