Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Introduction to gensim: "Topic Modelling for Humans"

14,055 views

Published on

An introduction to gensim, a free Python framework for topic modelling and semantic similarity using LSA/LSI and other statistical techniques.

Published in: Technology, Education
  • Be the first to comment

An Introduction to gensim: "Topic Modelling for Humans"

  1. 1. topic modelingfor humansWilliam BertDC Python Meetup1 May 2012
  2. 2. please go to http://ADDRESSand enter a sentenceinteresting relationships?gensim generated the data for those visualizationsby computing the semantic similarity of the input
  3. 3. who am I?William Bertdeveloper at Carney Labs (teamcarney.com)user of gensimstill new to world of topic modelling,semantic similarity, etc
  4. 4. gensim: “topic modeling for humans”topic modeling attempts to uncover theunderlying semantic structure of by identifyingrecurring patterns of terms in a set of data(topics).topic modellingdoes not parse sentences,does not care about word order, anddoes not "understand" grammar or syntax.
  5. 5. gensim: “topic modeling for humans”>>> lsi_model.show_topics()-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" +0.132*"software" + 0.119*"fort" + -0.119*"nov" +0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -0.105*"water",0.179*"squadron" + 0.158*"smith" + -0.140*"creek" + 0.135*"chess" + -0.130*"air" +0.128*"en" + -0.122*"nov" + -0.120*"fr" +0.119*"jan" + -0.115*"wales",0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania"+ 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -0.090*"engineering",
  6. 6. gensim isnt about topic modeling(for me, anyway)Its about similarity.What is similarity?Some types:• String matching• Stylometry• Term frequency• Semantic (meaning)
  7. 7. IsA seven-year quest to collect samples from thesolar systems formation ended in triumph in adark and wet Utah desert this weekend.similar in meaning toFor a month, a huge storm with massivelightning has been raging on Jupiter under thewatchful eye of an orbiting spacecraft.more or less than it is similar toOne of Saturns moons is spewing a giant plumeof water vapour that is feeding the planetsrings, scientists say.?
  8. 8. Who cares about semantic similarity?Some use cases:• Query large collections of text• Automatic metadata• Recommendations• Better human-computer interaction
  9. 9. gensim.corporaTextCorpus and other kinds of corpus classes>>> corpus = TextCorpus(file_like_object)>>> [doc for doc in corpus][[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]corpus = stream of vectors of documentfeature idsfor example, words in documents arefeatures (“bucket of words”)
  10. 10. gensim.corporaTextCorpus and other kinds of corpus classes>>> corpus = TextCorpus(file_like_object)>>> [doc for doc in corpus][[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]Dictionary class>>> print corpus.dictionaryDictionary(8472 unique tokens)dictionary maps features (words) to feature ids(numbers)
  11. 11. need massive collection of documents thatostensibly has meaningsounds like a job for wikipedia>>>wiki_corpus= WikiCorpus(articles) # articlesis Wikipedia text dump bz2 file. several hours.>>>wiki_corpus.dictionary.save("wiki_dict.dict")# persist dictionary>>>MmCorpus.serialize("wiki_corpus.mm", wiki_corpus) # uses numpy to persist corpus in MatrixMarket format. several GBs. can be BZ2’ed.>>>wiki_corpus= MmCorpus("wiki_corpus.mm") #revive a corpus
  12. 12. gensim.modelstransform corpora using models classesfor example, term frequency/inverse documentfrequency (TFIDF) transformationreflects importance of a term, not justpresence/absence
  13. 13. gensim.models>>> tfidf_trans =models.TfidfModel(wiki_corpus, id2word=dictionary) # TFIDF computes frequencies of all documentfeatures in the corpus. several hours.TfidfModel(num_docs=3430645, num_nnz=547534266)>>> tfidf_trans[documents] # emits documentsin TFIDF representation. documents must be inthe same BOW vector space as wiki_corpus.[[(40, 0.23), (6, 0.12), (78, 0.65)], [(39, ...]>>> tfidf_corpus = MmCorpus(corpus=tfidf_trans[wiki_corpus], id2word=dictionary) # builds newcorpus by iterating over documents transformedto TFIDF
  14. 14. gensim.models>>> lsi_trans =models.LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_features=400) # creates LSItransformation model from tfidf corpusrepresentation
  15. 15. topics again for a bit>>> lsi_model.show_topics()-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" +0.132*"software" + 0.119*"fort" + -0.119*"nov" +0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -0.105*"water",0.179*"squadron" + 0.158*"smith" + -0.140*"creek" + 0.135*"chess" + -0.130*"air" +0.128*"en" + -0.122*"nov" + -0.120*"fr" +0.119*"jan" + -0.115*"wales",0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania"+ 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -0.090*"engineering",
  16. 16. topics again for a bit• SVD decomposes a matrix into three simpler matrices• full rank SVD would be able to recreate the underlyingmatrix exactly from those three matrices• lower-rank SVD provides the best (least square error)approximation of the matrix• this approximation can find interesting relationships amongdata• it preserves most information while reducing noise andmerging dimensions associated with terms that have similarmeanings
  17. 17. topics again for a bit• SVD:alias-i.com/lingpipe/demos/tutorial/svd/read-me.html•Original paper:www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA_Deerwester1990.pdf• General explanation:tottdp.googlecode.com/files/LandauerFoltz-Laham1998.pdf• Many more
  18. 18. gensim.models>>> lsi_trans =models.LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_features=400, decay=1.0, chunksize=20000) # creates LSI transformation model from tfidfcorpus representation>>> print lsi_transLsiModel(num_terms=100000, num_topics=400, decay=1.0, chunksize=20000)
  19. 19. gensim.similarities(the best part)>>> index =Similarity(corpus=lsi_transformation[tfidf_transformation[index_corpus]], num_features=400, output_prefix=”/tmp/shard”)>>>index[lsi_trans[tfidf_trans[dictionary.doc2bow(tokenize(query))]]] # similarity of each documentin the index corpus to a new query document>>> [s for s in index] # a matrix of eachdocument’s similarities to all other documents[array([ 1. , 0. , 0.08, 0.01]), array([ 0. , 1. , 0.02, -0.02]), array([ 0.08, 0.02, 1. , 0.15]), array([ 0.01, -0.02, 0.15, 1. ])]
  20. 20. about gensimfour additional models availabledependencies: optional:numpy Pyroscipy Patterncreated by Radim Rehurek•radimrehurek.com/gensim•github.com/piskvorky/gensim•groups.google.com/group/gensim
  21. 21. thank youexample code, visualization code, and ppt:github.com/sandinmyjointsinterview with Radim:williamjohnbert.com
  22. 22. (additional slides)
  23. 23. gensim.models• term frequency/inverse document frequency(TFIDF)• log entropy• random projections• latent dirichlet allocation (LDA)• hierarchical dirichlet process (HDP)• latent semantic analysis/indexing (LSA/LSI)
  24. 24. slightly more about gensimDependencies: numpy and scipy, and optionallyPyro for distributed and Pattern for lemmatizationdata from Lee 2005 and other papers is availablein gensim for tests
  25. 25. gensim: “topic modelling for humans”>>> lda_model.show_topics()[0.083*bridge + 0.034*dam + 0.034*river +0.027*canal + 0.026*construction + 0.014*ferry +0.013*bridges + 0.013*tunnel + 0.012*trail +0.012*reservoir, 0.044*fight + 0.029*bout + 0.029*via +0.028*martial + 0.025*boxing + 0.024*submission +0.021*loss + 0.021*mixed + 0.020*arts +0.020*fighting, 0.086*italian + 0.062*italy + 0.048*di +0.024*milan + 0.019*rome + 0.014*venice +0.013*giovanni + 0.012*della + 0.011*florence +0.011*francesco’]

×