Your SlideShare is downloading. ×
An Introduction to gensim: "Topic Modelling for Humans"
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

An Introduction to gensim: "Topic Modelling for Humans"


Published on

An introduction to gensim, a free Python framework for topic modelling and semantic similarity using LSA/LSI and other statistical techniques.

An introduction to gensim, a free Python framework for topic modelling and semantic similarity using LSA/LSI and other statistical techniques.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Hi everyone, thanks for coming.I’m going to start off with a quick demo app. Please go to the address you see up there. Hookbox somtimes takes several seconds to connect to the channel, so give it some if it’s red. It will turn green. You’ll be invited to submit a sentence, particularly a statement or fact that has a mixture of nouns, verbs, and adjectives. There’s some examples of the kinds of sentence that might work well with this demo, but take a moment to think of a sentence of your own and go ahead and submit. We should see them pop up here on the visualization screen.The idea is to provide a bit of a concrete grounding for the talk, and this will serve as an example of one thing you can do with gensim, or at least the data generated by it.What do we see? We have a table comparing a number of submitted sentences with what I’m going to call similarity scores between them. The darker the green, the higher the score. Are there any interesting results?Here we also have some clustering visualizations that attempt to group the inputs that were found to have highest scores together. How do they cluster together?clickHopefully that worked and showed some interesting relationships among the input. (If not, well, I blame the input.)gensim, which I’ll be talking about today, was generating all the underlying similarity scores, measuring how similar each sentence was to the other ones. I’m going to explain how to get results like this from gensim.
  • A few quick words about me:William BertDeveloper at Carney Labs for about seven monthsCarney Labs is basically a startup wholly owned by a larger company in Alexandria called Team Carney.I use gensim at work developing a conversational tutoring web app.Topic modelling is still pretty new to me and I’m constantly learning more about it so my knowledge is still growing, but I’m really fascinated by it and trying to learn more by working with it a lot, and by doing things like this presentation.
  • gensim is a free Python framework for doing topic modelling. I’m going to blaze through a quick overview of topic modelling, then discuss how gensim uses it to do semantic similarity, generating data like what we saw.Topic modelling attempts to uncover the underlying semantic structure of text (or other data) by using statistical techniques to identify abstract, recurring patterns of terms in a set of data. These patterns are called topics. They may or may not correspond to our intuitive notion of a topic. Topic modelling models documents as collections of features, representing the documents as long vectors that indicate the presence/absence of important features, for example, the presence or absence of words in a document. We can use those vectors to create spaces and plot the locations of documents in those spaces and use that as a kind of proxy for their meaning.What isn't topic modelling?Topic modelling: does not parse sentences. in fact, knows nothing aboutword order. makes no attempt to "understand" grammar or languagesyntax. What does a topic look like?
  • Let’s take a quick look at some topics now, and we’ll come back to them again after I walk through how to generate them.clickThese three abbreviated topics were extracted from a large corpora of texts by gensim using a technique called latent semantic analysis (LSA). Just quickly note how they are collections of words that don’t necessarily/intuitively seem to belong together. There are also positive and negative scalar factors for each word, which get smaller in magnitude as the topic goes on. We don't see it here, but each of these topics actually has thousands more terms—these are just the first ten. So that’s what a topic looks like when I talk about topics, but the truth is...
  • gensim isn’t really about topic modeling, for me anyway. clickIt’s really about similarity. Topics are a means to an end.clickA few wordsabout similarity, because it can be elusive.clickThere are different kinds of similarity: String matching – how many characters strings have in commonStylometry – which is the similarity of style that looks at, say, length of words or sentences, use of function words, ratio of nouns to verbs, etc. Used to identify authors, for example.Term frequency – do the documents use the same words the same number of times (when scaled and normalized)?The kind of similarity I'm interested in is semantic similarity. Similarity of meanings.But what is that?
  • Take a moment to read these three sentences.They might be said to share certain elements: non-earth planets, weather, duration, research and data collectionHow would you quantify their similarity? How would you decide that two are more similar to each other than to the third? A study done in Australia in 2005 skipped over the question of defining semantic similarity formally and abstractly and instead defined it as what a sample of Australian college students think is similar. They had students read hundreds of paired short excerpts from news articles (and these sentences are excerpted from some of those)and rank the pairwise similarity on a scale. They then examined all the classifications, and found that they had a correlation of0.6. That’s obviously a positive correlation, but not terribly high. So, humans doesn’t necessarily agree with each other about semantic similarity—it’s kind of a fuzzy notion.That said, we’re going to try to put a number on it. In fact, the study I mentioned found that a particular topic modelling technique called latent semantic analysis (LSA) could achieve also a 0.6 correlation with the human ratings—correlating with the study participant’s choice about as well as they correlated with each other.
  • Why do we care about semantic similarity? Some use cases for document similarity comparison:-Traditionally used on large document collections. Legal discovery.Answer questions or aid searching huge corpora like government regulations, manuals, patent databases, etc.-Automatic metadata: system can intelligently suggest tags and categories for documents based on other documents they’re similar too. -Something that came up recently on the gensim google group: in a CMS, when a user creates a new post, they want to see posts that may be similar in content. We can do that with semantic similarity , and in fact, someone actuallly made this into a plugin for plone using gensim.-Recommendations, plagiarism dectection, exam scoring. And there are a number of other use cases. There are some new and fast online algorithms that work in realtime, whereas aspreviously, work was often done in batches. This brings up another potential use:-better HCI. Matching on similarityrather than words or with regexes allows us to accept broader ranges of input, in theory. So to make it happen, enter...gensim
  • To get our topics that we can then use to compute semantic similarity, we’re going to start by turning a large set of documents, which we’ll call a training/background corpus, into numeric vectors. We’ll use the tools in gensim’s corpora package.When I say document, a document can be as short as one word, or as long as many pages of text, or anywhere in between. My examples and the demo app are mostly sentence-size documents.In gensim, a corpus is an iterable that returns its documents as sparse vectors. (A sparse vector is just a compact way of storing large vectors that are mostly zeroes.)Corpus can made from a file, database query, a network stream, etc, as long as you can stream the documents and emit vectors. gensim iterates over the documents in a corpus with generators, so it uses constant memory, which means you can have enormous corpora and indexes.How you generate those vectors from the documents is up to you. (So a corpus isn't inherently tied to words or even language, it could be constructed from features of anything such as music or video, if you can figure out what to use as features [like amplitude or frequencies?] and how to extract them.)If your features are the presence or absence of words, your corpus is in what's called "bucket of words" (BOW) format. Gensim provides a convenience class called TextCorpus for creating a such corpus from a text file.So here we have a list of documents, each document is a list of (feature id, count) tuples. Feature #40 appears one time in document #0, etc.For BOW, we also need a dictionary...
  • Dictionary maps feature ids back to features (words). The corpus class will generate this for me.So the vectors indicate the presence of words in particular documents, and the resulting matrix containing these vectors will represent all the words appearing in all the documents.
  • To do interesting and useful things with semantic similarity, we need a good training or background corpus. Finding a good training corpus is something of an art. You want a large collection of documents (at least ten of thousands) that are representative of your problem domain. it can be difficult to find or build such a corpus. clickor, you can just use wikipedia. Helpfullyfor experimenting, gensim comes with a WikiCorpus class and other code for building a corpus from awikipedia article dump. clickWikicorpus makes two passes, one to extract the dictionary, and another to create and store the sparse vectors. It takes about 10 hours on an i7 to generate and serializecorpus and dictionary, though it uses constant memory. The resulting output vectors are about 15 GB uncompressed, about 5 compressed.So after these operations, wiki_corpus is now a BOW vector space representation of Wikipedia, embodied in a large corpus file in the Matrix Market format (popular matrix file format) and a several megabyte dictionary mapping ids to tokens (words). What can we do with our corpus?
  • We can transform corpora from one vector space to another using models. Transformationscan bring out hidden structure in the corpus, such as revealing relationships between words and documents. They can also represent the corpus in a more compact way, preserving much information while consuming fewer resources.A gensim 'transformation' is any object which accepts a sparse document via dictionary notation and returns another sparse document.”One useful transformation that we can generate from our BOW corpus is term frequency/inverse document frequency (TFIDF). Instead of a count of word appearances in a document, we get a score for each word that also takes into account the global frequency of that word. So a word’s TFIDF value in a given document increases proportionally to the number of times a word appears in that particular document, but is offset by the frequency of the word in the entire corpus, which helps to control for the fact that some words are generally more common than others.
  • Transformations are initialized with a training corpus,so we realize a TFIDF transformation from the corpus we just generated,wiki_corpus. This also takes several hours to generate for wikipedia.num_docs is number of documents in dictionary.num_nnz is number of non-zeroes in the matrix.clickOnce our model is generated, we can transform documents represented in one vector space model—the wiki_corpus BOW space--and emit them in another—the wiki_corpus TFIDF space as (word_id, word_weight) tuples where weight is a positive, normalized float. These documents can be anything, new and unseen, as long as they have been tokenized and put into the BOW representation using the same tokenizer and dictionary word->id mappings that were used for the wiki We can emit these new representations right into a fresh MmCorpus, which could also be serialized and persisted on disk (also requiring several GBs). However,
  • TFIDF corpus itself is not all that interesting except as a stepping stone to another model called LSI.Latent semantic indexing/analysis (LSI/LSA) —grandaddy of topic modelling similarity techniques. Original paper is from 1990. It produced the results we saw in the visualization.We can generate an LSI model almost the same way we did the TFIDF model, but we do need to provide an extra parameter called num_features, which brings us back to topics...
  • num_features is a parameter to LSI telling it how many topics to make. Here again are the topics we saw. These were generated by LSI from Wikipedia articles. What are these topics? It’s hard to say, exactly. The “themes” are unclear. But they are in some sense the corpora’s "principal components” (And in fact, principal component analysis is similar, if you know what that is.)Here’s a brief rundown of how LSI works to calculate these topics...
  • LSI uses a technique called singular value decomposition (SVD) to reduce the original term/document matrix’s number of dimensions and keep the most information for a given number of topics. I understand the technique conceptually but I’m not going to try to get into the math behind it because I don’t really understand it well enough to explain it, and there are plenty of resources online that explain it in great and accurate detail. Nonetheless, I’ll at least describe it briefly: SVD decomposes the word/document matrix into three simpler matrices. Full rank SVD will recreate the underlying matrix exactly, but LSA uses lower-order SVD, which provides the best (in the sense of least square error) approximation of the matrix at lower dimensions. By lowering the rank, dimensions associated with terms that have similar meanings are merged together. This preserves the most important semantic information in the text while reducing noise, and can uncover interesting relationships among the data of the underlying matrix. Still, the meaning of the terms and topics is not really apparent to us. This is because a single LSI topic is not about a single thing; they work as a set, The topics contain both positive and negative values, which cancel each other out delicately when generating vectors for documents.This is one of the reasons LSI topics are hard to interpret.---The original matrix can be too large for the computing resources; in this case, the approximated low rank matrix is interpreted as an approximation (a "least and necessary evil").The original matrix can be noisy: for example, anecdotal instances of terms are to be eliminated. From this point of view, the approximated matrix is interpreted as a de-noisified matrix (a better matrix than the original).The original term-document matrix is presumed overly sparse relative to the "true" term-document matrix. That is, the original matrix lists only the words actually in each document, whereas we might be interested in all words related to each document—generally a much larger set due to synonymy.The consequence of the rank lowering is that some dimensions are combined and depend on more than one term:{(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}This mitigates the problem of identifying synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings. It also mitigates the problem with polysemy, since components of polysemous words that point in the "right" direction are added to the components of words that share a similar meaning. Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.---A = m*n matrixA = U * S * V^twhere U is an m*k matrix, V is an n*k matrix, and S is a k*k matrix, and k is the rank of the matrix A.
  • As I said, there are plenty of resources online to explain it, so for now I will direct your questions there.
  • So we generate our somewhat mysterious LSI model, asking for 400 topics. Interesting note: no one has really figured out how to determine the best numer of topics for a given corpus for LSI, but experimentally people have found good results between 200 and 500.This will also take several hours. LSI model generation can be distributed to multiple CPUs/machines through a library called Python Remote Objects (Pyro), leading to faster model generation times. When it’s done, we have an LSIModel with 100,000 terms—that’s the size of the dictionary we created. The decay parameter gives more emphasis to new documents if any are added to the model after initial generation.Because the SVD algorithm is incremental, the memory load isconstant and can be controlled by a chunksize parameter thatsays how many documents are to be loaded into RAM at once. Larger chunks speed things up, but also require more RAM.
  • Now we get to the best part. With our LSI transformation, we can now use the classes in gensim.similarities to create an index of all the documents that we want to compare subsequent queries against. The Similarity class uses fixed memory by splitting the index across shards on disk and mmap'ing them in as necessary. output_prefix is for the filenames of the shards.What is index_corpus? Could be my original training/universe corpus--Wikipedia. Then the index would tell me which Wikipedia document any new queries are most similar to. But index corpus could also be a set of entirely different documents, for example arbitrary sentences typed in by a group of Python programmers,and the index will determine which of those documents my query is most similar to. You can even add new documents to an index in realtime.clickSo to calculate similarity of a query, we tokenize and preprocess query the same we treated wiki corpus, then convert to BOW, then do TFIDF transform, then LSI tranform, give that to the index, and we’ll get a list of similarity scores between the query and each document in the index. clickYou can also calculate the similarity scores between all documents in the index and get back a 2-dimensional matrix, which is what the visualization app was doing every time a new document was added to the index. This is what makes realtime similarity comparisons possible, for some value of similar.
  • A few more things about gensim before I wrap up:TFIDF and LSI are only two of six models it implements. There are a couple more weighting models and a couple more dimensionality reduction models, each with different properties, but I haven’t had a chance to work with those very clickgensim’s dependencies are numpy and scipy, and optionally Pyro for distributed model generation and Pattern for additional input processing.clickTo give credit where credit’s due, I want to say a few words about where gensim comes from.It was created by a Czech guy named Radim Rehurek, and his work to make the algorithms scalable and online contributed to his Phd thesis. He is an active developer and is very helpful on the mailing list. He’s working hard to build a community around gensim and make it into a robust open source project (LGPL license).I asked him a few questions about his work on gensim; the questions and the answers are up on my personal site, says: "Gensim has no ambition to become an all-encompassing production level tool, with robust failure handling and error recoveries.”But in my experience it has performed well, and Radim mentioned several commercial applications that are using it in addition to universities
  • Thanks for listening. This presentation, some sample, and the demo app are available on my github page, should note that the demo web app and the visualization are actually not part of gensim. gensim generated the data but the app is Flask and hookbox, the clustering is scipy and scikit-learn,and the visualization is d3.questions?
  • In addition to TFIDF, gensim has implemented several VSM algorithms, most of which I know nothing about, but to do justice to gensim’s capabilities:-TFIDF —weights tokens according to importance (local vs global).preserves dimensionality -Log Entropy- another term weighting function that uses log entropy normalization. preserves dimensionality. -Random Projections —approximates tfidf distances but less computationally expensive. reduces dimensionality -Latent Dirichlet Allocations (LDA) — a generative model that produces more human-readable topics. reduces dimensionality.-Hierarchical Dirichlet Process (HDP) — very new, first described in a paper from 2006, but not all operations are fully implemented in gensim yet.-Latent semantic indexing/analysis (LSI/LSA) —grandaddy of topic modelling similarity techniques. reduces dimensionality. Original paper is Deerwester et al 1990. I have used it most.
  • Dependencies: numpy and scipy, and optionally Pyro for distributed and Pattern for lemmatizationdata from Lee 2005 and other papers is available in gensim for tests
  • These recurring patterns called topics may or may not correspond to our intuitive notion of a topic. The abbreviated ones printed here were extracted from a large corpora of texts by gensim using a technique called latent dirichlet allocation (LDA) that actually does tend to produce human-readable topics (but not all the techniques do that, and latent semantic analysis, which is what the demo app used and what we’ll be looking at soon, does not)Some themes—they appear to be “about” something, with the terms having a decreasing weighting
  • Transcript

    • 1. topic modelingfor humansWilliam BertDC Python Meetup1 May 2012
    • 2. please go to http://ADDRESSand enter a sentenceinteresting relationships?gensim generated the data for those visualizationsby computing the semantic similarity of the input
    • 3. who am I?William Bertdeveloper at Carney Labs ( of gensimstill new to world of topic modelling,semantic similarity, etc
    • 4. gensim: “topic modeling for humans”topic modeling attempts to uncover theunderlying semantic structure of by identifyingrecurring patterns of terms in a set of data(topics).topic modellingdoes not parse sentences,does not care about word order, anddoes not "understand" grammar or syntax.
    • 5. gensim: “topic modeling for humans”>>> lsi_model.show_topics()-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" +0.132*"software" + 0.119*"fort" + -0.119*"nov" +0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -0.105*"water",0.179*"squadron" + 0.158*"smith" + -0.140*"creek" + 0.135*"chess" + -0.130*"air" +0.128*"en" + -0.122*"nov" + -0.120*"fr" +0.119*"jan" + -0.115*"wales",0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania"+ 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -0.090*"engineering",
    • 6. gensim isnt about topic modeling(for me, anyway)Its about similarity.What is similarity?Some types:• String matching• Stylometry• Term frequency• Semantic (meaning)
    • 7. IsA seven-year quest to collect samples from thesolar systems formation ended in triumph in adark and wet Utah desert this weekend.similar in meaning toFor a month, a huge storm with massivelightning has been raging on Jupiter under thewatchful eye of an orbiting spacecraft.more or less than it is similar toOne of Saturns moons is spewing a giant plumeof water vapour that is feeding the planetsrings, scientists say.?
    • 8. Who cares about semantic similarity?Some use cases:• Query large collections of text• Automatic metadata• Recommendations• Better human-computer interaction
    • 9. gensim.corporaTextCorpus and other kinds of corpus classes>>> corpus = TextCorpus(file_like_object)>>> [doc for doc in corpus][[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]corpus = stream of vectors of documentfeature idsfor example, words in documents arefeatures (“bucket of words”)
    • 10. gensim.corporaTextCorpus and other kinds of corpus classes>>> corpus = TextCorpus(file_like_object)>>> [doc for doc in corpus][[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]Dictionary class>>> print corpus.dictionaryDictionary(8472 unique tokens)dictionary maps features (words) to feature ids(numbers)
    • 11. need massive collection of documents thatostensibly has meaningsounds like a job for wikipedia>>>wiki_corpus= WikiCorpus(articles) # articlesis Wikipedia text dump bz2 file. several hours.>>>"wiki_dict.dict")# persist dictionary>>>MmCorpus.serialize("", wiki_corpus) # uses numpy to persist corpus in MatrixMarket format. several GBs. can be BZ2’ed.>>>wiki_corpus= MmCorpus("") #revive a corpus
    • 12. gensim.modelstransform corpora using models classesfor example, term frequency/inverse documentfrequency (TFIDF) transformationreflects importance of a term, not justpresence/absence
    • 13. gensim.models>>> tfidf_trans =models.TfidfModel(wiki_corpus, id2word=dictionary) # TFIDF computes frequencies of all documentfeatures in the corpus. several hours.TfidfModel(num_docs=3430645, num_nnz=547534266)>>> tfidf_trans[documents] # emits documentsin TFIDF representation. documents must be inthe same BOW vector space as wiki_corpus.[[(40, 0.23), (6, 0.12), (78, 0.65)], [(39, ...]>>> tfidf_corpus = MmCorpus(corpus=tfidf_trans[wiki_corpus], id2word=dictionary) # builds newcorpus by iterating over documents transformedto TFIDF
    • 14. gensim.models>>> lsi_trans =models.LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_features=400) # creates LSItransformation model from tfidf corpusrepresentation
    • 15. topics again for a bit>>> lsi_model.show_topics()-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" +0.132*"software" + 0.119*"fort" + -0.119*"nov" +0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -0.105*"water",0.179*"squadron" + 0.158*"smith" + -0.140*"creek" + 0.135*"chess" + -0.130*"air" +0.128*"en" + -0.122*"nov" + -0.120*"fr" +0.119*"jan" + -0.115*"wales",0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania"+ 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -0.090*"engineering",
    • 16. topics again for a bit• SVD decomposes a matrix into three simpler matrices• full rank SVD would be able to recreate the underlyingmatrix exactly from those three matrices• lower-rank SVD provides the best (least square error)approximation of the matrix• this approximation can find interesting relationships amongdata• it preserves most information while reducing noise andmerging dimensions associated with terms that have similarmeanings
    • 17. topics again for a bit••Original• General• Many more
    • 18. gensim.models>>> lsi_trans =models.LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_features=400, decay=1.0, chunksize=20000) # creates LSI transformation model from tfidfcorpus representation>>> print lsi_transLsiModel(num_terms=100000, num_topics=400, decay=1.0, chunksize=20000)
    • 19. gensim.similarities(the best part)>>> index =Similarity(corpus=lsi_transformation[tfidf_transformation[index_corpus]], num_features=400, output_prefix=”/tmp/shard”)>>>index[lsi_trans[tfidf_trans[dictionary.doc2bow(tokenize(query))]]] # similarity of each documentin the index corpus to a new query document>>> [s for s in index] # a matrix of eachdocument’s similarities to all other documents[array([ 1. , 0. , 0.08, 0.01]), array([ 0. , 1. , 0.02, -0.02]), array([ 0.08, 0.02, 1. , 0.15]), array([ 0.01, -0.02, 0.15, 1. ])]
    • 20. about gensimfour additional models availabledependencies: optional:numpy Pyroscipy Patterncreated by Radim Rehurek•••
    • 21. thank youexample code, visualization code, and with
    • 22. (additional slides)
    • 23. gensim.models• term frequency/inverse document frequency(TFIDF)• log entropy• random projections• latent dirichlet allocation (LDA)• hierarchical dirichlet process (HDP)• latent semantic analysis/indexing (LSA/LSI)
    • 24. slightly more about gensimDependencies: numpy and scipy, and optionallyPyro for distributed and Pattern for lemmatizationdata from Lee 2005 and other papers is availablein gensim for tests
    • 25. gensim: “topic modelling for humans”>>> lda_model.show_topics()[0.083*bridge + 0.034*dam + 0.034*river +0.027*canal + 0.026*construction + 0.014*ferry +0.013*bridges + 0.013*tunnel + 0.012*trail +0.012*reservoir, 0.044*fight + 0.029*bout + 0.029*via +0.028*martial + 0.025*boxing + 0.024*submission +0.021*loss + 0.021*mixed + 0.020*arts +0.020*fighting, 0.086*italian + 0.062*italy + 0.048*di +0.024*milan + 0.019*rome + 0.014*venice +0.013*giovanni + 0.012*della + 0.011*florence +0.011*francesco’]