Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Natural Language Processing in Action: teaching computers to interpret text

2,202 views

Published on

Most humans are pretty good at reading and interpreting text; computers...not so much. Natural Language Processing (NLP) is the discipline of teaching computers to read more like people, and you see examples of it in everything from chatbots to the speech-recognition software on your phone.
Natural Language Processing in Action is your guide to creating machines that understand human language.

Save 42% off Natural Language Processing in Action with code sllane at: https://www.manning.com/books/natural-language-processing-in-action

Published in: Software
  • Be the first to comment

Natural Language Processing in Action: teaching computers to interpret text

  1. 1. Analyzing meaning using the “Bag of Words” vector. Save 42% off Natural Language Processing in Action with code sllane at manning.com.
  2. 2. Counting frequency of words to analyze meaning. Bag ofWords is a useful vector representation that counts the number of occurrences, or “frequency,” of each word in the given text. Basically, it can be used to determine the probable topic and sentiment of a document based on the frequency of the words in it. For example, if a document frequently mentions “engines” and “wheels” it is more likely that it has something to do with cars, and if it mentions “happy” and “smiling” it is likely to have a positive sentiment.
  3. 3. Still, you can imagine how such an algorithm might be prone to error. Let’s look at a example where counting occurrences of words is useful: from nltk.tokenize import TreebankWordTokenizer sentence = "The faster Harry got to the store, the faster Harry, the faster, would get home." tokenizer = TreebankWordTokenizer() token_sequence = tokenizer.tokenize(sentence.lower()) print(token_sequence) ['the', 'faster', 'harry', 'got', 'to', 'the', 'store', ',', 'the', 'faster', 'harry', ',', 'the', 'faster', ',', 'would', 'get', 'home', '.'] Next, we’ll need a Python dictionary and a Counter – a special kind of dictionary that bins objects and counts them how we want.
  4. 4. Despite its flaws, our bag of words still contains enough information about the original intent of the sentence, and is able to do some powerful things like detect spam, compute sentiment, and even detect subtle intent (like sarcasm). So it may be a bag, but it’s full of meaning and information. Let’s go ahead and get these words sorted into some order that’s easier to think about.The Counter object has a handy method, most_common for just this purpose: word_list = bag_of_words.most_common() # Passing an integer as an argument will give you that many from the top of the list print(word_list) [('the', 4), (',', 3), ('faster', 3), ('harry', 2), ('to', 1), ('.', 1), ('would', 1), ('home', 1), ('got', 1), ('store', 1), ('get', 1)]
  5. 5. The ratio of the number of times a word occurs in a given document compared to the total word count in the document is referred to as term frequency (TF). Our top three words from the previous slide are “the,” “harry,” and “faster.” “the” doesn’t help us determine meaning or sentiment, so we’ll ignore it. That leaves us with “harry” and “faster.” As each appear twice, the Term Frequency for this document will be: times_harry_appears = bag_of_words['harry'] total_words = len(word_list) # The number of tokens from our original source. tf = times_harry_appears/total_words print(tf) 0.18181818181818182
  6. 6. Let’s pause for a second and look a little deeper atTerm Frequency, as it is pretty important. It is basically the word count tempered by how long the document is. But why “temper” it all? Let’s say you find the word “dog” 3 times in document A and 100 times in document B. Clearly “dog” is more important to document B. But, what if you find out document A is a 30 word email to a veterinarian and document B is War & Peace (~ 580,000 words!)? Our first analysis a bit off, so let’s take the document length into account: TF(‘dog’, documentA) = 3/30 = 0.1 TF(’dog’, documentB) = 100/580000 = 0.00017
  7. 7. Now we have something that describes “something” about the two documents and their relationship to the word “dog” and each other. So, instead of raw word counts in the vectors, we will useTerm Frequencies. Going back to our original example (with the words “harry” and “faster”), we could calculate each word and get its relative “importance” to the document. Our protagonist and his need for speed are clearly central to the story, and we’ve made some progress in turning text into numbers.This is a contrived example, but one can quickly see how meaningful results could come from this approach. Let’s look at a bigger piece of text (on the next two slides), taken from theWikipedia entry on kites:
  8. 8. Wikipedia Kites (https://en.wikipedia.org/wiki/Kite) A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air to create lift and drag. A kite consists of wings, tethers, and anchors. Kites often have a bridle to guide the face of the kite at the correct angle so the wind can lift it. A kite’s wing also may be so designed so a bridle is not needed; when kiting a sailplane for launch, the tether meets the wing at a single point. A kite may have fixed or moving anchors. Untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is still often called the kite. The lift that sustains the kite in flight is generated when air flows around the kite’s surface, producing low pressure above and high pressure below the wings. The interaction with the wind also generates horizontal drag along the direction of the wind. The resultant force vector from the lift and drag force components is opposed by the tension of one or more of the lines or tethers to which the kite is attached.
  9. 9. Wikipedia Kites (https://en.wikipedia.org/wiki/Kite) continued: The anchor point of the kite line may be static or moving (e.g. the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle). The same principles of fluid flow apply in liquids and kites are also used under water. A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon. Kites have a long and varied history and many different types are flown individually and at festivals worldwide. Kites may be flown for recreation, art or other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites are multi-line steerable kites designed to generate large forces which can be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have been made.
  10. 10. Let’s assign theWikipedia kite text to a variable: from collections import Counter from nltk.tokenize import TreebankWordTokenizer tokenizer = TreebankWordTokenizer() # kite_text = "A kite is traditionally ..." # Step left to user, so we aren't repeating ourselves tokens = tokenizer.tokenize(kite_text.lower()) token_sequence = Counter(tokens) print(token_sequence) Counter({'the': 26, 'a': 20, 'kite': 16, ',': 15, 'of': 10, 'and': 10, 'kites': 8, 'is': 7, ... 'below': 1, 'from': 1, 'fluid': 1, ')': 1, 'lighter-than-air': 1}) # Compressed for brevity Sidenote: Interestingly the tokenizer returns “kite.” as a token. Each tokenizer (such as RegexpTokenizer) treats punctuation differently, and you will get similar but different results.They each have their advantages and we encourage experimenting. Just a nice reminder that NLP is hard!
  11. 11. If we look at our example, we can see a whole lot of stopwords. If we are just looking at raw word count (we are), articles and prepositions (the, a, and, and of ) aren’t going to tell us a great deal, so let’s ditch them: import nltk nltk.download('stopwords') stopwords = nltk.corpus.stopwords.words('english') tokens = [x for x in tokens if x not in stopwords] kite_count = Counter(tokens) print(kite_count) Counter({'kite': 16, ',': 15, 'kites': 8, 'wing': 5, 'lift': 4, 'may': 4, 'flown': 3, 'also': 3, 'kiting': 3, 'force': 2, 'tethers': 2, ... 'lighter-than-air': 1, 'still': 1, 'sets': 1, ')': 1, 'sport': 1}) # Compressed for brevity
  12. 12. Just by looking at the number of time words occur in this document, we are learning something about it.The terms kite(s), wing, and lift are all very important. If we didn’t actually know what this document was about, and we just happened across it in our vast database of Google-like knowledge, we might “programmatically” be able to infer that it has something to do with “flight,” “lift,” or “kites.” Across multiple documents in a corpus, things get a little more interesting. A set of documents may all be about, say, kite flying. You would imagine all of the documents may refer to string and wind quite often, andTF(“string”) andTF(“wind”) would therefore rank very highly in all of the documents.
  13. 13. That’s all for now!We hope you found this presentation enjoyable and informative. Save 42% off Natural Language Processing in Action with code sllane at manning.com. Also see:

×