- 1. Discovering Trending Topics in News Kory Becker October, 2017, http://primaryobjects.com 1 Sponsored by
- 3. Data Science vs Machine Learning Data Science Generalizable extration of knowledge from data Predictive models and patterns Machine Learning Algorithms for modeling data SVM, neural network, clustering Artificial Intelligence Creating intelligent machines, human-like, or not Data science, machine learning, and a lot more
- 4. Unsupervised Learning Exploratory data analysis Discovers patterns in unlabeled data No training set No error rate for potential solution K-means Clustering, Markov Chains, Feature Extraction, Principal Component Analysis (Dimensionality Reduction) 4
- 5. K-Means Clustering Popular clustering algorithm Groups data into k clusters Data points belong to the cluster with closest mean Each cluster has a centroid (center)
- 6. k-Means Algorithm Choose a value for k (number of clusters) Guess Rule of thumb: ~~(Math.sqrt(points.length * 0.5)) Initialize centroids Random Farthest Point K-means++ Assign data points to closest centroid Move centroids to center of assigned points Demo: https://goo.gl/AjNEJk
- 12. What About Text? Natural language processing Term document matrix Digitize text into an array of 0’s and 1’s by term Remove sparse terms (non-frequently occurring terms) Reduced dimensionality Compressed data Speed
- 13. Natural Language Processing Convert text into a numerical representation Find commonalities within data Clustering Make predictions from data Classification Category, Popularity, Sentiment, Relationships
- 14. Bag of Words Model Corpus Cats like to chase mice. Dogs like to eat big bones.
- 15. Create a Dictionary Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones Cats like to chase mice. Dogs like to eat big bones. Corpus
- 16. Digitize Text Cats like to chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 Vector Length = 8 Corpus Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones
- 17. Classify Documents (eating) Cats like to chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 0 1 Corpus Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones
- 18. Predict on New Data Cats like to chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 Bats eat bugs. 0 0 0 0 0 1 0 0 0 1 ? Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones
- 19. Predict on New Data Cats like to chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 Bats eat bugs. 0 0 0 0 0 1 0 0 0 1 ? Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones
- 20. Predict on New Data Cats like to chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 Bats eat bugs. 0 0 0 0 0 1 0 0 0 1 1 Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones
- 21. Unigrams vs Bigrams Unigrams George Bush Clooney Bigrams George Bush George Clooney N-grams?
- 22. ML + News + ??? = Profit! Extract news stories Build corpus of headlines Use bigrams (word pairs) Strip sparse terms Apply k-means clustering .. and what do we get?
- 23. Visualizing News Clusters October 6, 2014
- 24. Visualizing News Clusters November 5, 2014
- 25. Visualizing News Clusters December 1, 2014
- 26. Additional Reading Discovering Trending Topics in News http://primaryobjects.com/CMS/Article162 Mirroring Your Twitter Personal with Intelligence http://primaryobjects.com/CMS/Article160 TF*IDF with .NET http://primaryobjects.com/CMS/Article157

- Trending topics are popular on social media sites: Twitter, Facebook, Google Plus, News. Aggregate large volume of posts and group into list of hashtags or topics.One method is clustering with machine learning, specifically unsupervised learning, which is a part of machine learning -> data science -> artificial intelligence
- There has been an increasing prominence of data science and AI due to several major factors: 1. Enormous amount of data being produced by the second 2. Internet usage 3. Hardware speed increases 4. Recent algorithm breakthroughs such as deep learning and deep belief networks. Companies have been stockpiling data and now realize they can’t keep up with it – the human brain can’t even keep up with it. What can they actually do with all of this data? In this presentation, we’ll be using the R programming language, which is popular by data scientists, to analyze news headlines.But first, what is data science vs machine learning vs artificial intelligence? Data science encompasses the general extraction of knowledge from data.This includes predictive models and patterns and usually involves statistics and machine learning.Machine learning is a set of algorithms for modeling data, usually based on statistical results: SVM, neural networks, linear regression, logistic regression, clustering.Both data science and machine learning are part of artificial intelligence. AI is the creation of intelligent machines.
- Unsupervised learning is a type of exploratory data analysis. Unlike supervised learning, it doesn’t require outputs nor training data, cross-validation, test sets. Give it a bunch of data and the AI will make sense of it. Discover patterns. Unsupervised learning is key behind deep-learning (layers of unsupervised neural networks learn to recognize abstract patterns and feed into a supervised layer for fine-tune training).
- One of the most common algorithms used for unsupervised learning is “k-Means Clustering”. This algorithm works by grouping data into a specified number of groups, also called “clusters”. Each data point within the data-set belongs to the closest cluster. Each cluster has a centroid (i.e., the center of the cluster). K-means is a really simple, yet powerful algorithm, for automatically clustering and grouping data. In fact, it can often be used as a first go-to algorithm for any data exploration project. Let’s take a look at how this algorithm works.
- k-Means Algorithm The k-means algorithm is a relatively straight-forward process. The first step is to choose a starting value for k. That is, choose the number of clusters that you’d like your data to be separated into. A rule of thumb for choosing the number of clusters is to take the square root of the number of data points divided by 2. You can, of course, choose any number of clusters that make sense. Initializing centroids can be done using several methods. The most simplest is to use random centroid locations. One problem with randomly initialized centroids is that the centroid locations may be too close together, possibly having 2 or more centroids falling within the same “true” cluster – thus dividing it in half. Another option for initializing centroids is to use a “farthest point” distance heuristic. With this type of initialization, the first centroid is randomly initialized. The second centroid is initialized to the data point farthest away from it. Each subsequent centroid is initialized to the farthest data point from the others. In this manner, centroids are well spread-out from each other, offering a more optimal chance at locating the true clusters within the data. A third option is called k-means++. Similar to the farthest point heuristic, each centroid after the first one is initialized to a data point location with a probability proportional to the square of its distance to the nearest preceding centroid. For a great visual on this algorithm, see: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
- Now that we have an idea of how the algorithm works, let’s see an example! In the above picture we have a series of data points, scattered within the plot. The data seems to have some kind of pattern, but generally, the points are mostly random within them. Suppose we want to divide this data into 6 groups (or clusters). You can probably visually get an idea of where those boundaries would be, effectively dividing the data into 6 parts at each spoke. However, what if we want to cluster into 3 groups? What would that look like? Let’s run through the k-means algorithm and cluster this data into 3 groups. We’ll start by initializing 3 random centroids within our data.
- We’ve added 3 random centroids to the data. They actually appear pretty well spaced apart within the data, but in actuality, they are indeed randomly placed. Each point has been assigned to its closest centroid, thus coloring the area in its respective centroid’s color. For example, consider the blue area. Do you see that point to the far top-right, sitting right on the line of blue and green? You might think that point is closer to green, but it is indeed closer to the blue centroid. The same goes for all other points within their assigned cluster. With the data points assigned to a cluster, the next step is to move each centroid to the center of their assigned points. So for example, the blue centroid is going to shift slightly up and to the right, so that it sits squarely within the center of the blue area. Likewise, the green centroid will shift slightly to the right and down. The red centroid will shift slightly to the right. After shifting the centroids, some of the data points will be re-assigned. For example, when the blue centroid shifts to the right, some of the points that were assigned to the green centroids will now be closer to the blue centroid. Thus, they’ll be re-assigned to blue. We repeat this process until the centroids stop shifting or the data points stop changing clusters – meaning the k-means algorithm has completed.
- This image shows the final iteration of the k-means algorithm, effectively clustering our data into 3 clusters. You can see how the data is evenly divided with each point assigned to its respective cluster.
- Let’s see one more example. This time, we’ll use 6 clusters. In this image, it’s easy to see the randomness of the initial cluster placements. The groups are nowhere near equal. Let’s see what the final iteration of the k-means algorithm looks like with 6 clusters.
- You can see how the groups are now evenly divided, with 6 clusters displayed with their respective assigned data points.
- Text can be clustered too! First convert it to a bit string, using a bag-of-words / term-document matrix. This is the key part of natural language processing. Reduce text into array of 1’s and 0’s by term (1 if the term exists in the dictionary, 0 if not existing). Remove sparse terms (words not appearing in many documents) to reduce dimensionality and compress data. Removing sparse terms reduced memory usage in the example data from 2GB to 91MB.
- Natural Language Processing The most basic form of natural language processing is to simply convert text into a numerical representation. This gives you an array of numbers. So, each document becomes a same-sized array of numbers. With this, you can apply machine learning algorithms, such as clustering and classification. This allows you to build unique insights into a set of documents, determining characteristics like category, popularity, sentiment, and relationships. This is the same type of processing that many popular online machine learning APIs use to classify data. For example, IBM Watson, Microsoft, Amazon, and Google, all include NLP APIs for working with data.
- Bag of Words Model Let’s take a look at a quick example. Here are two documents: “Cats like to chase mice.” and “Dogs like to eat big bones”. We’re going to try to categorize these documents as being about “eating”. To do this, we’ll build a bag-of-words model and then apply a classification algorithm. Now, the first thing to note is that the two documents are of different lengths. If you think about it, most documents will practically always be of different lengths. This is fine, because after we digitize the corpus, you’ll see that the resulting data fits neatly within same-sized vectors.
- Create a Dictionary So, the first step is to create a dictionary from our corpus. First, we apply a stemming algorithm on the corpus. This will remove the stop-word “to”. Next, we find each unique term and add it to our dictionary. You can see the resulting list on the right-side of this slide. Our dictionary contains 8 terms.
- Digitize Text With our dictionary created, we can now digitize the documents. Since our dictionary has 8 terms, each document will be encoded into a vector of length 8. This ensures that all documents end up having the same length. This makes it easier to process with machine learning algorithms. Let’s look at the first document. We’ll take the first term in the dictionary and see if it exists in the first document. The term is “cats”, which does indeed exist in the first document. Therefore, we’ll set a 1 as the first bit. The next term is “like”. Again, it exists in the first document, so we’ll set a 1 as the next bit. This repeats until we see the term “dogs”. This does not exist in the first document, so we set a “0”. Finally, we run through all terms in the dictionary and end up with a vector of length 8 for the first document. We repeat the same steps for the second document, going through each term in the dictionary and checking if it exists in the document.
- Classify Documents (Eating) Once the data is digitized, we can classify the documents with regard to “eating”. Since the first document is about chasing mice, maybe playing, we’ll assign a 0. It doesn’t really have to do with eating. The second document is clearly about eating. So, we’ll assign it a 1. At this point, we can train the data with logistic regression, a neural network, a support vector machine, etc.
- Predict on New Data Once our model has finished training, we can try predicting on new data to see if it’s classified correctly. Here you can see we have a new document, “Bats eat bugs.”. This document has never been seen by our machine learning algorithm yet. We want to try and categorize it as being about “eating” or not. We’ll first digitize the document, just like we did with our training corpus. In this case, we only have 1 term found in the dictionary.
- Predict on New Data The machine learning algorithm is probably going to find a relationship with this particular bit, highlighted in red above. This bit corresponds to the term “eat”, and is found in the training document that was classified as 1 for the category “eating”. Based on this similarity, our model is probably going to predict our new document as … ?
- Predict on New Data So this is the general idea behind natural language processing. Now, we didn’t have to classify just on “eating”. We could have just as easily classified based upon sentiment. In fact, this is a common method for performing sentiment analysis with machine learning. (Another non-machine learning method for sentiment analysis is using the AFINN word-list approach). This was a very basic example of natural language processing. In a real-world case, you could have tens of thousands of documents, with perhaps, multiple classifications. There are also various ways to encode the corpus, such as the count of the term within the sentence, tf*idf, and more.
- Which words should we include in our dictionary? i.e., how should we tokenize text? Take every word? “and”, “or”, “boy”, “dog” etc? No, we use porter stemmer to remove stop words and reduce longer words. Then we tokenize by either individual words (unigrams) or word-pairs (bigrams). While bigrams give more unique clusters, one downside is that they match less documents in each one. This is because finding documents that contain the same pairs of words is less likely than finding documents with the same single words. You can go further with N-grams, but this reduces the number of items in clusters even further (although they will be more unique).Extreme case of N-grams will assign each headline to its own cluster.
- What can we do with news data? Read the news database and extract headlines. Use bigrams.Strip sparse terms.Apply K-means clustering. Get highest count terms in each cluster -> trending topics!
- Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.