Uploaded byKory Becker

PPTX, PDF200 views

2017 CodeFest Wrap-up Presentation

This document discusses unsupervised machine learning techniques like k-means clustering and its application to natural language processing. It provides examples of how k-means clustering can be used to group text documents by transforming them into numeric feature vectors using techniques like bag-of-words modeling. The document suggests that combining machine learning, news articles, and clustering could provide insights and generate business value, for example by discovering trends in news stories or visualizing topic clusters.

Related topics:

Data Science Insights• Natural Language Processing•

2017 CodeFest
$how Me the Money
Kory Becker
October, 2017, http://primaryobjects.com

Unsupervised Learning
Exploratory data analysis
Discovers patterns in unlabeled data
No training set
No error rate for potential solution
K-means Clustering, Markov Chains,
Feature Extraction, Principal Component
Analysis (Dimensionality Reduction)
2

K-Means Clustering
Popular clustering algorithm
Groups data into k clusters
Data points belong to the cluster with closest mean
Each cluster has a centroid (center)

Clustering Example 1

Clustering Example 1

Clustering Example 1

Clustering Example 2

Clustering Example 2

What About Text?
Natural language processing
Term document matrix
Digitize text into an array of 0’s and 1’s by term
Remove sparse terms (non-frequently occurring
terms)
Reduced dimensionality
Compressed data
Speed

Natural Language
Processing
Convert text into a numerical representation
Find commonalities within data
Clustering
Make predictions from data
Classification
Category, Popularity, Sentiment,
Relationships

Bag of Words Model
Corpus
Cats like to chase mice.
Dogs like to eat big bones.

Create a Dictionary Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones
Cats like to chase mice.
Dogs like to eat big bones.
Corpus

Digitize Text
Cats like to chase mice.
1 1 1 1 0 0 0 0
Dogs like to eat big bones.
0 1 0 0 1 1 1 1
Vector Length = 8
Corpus
Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones

Unigrams vs Bigrams
Unigrams
George
Bush
Clooney
Bigrams
George Bush
George Clooney
N-grams?

ML + News + ??? = Profit!
Extract news stories
Build corpus of headlines
Use bigrams (word pairs)
Strip sparse terms
Apply k-means clustering
.. and what do we get?

Visualizing Clusters

Visualizing Clusters

Visualizing Clusters

Visualizing Clusters

Additional Reading
Discovering Trending Topics in News
http://primaryobjects.com/CMS/Article162
Mirroring Your Twitter Personal with Intelligence
http://primaryobjects.com/CMS/Article160
TF*IDF with .NET
http://primaryobjects.com/CMS/Article157

Thank you!
Kory Becker
http://primaryobjects.com
@primaryobjects

Recommended

PDF

NLP Structured Data Investigation on Non-Text

byDataWorks Summit/Hadoop Summit

PDF

NLP Structured Data Investigation on Non-Text

PDF

NLP Structured Data Investigation on Non-Text by Casey Stella

PDF

Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database

PDF

From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases

PDF

Systematic searching...well we had a bit of a look

bySANDRA FELICIANO

PDF

Reproducibility in cheminformatics and computational chemistry research: cert...

PPTX

Some Ideas on Making Research Data: "It's the Metadata, stupid!"

byAnita de Waard

PPTX

Discovering Trending Topics in News - 2017 Edition

PPTX

Machine Learning in a Flash: An Introduction to Natural Language Processing

PPTX

Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...

PPTX

Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...

PPT

PPT file

PPTX

2.unsupervised machine learning course.pptx

byabolijadhav12

PDF

Machine-learning-seminar-PPTForbegineers

bymehnazali2004

PPTX

Data Rich vs. Data Savvy: personalized learning and ai

PDF

Machine Learning at Geeky Base 2

byKan Ouivirach, Ph.D.

PDF

Data science

byPurna Chander

PDF

ML Basic Concepts.pdf

PPTX

fINAL ML PPT.pptx

by19445KNithinbabu

PPT

sequencea.ppt

byolusolaogunyewo1

PPT

sequenckjkojkjhguignmpojihiubgijnkompoje.ppt

byJITENDER773791

PPT

sequf;lds,g;'dsg;dlld'g;;gldgence - Copy.ppt

byJITENDER773791

PDF

The Magical Art of Extracting Meaning From Data

PPS

Brief Tour of Machine Learning

PPTX

Data science for advanced dummies

bySaurav Chakravorty

PDF

Module 7: Unsupervised Learning

PPTX

DataAnalysis in machine learning using different techniques

PPTX

Intelligent Heuristics for the Game Isolation

PPTX

Tips for Submitting a Proposal to Grace Hopper GHC 2020

More Related Content

PDF

NLP Structured Data Investigation on Non-Text

byDataWorks Summit/Hadoop Summit

PDF

NLP Structured Data Investigation on Non-Text

PDF

NLP Structured Data Investigation on Non-Text by Casey Stella

PDF

Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database

PDF

From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases

PDF

Systematic searching...well we had a bit of a look

bySANDRA FELICIANO

PDF

Reproducibility in cheminformatics and computational chemistry research: cert...

PPTX

Some Ideas on Making Research Data: "It's the Metadata, stupid!"

byAnita de Waard

NLP Structured Data Investigation on Non-Text

byDataWorks Summit/Hadoop Summit

NLP Structured Data Investigation on Non-Text

NLP Structured Data Investigation on Non-Text by Casey Stella

Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database

From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases

Systematic searching...well we had a bit of a look

bySANDRA FELICIANO

Reproducibility in cheminformatics and computational chemistry research: cert...

Some Ideas on Making Research Data: "It's the Metadata, stupid!"

byAnita de Waard

Similar to 2017 CodeFest Wrap-up Presentation

PPTX

Discovering Trending Topics in News - 2017 Edition

PPTX

Machine Learning in a Flash: An Introduction to Natural Language Processing

PPTX

Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...

PPTX

Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...

PPT

PPT file

PPTX

2.unsupervised machine learning course.pptx

byabolijadhav12

PDF

Machine-learning-seminar-PPTForbegineers

bymehnazali2004

PPTX

Data Rich vs. Data Savvy: personalized learning and ai

PDF

Machine Learning at Geeky Base 2

byKan Ouivirach, Ph.D.

PDF

Data science

byPurna Chander

PDF

ML Basic Concepts.pdf

PPTX

fINAL ML PPT.pptx

by19445KNithinbabu

PPT

sequencea.ppt

byolusolaogunyewo1

PPT

sequenckjkojkjhguignmpojihiubgijnkompoje.ppt

byJITENDER773791

PPT

sequf;lds,g;'dsg;dlld'g;;gldgence - Copy.ppt

byJITENDER773791

PDF

The Magical Art of Extracting Meaning From Data

PPS

Brief Tour of Machine Learning

PPTX

Data science for advanced dummies

bySaurav Chakravorty

PDF

Module 7: Unsupervised Learning

PPTX

DataAnalysis in machine learning using different techniques

Discovering Trending Topics in News - 2017 Edition

Machine Learning in a Flash: An Introduction to Natural Language Processing

Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...

Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...

PPT file

2.unsupervised machine learning course.pptx

byabolijadhav12

Machine-learning-seminar-PPTForbegineers

bymehnazali2004

Data Rich vs. Data Savvy: personalized learning and ai

Machine Learning at Geeky Base 2

byKan Ouivirach, Ph.D.

Data science

byPurna Chander

ML Basic Concepts.pdf

fINAL ML PPT.pptx

by19445KNithinbabu

sequencea.ppt

byolusolaogunyewo1

sequenckjkojkjhguignmpojihiubgijnkompoje.ppt

byJITENDER773791

sequf;lds,g;'dsg;dlld'g;;gldgence - Copy.ppt

byJITENDER773791

The Magical Art of Extracting Meaning From Data

Brief Tour of Machine Learning

Data science for advanced dummies

bySaurav Chakravorty

Module 7: Unsupervised Learning

DataAnalysis in machine learning using different techniques

More from Kory Becker

PPTX

Intelligent Heuristics for the Game Isolation

PPTX

Tips for Submitting a Proposal to Grace Hopper GHC 2020

PPTX

Grace Hopper 2019 Quantum Computing Recap

PPTX

An Introduction to Quantum Computing - Hopper X1 NYC 2019

PPTX

Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18

PPTX

Self Programming Artificial Intelligence - Lightning Talk

PPTX

Self Programming Artificial Intelligence

PPTX

IBM Watson Concept Insights

PPTX

Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)

Intelligent Heuristics for the Game Isolation

Tips for Submitting a Proposal to Grace Hopper GHC 2020

Grace Hopper 2019 Quantum Computing Recap

An Introduction to Quantum Computing - Hopper X1 NYC 2019

Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18

Self Programming Artificial Intelligence - Lightning Talk

Self Programming Artificial Intelligence

IBM Watson Concept Insights

Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)

Recently uploaded

PDF

GTM-and-Sales-Plan for a cyber security product

byAshish Jangir

PDF

Explaining specific examples of purpose-specific BOM

byakipii ogaoga

PDF

shayk.online - Anonymous chat with Sinatra and WebSockets

byEleanor McHugh

PDF

20260212 Security-JAWS activity results for 2025 and activity goals for 2026

PDF

final~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.pdf

byomarbishtawi04

PDF

Session 1/5: Enhancing Automation with Screenplay & Business Rules

bysuhanisingh58689

PDF

Spec-Driven Development with Kiro: Elevating Software Quality, Traceability, ...

PPTX

Celonis Optimizing your future with process

bydevjeremyappleyard

PDF

Odoo Implementation Checklist: A Strategic ERP Blueprint for Business-Ready D...

byBANIBRO IT SOLUTION

PDF

GenerationAI_Paris_2025_Architecting_Intelligence.pdf

PPTX

CTO Strategy OS 2026: The Tech, AI & Cloud Playbook Boards Want

byridwansassman

PDF

How does MES(Manufacturing Execution System) work?

byakipii ogaoga

PPTX

Automation Without Apprentices: How AI Challenges the Open Source Way

PDF

HOW TO OVERCOME THE THREATS OF ARTIFICIAL INTELLIGENCE AGAINST HUMANITY.pdf

PDF

final.pdf

byomarbishtawi04

PDF

Explaining the flow of purpose-specific BOM

byakipii ogaoga

PPTX

PPTX game guess the logo with a twistppt

PDF

apidays Paris 2025 | Zero Trust By Design

PDF

Constraint Collapse and Fidelity Decay in Scaled Language Models

bySemantic Fidelity Lab | A. Jacobs

PDF

February 2026 Patch Tuesday hosted by Chris Goettl and Todd Schell

GTM-and-Sales-Plan for a cyber security product

byAshish Jangir

Explaining specific examples of purpose-specific BOM

byakipii ogaoga

shayk.online - Anonymous chat with Sinatra and WebSockets

byEleanor McHugh

20260212 Security-JAWS activity results for 2025 and activity goals for 2026

final~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.pdf

byomarbishtawi04

Session 1/5: Enhancing Automation with Screenplay & Business Rules

bysuhanisingh58689

Spec-Driven Development with Kiro: Elevating Software Quality, Traceability, ...

Celonis Optimizing your future with process

bydevjeremyappleyard

Odoo Implementation Checklist: A Strategic ERP Blueprint for Business-Ready D...

byBANIBRO IT SOLUTION

GenerationAI_Paris_2025_Architecting_Intelligence.pdf

CTO Strategy OS 2026: The Tech, AI & Cloud Playbook Boards Want

byridwansassman

How does MES(Manufacturing Execution System) work?

byakipii ogaoga

Automation Without Apprentices: How AI Challenges the Open Source Way

HOW TO OVERCOME THE THREATS OF ARTIFICIAL INTELLIGENCE AGAINST HUMANITY.pdf

final.pdf

byomarbishtawi04

Explaining the flow of purpose-specific BOM

byakipii ogaoga

PPTX game guess the logo with a twistppt

apidays Paris 2025 | Zero Trust By Design

Constraint Collapse and Fidelity Decay in Scaled Language Models

bySemantic Fidelity Lab | A. Jacobs

February 2026 Patch Tuesday hosted by Chris Goettl and Todd Schell

2017 CodeFest Wrap-up Presentation

1.
2017 CodeFest $how Methe Money Kory Becker October, 2017, http://primaryobjects.com
2.
Unsupervised Learning Exploratory dataanalysis Discovers patterns in unlabeled data No training set No error rate for potential solution K-means Clustering, Markov Chains, Feature Extraction, Principal Component Analysis (Dimensionality Reduction) 2
3.
K-Means Clustering Popular clusteringalgorithm Groups data into k clusters Data points belong to the cluster with closest mean Each cluster has a centroid (center)
4.
Clustering Example 1
5.
Clustering Example 1
6.
Clustering Example 1
7.
Clustering Example 2
8.
Clustering Example 2
9.
What About Text? Naturallanguage processing Term document matrix Digitize text into an array of 0’s and 1’s by term Remove sparse terms (non-frequently occurring terms) Reduced dimensionality Compressed data Speed
10.
Natural Language Processing Convert textinto a numerical representation Find commonalities within data Clustering Make predictions from data Classification Category, Popularity, Sentiment, Relationships
11.
Bag of WordsModel Corpus Cats like to chase mice. Dogs like to eat big bones.
12.
Create a DictionaryDictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones Cats like to chase mice. Dogs like to eat big bones. Corpus
13.
Digitize Text Cats liketo chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 Vector Length = 8 Corpus Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones
14.
Unigrams vs Bigrams Unigrams George Bush Clooney Bigrams GeorgeBush George Clooney N-grams?
15.
ML + News+ ??? = Profit! Extract news stories Build corpus of headlines Use bigrams (word pairs) Strip sparse terms Apply k-means clustering .. and what do we get?
16.
Visualizing Clusters
17.
Visualizing Clusters
18.
Visualizing Clusters
19.
Visualizing Clusters
20.
Additional Reading Discovering TrendingTopics in News http://primaryobjects.com/CMS/Article162 Mirroring Your Twitter Personal with Intelligence http://primaryobjects.com/CMS/Article160 TF*IDF with .NET http://primaryobjects.com/CMS/Article157
21.
Thank you! Kory Becker http://primaryobjects.com @primaryobjects

Editor's Notes

#3 Unsupervised learning is a type of exploratory data analysis. Unlike supervised learning, it doesn’t require outputs nor training data, cross-validation, test sets. Give it a bunch of data and the AI will make sense of it. Discover patterns. Unsupervised learning is key behind deep-learning (layers of unsupervised neural networks learn to recognize abstract patterns and feed into a supervised layer for fine-tune training).
#4 One of the most common algorithms used for unsupervised learning is “k-Means Clustering”. This algorithm works by grouping data into a specified number of groups, also called “clusters”. Each data point within the data-set belongs to the closest cluster. Each cluster has a centroid (i.e., the center of the cluster). K-means is a really simple, yet powerful algorithm, for automatically clustering and grouping data. In fact, it can often be used as a first go-to algorithm for any data exploration project. Let’s take a look at how this algorithm works.
#5 Now that we have an idea of how the algorithm works, let’s see an example! In the above picture we have a series of data points, scattered within the plot. The data seems to have some kind of pattern, but generally, the points are mostly random within them. Suppose we want to divide this data into 6 groups (or clusters). You can probably visually get an idea of where those boundaries would be, effectively dividing the data into 6 parts at each spoke. However, what if we want to cluster into 3 groups? What would that look like? Let’s run through the k-means algorithm and cluster this data into 3 groups. We’ll start by initializing 3 random centroids within our data.
#6 We’ve added 3 random centroids to the data. They actually appear pretty well spaced apart within the data, but in actuality, they are indeed randomly placed. Each point has been assigned to its closest centroid, thus coloring the area in its respective centroid’s color. For example, consider the blue area. Do you see that point to the far top-right, sitting right on the line of blue and green? You might think that point is closer to green, but it is indeed closer to the blue centroid. The same goes for all other points within their assigned cluster. With the data points assigned to a cluster, the next step is to move each centroid to the center of their assigned points. So for example, the blue centroid is going to shift slightly up and to the right, so that it sits squarely within the center of the blue area. Likewise, the green centroid will shift slightly to the right and down. The red centroid will shift slightly to the right. After shifting the centroids, some of the data points will be re-assigned. For example, when the blue centroid shifts to the right, some of the points that were assigned to the green centroids will now be closer to the blue centroid. Thus, they’ll be re-assigned to blue. We repeat this process until the centroids stop shifting or the data points stop changing clusters – meaning the k-means algorithm has completed.
#7 This image shows the final iteration of the k-means algorithm, effectively clustering our data into 3 clusters. You can see how the data is evenly divided with each point assigned to its respective cluster.
#8 Let’s see one more example. This time, we’ll use 6 clusters. In this image, it’s easy to see the randomness of the initial cluster placements. The groups are nowhere near equal. Let’s see what the final iteration of the k-means algorithm looks like with 6 clusters.
#9 You can see how the groups are now evenly divided, with 6 clusters displayed with their respective assigned data points.
#10 Text can be clustered too! First convert it to a bit string, using a bag-of-words / term-document matrix. This is the key part of natural language processing. Reduce text into array of 1’s and 0’s by term (1 if the term exists in the dictionary, 0 if not existing). Remove sparse terms (words not appearing in many documents) to reduce dimensionality and compress data. Removing sparse terms reduced memory usage in the example data from 2GB to 91MB.
#11 Natural Language Processing The most basic form of natural language processing is to simply convert text into a numerical representation. This gives you an array of numbers. So, each document becomes a same-sized array of numbers. With this, you can apply machine learning algorithms, such as clustering and classification. This allows you to build unique insights into a set of documents, determining characteristics like category, popularity, sentiment, and relationships. This is the same type of processing that many popular online machine learning APIs use to classify data. For example, IBM Watson, Microsoft, Amazon, and Google, all include NLP APIs for working with data.
#12 Bag of Words Model Let’s take a look at a quick example. Here are two documents: “Cats like to chase mice.” and “Dogs like to eat big bones”. We’re going to try to categorize these documents as being about “eating”. To do this, we’ll build a bag-of-words model and then apply a classification algorithm. Now, the first thing to note is that the two documents are of different lengths. If you think about it, most documents will practically always be of different lengths. This is fine, because after we digitize the corpus, you’ll see that the resulting data fits neatly within same-sized vectors.
#13 Create a Dictionary So, the first step is to create a dictionary from our corpus. First, we apply a stemming algorithm on the corpus. This will remove the stop-word “to”. Next, we find each unique term and add it to our dictionary. You can see the resulting list on the right-side of this slide. Our dictionary contains 8 terms.
#14 Digitize Text With our dictionary created, we can now digitize the documents. Since our dictionary has 8 terms, each document will be encoded into a vector of length 8. This ensures that all documents end up having the same length. This makes it easier to process with machine learning algorithms. Let’s look at the first document. We’ll take the first term in the dictionary and see if it exists in the first document. The term is “cats”, which does indeed exist in the first document. Therefore, we’ll set a 1 as the first bit. The next term is “like”. Again, it exists in the first document, so we’ll set a 1 as the next bit. This repeats until we see the term “dogs”. This does not exist in the first document, so we set a “0”. Finally, we run through all terms in the dictionary and end up with a vector of length 8 for the first document. We repeat the same steps for the second document, going through each term in the dictionary and checking if it exists in the document.
#15 Which words should we include in our dictionary? i.e., how should we tokenize text? Take every word? “and”, “or”, “boy”, “dog” etc? No, we use porter stemmer to remove stop words and reduce longer words. Then we tokenize by either individual words (unigrams) or word-pairs (bigrams). While bigrams give more unique clusters, one downside is that they match less documents in each one. This is because finding documents that contain the same pairs of words is less likely than finding documents with the same single words. You can go further with N-grams, but this reduces the number of items in clusters even further (although they will be more unique).Extreme case of N-grams will assign each headline to its own cluster.
#16 What can we do with news data? Read the news database and extract headlines. Use bigrams.Strip sparse terms.Apply K-means clustering. Get highest count terms in each cluster -> trending topics!
#17 Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.
#18 Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.
#19 Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.
#20 Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.