SlideShare a Scribd company logo
2017 CodeFest
$how Me the Money
Kory Becker
October, 2017, http://primaryobjects.com
Unsupervised Learning
Exploratory data analysis
Discovers patterns in unlabeled data
No training set
No error rate for potential solution
K-means Clustering, Markov Chains,
Feature Extraction, Principal Component
Analysis (Dimensionality Reduction)
2
K-Means Clustering
Popular clustering algorithm
Groups data into k clusters
Data points belong to the cluster with closest mean
Each cluster has a centroid (center)
Clustering Example 1
Clustering Example 1
Clustering Example 1
Clustering Example 2
Clustering Example 2
What About Text?
Natural language processing
Term document matrix
Digitize text into an array of 0’s and 1’s by term
Remove sparse terms (non-frequently occurring
terms)
Reduced dimensionality
Compressed data
Speed
Natural Language
Processing
Convert text into a numerical representation
Find commonalities within data
Clustering
Make predictions from data
Classification
Category, Popularity, Sentiment,
Relationships
Bag of Words Model
Corpus
Cats like to chase mice.
Dogs like to eat big bones.
Create a Dictionary Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones
Cats like to chase mice.
Dogs like to eat big bones.
Corpus
Digitize Text
Cats like to chase mice.
1 1 1 1 0 0 0 0
Dogs like to eat big bones.
0 1 0 0 1 1 1 1
Vector Length = 8
Corpus
Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones
Unigrams vs Bigrams
Unigrams
George
Bush
Clooney
Bigrams
George Bush
George Clooney
N-grams?
ML + News + ??? = Profit!
Extract news stories
Build corpus of headlines
Use bigrams (word pairs)
Strip sparse terms
Apply k-means clustering
.. and what do we get?
Visualizing Clusters
Visualizing Clusters
Visualizing Clusters
Visualizing Clusters
Additional Reading
Discovering Trending Topics in News
http://primaryobjects.com/CMS/Article162
Mirroring Your Twitter Personal with Intelligence
http://primaryobjects.com/CMS/Article160
TF*IDF with .NET
http://primaryobjects.com/CMS/Article157
Thank you!
Kory Becker
http://primaryobjects.com
@primaryobjects

More Related Content

Similar to 2017 CodeFest Wrap-up Presentation

Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...
Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...
Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...
Kory Becker
 
Discovering Trending Topics in News - 2017 Edition
Discovering Trending Topics in News - 2017 EditionDiscovering Trending Topics in News - 2017 Edition
Discovering Trending Topics in News - 2017 Edition
Kory Becker
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
University of Minnesota, Duluth
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Anubhav Jain
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
IRJET Journal
 
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
cscpconf
 
An approach to word sense disambiguation combining modified lesk and bag of w...
An approach to word sense disambiguation combining modified lesk and bag of w...An approach to word sense disambiguation combining modified lesk and bag of w...
An approach to word sense disambiguation combining modified lesk and bag of w...
csandit
 
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Kory Becker
 
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
GeeksLab Odessa
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
IJET - International Journal of Engineering and Techniques
 
Eacl 2006 Pedersen
Eacl 2006 PedersenEacl 2006 Pedersen
Eurolan 2005 Pedersen
Eurolan 2005 PedersenEurolan 2005 Pedersen
Eurolan 2005 Pedersen
University of Minnesota, Duluth
 
Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...
Andre Freitas
 
Text Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextText Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's Next
Seth Grimes
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
cscpconf
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
csandit
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Anubhav Jain
 
Web & text mining lecture10
Web & text mining lecture10Web & text mining lecture10
Web & text mining lecture10
Houw Liong The
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
Andre Freitas
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
IOSR Journals
 

Similar to 2017 CodeFest Wrap-up Presentation (20)

Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...
Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...
Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...
 
Discovering Trending Topics in News - 2017 Edition
Discovering Trending Topics in News - 2017 EditionDiscovering Trending Topics in News - 2017 Edition
Discovering Trending Topics in News - 2017 Edition
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
 
An approach to word sense disambiguation combining modified lesk and bag of w...
An approach to word sense disambiguation combining modified lesk and bag of w...An approach to word sense disambiguation combining modified lesk and bag of w...
An approach to word sense disambiguation combining modified lesk and bag of w...
 
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
 
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
Eacl 2006 Pedersen
Eacl 2006 PedersenEacl 2006 Pedersen
Eacl 2006 Pedersen
 
Eurolan 2005 Pedersen
Eurolan 2005 PedersenEurolan 2005 Pedersen
Eurolan 2005 Pedersen
 
Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...
 
Text Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextText Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's Next
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 
Web & text mining lecture10
Web & text mining lecture10Web & text mining lecture10
Web & text mining lecture10
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 

More from Kory Becker

Intelligent Heuristics for the Game Isolation
Intelligent Heuristics  for the Game IsolationIntelligent Heuristics  for the Game Isolation
Intelligent Heuristics for the Game Isolation
Kory Becker
 
Tips for Submitting a Proposal to Grace Hopper GHC 2020
Tips for Submitting a Proposal to Grace Hopper GHC 2020Tips for Submitting a Proposal to Grace Hopper GHC 2020
Tips for Submitting a Proposal to Grace Hopper GHC 2020
Kory Becker
 
Grace Hopper 2019 Quantum Computing Recap
Grace Hopper 2019 Quantum Computing RecapGrace Hopper 2019 Quantum Computing Recap
Grace Hopper 2019 Quantum Computing Recap
Kory Becker
 
An Introduction to Quantum Computing - Hopper X1 NYC 2019
An Introduction to Quantum Computing - Hopper X1 NYC 2019An Introduction to Quantum Computing - Hopper X1 NYC 2019
An Introduction to Quantum Computing - Hopper X1 NYC 2019
Kory Becker
 
Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18
Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18
Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18
Kory Becker
 
Self Programming Artificial Intelligence - Lightning Talk
Self Programming Artificial Intelligence - Lightning TalkSelf Programming Artificial Intelligence - Lightning Talk
Self Programming Artificial Intelligence - Lightning Talk
Kory Becker
 
Machine Learning in a Flash: An Introduction to Natural Language Processing
Machine Learning in a Flash: An Introduction to Natural Language ProcessingMachine Learning in a Flash: An Introduction to Natural Language Processing
Machine Learning in a Flash: An Introduction to Natural Language Processing
Kory Becker
 
Self Programming Artificial Intelligence
Self Programming Artificial IntelligenceSelf Programming Artificial Intelligence
Self Programming Artificial Intelligence
Kory Becker
 
IBM Watson Concept Insights
IBM Watson Concept InsightsIBM Watson Concept Insights
IBM Watson Concept Insights
Kory Becker
 
Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)
Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)
Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)
Kory Becker
 

More from Kory Becker (10)

Intelligent Heuristics for the Game Isolation
Intelligent Heuristics  for the Game IsolationIntelligent Heuristics  for the Game Isolation
Intelligent Heuristics for the Game Isolation
 
Tips for Submitting a Proposal to Grace Hopper GHC 2020
Tips for Submitting a Proposal to Grace Hopper GHC 2020Tips for Submitting a Proposal to Grace Hopper GHC 2020
Tips for Submitting a Proposal to Grace Hopper GHC 2020
 
Grace Hopper 2019 Quantum Computing Recap
Grace Hopper 2019 Quantum Computing RecapGrace Hopper 2019 Quantum Computing Recap
Grace Hopper 2019 Quantum Computing Recap
 
An Introduction to Quantum Computing - Hopper X1 NYC 2019
An Introduction to Quantum Computing - Hopper X1 NYC 2019An Introduction to Quantum Computing - Hopper X1 NYC 2019
An Introduction to Quantum Computing - Hopper X1 NYC 2019
 
Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18
Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18
Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18
 
Self Programming Artificial Intelligence - Lightning Talk
Self Programming Artificial Intelligence - Lightning TalkSelf Programming Artificial Intelligence - Lightning Talk
Self Programming Artificial Intelligence - Lightning Talk
 
Machine Learning in a Flash: An Introduction to Natural Language Processing
Machine Learning in a Flash: An Introduction to Natural Language ProcessingMachine Learning in a Flash: An Introduction to Natural Language Processing
Machine Learning in a Flash: An Introduction to Natural Language Processing
 
Self Programming Artificial Intelligence
Self Programming Artificial IntelligenceSelf Programming Artificial Intelligence
Self Programming Artificial Intelligence
 
IBM Watson Concept Insights
IBM Watson Concept InsightsIBM Watson Concept Insights
IBM Watson Concept Insights
 
Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)
Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)
Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)
 

Recently uploaded

Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 

Recently uploaded (20)

Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 

2017 CodeFest Wrap-up Presentation

Editor's Notes

  1. Unsupervised learning is a type of exploratory data analysis. Unlike supervised learning, it doesn’t require outputs nor training data, cross-validation, test sets. Give it a bunch of data and the AI will make sense of it. Discover patterns.  Unsupervised learning is key behind deep-learning (layers of unsupervised neural networks learn to recognize abstract patterns and feed into a supervised layer for fine-tune training).
  2. One of the most common algorithms used for unsupervised learning is “k-Means Clustering”. This algorithm works by grouping data into a specified number of groups, also called “clusters”. Each data point within the data-set belongs to the closest cluster. Each cluster has a centroid (i.e., the center of the cluster). K-means is a really simple, yet powerful algorithm, for automatically clustering and grouping data. In fact, it can often be used as a first go-to algorithm for any data exploration project. Let’s take a look at how this algorithm works.
  3. Now that we have an idea of how the algorithm works, let’s see an example! In the above picture we have a series of data points, scattered within the plot. The data seems to have some kind of pattern, but generally, the points are mostly random within them. Suppose we want to divide this data into 6 groups (or clusters). You can probably visually get an idea of where those boundaries would be, effectively dividing the data into 6 parts at each spoke. However, what if we want to cluster into 3 groups? What would that look like? Let’s run through the k-means algorithm and cluster this data into 3 groups. We’ll start by initializing 3 random centroids within our data.
  4. We’ve added 3 random centroids to the data. They actually appear pretty well spaced apart within the data, but in actuality, they are indeed randomly placed. Each point has been assigned to its closest centroid, thus coloring the area in its respective centroid’s color. For example, consider the blue area. Do you see that point to the far top-right, sitting right on the line of blue and green? You might think that point is closer to green, but it is indeed closer to the blue centroid. The same goes for all other points within their assigned cluster. With the data points assigned to a cluster, the next step is to move each centroid to the center of their assigned points. So for example, the blue centroid is going to shift slightly up and to the right, so that it sits squarely within the center of the blue area. Likewise, the green centroid will shift slightly to the right and down. The red centroid will shift slightly to the right. After shifting the centroids, some of the data points will be re-assigned. For example, when the blue centroid shifts to the right, some of the points that were assigned to the green centroids will now be closer to the blue centroid. Thus, they’ll be re-assigned to blue. We repeat this process until the centroids stop shifting or the data points stop changing clusters – meaning the k-means algorithm has completed.
  5. This image shows the final iteration of the k-means algorithm, effectively clustering our data into 3 clusters. You can see how the data is evenly divided with each point assigned to its respective cluster.
  6. Let’s see one more example. This time, we’ll use 6 clusters. In this image, it’s easy to see the randomness of the initial cluster placements. The groups are nowhere near equal. Let’s see what the final iteration of the k-means algorithm looks like with 6 clusters.
  7. You can see how the groups are now evenly divided, with 6 clusters displayed with their respective assigned data points.
  8. Text can be clustered too! First convert it to a bit string, using a bag-of-words / term-document matrix. This is the key part of natural language processing. Reduce text into array of 1’s and 0’s by term (1 if the term exists in the dictionary, 0 if not existing). Remove sparse terms (words not appearing in many documents) to reduce dimensionality and compress data. Removing sparse terms reduced memory usage in the example data from 2GB to 91MB.
  9. Natural Language Processing The most basic form of natural language processing is to simply convert text into a numerical representation. This gives you an array of numbers. So, each document becomes a same-sized array of numbers. With this, you can apply machine learning algorithms, such as clustering and classification. This allows you to build unique insights into a set of documents, determining characteristics like category, popularity, sentiment, and relationships. This is the same type of processing that many popular online machine learning APIs use to classify data. For example, IBM Watson, Microsoft, Amazon, and Google, all include NLP APIs for working with data.
  10. Bag of Words Model Let’s take a look at a quick example. Here are two documents: “Cats like to chase mice.” and “Dogs like to eat big bones”. We’re going to try to categorize these documents as being about “eating”. To do this, we’ll build a bag-of-words model and then apply a classification algorithm. Now, the first thing to note is that the two documents are of different lengths. If you think about it, most documents will practically always be of different lengths. This is fine, because after we digitize the corpus, you’ll see that the resulting data fits neatly within same-sized vectors.
  11. Create a Dictionary So, the first step is to create a dictionary from our corpus. First, we apply a stemming algorithm on the corpus. This will remove the stop-word “to”. Next, we find each unique term and add it to our dictionary. You can see the resulting list on the right-side of this slide. Our dictionary contains 8 terms.
  12. Digitize Text With our dictionary created, we can now digitize the documents. Since our dictionary has 8 terms, each document will be encoded into a vector of length 8. This ensures that all documents end up having the same length. This makes it easier to process with machine learning algorithms. Let’s look at the first document. We’ll take the first term in the dictionary and see if it exists in the first document. The term is “cats”, which does indeed exist in the first document. Therefore, we’ll set a 1 as the first bit. The next term is “like”. Again, it exists in the first document, so we’ll set a 1 as the next bit. This repeats until we see the term “dogs”. This does not exist in the first document, so we set a “0”. Finally, we run through all terms in the dictionary and end up with a vector of length 8 for the first document. We repeat the same steps for the second document, going through each term in the dictionary and checking if it exists in the document.
  13. Which words should we include in our dictionary? i.e., how should we tokenize text? Take every word? “and”, “or”, “boy”, “dog” etc? No, we use porter stemmer to remove stop words and reduce longer words. Then we tokenize by either individual words (unigrams) or word-pairs (bigrams). While bigrams give more unique clusters, one downside is that they match less documents in each one. This is because finding documents that contain the same pairs of words is less likely than finding documents with the same single words. You can go further with N-grams, but this reduces the number of items in clusters even further (although they will be more unique). Extreme case of N-grams will assign each headline to its own cluster.
  14. What can we do with news data? Read the news database and extract headlines. Use bigrams. Strip sparse terms. Apply K-means clustering. Get highest count terms in each cluster -> trending topics!  
  15. Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.
  16. Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.
  17. Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.
  18. Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.