SlideShare a Scribd company logo
Natural Language Processing
From Scratch
Noemi Derzsy
§ Datasets: 32089
§ Some of the data sets:
- Mars Rover sound data
- Hubble Telescope image collection
- NASA patents
- Picture of the Day of Earth
- etc.
http://data.nasa.gov/data.json
Open NASA Metadata
Metadata information:
• id
• type
• accessLevel
• accrualPeriodicity
• bureauCode
• contactPoint
• title
• description
• distribution
• identifier
• Issued
• keyword
• language
• modified
• programCode
• theme
• license
• location (HTML link)
• etc.
Format:json
§ Open NASA platform:https://open.nasa.gov
PyGotham 2017 New York City @NoemiDerzsy
Which of these features is “best” to tie together the data?
How do we label groupings in a meaningful manner?
How many groups/how to arrange/visualize them?
Are Descriptions, Keywords representative of the content?
PyGotham 2017 New York City @NoemiDerzsy
Natural Language Processing (NLP)
Python Libraries
• NLTK
• TextBlob
• spaCy
• gensim
• Stanford CoreNLP
Jupyter Notebooks:
https://github.com/nderzsy/NASADatanauts
PyGotham 2017 New York City @NoemiDerzsy
Word Clouds in Description
Word clouds of the over
32,000 open NASA
datasets and their
descriptions
PyGotham 2017 New York City @NoemiDerzsy
Word clouds of the over
32,000 open NASA
datasets and their titles
Word Clouds in Title
PyGotham 2017 New York City @NoemiDerzsy
Word Clouds in Keywords
Word clouds of the over
32,000 open NASA datasets
and their keywords
PyGotham 2017 New York City @NoemiDerzsy
How to Obtain Customized Word Clouds?
o get stencil (shape of your choice)
o get text
o https://github.com/amueller/word_cloud
PyGotham 2017 New York City @NoemiDerzsy
Text Preprocessing, Cleaning
• treat “Data” and “data” as identical words: covert all words to lowercase lower()
• remove special characters, codes, numbers: regular expressions
• check for misspelling
• stop words: this, and, for, where, etc.
• “system” vs. “systems”: lemmatize
• “compute”, “computer”, “computation” -> “comput”: stem
• tokenize: break down text to smallest parts (words)
• POS tagging
• dimensionality reduction (PCA)
PyGotham 2017 New York City @NoemiDerzsy
Stemming
• from nltk.stem.api import StemmerI
• from nltk.stem.regexp import RegexpStemmer
• from nltk.stem.lancaster import LancasterStemmer
• from nltk.stem.isri import ISRIStemmer
• from nltk.stem.porter import PorterStemmer
• from nltk.stem.snowball import SnowballStemmer
• from nltk.stem.wordnet import WordNetLemmatizer
• from nltk.stem.rslp import RSLPStemmer
• Lemmatization: similar to stemming, but with stems being valid words
• nltk.WordNetLemmatizer()
Lemmatization
PyGotham 2017 New York City @NoemiDerzsy
Term Frequency
Keywords Description Title
• the number of times a word occurs in text corpus
PyGotham 2017 New York City @NoemiDerzsy
TF-IDF
• term frequency – inverse document frequency
• measures term frequency / document frequency
Top 25 Highest TF-IDF
Words in
Description
Top 25 Highest MeanTF-IDF
Words in
Description
PyGotham 2017 New York City @NoemiDerzsy
Description TF-IDF and Keywords
Top 10 keywords:
1. EARTH SCIENCE: 14387
2. OCEANS: 10034
3. PROJECT: 7464
4. OCEAN OPTICS: 7325
5. ATMOSPHERE: 7324
6. OCEAN COLOR: 7271
7. COMPLETED: 6453
8. ATMOSPHERIC WATER VAPOR: 3143
9. LAND SURFACE: 2721
10.BIOSPHERE: 2450
Top 25 Keywords
PyGotham 2017 New York City @NoemiDerzsy
Word Co-Occurrence
§ Co-occurrence matrix of top most frequently co-occurring terms
Top 10
Word Co-Occurrence
Matrix
PyGotham 2017 New York City @NoemiDerzsy
Word Co-Occurrence
Top 20
Word Co-Occurrence
Matrix
PyGotham 2017 New York City @NoemiDerzsy
Top 50
Word Co-Occurrence
Matrix
PyGotham 2017 New York City @NoemiDerzsy
seaborn.clustermap
Top 20
Word Co-Occurrence
Matrix
Discovering Structure in Heatmap Data
standardization
PyGotham 2017 New York City @NoemiDerzsy
Are features pulled from text (such as title, description fields)
and/or human supplied-keywords
descriptive of the content?
Topic Modeling…
PyGotham 2017 New York City @NoemiDerzsy
What is Topic Modeling?
An efficient way to make sense of large volume of texts.
Identify topics within text corpus.
Categorize documents into topics.
Associate words with topics.
Who uses it?
Search engines, for marketing purpose, etc.
PyGotham 2017 New York City @NoemiDerzsy
Latent Dirichlet Allocation (LDA)
q several techniques, but LDA is the most common
q Bayesian inference model that associates each document with a probability distribution over topics
q topics are probability distributions over words (probability of the word being generated from that topic for that document)
q clusters words into topics
q clusters documents into mixture of topics
q scales well with growing corpus
q before running LDA algorithm,we have to specify the number of topics: how to choose beforehand the optimal number of topics?
PyGotham 2017 New York City @NoemiDerzsy
Topic Model Evaluation: Topic Coherence
Q: How to select the top topics?
A: Calculate the UMass topic coherence for each topic. Algorithm from Mimno, Wallach, Talley, Leenders, McCallum: Optimizing
Semantic Coherence in Topic Models, CEMNLP 2011.
Coherence = ! score(𝑤),	𝑤,)
).,
pairwise scores on the words used to describe the topic.
s𝑐𝑜𝑟𝑒34566 78,79
=	log
𝐷 𝑤), 𝑤, + 1
𝐷(𝑤))
D(wi)D(wi) as the count of documents containing the word wiwi, D(wi,wj)D(wi,wj) the count of documents containing both
words wiwi and wjwj, and DD the total number or documents in the corpus.
PyGotham 2017 New York City @NoemiDerzsy
openNASA Topics of Highest Coherence
PyGotham 2017 New York City @NoemiDerzsy
Keywords for Topics
• selected keywords with their most frequently occurring terms
PyGotham 2017 New York City @NoemiDerzsy
Other Clustering Method: K-Means
• using TF-IDF, the document vectors are put through a K-Means clustering algorithm which
computes the Euclidean distances amongst these documents and clusters nearby documents
together
• the algorithm generates cluster tags, known as cluster centers which represent the documents
within these clusters
• K-means distance:
• Euclidean
• Cosine
• Fuzzy
• Accuracy comparison:
• silhouette analysis can be used to study the separation distance between the resulting
clusters; can be used to determine the optimal number of clusters (silhouette score)
PyGotham 2017 New York City @NoemiDerzsy
K-Means Clustering
§ Top 10 terms per cluster:
Cluster 0
seawifs
local
collect
km
mission
data
orbview
seastar
broadcast
noon
§ Similar clusters with top words as found using LDA
Cluster 1
data
project
system
use
soil
high
contain
phase
set
gt
Cluster 2
modis
terra
aqua
earth
global
pass
south
north
equator
eos
Cluster 3
band
oct
color
adeos
czcs
nominal
sense
spacecraft
agency
thermal
Cluster 4
product
data
level
version
file
aquarius
set
daily
ml
standard
PyGotham 2017 New York City @NoemiDerzsy
Text Classification
§ Naïve Bayes probabilistic model
• Multinomial model
• data follows multinomial distribution
• each feature value is a count (word co-occurrence, weight, tf-idf, etc.)
• Take into account word importance
• Bernoulli model
• data follows a multivariate Bernoulli distribution
• bag of words (count word occurrence)
• each feature is a binary feature (word in text? True/False)
• ignores word significance
§ KNN (K-nearest neighbor) classifier
§ Decision trees
§ Support Vector Machine (SVM)
PyGotham 2017 New York City @NoemiDerzsy
What are the important connections between
NASA datasets and other important datasets outside of NASA
in the US government?
PyGotham 2017 New York City @NoemiDerzsy
Other Open Government Dataset Collections
http://data.nasa.gov/data.json
http://www.epa.gov/data.json
http://data.gov
http://www.nsf.gov/data.json
http://usda.gov/data.json
http://data.noaa.gov/data.json
http://www.commerce.gov/data.json
http://nist.gov/data.json
http://www.defense.gov/data.json
http://www2.ed.gov/data.json
http://www.dol.gov/data.json
http://www.state.gov/data.json
http://www.dot.gov/data.json
http://www.energy.gov/data.json
http://nrel.gov/data.json
http://healthdata.gov/data.json
http://www.hud.gov/data.json
http://www.doi.gov/data.json
http://www.justice.gov/data.json
http://www.archives.gov/data.json
http://www.nrc.gov/data.json
http://www.nsf.gov/data.json
http://www.opm.gov/data.json
https://www.sba.gov/sites/default/files/data.json
http://www.ssa.gov/data.json
http://www.consumerfinance.gov/data.json
http://www.fhfa.gov/data.json
http://www.imls.gov/data.json
http://data.mcc.gov/raw/index.json
http://www.nitrd.gov/data.json
http://www.ntsb.gov/data.json
http://www.sec.gov/data.json
https://open.whitehouse.gov/data.json
http://treasury.gov/data.json
http://www.usaid.gov/data.json
http://www.gsa.gov/data.json
https://www.economist.com/news/international/21678833-open-data-revolution-has-not-lived-up-expectations-it-only-getting
Open Government Data: Out of the Box
The Economist
PyGotham 2017 New York City @NoemiDerzsy
pyNASA and pyOpenGov Libraries
§ Python library that loads all the open NASA or other government metadata collection at once
pyNASA
https://github.com/bmtgoncalves/pyNASA
pyOpenGov
https://github.com/nderzsy/pyOpenGov
How to install:
>> pip install pyNASA
>> pip install pyOpenGov
PyGotham 2017 New York City @NoemiDerzsy
Takeaways
§ NLP enables understanding of structured and unstructured text
§ Topic modeling useful tool for understanding topics in large text corpus, documents
§ Topic models efficient for evaluating the accuracy of human-supplied descriptions, keywords
§ Tedious preprocessing a must before modeling (stop words, lemmatization, special characters, etc.)
§ Open government data enables citizens in understanding topics, areas of focus
PyGotham 2017 New York City @NoemiDerzsy
Contact
GitHub: https://github.com/nderzsy/NASADatanauts
Twitter: @NoemiDerzsy
Website: http://www.noemiderzsy.com

More Related Content

What's hot

Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
Vrije Universiteit Amsterdam
 
Talk odysseyiswc2017
Talk odysseyiswc2017Talk odysseyiswc2017
Talk odysseyiswc2017
hala Skaf
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
Kai Chan
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
MOVING Project
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
DECK36
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
Eugene Nho
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Sujit Pal
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
Damiano Spina
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
Ansgar Scherp
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
Tatiana Tarasova
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
Adnan Akhter
 
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Association for Computational Linguistics
 
search engine
search enginesearch engine
search engine
Musaib Khan
 
Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13
Brian Ulicny
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
Tobias Wunner
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
Jean-Paul Calbimonte
 
Sigir 2011 proceedings
Sigir 2011 proceedingsSigir 2011 proceedings
Sigir 2011 proceedings
chetanagavankar
 
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and HyperlinkingDCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
multimediaeval
 
Time-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information Retrieval
Nattiya Kanhabua
 
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesMore Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
André Valdestilhas
 

What's hot (20)

Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
 
Talk odysseyiswc2017
Talk odysseyiswc2017Talk odysseyiswc2017
Talk odysseyiswc2017
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
 
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
Chaya Liebeskind - 2015 - Integrating Query Performance Prediction in Term Sc...
 
search engine
search enginesearch engine
search engine
 
Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
 
Sigir 2011 proceedings
Sigir 2011 proceedingsSigir 2011 proceedings
Sigir 2011 proceedings
 
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and HyperlinkingDCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
 
Time-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information Retrieval
 
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesMore Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
 

Similar to PyGotham NY 2017: Natural Language Processing from Scratch

Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Noemi Derzsy
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
PyData
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
kelbedweihy
 
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
NodejsFoundation
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
George Stathis
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
Jonathan Stray
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
William Lyon
 
Web and text
Web and textWeb and text
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
Houw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
Uma Se
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic Processing
Na'im Tyson
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
rchbeir
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
vincent683379
 
Text mining and Visualizations
Text mining  and VisualizationsText mining  and Visualizations
Text mining and Visualizations
Kasturi SR Narayana Murthy
 
Topic Extraction using Machine Learning
Topic Extraction using Machine LearningTopic Extraction using Machine Learning
Topic Extraction using Machine Learning
Sanjib Basak
 
Natural Language Processing with Graphs
Natural Language Processing with GraphsNatural Language Processing with Graphs
Natural Language Processing with Graphs
Neo4j
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 

Similar to PyGotham NY 2017: Natural Language Processing from Scratch (20)

Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic Processing
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Text mining and Visualizations
Text mining  and VisualizationsText mining  and Visualizations
Text mining and Visualizations
 
Topic Extraction using Machine Learning
Topic Extraction using Machine LearningTopic Extraction using Machine Learning
Topic Extraction using Machine Learning
 
Natural Language Processing with Graphs
Natural Language Processing with GraphsNatural Language Processing with Graphs
Natural Language Processing with Graphs
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 

Recently uploaded

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 

Recently uploaded (20)

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 

PyGotham NY 2017: Natural Language Processing from Scratch

  • 1. Natural Language Processing From Scratch Noemi Derzsy
  • 2. § Datasets: 32089 § Some of the data sets: - Mars Rover sound data - Hubble Telescope image collection - NASA patents - Picture of the Day of Earth - etc. http://data.nasa.gov/data.json Open NASA Metadata Metadata information: • id • type • accessLevel • accrualPeriodicity • bureauCode • contactPoint • title • description • distribution • identifier • Issued • keyword • language • modified • programCode • theme • license • location (HTML link) • etc. Format:json § Open NASA platform:https://open.nasa.gov PyGotham 2017 New York City @NoemiDerzsy
  • 3. Which of these features is “best” to tie together the data? How do we label groupings in a meaningful manner? How many groups/how to arrange/visualize them? Are Descriptions, Keywords representative of the content? PyGotham 2017 New York City @NoemiDerzsy
  • 4. Natural Language Processing (NLP) Python Libraries • NLTK • TextBlob • spaCy • gensim • Stanford CoreNLP Jupyter Notebooks: https://github.com/nderzsy/NASADatanauts PyGotham 2017 New York City @NoemiDerzsy
  • 5. Word Clouds in Description Word clouds of the over 32,000 open NASA datasets and their descriptions PyGotham 2017 New York City @NoemiDerzsy
  • 6. Word clouds of the over 32,000 open NASA datasets and their titles Word Clouds in Title PyGotham 2017 New York City @NoemiDerzsy
  • 7. Word Clouds in Keywords Word clouds of the over 32,000 open NASA datasets and their keywords PyGotham 2017 New York City @NoemiDerzsy
  • 8. How to Obtain Customized Word Clouds? o get stencil (shape of your choice) o get text o https://github.com/amueller/word_cloud PyGotham 2017 New York City @NoemiDerzsy
  • 9. Text Preprocessing, Cleaning • treat “Data” and “data” as identical words: covert all words to lowercase lower() • remove special characters, codes, numbers: regular expressions • check for misspelling • stop words: this, and, for, where, etc. • “system” vs. “systems”: lemmatize • “compute”, “computer”, “computation” -> “comput”: stem • tokenize: break down text to smallest parts (words) • POS tagging • dimensionality reduction (PCA) PyGotham 2017 New York City @NoemiDerzsy
  • 10. Stemming • from nltk.stem.api import StemmerI • from nltk.stem.regexp import RegexpStemmer • from nltk.stem.lancaster import LancasterStemmer • from nltk.stem.isri import ISRIStemmer • from nltk.stem.porter import PorterStemmer • from nltk.stem.snowball import SnowballStemmer • from nltk.stem.wordnet import WordNetLemmatizer • from nltk.stem.rslp import RSLPStemmer • Lemmatization: similar to stemming, but with stems being valid words • nltk.WordNetLemmatizer() Lemmatization PyGotham 2017 New York City @NoemiDerzsy
  • 11. Term Frequency Keywords Description Title • the number of times a word occurs in text corpus PyGotham 2017 New York City @NoemiDerzsy
  • 12. TF-IDF • term frequency – inverse document frequency • measures term frequency / document frequency Top 25 Highest TF-IDF Words in Description Top 25 Highest MeanTF-IDF Words in Description PyGotham 2017 New York City @NoemiDerzsy
  • 13. Description TF-IDF and Keywords Top 10 keywords: 1. EARTH SCIENCE: 14387 2. OCEANS: 10034 3. PROJECT: 7464 4. OCEAN OPTICS: 7325 5. ATMOSPHERE: 7324 6. OCEAN COLOR: 7271 7. COMPLETED: 6453 8. ATMOSPHERIC WATER VAPOR: 3143 9. LAND SURFACE: 2721 10.BIOSPHERE: 2450 Top 25 Keywords PyGotham 2017 New York City @NoemiDerzsy
  • 14. Word Co-Occurrence § Co-occurrence matrix of top most frequently co-occurring terms Top 10 Word Co-Occurrence Matrix PyGotham 2017 New York City @NoemiDerzsy
  • 15. Word Co-Occurrence Top 20 Word Co-Occurrence Matrix PyGotham 2017 New York City @NoemiDerzsy
  • 16. Top 50 Word Co-Occurrence Matrix PyGotham 2017 New York City @NoemiDerzsy
  • 17. seaborn.clustermap Top 20 Word Co-Occurrence Matrix Discovering Structure in Heatmap Data standardization PyGotham 2017 New York City @NoemiDerzsy
  • 18. Are features pulled from text (such as title, description fields) and/or human supplied-keywords descriptive of the content? Topic Modeling… PyGotham 2017 New York City @NoemiDerzsy
  • 19. What is Topic Modeling? An efficient way to make sense of large volume of texts. Identify topics within text corpus. Categorize documents into topics. Associate words with topics. Who uses it? Search engines, for marketing purpose, etc. PyGotham 2017 New York City @NoemiDerzsy
  • 20. Latent Dirichlet Allocation (LDA) q several techniques, but LDA is the most common q Bayesian inference model that associates each document with a probability distribution over topics q topics are probability distributions over words (probability of the word being generated from that topic for that document) q clusters words into topics q clusters documents into mixture of topics q scales well with growing corpus q before running LDA algorithm,we have to specify the number of topics: how to choose beforehand the optimal number of topics? PyGotham 2017 New York City @NoemiDerzsy
  • 21. Topic Model Evaluation: Topic Coherence Q: How to select the top topics? A: Calculate the UMass topic coherence for each topic. Algorithm from Mimno, Wallach, Talley, Leenders, McCallum: Optimizing Semantic Coherence in Topic Models, CEMNLP 2011. Coherence = ! score(𝑤), 𝑤,) )., pairwise scores on the words used to describe the topic. s𝑐𝑜𝑟𝑒34566 78,79 = log 𝐷 𝑤), 𝑤, + 1 𝐷(𝑤)) D(wi)D(wi) as the count of documents containing the word wiwi, D(wi,wj)D(wi,wj) the count of documents containing both words wiwi and wjwj, and DD the total number or documents in the corpus. PyGotham 2017 New York City @NoemiDerzsy
  • 22. openNASA Topics of Highest Coherence PyGotham 2017 New York City @NoemiDerzsy
  • 23. Keywords for Topics • selected keywords with their most frequently occurring terms PyGotham 2017 New York City @NoemiDerzsy
  • 24. Other Clustering Method: K-Means • using TF-IDF, the document vectors are put through a K-Means clustering algorithm which computes the Euclidean distances amongst these documents and clusters nearby documents together • the algorithm generates cluster tags, known as cluster centers which represent the documents within these clusters • K-means distance: • Euclidean • Cosine • Fuzzy • Accuracy comparison: • silhouette analysis can be used to study the separation distance between the resulting clusters; can be used to determine the optimal number of clusters (silhouette score) PyGotham 2017 New York City @NoemiDerzsy
  • 25. K-Means Clustering § Top 10 terms per cluster: Cluster 0 seawifs local collect km mission data orbview seastar broadcast noon § Similar clusters with top words as found using LDA Cluster 1 data project system use soil high contain phase set gt Cluster 2 modis terra aqua earth global pass south north equator eos Cluster 3 band oct color adeos czcs nominal sense spacecraft agency thermal Cluster 4 product data level version file aquarius set daily ml standard PyGotham 2017 New York City @NoemiDerzsy
  • 26. Text Classification § Naïve Bayes probabilistic model • Multinomial model • data follows multinomial distribution • each feature value is a count (word co-occurrence, weight, tf-idf, etc.) • Take into account word importance • Bernoulli model • data follows a multivariate Bernoulli distribution • bag of words (count word occurrence) • each feature is a binary feature (word in text? True/False) • ignores word significance § KNN (K-nearest neighbor) classifier § Decision trees § Support Vector Machine (SVM) PyGotham 2017 New York City @NoemiDerzsy
  • 27. What are the important connections between NASA datasets and other important datasets outside of NASA in the US government? PyGotham 2017 New York City @NoemiDerzsy
  • 28. Other Open Government Dataset Collections http://data.nasa.gov/data.json http://www.epa.gov/data.json http://data.gov http://www.nsf.gov/data.json http://usda.gov/data.json http://data.noaa.gov/data.json http://www.commerce.gov/data.json http://nist.gov/data.json http://www.defense.gov/data.json http://www2.ed.gov/data.json http://www.dol.gov/data.json http://www.state.gov/data.json http://www.dot.gov/data.json http://www.energy.gov/data.json http://nrel.gov/data.json http://healthdata.gov/data.json http://www.hud.gov/data.json http://www.doi.gov/data.json http://www.justice.gov/data.json http://www.archives.gov/data.json http://www.nrc.gov/data.json http://www.nsf.gov/data.json http://www.opm.gov/data.json https://www.sba.gov/sites/default/files/data.json http://www.ssa.gov/data.json http://www.consumerfinance.gov/data.json http://www.fhfa.gov/data.json http://www.imls.gov/data.json http://data.mcc.gov/raw/index.json http://www.nitrd.gov/data.json http://www.ntsb.gov/data.json http://www.sec.gov/data.json https://open.whitehouse.gov/data.json http://treasury.gov/data.json http://www.usaid.gov/data.json http://www.gsa.gov/data.json https://www.economist.com/news/international/21678833-open-data-revolution-has-not-lived-up-expectations-it-only-getting Open Government Data: Out of the Box The Economist PyGotham 2017 New York City @NoemiDerzsy
  • 29. pyNASA and pyOpenGov Libraries § Python library that loads all the open NASA or other government metadata collection at once pyNASA https://github.com/bmtgoncalves/pyNASA pyOpenGov https://github.com/nderzsy/pyOpenGov How to install: >> pip install pyNASA >> pip install pyOpenGov PyGotham 2017 New York City @NoemiDerzsy
  • 30. Takeaways § NLP enables understanding of structured and unstructured text § Topic modeling useful tool for understanding topics in large text corpus, documents § Topic models efficient for evaluating the accuracy of human-supplied descriptions, keywords § Tedious preprocessing a must before modeling (stop words, lemmatization, special characters, etc.) § Open government data enables citizens in understanding topics, areas of focus PyGotham 2017 New York City @NoemiDerzsy