SlideShare a Scribd company logo
PyData New York 2017
Noemi Derzsy
Data Science Keys to Open Up
OpenNASA Datasets
@NoemiDerzsy
PyData New York 2017
Open NASA Platform
• Data at NASA: 32 089
• Data at other government agencies: ~185 000
• NASA code repositories: 356
• NASA APIs: 51
Continuously growing…
https://open.nasa.gov/explore/datanauts/
@NoemiDerzsy
PyData New York 2017 @NoemiDerzsy
PyData New York 2017
§ Datasets: 32089
§ Some of the data sets:
- Mars Rover sound data
- Hubble Telescope image collection
- NASA patents
- Picture of the Day of Earth
- etc.
http://data.nasa.gov/data.json
Open NASA Metadata
Metadata information:
• id
• type
• accessLevel
• accrualPeriodicity
• bureauCode
• contactPoint
• title
• description
• distribution
• identifier
• Issued
• keyword
• language
• modified
• programCode
• theme
• license
• location (HTML link)
• Etc.
Format:JSON
@NoemiDerzsy
PyData New York 2017
Which of these features is “best” to tie together the data?
How do we label groupings in a meaningful manner?
How many groups/how to arrange/visualize them?
Are Descriptions, Keywords representative of the content?
@NoemiDerzsy
PyData New York 2017
Natural Language Processing (NLP)
Python Libraries
• NLTK
• TextBlob
• spaCy
• gensim
• Stanford CoreNLP
Jupyter Notebooks:
https://github.com/nderzsy/NASADatanauts
@NoemiDerzsy
PyData New York 2017
Word Clouds in Description
Word clouds of the over
32,000 open NASA
datasets and their
descriptions
@NoemiDerzsy
PyData New York 2017
Word clouds of the over
32,000 open NASA
datasets and their titles
Word Clouds in Title
@NoemiDerzsy
PyData New York 2017
Word Clouds in Keywords
Word clouds of the over
32,000 open NASA datasets
and their keywords
@NoemiDerzsy
PyData New York 2017
How to Obtain Customized Word Clouds?
o get stencil (shape of your choice)
o get text
o https://github.com/amueller/word_cloud
@NoemiDerzsy
PyData New York 2017
Text Preprocessing, Cleaning
• treat “Data” and “data” as identical words: covert all words to lowercase lower()
• remove special characters, codes, numbers: regular expressions
• check for misspelling
• stop words: this, and, for, where, etc.
• “system” vs. “systems”: lemmatize
• “compute”, “computer”, “computation” -> “comput”: stem
• tokenize: break down text to smallest parts (words)
• POS tagging
• dimensionality reduction (PCA)
@NoemiDerzsy
PyData New York 2017
Stemming
• from nltk.stem.api import StemmerI
• from nltk.stem.regexp import RegexpStemmer
• from nltk.stem.lancaster import LancasterStemmer
• from nltk.stem.isri import ISRIStemmer
• from nltk.stem.porter import PorterStemmer
• from nltk.stem.snowball import SnowballStemmer
• from nltk.stem.wordnet import WordNetLemmatizer
• from nltk.stem.rslp import RSLPStemmer
• Lemmatization: similar to stemming, but with stems being valid words
• nltk.WordNetLemmatizer()
Lemmatization
@NoemiDerzsy
PyData New York 2017
Term Frequency
Keywords Description Title
• the number of times a word occurs in text corpus
@NoemiDerzsy
PyData New York 2017
TF-IDF
• term frequency – inverse document frequency
• measures term frequency / document frequency
Top 25 Highest TF-IDF
Words in
Description
Top 25 Highest MeanTF-IDF
Words in
Description
@NoemiDerzsy
PyData New York 2017
Description TF-IDF and Keywords
Top 10 keywords:
1. EARTH SCIENCE: 14387
2. OCEANS: 10034
3. PROJECT: 7464
4. OCEAN OPTICS: 7325
5. ATMOSPHERE: 7324
6. OCEAN COLOR: 7271
7. COMPLETED: 6453
8. ATMOSPHERIC WATER VAPOR: 3143
9. LAND SURFACE: 2721
10.BIOSPHERE: 2450
Top 25 Keywords
@NoemiDerzsy
PyData New York 2017
Word Co-Occurrence
§ Co-occurrence matrix of top most frequently co-occurring terms
Top 10
Word Co-Occurrence
Matrix
@NoemiDerzsy
PyData New York 2017
Word Co-Occurrence
Top 20
Word Co-Occurrence
Matrix
@NoemiDerzsy
PyData New York 2017
Top 50
Word Co-Occurrence
Matrix
@NoemiDerzsy
PyData New York 2017
seaborn.clustermap
Top 20
Word Co-Occurrence
Matrix
Discovering Structure in Heatmap Data
standardization
@NoemiDerzsy
PyData New York 2017
Are features pulled from text (such as title, description fields)
and/or human supplied-keywords
descriptive of the content?
Topic Modeling…
@NoemiDerzsy
PyData New York 2017
What is Topic Modeling?
An efficient way to make sense of large volume of texts.
Identify topics within text corpus.
Categorize documents into topics.
Associate words with topics.
Who uses it?
Search engines, for marketing purpose, etc.
@NoemiDerzsy
PyData New York 2017
Latent Dirichlet Allocation (LDA)
q several techniques, but LDA is the most common
q Bayesian inference model that associates each document with a probability distribution over topics
q topics are probability distributions over words (probability of the word being generated from that topic for that document)
q clusters words into topics
q clusters documents into mixture of topics
q scales well with growing corpus
q before running LDA algorithm,we have to specify the number of topics: how to choose beforehand the optimal number of topics?
@NoemiDerzsy
PyData New York 2017
Topic Model Evaluation: Topic Coherence
Q: How to select the top topics?
A: Calculate the UMass topic coherence for each topic. Algorithm from Mimno, Wallach, Talley, Leenders, McCallum: Optimizing
Semantic Coherence in Topic Models, CEMNLP 2011.
Coherence = ! score(𝑤),	𝑤,)
).,
pairwise scores on the words used to describe the topic.
s𝑐𝑜𝑟𝑒34566 78,79
=	log
𝐷 𝑤), 𝑤, + 1
𝐷(𝑤))
D(wi)D(wi) as the count of documents containing the word wiwi, D(wi,wj)D(wi,wj) the count of documents containing both
words wiwi and wjwj, and DD the total number or documents in the corpus.
@NoemiDerzsy
PyData New York 2017
openNASA Topics of Highest Coherence
@NoemiDerzsy
PyData New York 2017
Keywords for Topics
• selected keywords with their most frequently occurring terms
@NoemiDerzsy
PyData New York 2017
Other Clustering Method: K-Means
• using TF-IDF, the document vectors are put through a K-Means clustering algorithm which
computes the Euclidean distances amongst these documents and clusters nearby documents
together
• the algorithm generates cluster tags, known as cluster centers which represent the documents
within these clusters
• K-means distance:
• Euclidean
• Cosine
• Fuzzy
• Accuracy comparison:
• silhouette analysis can be used to study the separation distance between the resulting
clusters; can be used to determine the optimal number of clusters
@NoemiDerzsy
PyData New York 2017
K-Means Clustering
§ Top 10 terms per cluster:
Cluster 0
seawifs
local
collect
km
mission
data
orbview
seastar
broadcast
noon
§ Similar clusters with top words as found using LDA
Cluster 1
data
project
system
use
soil
high
contain
phase
set
gt
Cluster 2
modis
terra
aqua
earth
global
pass
south
north
equator
eos
Cluster 3
band
oct
color
adeos
czcs
nominal
sense
spacecraft
agency
thermal
Cluster 4
product
data
level
version
file
aquarius
set
daily
ml
standard
@NoemiDerzsy
PyData New York 2017
K-Means Clustering
§ cluster size distribution for k = 5
Cluster 2: 19031
Cluster 0: 5805
Cluster 1: 4481
Cluster 3: 1968
Cluster 4: 804
§ silhouette score improves by increasing the number of clusters, best performance for K > 100
§ analysis of cluster size distributions with varying K can reveal additional information
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 50 100 150 200 250 300 350
SilhouetteScore
K Number of Clusters
Silhouette Score for K-Means Clustering openNASA Datasets
@NoemiDerzsy
PyData New York 2017
Text Classification
§ Naïve Bayes probabilistic model
• Multinomial model
• data follows multinomial distribution
• each feature value is a count (word co-occurrence, weight, tf-idf, etc.)
• Take into account word importance
• Bernoulli model
• data follows a multivariate Bernoulli distribution
• bag of words (count word occurrence)
• each feature is a binary feature (word in text? True/False)
• ignores word significance
§ KNN (K-nearest neighbor) classifier
§ Decision trees
§ Support Vector Machine (SVM)
@NoemiDerzsy
PyData New York 2017
Network Science: Bipartite Network
§ Term in Description – Keyword:
- connect 2 words in description if they appear under same Keyword
- connect 2 keywords if they share common words in Description
§ All bipartite network projection possibilities:
- description – keyword
- description – title
- title – keyword
§ Type of interaction:
- link direction: directed/undirected
- link weight: how strong the connection (co-occurrence matrix)
@NoemiDerzsy
PyData New York 2017
Structural Network Properties
§ degree distribution: reveals connectivity pattern
§ do we have hubs (nodes with significantly higher number of connections)?
§ average degree (how interconnected the elements are)
§ clusters, community detection (subgroups with elements more densely connected among each
other)
@NoemiDerzsy
PyData New York 2017
Terms that occur > 1% of the documents (in more than 320 data set descriptions)
• Network size: # of nodes (words) = 1015
• Average degree: 890.319 (densely connected terms)
• Components: 1 connected component
• Modularity analysis: 2 clusters (pink, violet)
@NoemiDerzsy
PyData New York 2017
Top 100
most frequent words in
Description
• size/color hue ~ occurrence frequency
• outer circle: top 10 most frequent words
Nodes with the highest
connectivity are not all the
same as most frequently
co-occurring
@NoemiDerzsy
PyData New York 2017
Network structures:
powerful tool for
capturing additional information about
interconnected systems!
@NoemiDerzsy
PyData New York 2017
What are the important connections between
NASA datasets and other important datasets outside of NASA
in the US government?
@NoemiDerzsy
PyData New York 2017
Other Open Government Dataset Collections
http://data.nasa.gov/data.json
http://www.epa.gov/data.json
http://data.gov
http://www.nsf.gov/data.json
http://usda.gov/data.json
http://data.noaa.gov/data.json
http://www.commerce.gov/data.json
http://nist.gov/data.json
http://www.defense.gov/data.json
http://www2.ed.gov/data.json
http://www.dol.gov/data.json
http://www.state.gov/data.json
http://www.dot.gov/data.json
http://www.energy.gov/data.json
http://nrel.gov/data.json
http://healthdata.gov/data.json
http://www.hud.gov/data.json
http://www.doi.gov/data.json
http://www.justice.gov/data.json
http://www.archives.gov/data.json
http://www.nrc.gov/data.json
http://www.nsf.gov/data.json
http://www.opm.gov/data.json
https://www.sba.gov/sites/default/files/data.json
http://www.ssa.gov/data.json
http://www.consumerfinance.gov/data.json
http://www.fhfa.gov/data.json
http://www.imls.gov/data.json
http://data.mcc.gov/raw/index.json
http://www.nitrd.gov/data.json
http://www.ntsb.gov/data.json
http://www.sec.gov/data.json
https://open.whitehouse.gov/data.json
http://treasury.gov/data.json
http://www.usaid.gov/data.json
http://www.gsa.gov/data.json
https://www.economist.com/news/international/21678833-open-data-revolution-has-not-lived-up-expectations-it-only-getting
Open Government Data: Out of the Box
The Economist
@NoemiDerzsy
PyData New York 2017
pyNASA and pyOpenGov Libraries
§ Python library that loads all the open NASA or other government metadata collection at once
pyNASA
https://github.com/bmtgoncalves/pyNASA
pyOpenGov
https://github.com/nderzsy/pyOpenGov
How to install:
>> pip install pyNASA
>> pip install pyOpenGov
@NoemiDerzsy
PyData New York 2017
Takeaways
§ NLP enables understanding of structured and unstructured text
§ Topic modeling useful tool for understanding topics in large text corpus, documents
§ Topic models efficient for evaluating the accuracy of human-supplied descriptions, keywords
§ Tedious preprocessing a must before modeling (stop words, lemmatization, special characters, etc.)
§ Network (graph) structure can reveal more information, term/topic associations
§ Network projections allow to answer additional questions
§ Open government data enables citizens in understanding topics, areas of focus
@NoemiDerzsy
Contact
GitHub: https://github.com/nderzsy/NASADatanauts
Twitter: @NoemiDerzsy
Website: http://www.noemiderzsy.com

More Related Content

What's hot

A Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
A Survey on Efficient Privacy-Preserving Ranked Keyword Search MethodA Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
A Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
IRJET Journal
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
Minsoo Jun
 
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
KamleshKumar394
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
Universiti Technologi Malaysia (UTM)
 
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...
MongoDB
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
Adnan Akhter
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + Hadoop
MongoDB
 
Interpreting Relational Schema to Graphs
Interpreting Relational Schema to GraphsInterpreting Relational Schema to Graphs
Interpreting Relational Schema to Graphs
Neo4j
 
Breaking the Oracle Tie; High Performance OLTP and Analytics Using MongoDB
Breaking the Oracle Tie; High Performance OLTP and Analytics Using MongoDBBreaking the Oracle Tie; High Performance OLTP and Analytics Using MongoDB
Breaking the Oracle Tie; High Performance OLTP and Analytics Using MongoDB
MongoDB
 
Graph Libraries - Overview on Networkx
Graph Libraries - Overview on NetworkxGraph Libraries - Overview on Networkx
Graph Libraries - Overview on Networkx
鈺棻 曾
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
 
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By  Kabul KurniawanKnowledge Graph for Cybersecurity: An Introduction By  Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
Kabul Kurniawan
 
Virtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log AnalysisVirtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log Analysis
Kabul Kurniawan
 
Using MongoDB with Hadoop & Spark
Using MongoDB with Hadoop & SparkUsing MongoDB with Hadoop & Spark
Using MongoDB with Hadoop & Spark
MongoDB
 
DataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leadersDataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leaders
DataXDay Conference by Xebia
 
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
Shaghayegh (Sherry) Sahebi
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
MLconf
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
restassure
 
Query Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning PaperQuery Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning Paper
DBOnto
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 

What's hot (20)

A Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
A Survey on Efficient Privacy-Preserving Ranked Keyword Search MethodA Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
A Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + Hadoop
 
Interpreting Relational Schema to Graphs
Interpreting Relational Schema to GraphsInterpreting Relational Schema to Graphs
Interpreting Relational Schema to Graphs
 
Breaking the Oracle Tie; High Performance OLTP and Analytics Using MongoDB
Breaking the Oracle Tie; High Performance OLTP and Analytics Using MongoDBBreaking the Oracle Tie; High Performance OLTP and Analytics Using MongoDB
Breaking the Oracle Tie; High Performance OLTP and Analytics Using MongoDB
 
Graph Libraries - Overview on Networkx
Graph Libraries - Overview on NetworkxGraph Libraries - Overview on Networkx
Graph Libraries - Overview on Networkx
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By  Kabul KurniawanKnowledge Graph for Cybersecurity: An Introduction By  Kabul Kurniawan
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
 
Virtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log AnalysisVirtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log Analysis
 
Using MongoDB with Hadoop & Spark
Using MongoDB with Hadoop & SparkUsing MongoDB with Hadoop & Spark
Using MongoDB with Hadoop & Spark
 
DataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leadersDataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leaders
 
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
 
Query Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning PaperQuery Distributed RDF Graphs: The Effects of Partitioning Paper
Query Distributed RDF Graphs: The Effects of Partitioning Paper
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 

Similar to Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017

PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
Noemi Derzsy
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
George Stathis
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
Carlos Badenes-Olmedo
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
Diego Pessoa
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Parang Saraf
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
Sameera Horawalavithana
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
kelbedweihy
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
Neo4j
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
Samet KILICTAS
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Artificial Intelligence Institute at UofSC
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapers
darthvader42
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
yashsharma863914
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
DanBarcan2
 
Cassandra
CassandraCassandra
Cassandra
ssuserbad56d
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
Robert Grossman
 
ArXiv Literature Exploration using Social Network Analysis
ArXiv Literature Exploration using Social Network AnalysisArXiv Literature Exploration using Social Network Analysis
ArXiv Literature Exploration using Social Network Analysis
Tanat Iempreedee
 
Graph Analytics: Graph Algorithms Inside Neo4j
Graph Analytics: Graph Algorithms Inside Neo4jGraph Analytics: Graph Algorithms Inside Neo4j
Graph Analytics: Graph Algorithms Inside Neo4j
Neo4j
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 

Similar to Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017 (20)

PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapers
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
Cassandra
CassandraCassandra
Cassandra
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
ArXiv Literature Exploration using Social Network Analysis
ArXiv Literature Exploration using Social Network AnalysisArXiv Literature Exploration using Social Network Analysis
ArXiv Literature Exploration using Social Network Analysis
 
Graph Analytics: Graph Algorithms Inside Neo4j
Graph Analytics: Graph Algorithms Inside Neo4jGraph Analytics: Graph Algorithms Inside Neo4j
Graph Analytics: Graph Algorithms Inside Neo4j
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 

Recently uploaded

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 

Recently uploaded (20)

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 

Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017

  • 1. PyData New York 2017 Noemi Derzsy Data Science Keys to Open Up OpenNASA Datasets @NoemiDerzsy
  • 2. PyData New York 2017 Open NASA Platform • Data at NASA: 32 089 • Data at other government agencies: ~185 000 • NASA code repositories: 356 • NASA APIs: 51 Continuously growing… https://open.nasa.gov/explore/datanauts/ @NoemiDerzsy
  • 3. PyData New York 2017 @NoemiDerzsy
  • 4. PyData New York 2017 § Datasets: 32089 § Some of the data sets: - Mars Rover sound data - Hubble Telescope image collection - NASA patents - Picture of the Day of Earth - etc. http://data.nasa.gov/data.json Open NASA Metadata Metadata information: • id • type • accessLevel • accrualPeriodicity • bureauCode • contactPoint • title • description • distribution • identifier • Issued • keyword • language • modified • programCode • theme • license • location (HTML link) • Etc. Format:JSON @NoemiDerzsy
  • 5. PyData New York 2017 Which of these features is “best” to tie together the data? How do we label groupings in a meaningful manner? How many groups/how to arrange/visualize them? Are Descriptions, Keywords representative of the content? @NoemiDerzsy
  • 6. PyData New York 2017 Natural Language Processing (NLP) Python Libraries • NLTK • TextBlob • spaCy • gensim • Stanford CoreNLP Jupyter Notebooks: https://github.com/nderzsy/NASADatanauts @NoemiDerzsy
  • 7. PyData New York 2017 Word Clouds in Description Word clouds of the over 32,000 open NASA datasets and their descriptions @NoemiDerzsy
  • 8. PyData New York 2017 Word clouds of the over 32,000 open NASA datasets and their titles Word Clouds in Title @NoemiDerzsy
  • 9. PyData New York 2017 Word Clouds in Keywords Word clouds of the over 32,000 open NASA datasets and their keywords @NoemiDerzsy
  • 10. PyData New York 2017 How to Obtain Customized Word Clouds? o get stencil (shape of your choice) o get text o https://github.com/amueller/word_cloud @NoemiDerzsy
  • 11. PyData New York 2017 Text Preprocessing, Cleaning • treat “Data” and “data” as identical words: covert all words to lowercase lower() • remove special characters, codes, numbers: regular expressions • check for misspelling • stop words: this, and, for, where, etc. • “system” vs. “systems”: lemmatize • “compute”, “computer”, “computation” -> “comput”: stem • tokenize: break down text to smallest parts (words) • POS tagging • dimensionality reduction (PCA) @NoemiDerzsy
  • 12. PyData New York 2017 Stemming • from nltk.stem.api import StemmerI • from nltk.stem.regexp import RegexpStemmer • from nltk.stem.lancaster import LancasterStemmer • from nltk.stem.isri import ISRIStemmer • from nltk.stem.porter import PorterStemmer • from nltk.stem.snowball import SnowballStemmer • from nltk.stem.wordnet import WordNetLemmatizer • from nltk.stem.rslp import RSLPStemmer • Lemmatization: similar to stemming, but with stems being valid words • nltk.WordNetLemmatizer() Lemmatization @NoemiDerzsy
  • 13. PyData New York 2017 Term Frequency Keywords Description Title • the number of times a word occurs in text corpus @NoemiDerzsy
  • 14. PyData New York 2017 TF-IDF • term frequency – inverse document frequency • measures term frequency / document frequency Top 25 Highest TF-IDF Words in Description Top 25 Highest MeanTF-IDF Words in Description @NoemiDerzsy
  • 15. PyData New York 2017 Description TF-IDF and Keywords Top 10 keywords: 1. EARTH SCIENCE: 14387 2. OCEANS: 10034 3. PROJECT: 7464 4. OCEAN OPTICS: 7325 5. ATMOSPHERE: 7324 6. OCEAN COLOR: 7271 7. COMPLETED: 6453 8. ATMOSPHERIC WATER VAPOR: 3143 9. LAND SURFACE: 2721 10.BIOSPHERE: 2450 Top 25 Keywords @NoemiDerzsy
  • 16. PyData New York 2017 Word Co-Occurrence § Co-occurrence matrix of top most frequently co-occurring terms Top 10 Word Co-Occurrence Matrix @NoemiDerzsy
  • 17. PyData New York 2017 Word Co-Occurrence Top 20 Word Co-Occurrence Matrix @NoemiDerzsy
  • 18. PyData New York 2017 Top 50 Word Co-Occurrence Matrix @NoemiDerzsy
  • 19. PyData New York 2017 seaborn.clustermap Top 20 Word Co-Occurrence Matrix Discovering Structure in Heatmap Data standardization @NoemiDerzsy
  • 20. PyData New York 2017 Are features pulled from text (such as title, description fields) and/or human supplied-keywords descriptive of the content? Topic Modeling… @NoemiDerzsy
  • 21. PyData New York 2017 What is Topic Modeling? An efficient way to make sense of large volume of texts. Identify topics within text corpus. Categorize documents into topics. Associate words with topics. Who uses it? Search engines, for marketing purpose, etc. @NoemiDerzsy
  • 22. PyData New York 2017 Latent Dirichlet Allocation (LDA) q several techniques, but LDA is the most common q Bayesian inference model that associates each document with a probability distribution over topics q topics are probability distributions over words (probability of the word being generated from that topic for that document) q clusters words into topics q clusters documents into mixture of topics q scales well with growing corpus q before running LDA algorithm,we have to specify the number of topics: how to choose beforehand the optimal number of topics? @NoemiDerzsy
  • 23. PyData New York 2017 Topic Model Evaluation: Topic Coherence Q: How to select the top topics? A: Calculate the UMass topic coherence for each topic. Algorithm from Mimno, Wallach, Talley, Leenders, McCallum: Optimizing Semantic Coherence in Topic Models, CEMNLP 2011. Coherence = ! score(𝑤), 𝑤,) )., pairwise scores on the words used to describe the topic. s𝑐𝑜𝑟𝑒34566 78,79 = log 𝐷 𝑤), 𝑤, + 1 𝐷(𝑤)) D(wi)D(wi) as the count of documents containing the word wiwi, D(wi,wj)D(wi,wj) the count of documents containing both words wiwi and wjwj, and DD the total number or documents in the corpus. @NoemiDerzsy
  • 24. PyData New York 2017 openNASA Topics of Highest Coherence @NoemiDerzsy
  • 25. PyData New York 2017 Keywords for Topics • selected keywords with their most frequently occurring terms @NoemiDerzsy
  • 26. PyData New York 2017 Other Clustering Method: K-Means • using TF-IDF, the document vectors are put through a K-Means clustering algorithm which computes the Euclidean distances amongst these documents and clusters nearby documents together • the algorithm generates cluster tags, known as cluster centers which represent the documents within these clusters • K-means distance: • Euclidean • Cosine • Fuzzy • Accuracy comparison: • silhouette analysis can be used to study the separation distance between the resulting clusters; can be used to determine the optimal number of clusters @NoemiDerzsy
  • 27. PyData New York 2017 K-Means Clustering § Top 10 terms per cluster: Cluster 0 seawifs local collect km mission data orbview seastar broadcast noon § Similar clusters with top words as found using LDA Cluster 1 data project system use soil high contain phase set gt Cluster 2 modis terra aqua earth global pass south north equator eos Cluster 3 band oct color adeos czcs nominal sense spacecraft agency thermal Cluster 4 product data level version file aquarius set daily ml standard @NoemiDerzsy
  • 28. PyData New York 2017 K-Means Clustering § cluster size distribution for k = 5 Cluster 2: 19031 Cluster 0: 5805 Cluster 1: 4481 Cluster 3: 1968 Cluster 4: 804 § silhouette score improves by increasing the number of clusters, best performance for K > 100 § analysis of cluster size distributions with varying K can reveal additional information 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0 50 100 150 200 250 300 350 SilhouetteScore K Number of Clusters Silhouette Score for K-Means Clustering openNASA Datasets @NoemiDerzsy
  • 29. PyData New York 2017 Text Classification § Naïve Bayes probabilistic model • Multinomial model • data follows multinomial distribution • each feature value is a count (word co-occurrence, weight, tf-idf, etc.) • Take into account word importance • Bernoulli model • data follows a multivariate Bernoulli distribution • bag of words (count word occurrence) • each feature is a binary feature (word in text? True/False) • ignores word significance § KNN (K-nearest neighbor) classifier § Decision trees § Support Vector Machine (SVM) @NoemiDerzsy
  • 30. PyData New York 2017 Network Science: Bipartite Network § Term in Description – Keyword: - connect 2 words in description if they appear under same Keyword - connect 2 keywords if they share common words in Description § All bipartite network projection possibilities: - description – keyword - description – title - title – keyword § Type of interaction: - link direction: directed/undirected - link weight: how strong the connection (co-occurrence matrix) @NoemiDerzsy
  • 31. PyData New York 2017 Structural Network Properties § degree distribution: reveals connectivity pattern § do we have hubs (nodes with significantly higher number of connections)? § average degree (how interconnected the elements are) § clusters, community detection (subgroups with elements more densely connected among each other) @NoemiDerzsy
  • 32. PyData New York 2017 Terms that occur > 1% of the documents (in more than 320 data set descriptions) • Network size: # of nodes (words) = 1015 • Average degree: 890.319 (densely connected terms) • Components: 1 connected component • Modularity analysis: 2 clusters (pink, violet) @NoemiDerzsy
  • 33. PyData New York 2017 Top 100 most frequent words in Description • size/color hue ~ occurrence frequency • outer circle: top 10 most frequent words Nodes with the highest connectivity are not all the same as most frequently co-occurring @NoemiDerzsy
  • 34. PyData New York 2017 Network structures: powerful tool for capturing additional information about interconnected systems! @NoemiDerzsy
  • 35. PyData New York 2017 What are the important connections between NASA datasets and other important datasets outside of NASA in the US government? @NoemiDerzsy
  • 36. PyData New York 2017 Other Open Government Dataset Collections http://data.nasa.gov/data.json http://www.epa.gov/data.json http://data.gov http://www.nsf.gov/data.json http://usda.gov/data.json http://data.noaa.gov/data.json http://www.commerce.gov/data.json http://nist.gov/data.json http://www.defense.gov/data.json http://www2.ed.gov/data.json http://www.dol.gov/data.json http://www.state.gov/data.json http://www.dot.gov/data.json http://www.energy.gov/data.json http://nrel.gov/data.json http://healthdata.gov/data.json http://www.hud.gov/data.json http://www.doi.gov/data.json http://www.justice.gov/data.json http://www.archives.gov/data.json http://www.nrc.gov/data.json http://www.nsf.gov/data.json http://www.opm.gov/data.json https://www.sba.gov/sites/default/files/data.json http://www.ssa.gov/data.json http://www.consumerfinance.gov/data.json http://www.fhfa.gov/data.json http://www.imls.gov/data.json http://data.mcc.gov/raw/index.json http://www.nitrd.gov/data.json http://www.ntsb.gov/data.json http://www.sec.gov/data.json https://open.whitehouse.gov/data.json http://treasury.gov/data.json http://www.usaid.gov/data.json http://www.gsa.gov/data.json https://www.economist.com/news/international/21678833-open-data-revolution-has-not-lived-up-expectations-it-only-getting Open Government Data: Out of the Box The Economist @NoemiDerzsy
  • 37. PyData New York 2017 pyNASA and pyOpenGov Libraries § Python library that loads all the open NASA or other government metadata collection at once pyNASA https://github.com/bmtgoncalves/pyNASA pyOpenGov https://github.com/nderzsy/pyOpenGov How to install: >> pip install pyNASA >> pip install pyOpenGov @NoemiDerzsy
  • 38. PyData New York 2017 Takeaways § NLP enables understanding of structured and unstructured text § Topic modeling useful tool for understanding topics in large text corpus, documents § Topic models efficient for evaluating the accuracy of human-supplied descriptions, keywords § Tedious preprocessing a must before modeling (stop words, lemmatization, special characters, etc.) § Network (graph) structure can reveal more information, term/topic associations § Network projections allow to answer additional questions § Open government data enables citizens in understanding topics, areas of focus @NoemiDerzsy