GraphAware®
SIGNALS FROM OUTER
SPACE
Vlasta Kůs, Data Scientist @ GraphAware
graphaware.com
@graph_aware, @VlastaKus
How NASA Benefits from Graph-Powered NLP
‣ Database of learned knowledge across NASA’s programs & projects
‣ Unstructured text with basic metadata
‣ Collected since late 1950s (100s of millions of documents)
‣ Public dataset: ~1600 documents
NASA’s Lessons Learned
GraphAware®
"1406",420,"Roberts, J “,
"VO'75 Pressure Regulator Leakage and Work-Around
Procedures (~1976)”,
"The pressure regulator in the Viking Orbiter Propulsion
Subsystem started leaking following a pyro firing that
occurred prior to the near-Mars TCM. Likely causes were
corrosion or residue from propellant migration or pyro
valve blowby, or particulate contamination. Recommendations
included using separate regulators for the fuel and
oxidizer sides, incorporating a bellows in the pyro valve
to eliminate blowby, and adding a isolation valve between
the regulator and propellant tank.“,
" The micro-scale effects of long-term propellant exposure
should be investigated in order to better critique
regulator design. “,
"JPL",1996-07-08,"",TRUE,"",1460,7,NA,"https://
nen.nasa.gov/web/11/viewall/-/viewall/420"
NASA’s Lessons Learned Database
GraphAware®
“673",1326,"Relvini, Kristine “,
"Lessons Learned Not Being Inputted Into Lessons
Learned Information System (LLIS) Database”,
“",
"If you don't document the lessons learned, you loose
knowledgeable, shared information and tracking capacity
across programs.“,
"KSC",2002-10-11,"Aeronautics Research, Science,
Exploration Systems, Space Operations, ",FALSE,"",
702,6,NA,"https://nen.nasa.gov/web/11/viewall/-/
viewall/1326"
NASA’s Lessons Learned
GraphAware®
Graph database = isolated data silos -> connected knowledge
‣ Efficient search
‣ Relationships among various areas
Apollo, Space Shuttle, Orion, …
‣ Pattern recognition (clusters, communities, correlations, …)
Example: correlation between corrosion of valves & topics involving batteries
‣ Useful for planning future projects and preventing/solving issues
NASA’s Lessons Learned
GraphAware®
What is a Graph?
GraphAware®
G = (V, E)
WHY NEO4J?
GraphAware®
It is a proper graph database
It is a proper database
Graph-Based Architecture: Knowledge Graph
GraphAware®
EXAMPLE
GraphAware®
‣ NLP = machine learning tools allowing computers to process - and
perhaps understand - human languages
‣ Basic steps
Sentence segmentation
Tokenisation
Lemmatisation
Part of Speech (POS) tagging
Parsing
Named Entities Recognition (NER)
Sentiment analysis
…
Natural Language Processing
GraphAware®
Currently supported toolkits for human language processing
‣ Stanford CoreNLP
‣ developed at Stanford University
‣ fast, robust, production ready
‣ many pre-built models
‣ license: GPL v3+
‣ Apache OpenNLP
‣ developed by volunteers
‣ many pre-built models
‣ license: Apache License v2.0
NLP: Text Processors
GraphAware®
‣ Named Entity Recognition (NER) = classification of words into predefined
classes
‣ Examples: Dr. Who -> Person, May 2018 -> Date, EU -> Country …
‣ Stanford NLP default entities: Person, Location, Date, Organisation,
Number, Money, Percentage
‣ Custom NE classes -> training on large tokenised & labeled corpus
‣ Wikipedia, Wikidata - rich sources of multilingual training data that can
be extracted automatically
Named Entity Recognition
GraphAware®
Custom Named Entities based on Wikipedia
GraphAware®
NASA use case: identify names of space missions
Training - crawling Wikipedia & identifying relevant information
EXAMPLE
GraphAware®
Universal Dependencies: cross-linguistically consistent grammatical relations
among words in a sentence
Examples:
‣ amod (adjectival modifier)
Matt likes red wine.
‣ appos (appositional modifier)
Mars Global Surveyor (MGS) was an American robotic spacecraft …
‣ conj (conjunct)
It failed to respond to messages and commands.
‣ …
Universal Dependencies
GraphAware®
‣ Stanford CoreNLP: Dependency & Part of Speech analysis of a single sentence
Source: http://nlp.stanford.edu:8080/corenlp/process
Either find an efficient representation in some traditional database, or …
Graph-Powered NLP
GraphAware®
Graph-Powered NLP
GraphAware®
NLP and property graphs: natural fit
… use a property graph!
EXAMPLE
GraphAware®
Unsupervised techniques tend to be underestimated, while …
‣ No need for time & money to get massive labeled training datasets
‣ Often faster to train & faster to predict
‣ Unsupervised deep learning
Unsupervised ML Algorithms
GraphAware®
PageRank
GraphAware®
PageRank = a measure of importance of a web page based on the quality
of links from other pages
The formula reflects a model of a random surfer.
Source: https://en.wikipedia.org/wiki/PageRank
Keyword Extraction: TextRank
GraphAware®
Keywords = words/phrases that capture the semantic essence of a text
Graph-Based Unsupervised Algorithm:
‣ Construct a graph of word co-occurrences
‣ Asses the importance of words by PageRank algorithm
‣ Use top 1/3 of words as keyword candidates
‣ Use universal dependencies to construct key phrases
GraphAware®
Rada Mihalcea, Paul Tarau. TextRank: Bringing Order into Texts. Proceedings of EMNLP 2004, pages 404–411, Barcelona,
Spain. Association for Computational Linguistics. http://www.aclweb.org/anthology/W04-3252.
Keyword Extraction: TextRank
Despite its simplicity, TextRank provides
state of the art results on wide range of
unstructured texts.
Leveraging universal dependencies allowed
us to surpass precision & recall of the
original TextRank paper.
NASA examples: “space shuttle”, “flight hardware”, “launch vehicle”, …
Automatic text summarisation
‣ Abstractive
‣ Extractive
TextRank can be adapted for efficient
sentence ranking for extractive summarisation.
Summarisation: TextRank
GraphAware®
EXAMPLE
GraphAware®
ConceptNet 5 = semantic network for understanding the meaning of words
‣ Relational knowledge from MIT’s Open Mind Common Sense project
‣ DBPedia (information from Wikipedia info-boxes)
‣ Wiktionary (free multilingual dictionary)
‣ …
Knowledge Enrichment: ConceptNet 5
GraphAware®
Microsoft Concept Graph = semantic network introducing knowledge
about concepts
‣ harnessed from billions of web pages and years’ worth of search logs
Expand the knowledge from external or other internal sources.
Knowlege Enrichment
GraphAware®
‣ Latent Dirichlet Allocation (LDA) - generative statistical model that
describes documents as a probabilistic mixture of a small number of topics
‣ Each topic described by a list of most relevant words
‣ Sample of topics from the NASA dataset
[“design”, "failure", "test", "result", "flight", "hardware", "mission", “testing”, “system”,
“due”]
[“pressure", "system", "cause", "valve", "propellant", "leak", "operation", “shuttle”,
“space”, “gas”]
[“space”, "shuttle", "NASA", "operation", "safety", "iss", "crew", "ISS", "astronaut", "progr
am"]
Topic Extraction: Latent Dirichlet Allocation
GraphAware®
EXAMPLE
GraphAware®
‣ Word embeddings = representation of words as multi-dimensional
semantic vectors which encode linguistic patterns
‣ Use cases: word sense disambiguation, new distance functions between
documents, starting point for further ML (e.g. NN classification)
‣ Word2vec = shallow two-layer neural network model for producing word
embeddings
‣ ConceptNet Numberbatch - consists of state-of-the-art word embeddings
Word Embeddings
GraphAware®
Word Embeddings: word2vec
GraphAware®
Tomas Mikolov et al.: https://arxiv.org/abs/1301.3781
Word Embeddings: word2vec
GraphAware®
Kusner et al.: http://mkusner.github.io/publications/WMD.pdf
Document distance: min. cumulative distance that all words need to travel
Semantic patterns representable as linear translations:
distance(Oslo -> Norway) similar to distance(Berlin -> Germany)
vec(Germany) - vec(Berlin) + vec(Oslo) = vec(Norway)
Document Embeddings
GraphAware®
Q. Le, T. Mikolov: Distributed representations of sentences and documents, arXiv:1405.4053v2
Paragraph Vector (doc2vec): extension of word2vec
The additional paragraph node represents context (topic) of the current document.
Paragraph vectors have the same behaviour towards linear vector translations as
word vectors.
Document Embeddings
GraphAware®
doc2vec vectors of dimension 300, NASA sentences -> dimensionality reduction (PCA + t-SNE)
Document Embeddings
GraphAware®
doc2vec vectors of dimension 2000, 30k Wikipedia pages -> dimensionality reduction (PCA + t-SNE)
Some of the neural networks applicable to text processing
‣ Shallow networks (word & document embeddings)
‣ Deep Auto-Encoders
‣ Convolutional Neural Networks
‣ Recurrent Neural Networks (LSTMs)
Deep Learning for Text Processing
GraphAware®
Self-supervised Auto-Encoders: useful for vector embeddings (images, texts)
DeepLearning4J - Java-based deep learning library
Example of auto-encoder (e.g. stacked RBMs) …
Deep Learning: Auto-encoders
GraphAware®
Works well for images, but problematic for texts (sparsity).
Convolutional Neural Networks
GraphAware®
Y. Zhang, B. Wallace: arXiv:1510.03820
Classification of documents based on word embeddings and CNN
Deep Learning: Summarisation
GraphAware®
S. Narayan et al.: Ranking Sentences for Extractive Summarisation with Reinforcement learning, arXiv:1802.08636
Deep Learning: Summarisation
GraphAware®
Extractive summarisation (sentence ranking) notably outperforms abstractive.
S. Narayan et al.: Ranking Sentences for Extractive Summarisation with Reinforcement learning, arXiv:1802.08636
Knowledge Graphs are a powerful problem-solving tool
‣ Augmented search
‣ Actionable knowledge
‣ Machine Learning
‣ Chatbots and Question answering systems
‣ Foundational to AI
Conclusion
GraphAware®
www.graphaware.com @graph_aware

Signals from outer space

  • 1.
    GraphAware® SIGNALS FROM OUTER SPACE VlastaKůs, Data Scientist @ GraphAware graphaware.com @graph_aware, @VlastaKus How NASA Benefits from Graph-Powered NLP
  • 2.
    ‣ Database oflearned knowledge across NASA’s programs & projects ‣ Unstructured text with basic metadata ‣ Collected since late 1950s (100s of millions of documents) ‣ Public dataset: ~1600 documents NASA’s Lessons Learned GraphAware®
  • 3.
    "1406",420,"Roberts, J “, "VO'75Pressure Regulator Leakage and Work-Around Procedures (~1976)”, "The pressure regulator in the Viking Orbiter Propulsion Subsystem started leaking following a pyro firing that occurred prior to the near-Mars TCM. Likely causes were corrosion or residue from propellant migration or pyro valve blowby, or particulate contamination. Recommendations included using separate regulators for the fuel and oxidizer sides, incorporating a bellows in the pyro valve to eliminate blowby, and adding a isolation valve between the regulator and propellant tank.“, " The micro-scale effects of long-term propellant exposure should be investigated in order to better critique regulator design. “, "JPL",1996-07-08,"",TRUE,"",1460,7,NA,"https:// nen.nasa.gov/web/11/viewall/-/viewall/420" NASA’s Lessons Learned Database GraphAware®
  • 4.
    “673",1326,"Relvini, Kristine “, "LessonsLearned Not Being Inputted Into Lessons Learned Information System (LLIS) Database”, “", "If you don't document the lessons learned, you loose knowledgeable, shared information and tracking capacity across programs.“, "KSC",2002-10-11,"Aeronautics Research, Science, Exploration Systems, Space Operations, ",FALSE,"", 702,6,NA,"https://nen.nasa.gov/web/11/viewall/-/ viewall/1326" NASA’s Lessons Learned GraphAware®
  • 5.
    Graph database =isolated data silos -> connected knowledge ‣ Efficient search ‣ Relationships among various areas Apollo, Space Shuttle, Orion, … ‣ Pattern recognition (clusters, communities, correlations, …) Example: correlation between corrosion of valves & topics involving batteries ‣ Useful for planning future projects and preventing/solving issues NASA’s Lessons Learned GraphAware®
  • 6.
    What is aGraph? GraphAware® G = (V, E)
  • 7.
    WHY NEO4J? GraphAware® It isa proper graph database It is a proper database
  • 8.
  • 9.
  • 10.
    ‣ NLP =machine learning tools allowing computers to process - and perhaps understand - human languages ‣ Basic steps Sentence segmentation Tokenisation Lemmatisation Part of Speech (POS) tagging Parsing Named Entities Recognition (NER) Sentiment analysis … Natural Language Processing GraphAware®
  • 11.
    Currently supported toolkitsfor human language processing ‣ Stanford CoreNLP ‣ developed at Stanford University ‣ fast, robust, production ready ‣ many pre-built models ‣ license: GPL v3+ ‣ Apache OpenNLP ‣ developed by volunteers ‣ many pre-built models ‣ license: Apache License v2.0 NLP: Text Processors GraphAware®
  • 12.
    ‣ Named EntityRecognition (NER) = classification of words into predefined classes ‣ Examples: Dr. Who -> Person, May 2018 -> Date, EU -> Country … ‣ Stanford NLP default entities: Person, Location, Date, Organisation, Number, Money, Percentage ‣ Custom NE classes -> training on large tokenised & labeled corpus ‣ Wikipedia, Wikidata - rich sources of multilingual training data that can be extracted automatically Named Entity Recognition GraphAware®
  • 13.
    Custom Named Entitiesbased on Wikipedia GraphAware® NASA use case: identify names of space missions Training - crawling Wikipedia & identifying relevant information
  • 14.
  • 15.
    Universal Dependencies: cross-linguisticallyconsistent grammatical relations among words in a sentence Examples: ‣ amod (adjectival modifier) Matt likes red wine. ‣ appos (appositional modifier) Mars Global Surveyor (MGS) was an American robotic spacecraft … ‣ conj (conjunct) It failed to respond to messages and commands. ‣ … Universal Dependencies GraphAware®
  • 16.
    ‣ Stanford CoreNLP:Dependency & Part of Speech analysis of a single sentence Source: http://nlp.stanford.edu:8080/corenlp/process Either find an efficient representation in some traditional database, or … Graph-Powered NLP GraphAware®
  • 17.
    Graph-Powered NLP GraphAware® NLP andproperty graphs: natural fit … use a property graph!
  • 18.
  • 19.
    Unsupervised techniques tendto be underestimated, while … ‣ No need for time & money to get massive labeled training datasets ‣ Often faster to train & faster to predict ‣ Unsupervised deep learning Unsupervised ML Algorithms GraphAware®
  • 20.
    PageRank GraphAware® PageRank = ameasure of importance of a web page based on the quality of links from other pages The formula reflects a model of a random surfer. Source: https://en.wikipedia.org/wiki/PageRank
  • 21.
    Keyword Extraction: TextRank GraphAware® Keywords= words/phrases that capture the semantic essence of a text Graph-Based Unsupervised Algorithm: ‣ Construct a graph of word co-occurrences ‣ Asses the importance of words by PageRank algorithm ‣ Use top 1/3 of words as keyword candidates ‣ Use universal dependencies to construct key phrases
  • 22.
    GraphAware® Rada Mihalcea, PaulTarau. TextRank: Bringing Order into Texts. Proceedings of EMNLP 2004, pages 404–411, Barcelona, Spain. Association for Computational Linguistics. http://www.aclweb.org/anthology/W04-3252. Keyword Extraction: TextRank Despite its simplicity, TextRank provides state of the art results on wide range of unstructured texts. Leveraging universal dependencies allowed us to surpass precision & recall of the original TextRank paper. NASA examples: “space shuttle”, “flight hardware”, “launch vehicle”, …
  • 23.
    Automatic text summarisation ‣Abstractive ‣ Extractive TextRank can be adapted for efficient sentence ranking for extractive summarisation. Summarisation: TextRank GraphAware®
  • 24.
  • 25.
    ConceptNet 5 =semantic network for understanding the meaning of words ‣ Relational knowledge from MIT’s Open Mind Common Sense project ‣ DBPedia (information from Wikipedia info-boxes) ‣ Wiktionary (free multilingual dictionary) ‣ … Knowledge Enrichment: ConceptNet 5 GraphAware® Microsoft Concept Graph = semantic network introducing knowledge about concepts ‣ harnessed from billions of web pages and years’ worth of search logs
  • 26.
    Expand the knowledgefrom external or other internal sources. Knowlege Enrichment GraphAware®
  • 27.
    ‣ Latent DirichletAllocation (LDA) - generative statistical model that describes documents as a probabilistic mixture of a small number of topics ‣ Each topic described by a list of most relevant words ‣ Sample of topics from the NASA dataset [“design”, "failure", "test", "result", "flight", "hardware", "mission", “testing”, “system”, “due”] [“pressure", "system", "cause", "valve", "propellant", "leak", "operation", “shuttle”, “space”, “gas”] [“space”, "shuttle", "NASA", "operation", "safety", "iss", "crew", "ISS", "astronaut", "progr am"] Topic Extraction: Latent Dirichlet Allocation GraphAware®
  • 28.
  • 29.
    ‣ Word embeddings= representation of words as multi-dimensional semantic vectors which encode linguistic patterns ‣ Use cases: word sense disambiguation, new distance functions between documents, starting point for further ML (e.g. NN classification) ‣ Word2vec = shallow two-layer neural network model for producing word embeddings ‣ ConceptNet Numberbatch - consists of state-of-the-art word embeddings Word Embeddings GraphAware®
  • 30.
    Word Embeddings: word2vec GraphAware® TomasMikolov et al.: https://arxiv.org/abs/1301.3781
  • 31.
    Word Embeddings: word2vec GraphAware® Kusneret al.: http://mkusner.github.io/publications/WMD.pdf Document distance: min. cumulative distance that all words need to travel Semantic patterns representable as linear translations: distance(Oslo -> Norway) similar to distance(Berlin -> Germany) vec(Germany) - vec(Berlin) + vec(Oslo) = vec(Norway)
  • 32.
    Document Embeddings GraphAware® Q. Le,T. Mikolov: Distributed representations of sentences and documents, arXiv:1405.4053v2 Paragraph Vector (doc2vec): extension of word2vec The additional paragraph node represents context (topic) of the current document. Paragraph vectors have the same behaviour towards linear vector translations as word vectors.
  • 33.
    Document Embeddings GraphAware® doc2vec vectorsof dimension 300, NASA sentences -> dimensionality reduction (PCA + t-SNE)
  • 34.
    Document Embeddings GraphAware® doc2vec vectorsof dimension 2000, 30k Wikipedia pages -> dimensionality reduction (PCA + t-SNE)
  • 35.
    Some of theneural networks applicable to text processing ‣ Shallow networks (word & document embeddings) ‣ Deep Auto-Encoders ‣ Convolutional Neural Networks ‣ Recurrent Neural Networks (LSTMs) Deep Learning for Text Processing GraphAware®
  • 36.
    Self-supervised Auto-Encoders: usefulfor vector embeddings (images, texts) DeepLearning4J - Java-based deep learning library Example of auto-encoder (e.g. stacked RBMs) … Deep Learning: Auto-encoders GraphAware® Works well for images, but problematic for texts (sparsity).
  • 37.
    Convolutional Neural Networks GraphAware® Y.Zhang, B. Wallace: arXiv:1510.03820 Classification of documents based on word embeddings and CNN
  • 38.
    Deep Learning: Summarisation GraphAware® S.Narayan et al.: Ranking Sentences for Extractive Summarisation with Reinforcement learning, arXiv:1802.08636
  • 39.
    Deep Learning: Summarisation GraphAware® Extractivesummarisation (sentence ranking) notably outperforms abstractive. S. Narayan et al.: Ranking Sentences for Extractive Summarisation with Reinforcement learning, arXiv:1802.08636
  • 40.
    Knowledge Graphs area powerful problem-solving tool ‣ Augmented search ‣ Actionable knowledge ‣ Machine Learning ‣ Chatbots and Question answering systems ‣ Foundational to AI Conclusion GraphAware®
  • 41.