Natural Language Processing with Neo4j


Published on

Recent natural language processing advancements have propelled search engine and information retrieval innovations into the public spotlight. People want to be able to interact with their devices in a natural way. In this talk I will be introducing you to natural language search using a Neo4j graph database. I will show you how to interact with an abstract graph data structure using natural language and how this approach is key to future innovations in the way we interact with our devices.

Published in: Technology, Education
1 Comment
  • just what I have been looking for to be part of an Artificial Intelligence Agent, ala, Startrek, Comander Data without the body.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Introduction
    My name is.., I work for.., My job is..
    Today I want to talk to you about NLP with Neo4j
    I’m from California, I live in the SF Bay Area.
  • These are the core ideas behind my research on NLP
  • My story about making a better search engine on top of Wikipedia.
    The problem was understanding unstructured text.
    I wanted to solve that problem.
    Wikipedia has so much valuable knowledge.
    Analyzing it on your own document by document would take a life time.
  • This process yields this basic graph structure.
  • Why I am here
    I am infatuated with the idea of machine learning
  • Anything can learn. Anything can learn that can store information. To back reference. To assimilate knowledge about past experience.
  • The store part of learning is crucial
  • Machine learning is real. It isn’t magic. It is profoundly real, interesting, and simple. It is simple to articulate. It is the ability of machines to learn from prior experiences.
  • Machine learning algorithms make a hypothesis based on studying data and predicting something meaningful.
  • When I say experience. I mean DATA.
    Machine learning is based on collecting DATA.
  • Problem with AI is that it has a lot in common with magic.
    A lot of people say it exists, and a lot of people say it doesn’t.
    There are groups, cults, movies, books, and endless fantasy stories that are based around AI.
    It’s a central theme in some ancient greek mythology.
    It’s a wrapper term for loads of stuff.
  • Because we don’t really understand how intelligence works at the human level. Or at least there is no easy way to describe it. Generally it is the act of perceiving an environment and then acting to maximize chances of success.
  • So I wanted to build a better search engine for Wikipedia. So naturally I started by using Wikipedia to learn more about NLP, machine learning.
  • This process yields this basic graph structure.
  • I recorded my process. I observed myself. I would search for a term. I would read through the text and when I came to a term I didn’t recognize, I would open up a new tab from the hyperlink of the term and then repeat the process until I made my way back up to the original topic I searched for.
  • I recorded my process. I observed myself. I would search for a term. I would read through the text and when I came to a term I didn’t recognize, I would open up a new tab from the hyperlink of the term and then repeat the process until I made my way back up to the original topic I searched for.
  • So I put together a diagram of my learning process as a recursive algorithm. Through that process I built a prototype. But it had no database!
  • The result of the algorithm was a graph. I needed to store that data as a graph. Naturally I found my way to Neo4j, which is a graph database.
  • Simple graph data model.
    Many different articles,
    Contain many different phrases,Extracted from many sentences,
    Which were extracted from the article
  • Visualizing the result in Gephi
  • Here is what the database looked like at 200k nodes and 1 million relationships when visualized in Gephi
  • Now with Cypher (Neo4j’s query language) I could traverse these nodes to do automatic summarization of Wikipedia text.
  • How the algorithm works
  • You start with a seed article’s name. Which sits in a queue waiting to be processed by one of the application’s worker roles. (Using Windows Azure Service Bus)
  • The article’s text and meta data are fetched from Wikipedia’s open search API.
  • The text is then analyzed using a sliding window of RegEx. Each word has a look behind and a look ahead.
  • As each word is read, the bi-gram (2 word phrase) is matched on the entire text, looking ahead or behind of the current position.
  • If there is more than one match within the text being analyzed, then the multiple bi-grams turn into tri-grams by looking ahead one word for each match.
  • This process is repeated until the text returns no duplicate n-grams. At this point, any n-gram that has more than one match within the text of the article is stored in the Neo4j database as a phrase that is contained within the article’s node. Each sentence that contained at least one of the n-grams is also added to the database, with relationships pointing to both the article node and the phrase node that is contained within it.
  • Further more, each phrase node can have an ancestry. Because each phrase can be a derivative of some other phrase.
  • Natural Language Processing with Neo4j

    1. 1. Natural Language Processing with Neo4j Kenny Bastani @kennybastani
    2. 2. This is a hobby of mine I’m passionate about it It’s always a work in progress I do it for fun
    3. 3. Machine Learning Focuses • Text mining • Natural Language Processing • Automatic summarization • Graph databases • Commitment to unsupervised learning.
    4. 4. Why NLP and Graphs?
    5. 5. I wanted a better way to learn with less effort I wanted something a little more zippy. I’m mostly self-taught, so I wanted something that made self-learning easier for others.
    6. 6. The Idea Articles Contain un d in Found in Fo Phrases Sentences
    7. 7. Importance of NLP • I’m inspired by the idea of machines learning from experience. • NLP is important for finding valuable information in noisy unstructured text. • I’m a Developer Evangelist for Neo4j, so I’m kind of a fan of graph databases.
    8. 8. Algorithms can learn As long as it can store information and retrieve it in enough time for it to be of any use.
    9. 9. Learning requires storage To learn, storage is required. For NLP, storage is sometimes a second class citizen. Much focus is on the algorithm first, then storage second. But really, it’s storage and retrieval of big data that is the problem.
    10. 10. Machine learning Machine learning isn’t magic or hard to understand. It’s real stuff. We know how to do it. It’s easily articulated. ML algorithms solve big computational problems today. It’s based on the idea of machines learning from prior experiences as data.
    11. 11. Formulate a Hypothesis When you analyze data, the outcome is usually a hypothesis. An hypothesis is a conclusion based on limited data. There are always more pieces needed to solve the puzzle.
    12. 12. Build on Past Experience By experience, I mean DATA. Machine Learning techniques are entirely based on collection and analysis of recorded data. So storage is really important if you want to do machine learning successfully. You cannot play baseball without your brain. Don’t try it.
    13. 13. The Problem with AI The problem with AI is that it seems like magic. Some people say strong AI is possible. There are some people that deny that it is possible. It is a central theme in many fictional fantasy films and book genres. It’s in Greek mythology.
    14. 14. Is AI Misunderstood? Researchers admit to not fully understanding how intelligence works in the human brain. We generally understand how it works, but no consensus on how to recreate it in machines. AI is really just the act of perceiving an environment and maximizing chances of success.
    15. 15. You get the point. • Now why is a Graph Database useful for unsupervised machine learning? • Let’s consider the problem I stated earlier. • I wanted to build a better way to summarize and learn from Wikipedia’s combined knowledge.
    16. 16. Unsupervised Learning on Wikipedia Articles Contain un d in Found in Fo Phrases Sentences
    17. 17. How do you learn about learning? I started by observing myself learning from reading Wikipedia articles. I searched for an interesting term on Google. I read through the article’s text word by word.
    18. 18. The Learning Algorithm As I read the article’s text, I would sometimes come across a phrase or term I had not seen before. Before continuing reading I would open up a new tab and search for the unrecognized phrase. It was a well defined recursive algorithm. I would drill down n-times on unrecognized article terms until returning to the original article text.
    19. 19. A Self-Learning Algorithm In the computer’s world, this process would result in an ontology of labeled data. Which looks a lot like a graph. But how would I store the results? If only there were a database for that..
    20. 20. Neo4j is a graph database …and graphs are everywhere!
    21. 21. Contains Article Phrase un d in Found in Fo Sentence Simple Clustering Model
    22. 22. Summarizing Article Text
    23. 23. What about the NLP stuff? This is how I did it.
    24. 24. The seed article You start with a seed article which is the first article text to start the learning algorithm with.
    25. 25. Fetch text from Wikipedia Get the unstructured text and meta data from Wikipedia.
    26. 26. Sliding text window I formulated dynamic RegEx templates and treated them as a hypothesis. The RegEx template would slide word by word through the text, searching for unrecognized phrases (n known word matches + 1 wildcard word match)
    27. 27. Looking for redundant phrases As each unrecognized phrase is encountered, the dynamic RegEx is then matched against the entire article’s text. The algorithm looks for more than 2 identical phrases within the article’s text. It appends a 3rd wildcard word match to the template and then rescans the text for redundant phrases until none are found.
    28. 28. Identify Redundancy of Text This recursive matching process within the local article’s text resulted in finding the duplicate phrases of a variable length. “The King of Sweden” has 2 appearances in an article, so that must be important to the topic of Sweden. Better go search for an article stub on “The King of Sweden”
    29. 29. Graph Storage and Retrieval Every time a phrase that doesn’t exist as a node in Neo4j is encountered, it becomes a target of investigation, kind of like a hypothesis. Each sentence that contains the extracted phrase is also added to Neo4j as a content node. Relationships are added between nodes, showing semantic relationship.
    30. 30. Phrase inheritance Phrases can be found within other phrases, denoting a grammatical inheritance hierarchy mapped to a variety of content nodes and articles.
    31. 31. Phrase Inheritance Graph Data Model Article Contains Sentence “X MEN.” Found in Fo un d in Phrase “X Y” Found in Found in Phrase “X” Sentence “X Y Z.” Found in Fo u nd Phrase “X Y Z” in Contains Article
    32. 32. Graphs are everywhere. Questions?
    33. 33. Thanks for coming to my talk! Please look me up on Twitter and LinkedIn! Twitter: LinkedIn: