Document Classification with Neo4j

Document Classification with Neo4j
(graphs)-[:are]->(everywhere)
© All Rights Reserved 2014 | Neo Technology, Inc.
@kennybastani
Neo4j Developer Evangelist

Agenda
• Introduction to Neo4j
• Introduction to Graph-based Document Classification
• Graph-based Hierarchical Pattern Recognition
• Generating a Vector Space Model for Recommendations
• Graphify for Neo4j
• U.S. Presidential Speech Transcript Analysis
2

Introduction to Neo4j
3

The Property Graph Data Model
4

John
Sally
Graph Databases
Book
5

name: John
age: 27
name: Sally
age: 32
FRIEND_OF
since: 01/09/2013
title: Graph Databases
authors: Ian Robinson,
Jim Webber
HAS_READ
on: 2/03/2013
rating: 5
HAS_READ
on: 02/09/2013
rating: 4
FRIEND_OF
since: 01/09/2013
6

The Relational Table Model
7

Customers Customer_Accounts Accounts
8

The Neo4j Browser
9

Neo4j Browser - finding help
http://localhost:7474/
10

Execute Cypher, Visualize
11

Introduction to Document Classification
12

Document Classification
Automatically assign a document to one or more classes
Documents may be classified according to their subjects or
according to other attributes
Automatically classify unlabeled documents to a set of relevant
classes using labeled training data
13

Example Use Cases for Document
Classification
14

Sentiment Analysis for Movie Reviews
Scenario: A movie website allows users to submit reviews describing what they
either liked or disliked about a particular movie.
Problem: The user reviews are unstructured text.
How do I automatically generate a score indicating whether the review was
positive or negative?
Solution: Train a natural language parsing model on a dataset that has been
labeled in previous reviews as either positive or negative.
15

Recommend Relevant Tags
Scenario: A Q/A website allows users to submit questions and receive answers
from other users.
Problem: Users sometime do not know what tags to apply to their questions in
order to increase discoverability for receiving answers.
Solution: Automatically recommend the most relevant tags for questions by
classifying the text from training on previous questions.
16

Recommend Similar Articles
Scenario: A news website provides hundreds of new articles a day to users on a
broad range of topics.
Problem: The site needs to increase user engagement and time spent on the site.
Solution: Train natural language parsing models for daily articles in order to
provide recommendations for highly relevant articles at the bottom of each page.
17

How Automated Document Classification Works
18

Label
X Y
Document
Document
Document
Document
Label Label
Assign a set of labels that describes the
document’s text
Supervised Learning
Step 1: Create a Training Dataset
Z
19

Step 2: Train a Natural Language Parsing Model
p
X Y
= State Machine
Deep feature representations are selected and
learned using an evolutionary algorithm
State machines represent predicates that evaluate to
0 or 1 for a text match
State machines map to classes of document labels
that matched text during training
Deep Learning
p p
p p p
Class
Class
Z
Class
20

cos(θ)
Unlabeled Document
The natural language parsing model is
used to classify other unlabeled
documents
X
Class
Y
Class
Z
Class
0.99
0.67
0.01
cos(θ)
cos(θ)
Step 3: Classify Unlabeled Documents
21

Hierarchical Pattern Recognition
(HPR)
22

What is Hierarchical Pattern Recognition (HPR)?
HPR is a graph-based deep learning algorithm I
created that learns deep feature representations in
linear time —
I created the algorithm to do graph-based traversals
using a hierarchy of finite state machines (FSM).
Designed for scalable performance in P time:
23

Influences & Inspirations
+ =
p
p p
p p p
X Y Z
24
Ray Kurzweil
(Pattern Recognition Theory of Mind)
Jeff Hawkins
(Hierarchical Temporal Memory)

How does feature extraction work?
p
25
“Deep” feature representations are learned and associated
with labels that are mapped to documents that the feature
was discovered in.
The feature hierarchy is translated into a Vector Space Model
for classification on feature vectors generated from unlabeled
text.
p p
p p p
X Y Z
HPR uses a probabilistic model in combination with an
evolutionary algorithm to generate hierarchies of deep feature
representations.

Graph-based feature learning
26

Learning new features from
matches on training data
27

Cost Function for the Generations of Features
Reproduction occurs after a threshold of matches has been
exceeded for a feature.
After replication the cost function is applied to increase that
threshold every time the feature reproduces.
is the current threshold on the feature node.
is the minimum threshold, which I chose as 5 for new features.
Cost function:
28

Vector Space Model
30

Generating Feature Vectors
The natural language parsing model created during training can be
turned into a global feature index.
This global feature index is a list of Neo4j internal IDs for every feature
in the hierarchy.
Using that global feature index, a multi-dimensional vector space is
created with a length equal to the number of features in the hierarchy.
31

Relevance Rankings
“Relevance rankings of documents in a keyword search can be
calculated, using the assumptions of document similarities theory, by
comparing the deviation of angles between each document vector and
the original query vector where the query is represented as the same
kind of vector as the documents.” - Wikipedia
32

Vector-based Cosine Similarity Measure
In practice, it is easier to calculate the cosine of the angle between the
vectors, instead of the angle itself:
33

Cosine Similarity & Vector Space Model
34

Vector-based Cosine Similarity Measure
“The resulting similarity ranges from -1 meaning exactly opposite, to 1
meaning exactly the same, with 0 usually indicating independence,
and in-between values indicating intermediate similarity or
dissimilarity.”
via Wikipedia
35

Graphify for Neo4j
36

Graphify for Neo4j
Graphify is a Neo4j unmanaged extension used for
document and text classification using graph-based
hierarchical pattern recognition.
https://github.com/kbastani/graphify
37

Example Project
Head over to the GitHub project page and clone it to your
local machine.
Follow the directions listed in the README.md to install the
extension.
Navigate to the /examples directory of the project.
Run:
examples/graphify-examples-author/src/java/org/neo4j/nlp/examples/author/main.java
38

U.S. Presidential Speech
Transcript Analysis
39

Identify the Political Affiliation of a Presidential Speech
This example ingests a set of texts from presidential speeches with
labels from the author of that speech in training phase. After building
the training models, unlabeled presidential speeches are classified in
the test phase.
40

The Presidents
• Ronald Reagan
• labels: liberal, republican, ronald-reagan
• George H.W. Bush
• labels: conservative, republican, bush41
• Bill Clinton
• labels: liberal, democrat, bill-clinton
• George W. Bush
• labels: conservative, republican, bush43
• Barack Obama
• labels: liberal, democrat, barack-obama
41

Training
Each of the presidents in the example have 6 speeches to analyze.
4 of the speeches are used to build a natural language parsing model.
2 of the speeches are used to test the validity of that model.
42

Get Similar Labels/Classes
43

Ronald Reagan
republican 0.7182046285385341
liberal 0.644281223102398
democrat 0.4854114595950056
conservative 0.4133639188595147
bill-clinton 0.4057969121945167
barack-obama 0.323947855372623
bush41 0.3222644898334092
bush43 0.3161309849153592
Class Similarity
44

George H.W. Bush
conservative 0.7032274806766954
republican 0.6047256274615608
liberal 0.4439742461594541
democrat 0.39114918238853674
bill-clinton 0.3234223107986785
ronald-reagan 0.3222644898334092
barack-obama 0.2929260544514002
bush43 0.29106733975087984
Class Similarity
45

democrat 0.8375678825642422
liberal 0.7847858060182163
republican 0.5561860529059708
conservative 0.45365774896422445
barack-obama 0.4507676679770066
ronald-reagan 0.4057969121945167
bush43 0.365042482383354
bush41 0.3234223107986785
Bill Clinton
Class Similarity
46

George W. Bush
conservative 0.820636570272315
republican 0.7056890956512284
liberal 0.5075788396061254
democrat 0.4505424322086937
bill-clinton 0.365042482383354
barack-obama 0.33801949243378965
ronald-reagan 0.3161309849153592
bush41 0.29106733975087984
Class Similarity
47

Barack Obama
democrat 0.7668017370739147
liberal 0.7184792203867296
republican 0.4847680475425114
bill-clinton 0.4507676679770066
conservative 0.4149264161292232
bush43 0.33801949243378965
ronald-reagan 0.323947855372623
bush41 0.2929260544514002
Class Similarity
48

Get involved in the Neo4j community
49

http://stackoverflow.com/questions/tagged/neo4j
50

http://groups.google.com/group/neo4j
51

https://github.com/neo4j/neo4j/issues
52

http://neo4j.meetup.com/
53

(Thank You)
54

Twitter www.twitter.com/kennybastani
LinkedIn www.linkedin.com/in/kennybastani
GitHub www.github.com/kbastani
Get in touch
55

Document Classification with Neo4j

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (20)

Similar to Document Classification with Neo4j

Similar to Document Classification with Neo4j (20)

More from Kenny Bastani

More from Kenny Bastani (9)

Recently uploaded

Recently uploaded (20)

Document Classification with Neo4j

Editor's Notes