Using Graph theory to understand Intent & Concepts - Neo4j User Group (January 2013)

Using Graph Theory to understand Intent & Concepts – January 2013

tumra.com

UNDERSTANDING INTENT & CONCEPTS

•  Use case:
-  Enhancing Social TV user experience
-  Matching users to content that interests them

•  Topics we’ll cover:
-  Natural Language Processing
-  Graph Theory
-  Machine Learning

tumra.com

USE CASE ENHANCED SOCIAL TV

•  Objectives:
-  Increase engagement with content
-  Enhance multi-channel user experience

•  We built a prototype solution:
-  Mines unstructured data in real-time
-  Understands:
-  What interests individual users
-  Entities & Concepts (People, Places, Events)

tumra.com

THE CHALLENGE

THANKS FORtoLISTENING

Help users to “follow the story” regardless of the
news outlet, integrated web / second-screen

tumra.com

Photo Credit: byrion on Flickr (cc)

THE PROBLEM

Unstructured
Data
Magic?!?! Awesomeness!

tumra.com

THE PROBLEM

•  Little useful data to work with…
-  Streams of continuous live TV
-  Have to create metadata

•  Where did we start?
-  Ingest several live news channels
-  Extract whatever data was available:
-  In-video text using OCR
-  Subtitles / Closed Captions

tumra.com

STEP 1 NAMED ENTITY RECOGNITION

We used a simple N-Gram model for exact matches;
then Apache Lucene for everything else…

tumra.com

EXAMPLE N.E.R.

“David Cameron and the German
Chancellor Angela Merkel meets to
discuss the debt crisis and signal
their approval for greater eurozone
integration.”

tumra.com

INITIAL SOLUTION

NoSQL

Unstructured
Awesomeness!
Data

NER

tumra.com

OH NO!!!
*facepalm*

Photo Credit: cesarastudillo on Flickr (cc)

DISAMBIGUATION

•  Which “David Cameron”?
-  We have many in our Knowledgebase
-  Sportsmen, actors, painters & characters…

•  Our initial simplistic approach was naïve
-  Works great with unambiguous matches
-  Best-case returns top-scoring entity

•  We needed a smarter approach
tumra.com

RECAP

•  We have an effectively ‘flat’ KB of Entities
-  “David Cameron” -> Politician (Person)
-  “Angela Merkel” -> Politician (Person)
-  “German Chancellor” -> Political office (Concept)
-  “Debt” -> Economic concept (Concept)
-  “Eurozone” -> Economic area (Place)

•  We needed a way to find relationships
between Entities

tumra.com

THE BIG IDEA

Graphs allow us to store relationships between entities, and
graph algorithms allow us to interrogate those connections…

GRAPH DATABASES

Graph
Neo4J
Lab

Apache Golden
Giraph Orb

… of course there are many more open-source & proprietary ones

tumra.com

SO, WHICH ONE?

???
… it had to be fast, scalable, active development

tumra.com

STEP 2 BUILDING RELATIONSHIPS

We had 250 million Nodes, and 4 billion Edges…
great initial results but horrendously inefficient!

Example: “David Cameron” & “Angela Merkel”

tumra.com

INITIAL IMPROVEMENTS

•  We didn’t need everything… just:
-  People: “David Cameron”, “Angela Merkel”
-  Places: “London”, “Downing Street”, “Eurozone”
-  Concepts: “Debt”, “President”, “Eurozone”
-  Things: Companies, Products etc.

•  Pruned the graph using Map/Reduce

•  This reduced the number of Entities…
-  … but we still had billions of connections
tumra.com

EXAMPLE PEOPLE, PLACES, CONCEPTS

integration.”

tumra.com

EXAMPLE PEOPLE, PLACES, CONCEPTS

integration.”

Concepts Places
People

tumra.com

DISAMBIGUATION

Angela
Merkel

David
Cameron
(painter) Living
Person Politician
Head of
State

David
Cameron David
(footballer) David
Cameron Cameron
(actor) (politician)

Possibilities: shortest path, number of common connections etc.

STEP 3 SIMPLIFYING THE GRAPH

Sure all that extra metadata was tasty but we didn’t
need it all to solve the use-case…

So we used Map/Reduce to count the common
connections

tumra.com

SIMPLIFIED

Angela
Merkel

David
Cameron
(painter)
1
3
1
David
Cameron David
(footballer) David
Cameron Cameron

Woah … that looks a lot like Least Cost Routing problem

LEAST COST PATH

Angela
Merkel

David
Cameron
(painter)
1/1
1/3
1/1
David
Cameron David
(footballer) David
Cameron Cameron

1 / number of common connections = cost

UPDATED SOLUTION

Neo4J NoSQL

Unstructured
Disambiguation Awesomeness!
Data

NER

tumra.com

RECAP

•  Graphs allow us to interrogate relationships
-  Disambiguate when faced with multiple possibilities
-  Infer more about the context of what’s happening

•  Went through iterations of improvements
-  Kept our Entity data in NoSQL = TB’s
-  Used the Graph as an index of sorts = GB’s

•  Neo4j was a great ﬁt for our needs

tumra.com

STEP 4 MAKING IT WORK REAL-TIME

Some queries were taking ‘seconds’ and we needed
to go a lot faster because TV wont wait for us …

Do we really need to check the Graph everytime?

tumra.com

ENTER MACHINE LEARNING

•  We can use simple predictors to estimate
the likelihood of Entities occurring
-  i.e. every time we’ve looked for “David Cameron” in
the past the best match was the Politician

•  Keeping a ‘probabilistic context’ of recent
Entities allows us to detect shifts in topics
-  Works especially well on News channels
-  Reduces the demand on Graph lookups

tumra.com

BAYES THEOREM

Looks complicated, but its basically just counting & division

Photo Credit: mattbuck007 on Flickr (cc)

STEP 5 MAKING IT WORK WORLDWIDE

We solved the problem for English, but what about
other languages?

tumra.com

LANGUAGE

•  Our core Entities of ‘People’, ‘Places’, &
‘Concepts’ are language agnostic…

•  We needed a way to ditch ‘language’ and
jump straight to entities…
-  The colour ‘Red’ means the same thing regardless of
you calling it ‘Rot’, ‘Rouge’ or ‘赤’

•  Again, Graphs could solve the problem
tumra.com

LANGUAGE INDEPENDENT

Red !"#‫أ‬

Color:
Rouge
Red 赤

Rot Röd
Rojo 紅

PROBLEM SOLVED

Typical response time ~30ms … relevancy improves
over time and learns new entities ‘online’

tumra.com

FINAL SOLUTION

Neo4J NoSQL

Unstructured Language Model Disambiguation
Awesomeness!
Data
Machine Learning

NER

tumra.com

ABOUT US

•  We’ve built a product…
-  Our ‘Digital Marketing Optimization’ platform
improves conversion rates & customer satisfaction
for eCommerce & Marketing campaigns
-  Launches Q1 2013

•  What else do we do?
-  ‘Big Data’ & ‘Data Science’ professional services
-  Bespoke prototype & solution development

“TUMRA” is a transliteration of the Sanskrit word for “BIG”;
we thought it’s a great name … ( and the .COM was available )
tumra.com

TUMRA
You?

THANKS FOR LISTENING

We’re hiring!
Data Scientists & Developers
work@tumra.com
tumra.com

THANKS FOR LISTENING
Questions?

tumra.com
hello@tumra.com

twitter.com/tumra
tumra.com

Using Graph theory to understand Intent & Concepts - Neo4j User Group (January 2013)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Recently uploaded

Recently uploaded (20)

Using Graph theory to understand Intent & Concepts - Neo4j User Group (January 2013)