Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Using Graph theory to understand Intent & Concepts - Neo4j User Group (January 2013)


Published on

Title: Using Graph Theory to understand User Intent …

Title: Using Graph Theory to understand User Intent

Subtitle: Graph-based Natural Language Processing applied to real-time Machine Learning


We are in a Graph Renaissance period. The advent of high-performance free/open-source software combined with inexpensive Cloud computing platforms enable graphs of information to be manipulated and utilised at scales never before seen. While use-cases like mining social and web data with graphs are common-place, their use in Natural Language Processing has largely been overlooked. In this presentation Michael Cutler will describe how TUMRA have used graph-based NLP algorithms as a core component of their upcoming digital marketing product TUMRA Optimize.

Presenter: Michael Cutler


Michael is the CTO co-founder of TUMRA, a Data Science startup based in Chiswick, West London. First discovering Hadoop back in 2008, Michael has been following the bleeding edge of ‘Big Data’ technology since before it was called ‘Big Data’ and has applied it to solve real-world problems.

Before starting TUMRA, Michael was a senior researcher in the R&D labs for British Sky Broadcasting, inventing new technologies and solutions for everything from Satellite, Video and Network systems through to Web and Mobile-based applications.

Twitter: @tumra @cotdp

Published in: Technology
1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Using Graph Theory to understand Intent & Concepts – January 2013  
  • 2. UNDERSTANDING INTENT & CONCEPTS  •  Use case: -  Enhancing Social TV user experience -  Matching users to content that interests them•  Topics we’ll cover: -  Natural Language Processing -  Graph Theory -  Machine Learning  
  • 3. USE CASE ENHANCED SOCIAL TV  •  Objectives: -  Increase engagement with content -  Enhance multi-channel user experience•  We built a prototype solution: -  Mines unstructured data in real-time -  Understands: -  What interests individual users -  Entities & Concepts (People, Places, Events)  
  • 4. THE CHALLENGE  THANKS FORtoLISTENING   Help users to “follow the story” regardless of the news outlet, integrated web / second-screen   Photo Credit: byrion on Flickr (cc)
  • 5. THE PROBLEM  Unstructured Data Magic?!?! Awesomeness!  
  • 6. THE PROBLEM  •  Little useful data to work with… -  Streams of continuous live TV -  Have to create metadata•  Where did we start? -  Ingest several live news channels -  Extract whatever data was available: -  In-video text using OCR -  Subtitles / Closed Captions  
  • 7. STEP 1 NAMED ENTITY RECOGNITION  We used a simple N-Gram model for exact matches; then Apache Lucene for everything else…  
  • 8. EXAMPLE N.E.R.   “David Cameron and the GermanChancellor Angela Merkel meets to discuss the debt crisis and signaltheir approval for greater eurozone integration.”  
  • 9. EXAMPLE N.E.R.   “David Cameron and the GermanChancellor Angela Merkel meets to discuss the debt crisis and signaltheir approval for greater eurozone integration.”  
  • 10. INITIAL SOLUTION   NoSQLUnstructured Awesomeness! Data NER  
  • 11. OH NO!!! *facepalm*   Photo Credit: cesarastudillo on Flickr (cc)
  • 12. DISAMBIGUATION  •  Which “David Cameron”? -  We have many in our Knowledgebase -  Sportsmen, actors, painters & characters…•  Our initial simplistic approach was naïve -  Works great with unambiguous matches -  Best-case returns top-scoring entity•  We needed a smarter approach  
  • 13. RECAP  •  We have an effectively ‘flat’ KB of Entities -  “David Cameron” -> Politician (Person) -  “Angela Merkel” -> Politician (Person) -  “German Chancellor” -> Political office (Concept) -  “Debt” -> Economic concept (Concept) -  “Eurozone” -> Economic area (Place)•  We needed a way to find relationships between Entities  
  • 14. THE BIG IDEA  Graphs allow us to store relationships between entities, andgraph algorithms allow us to interrogate those connections…  
  • 15. GRAPH DATABASES   Graph Neo4J Lab Apache Golden Giraph Orb… of course there are many more open-source & proprietary ones  
  • 16. SO, WHICH ONE?   ???… it had to be fast, scalable, active development  
  • 17. STEP 2 BUILDING RELATIONSHIPS  We had 250 million Nodes, and 4 billion Edges…great initial results but horrendously inefficient! Example: “David Cameron” & “Angela Merkel”  
  • 18. INITIAL IMPROVEMENTS  •  We didn’t need everything… just: -  People: “David Cameron”, “Angela Merkel” -  Places: “London”, “Downing Street”, “Eurozone” -  Concepts: “Debt”, “President”, “Eurozone” -  Things: Companies, Products etc.•  Pruned the graph using Map/Reduce•  This reduced the number of Entities… -  … but we still had billions of connections  
  • 19. EXAMPLE PEOPLE, PLACES, CONCEPTS   “David Cameron and the German Chancellor Angela Merkel meets to discuss the debt crisis and signal their approval for greater eurozone integration.”  
  • 20. EXAMPLE PEOPLE, PLACES, CONCEPTS     “David Cameron and the German Chancellor Angela Merkel meets to discuss the debt crisis and signal their approval for greater eurozone integration.”  Concepts Places People  
  • 21. DISAMBIGUATION   Angela Merkel David Cameron (painter) Living Person Politician Head of State David Cameron David(footballer) David Cameron Cameron (actor) (politician)Possibilities: shortest path, number of common connections etc.  
  • 22. STEP 3 SIMPLIFYING THE GRAPH  Sure all that extra metadata was tasty but we didn’t need it all to solve the use-case… So we used Map/Reduce to count the common connections  
  • 23. SIMPLIFIED   Angela Merkel David Cameron (painter) 1 3 1 David Cameron David(footballer) David Cameron Cameron (actor) (politician) Woah … that looks a lot like Least Cost Routing problem  
  • 24. LEAST COST PATH   Angela Merkel David Cameron (painter) 1/1 1/3 1/1 David Cameron David(footballer) David Cameron Cameron (actor) (politician) 1 / number of common connections = cost  
  • 25. UPDATED SOLUTION   Neo4J NoSQLUnstructured Disambiguation Awesomeness! Data NER  
  • 26. RECAP  •  Graphs allow us to interrogate relationships -  Disambiguate when faced with multiple possibilities -  Infer more about the context of what’s happening•  Went through iterations of improvements -  Kept our Entity data in NoSQL = TB’s -  Used the Graph as an index of sorts = GB’s•  Neo4j was a great fit for our needs  
  • 27. STEP 4 MAKING IT WORK REAL-TIME  Some queries were taking ‘seconds’ and we needed to go a lot faster because TV wont wait for us … Do we really need to check the Graph everytime?  
  • 28. ENTER MACHINE LEARNING  •  We can use simple predictors to estimate the likelihood of Entities occurring -  i.e. every time we’ve looked for “David Cameron” in the past the best match was the Politician•  Keeping a ‘probabilistic context’ of recent Entities allows us to detect shifts in topics -  Works especially well on News channels -  Reduces the demand on Graph lookups  
  • 29. BAYES THEOREM  Looks complicated, but its basically just counting & division   Photo Credit: mattbuck007 on Flickr (cc)
  • 30. STEP 5 MAKING IT WORK WORLDWIDE   We solved the problem for English, but what about other languages?  
  • 31. LANGUAGE  •  Our core Entities of ‘People’, ‘Places’, & ‘Concepts’ are language agnostic…•  We needed a way to ditch ‘language’ and jump straight to entities… -  The colour ‘Red’ means the same thing regardless of you calling it ‘Rot’, ‘Rouge’ or ‘赤’•  Again, Graphs could solve the problem  
  • 32. LANGUAGE INDEPENDENT  Red !"#‫أ‬ Color:Rouge Red 赤 Rot Röd Rojo 紅
  • 33. PROBLEM SOLVED  Typical response time ~30ms … relevancy improves over time and learns new entities ‘online’  
  • 34. FINAL SOLUTION   Neo4J NoSQLUnstructured Language Model Disambiguation Awesomeness! Data Machine Learning NER  
  • 35. ABOUT US  •  We’ve built a product… -  Our ‘Digital Marketing Optimization’ platform improves conversion rates & customer satisfaction for eCommerce & Marketing campaigns -  Launches Q1 2013•  What else do we do? -  ‘Big Data’ & ‘Data Science’ professional services -  Bespoke prototype & solution development “TUMRA” is a transliteration of the Sanskrit word for “BIG”; we thought it’s a great name … ( and the .COM was available )  
  • 36. TUMRA You?THANKS FOR LISTENING   We’re hiring! Data Scientists & Developers  
  • 37. THANKS FOR LISTENING Questions?