FriendlyData - Natural Language Interface for Database

NLP + ML =
Minsk, 2017.
Slides adjusted to be California-friendly.
Dima Korolev

NLP: Grammar
Tanya, a social sciences graduate, works at Quora.

NLP: Grammar
Template
$NAME, a $DEGREE graduate, works at $COMPANY
BNF
QUERY ::= NAME COMMA? a DEGREE graduate COMMA? works at COMPANY
NAME ::= GIRL | BOY
GIRL ::= maryna | dasha | tanya
BOY ::= michael | alex | dima
DEGREE ::= social sciences | physics | mathematics
COMPANY ::= google | microsoft | quora
COMMA ::= /,/
Example
{GIRL=Tanya}, a {DEGREE=social sciences} graduate, works at {COMPANY=Quora}

NLP, Approach One: The Regular Expression (RE)

NLP, Approach Two: Abstract Syntax Tree (AST)

NLP: AST > RE
Troubles matching with regular expressions:
1. Performance.
2. Performance.
3. No extensibility.

NLP: AST > RE cont’d
1. Performance.
RE generation bloats the input. Small BNF can expand to a large RE.
2. Performance.
RE application can be slow. Even a short RE can be O(exp(N)).
Extending an inner term is an intolerable pain.
How are we going to launch those machine learning features atop REs?

NLP: All Hail Regexes!
An obligatory disclaimer: We settled for a hybrid approach, and still use regexes.
Example: "2017-dec-20".
Regardless, an AST-powered grammar gave us a nearly 1000x speedup.

NLP: AST Implementation Highlights
● Compilers are among the most painful things to build.
○ For one, take human-readable and IDE-understandable error messages. Oh, and Unicode.
● Penalty-based matching is not how REs work. Because greediness.
○ Match ”hello world” against the following grammar:
GREETINGS ::= hello| hello world
QUERY ::= GREETINGS world? # Compare with `world??` instead. Sucks to be greedy.

NLP + ML =
Okay. Let’s talk machine learning now.

ML: Product Features
● Business need, not a fetish.
● I loosely define ML product features as anything that is data-driven.
● Heuristic: a feature is an ML one once it needs a regression test.
○ Because a unit test alone fulfills the engineer’s OCD, but doesn’t really bring business value.

ML: Product Features cont’d
● Obvious:
○ Spelling corrections.
○ Query suggestions.
● Less obvious:
○ Grammar-wide synonyms (“jargon”, “funding is $1M” == “raised $1M”).
● Moonshots:
○ Onboarding: Gently introduce the user to The Power, keeping their flow calm and peaceful.

Query Suggestions
Q: Why start from suggestions while spell checking is easier and cleaner?

Query Suggestions
A: Because good suggestions effectively fix spelling, but not vice versa.

Theory: Query Suggestions
● ML 101 refresher:
○ Pareto efficiency.
○ Precision, recall, log loss. Classification, regression, and ranking cost functions.
● TL;DR:
○ The quality of the suggestions engine is a continuum.
○ On the one hand, a trie of possible query terms from the first one, and ignore all grammar.
■ Nearly 100% perfect suggestions, very low coverage.
○ On the other hand, query term frequency counting with some way of keeping context.
■ With no context, nearly 100% “coverage”, nearly 100% gibberish.
■ Feature engineering: what exactly some stands for becomes the key.

Practice: Query Suggestions
● We have a corpus of unlabeled queries.
○ And the privilege to proofread, filter, and label it ourselves.
● We have a good idea of what queries do we want users to type.
○ The onboarding moonshot is also on the radar.
● Ideally, we want to prototype quickly and launch right away.
○ Which is exactly what happened.

Query Suggestions: Backstory
Q: What would a data engineer do first as they have an AST and a queryset?

Query Suggestions: Backstory
A: Generate random queries from a learned distribution to get a feel of it!

Query Suggestions: Disclaimer
● To demonstrate how the above plays together, let’s refer to a synthetic example.
○ Note to those Californicated:
■ By no means do I imply someone with the name, say, Tanya is more likely to be a social sciences
graduate than someone who is, say, Michael. And by no means I imply gender is the cause of it.
■ By no means do I imply that someone with the name Tanya is more likely to be employed by an
excessively politically correct company, as opposed to company doing tangible engineering.
○ It’s the imbalances in data that we, data engineers, uncover for living. Judgement calls are yours, not mine.

Query Suggestions: Demo
— It’s creepy. I like it.
Andy B. (personal communication)

Query Suggestions: Oh well ...

Query Suggestions: Implementation Highlights
● Machine learning:
○ There are three pillars of data engineering:
■ [ labeled ] Data.
■ [ extracted ] Features.
■ [ learning ] Algorithms.
○ TL;DR: No rocket science, but most ML is about carefully using simple features.
● Software engineering:
○ Effectively, the enumeration of queries to be suggested is the AST traversal.
○ “Trie” “prefix” generators are stateful, both wrt the current node and wrt the terms consumed.
○ To handle XXX QPS it has to be breadth first search, not depth first search.
○ Thus, the priority-queue-chained “calls” carry both the “local state” and the “global state”.
○ TL;DR: Quite an implementation exercise.

FriendlyData - Natural Language Interface for Database

Recommended

Recommended

More Related Content

Similar to FriendlyData - Natural Language Interface for Database

Similar to FriendlyData - Natural Language Interface for Database (20)

Recently uploaded

Recently uploaded (20)

FriendlyData - Natural Language Interface for Database