3. NLP: Grammar
Template
$NAME, a $DEGREE graduate, works at $COMPANY
BNF
QUERY ::= NAME COMMA? a DEGREE graduate COMMA? works at COMPANY
NAME ::= GIRL | BOY
GIRL ::= maryna | dasha | tanya
BOY ::= michael | alex | dima
DEGREE ::= social sciences | physics | mathematics
COMPANY ::= google | microsoft | quora
COMMA ::= /,/
Example
{GIRL=Tanya}, a {DEGREE=social sciences} graduate, works at {COMPANY=Quora}
5. NLP: RE #dontpanic #undercontrol #nopixelswereharmed
re =
/^(?<query_7>(?<name_7_8>(?<girl_7_8_9>smaryna|sdasha|sta
nya)|(?<boy_7_8_10>smichael|salex|sdima))(?:/(?:s),(?=s
)/)?sa(?<degree_7_11>ssocialssciences|sphysics|smathema
tics)sgraduate(?:/(?:s),(?=s)/)?sworkssat(?<company_7_1
2>sgoogle|smicrosoft|squora))(?:s)$/
# A non-artificial example RE would be in the order of megabytes. -- D.K.
9. NLP: AST > RE
Troubles matching with regular expressions:
1. Performance.
2. Performance.
3. No extensibility.
4. No extensibility.
10. NLP: AST > RE cont’d
1. Performance.
RE generation bloats the input. Small BNF can expand to a large RE.
2. Performance.
RE application can be slow. Even a short RE can be O(exp(N)).
3. No extensibility.
Extending an inner term is an intolerable pain.
4. No extensibility.
How are we going to launch those machine learning features atop REs?
11. NLP: All Hail Regexes!
An obligatory disclaimer: We settled for a hybrid approach, and still use regexes.
Example: "2017-dec-20".
Regardless, an AST-powered grammar gave us a nearly 1000x speedup.
12. NLP: AST Implementation Highlights
● Compilers are among the most painful things to build.
○ For one, take human-readable and IDE-understandable error messages. Oh, and Unicode.
● Penalty-based matching is not how REs work. Because greediness.
○ Match ”hello world” against the following grammar:
GREETINGS ::= hello| hello world
QUERY ::= GREETINGS world? # Compare with `world??` instead. Sucks to be greedy.
14. ML: Product Features
● Business need, not a fetish.
● I loosely define ML product features as anything that is data-driven.
● Heuristic: a feature is an ML one once it needs a regression test.
○ Because a unit test alone fulfills the engineer’s OCD, but doesn’t really bring business value.
15. ML: Product Features cont’d
● Obvious:
○ Spelling corrections.
○ Query suggestions.
● Less obvious:
○ Grammar-wide synonyms (“jargon”, “funding is $1M” == “raised $1M”).
● Moonshots:
○ Onboarding: Gently introduce the user to The Power, keeping their flow calm and peaceful.
19. Theory: Query Suggestions
● ML 101 refresher:
○ Pareto efficiency.
○ Precision, recall, log loss. Classification, regression, and ranking cost functions.
● TL;DR:
○ The quality of the suggestions engine is a continuum.
○ On the one hand, a trie of possible query terms from the first one, and ignore all grammar.
■ Nearly 100% perfect suggestions, very low coverage.
○ On the other hand, query term frequency counting with some way of keeping context.
■ With no context, nearly 100% “coverage”, nearly 100% gibberish.
■ Feature engineering: what exactly some stands for becomes the key.
20. Practice: Query Suggestions
● We have a corpus of unlabeled queries.
○ And the privilege to proofread, filter, and label it ourselves.
● We have a good idea of what queries do we want users to type.
○ The onboarding moonshot is also on the radar.
● Ideally, we want to prototype quickly and launch right away.
○ Which is exactly what happened.
23. Query Suggestions: Disclaimer
● To demonstrate how the above plays together, let’s refer to a synthetic example.
○ Note to those Californicated:
■ By no means do I imply someone with the name, say, Tanya is more likely to be a social sciences
graduate than someone who is, say, Michael. And by no means I imply gender is the cause of it.
■ By no means do I imply that someone with the name Tanya is more likely to be employed by an
excessively politically correct company, as opposed to company doing tangible engineering.
○ It’s the imbalances in data that we, data engineers, uncover for living. Judgement calls are yours, not mine.
26. Query Suggestions: Implementation Highlights
● Machine learning:
○ There are three pillars of data engineering:
■ [ labeled ] Data.
■ [ extracted ] Features.
■ [ learning ] Algorithms.
○ TL;DR: No rocket science, but most ML is about carefully using simple features.
● Software engineering:
○ Effectively, the enumeration of queries to be suggested is the AST traversal.
○ “Trie” “prefix” generators are stateful, both wrt the current node and wrt the terms consumed.
○ To handle XXX QPS it has to be breadth first search, not depth first search.
○ Thus, the priority-queue-chained “calls” carry both the “local state” and the “global state”.
○ TL;DR: Quite an implementation exercise.