SlideShare a Scribd company logo
1 of 27
Download to read offline
NLP + ML =
Minsk, 2017.
Slides adjusted to be California-friendly.
Dima Korolev
NLP: Grammar
Tanya, a social sciences graduate, works at Quora.
NLP: Grammar
Template
$NAME, a $DEGREE graduate, works at $COMPANY
BNF
QUERY ::= NAME COMMA? a DEGREE graduate COMMA? works at COMPANY
NAME ::= GIRL | BOY
GIRL ::= maryna | dasha | tanya
BOY ::= michael | alex | dima
DEGREE ::= social sciences | physics | mathematics
COMPANY ::= google | microsoft | quora
COMMA ::= /,/
Example
{GIRL=Tanya}, a {DEGREE=social sciences} graduate, works at {COMPANY=Quora}
NLP, Approach One: The Regular Expression (RE)
NLP: RE #dontpanic #undercontrol #nopixelswereharmed
re =
/^(?<query_7>(?<name_7_8>(?<girl_7_8_9>smaryna|sdasha|sta
nya)|(?<boy_7_8_10>smichael|salex|sdima))(?:/(?:s),(?=s
)/)?sa(?<degree_7_11>ssocialssciences|sphysics|smathema
tics)sgraduate(?:/(?:s),(?=s)/)?sworkssat(?<company_7_1
2>sgoogle|smicrosoft|squora))(?:s)$/
# A non-artificial example RE would be in the order of megabytes. -- D.K.
NLP, Approach Two: Abstract Syntax Tree (AST)
AST
NLP: AST > RE
Troubles matching with regular expressions:
1. Performance.
2. Performance.
3. No extensibility.
4. No extensibility.
NLP: AST > RE cont’d
1. Performance.
RE generation bloats the input. Small BNF can expand to a large RE.
2. Performance.
RE application can be slow. Even a short RE can be O(exp(N)).
3. No extensibility.
Extending an inner term is an intolerable pain.
4. No extensibility.
How are we going to launch those machine learning features atop REs?
NLP: All Hail Regexes!
An obligatory disclaimer: We settled for a hybrid approach, and still use regexes.
Example: "2017-dec-20".
Regardless, an AST-powered grammar gave us a nearly 1000x speedup.
NLP: AST Implementation Highlights
● Compilers are among the most painful things to build.
○ For one, take human-readable and IDE-understandable error messages. Oh, and Unicode.
● Penalty-based matching is not how REs work. Because greediness.
○ Match ”hello world” against the following grammar:
GREETINGS ::= hello| hello world
QUERY ::= GREETINGS world? # Compare with `world??` instead. Sucks to be greedy.
NLP + ML =
Okay. Let’s talk machine learning now.
ML: Product Features
● Business need, not a fetish.
● I loosely define ML product features as anything that is data-driven.
● Heuristic: a feature is an ML one once it needs a regression test.
○ Because a unit test alone fulfills the engineer’s OCD, but doesn’t really bring business value.
ML: Product Features cont’d
● Obvious:
○ Spelling corrections.
○ Query suggestions.
● Less obvious:
○ Grammar-wide synonyms (“jargon”, “funding is $1M” == “raised $1M”).
● Moonshots:
○ Onboarding: Gently introduce the user to The Power, keeping their flow calm and peaceful.
Query Suggestions
Q: Why start from suggestions while spell checking is easier and cleaner?
Query Suggestions
A: Because good suggestions effectively fix spelling, but not vice versa.
Machine Learning 101
Theory: Query Suggestions
● ML 101 refresher:
○ Pareto efficiency.
○ Precision, recall, log loss. Classification, regression, and ranking cost functions.
● TL;DR:
○ The quality of the suggestions engine is a continuum.
○ On the one hand, a trie of possible query terms from the first one, and ignore all grammar.
■ Nearly 100% perfect suggestions, very low coverage.
○ On the other hand, query term frequency counting with some way of keeping context.
■ With no context, nearly 100% “coverage”, nearly 100% gibberish.
■ Feature engineering: what exactly some stands for becomes the key.
Practice: Query Suggestions
● We have a corpus of unlabeled queries.
○ And the privilege to proofread, filter, and label it ourselves.
● We have a good idea of what queries do we want users to type.
○ The onboarding moonshot is also on the radar.
● Ideally, we want to prototype quickly and launch right away.
○ Which is exactly what happened.
Query Suggestions: Backstory
Q: What would a data engineer do first as they have an AST and a queryset?
Query Suggestions: Backstory
A: Generate random queries from a learned distribution to get a feel of it!
Query Suggestions: Disclaimer
● To demonstrate how the above plays together, let’s refer to a synthetic example.
○ Note to those Californicated:
■ By no means do I imply someone with the name, say, Tanya is more likely to be a social sciences
graduate than someone who is, say, Michael. And by no means I imply gender is the cause of it.
■ By no means do I imply that someone with the name Tanya is more likely to be employed by an
excessively politically correct company, as opposed to company doing tangible engineering.
○ It’s the imbalances in data that we, data engineers, uncover for living. Judgement calls are yours, not mine.
Query Suggestions: Demo
— It’s creepy. I like it.
Andy B. (personal communication)
Query Suggestions: Oh well ...
Query Suggestions: Implementation Highlights
● Machine learning:
○ There are three pillars of data engineering:
■ [ labeled ] Data.
■ [ extracted ] Features.
■ [ learning ] Algorithms.
○ TL;DR: No rocket science, but most ML is about carefully using simple features.
● Software engineering:
○ Effectively, the enumeration of queries to be suggested is the AST traversal.
○ “Trie” “prefix” generators are stateful, both wrt the current node and wrt the terms consumed.
○ To handle XXX QPS it has to be breadth first search, not depth first search.
○ Thus, the priority-queue-chained “calls” carry both the “local state” and the “global state”.
○ TL;DR: Quite an implementation exercise.
Q&A

More Related Content

Similar to FriendlyData - Natural Language Interface for Database

D7 MarkPlus - Machine Learning Algorithm.pdf
D7 MarkPlus - Machine Learning Algorithm.pdfD7 MarkPlus - Machine Learning Algorithm.pdf
D7 MarkPlus - Machine Learning Algorithm.pdf
ikraizn
 

Similar to FriendlyData - Natural Language Interface for Database (20)

_[Session #3 +4] Technical Skills +Interview Preparation.pptx
_[Session #3 +4] Technical Skills +Interview Preparation.pptx_[Session #3 +4] Technical Skills +Interview Preparation.pptx
_[Session #3 +4] Technical Skills +Interview Preparation.pptx
 
Machine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventMachine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup Event
 
How to be a Good Machine Learning PM by Google Product Manager
How to be a Good Machine Learning PM by Google Product ManagerHow to be a Good Machine Learning PM by Google Product Manager
How to be a Good Machine Learning PM by Google Product Manager
 
How to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product ManagerHow to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product Manager
 
Improving How We Deliver Machine Learning Models (XCONF 2019)
Improving How We Deliver Machine Learning Models (XCONF 2019)Improving How We Deliver Machine Learning Models (XCONF 2019)
Improving How We Deliver Machine Learning Models (XCONF 2019)
 
Getting a Data Science Job
Getting a Data Science JobGetting a Data Science Job
Getting a Data Science Job
 
Cracking the coding interview columbia - march 23 2011
Cracking the coding interview   columbia - march 23 2011Cracking the coding interview   columbia - march 23 2011
Cracking the coding interview columbia - march 23 2011
 
Bounce Back to IT: Focus on Analyst Positions
Bounce Back to IT:  Focus on Analyst PositionsBounce Back to IT:  Focus on Analyst Positions
Bounce Back to IT: Focus on Analyst Positions
 
Overview of machine learning
Overview of machine learning Overview of machine learning
Overview of machine learning
 
D7 MarkPlus - Machine Learning Algorithm.pdf
D7 MarkPlus - Machine Learning Algorithm.pdfD7 MarkPlus - Machine Learning Algorithm.pdf
D7 MarkPlus - Machine Learning Algorithm.pdf
 
Hiring the best 7.15.2017
Hiring the best 7.15.2017Hiring the best 7.15.2017
Hiring the best 7.15.2017
 
Karat at CMU
Karat at CMUKarat at CMU
Karat at CMU
 
Cepstrum Placement Talk 2022.pptx
Cepstrum Placement Talk 2022.pptxCepstrum Placement Talk 2022.pptx
Cepstrum Placement Talk 2022.pptx
 
SoDA Analytics deck
SoDA Analytics deckSoDA Analytics deck
SoDA Analytics deck
 
What should be your approach for solving ML_CV problem statements_.pdf
What should be your approach for solving ML_CV problem statements_.pdfWhat should be your approach for solving ML_CV problem statements_.pdf
What should be your approach for solving ML_CV problem statements_.pdf
 
How To Get Started With Machine Learning
How To Get Started With Machine LearningHow To Get Started With Machine Learning
How To Get Started With Machine Learning
 
Intro to ML.pptx
Intro to ML.pptxIntro to ML.pptx
Intro to ML.pptx
 
How to Create Data Consistency in Product by Crowdcube Sr. PM
How to Create Data Consistency in Product by Crowdcube Sr. PMHow to Create Data Consistency in Product by Crowdcube Sr. PM
How to Create Data Consistency in Product by Crowdcube Sr. PM
 
A field guide the machine learning zoo
A field guide the machine learning zoo A field guide the machine learning zoo
A field guide the machine learning zoo
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

FriendlyData - Natural Language Interface for Database

  • 1. NLP + ML = Minsk, 2017. Slides adjusted to be California-friendly. Dima Korolev
  • 2. NLP: Grammar Tanya, a social sciences graduate, works at Quora.
  • 3. NLP: Grammar Template $NAME, a $DEGREE graduate, works at $COMPANY BNF QUERY ::= NAME COMMA? a DEGREE graduate COMMA? works at COMPANY NAME ::= GIRL | BOY GIRL ::= maryna | dasha | tanya BOY ::= michael | alex | dima DEGREE ::= social sciences | physics | mathematics COMPANY ::= google | microsoft | quora COMMA ::= /,/ Example {GIRL=Tanya}, a {DEGREE=social sciences} graduate, works at {COMPANY=Quora}
  • 4. NLP, Approach One: The Regular Expression (RE)
  • 5. NLP: RE #dontpanic #undercontrol #nopixelswereharmed re = /^(?<query_7>(?<name_7_8>(?<girl_7_8_9>smaryna|sdasha|sta nya)|(?<boy_7_8_10>smichael|salex|sdima))(?:/(?:s),(?=s )/)?sa(?<degree_7_11>ssocialssciences|sphysics|smathema tics)sgraduate(?:/(?:s),(?=s)/)?sworkssat(?<company_7_1 2>sgoogle|smicrosoft|squora))(?:s)$/ # A non-artificial example RE would be in the order of megabytes. -- D.K.
  • 6. NLP, Approach Two: Abstract Syntax Tree (AST)
  • 7. AST
  • 8.
  • 9. NLP: AST > RE Troubles matching with regular expressions: 1. Performance. 2. Performance. 3. No extensibility. 4. No extensibility.
  • 10. NLP: AST > RE cont’d 1. Performance. RE generation bloats the input. Small BNF can expand to a large RE. 2. Performance. RE application can be slow. Even a short RE can be O(exp(N)). 3. No extensibility. Extending an inner term is an intolerable pain. 4. No extensibility. How are we going to launch those machine learning features atop REs?
  • 11. NLP: All Hail Regexes! An obligatory disclaimer: We settled for a hybrid approach, and still use regexes. Example: "2017-dec-20". Regardless, an AST-powered grammar gave us a nearly 1000x speedup.
  • 12. NLP: AST Implementation Highlights ● Compilers are among the most painful things to build. ○ For one, take human-readable and IDE-understandable error messages. Oh, and Unicode. ● Penalty-based matching is not how REs work. Because greediness. ○ Match ”hello world” against the following grammar: GREETINGS ::= hello| hello world QUERY ::= GREETINGS world? # Compare with `world??` instead. Sucks to be greedy.
  • 13. NLP + ML = Okay. Let’s talk machine learning now.
  • 14. ML: Product Features ● Business need, not a fetish. ● I loosely define ML product features as anything that is data-driven. ● Heuristic: a feature is an ML one once it needs a regression test. ○ Because a unit test alone fulfills the engineer’s OCD, but doesn’t really bring business value.
  • 15. ML: Product Features cont’d ● Obvious: ○ Spelling corrections. ○ Query suggestions. ● Less obvious: ○ Grammar-wide synonyms (“jargon”, “funding is $1M” == “raised $1M”). ● Moonshots: ○ Onboarding: Gently introduce the user to The Power, keeping their flow calm and peaceful.
  • 16. Query Suggestions Q: Why start from suggestions while spell checking is easier and cleaner?
  • 17. Query Suggestions A: Because good suggestions effectively fix spelling, but not vice versa.
  • 19. Theory: Query Suggestions ● ML 101 refresher: ○ Pareto efficiency. ○ Precision, recall, log loss. Classification, regression, and ranking cost functions. ● TL;DR: ○ The quality of the suggestions engine is a continuum. ○ On the one hand, a trie of possible query terms from the first one, and ignore all grammar. ■ Nearly 100% perfect suggestions, very low coverage. ○ On the other hand, query term frequency counting with some way of keeping context. ■ With no context, nearly 100% “coverage”, nearly 100% gibberish. ■ Feature engineering: what exactly some stands for becomes the key.
  • 20. Practice: Query Suggestions ● We have a corpus of unlabeled queries. ○ And the privilege to proofread, filter, and label it ourselves. ● We have a good idea of what queries do we want users to type. ○ The onboarding moonshot is also on the radar. ● Ideally, we want to prototype quickly and launch right away. ○ Which is exactly what happened.
  • 21. Query Suggestions: Backstory Q: What would a data engineer do first as they have an AST and a queryset?
  • 22. Query Suggestions: Backstory A: Generate random queries from a learned distribution to get a feel of it!
  • 23. Query Suggestions: Disclaimer ● To demonstrate how the above plays together, let’s refer to a synthetic example. ○ Note to those Californicated: ■ By no means do I imply someone with the name, say, Tanya is more likely to be a social sciences graduate than someone who is, say, Michael. And by no means I imply gender is the cause of it. ■ By no means do I imply that someone with the name Tanya is more likely to be employed by an excessively politically correct company, as opposed to company doing tangible engineering. ○ It’s the imbalances in data that we, data engineers, uncover for living. Judgement calls are yours, not mine.
  • 24. Query Suggestions: Demo — It’s creepy. I like it. Andy B. (personal communication)
  • 26. Query Suggestions: Implementation Highlights ● Machine learning: ○ There are three pillars of data engineering: ■ [ labeled ] Data. ■ [ extracted ] Features. ■ [ learning ] Algorithms. ○ TL;DR: No rocket science, but most ML is about carefully using simple features. ● Software engineering: ○ Effectively, the enumeration of queries to be suggested is the AST traversal. ○ “Trie” “prefix” generators are stateful, both wrt the current node and wrt the terms consumed. ○ To handle XXX QPS it has to be breadth first search, not depth first search. ○ Thus, the priority-queue-chained “calls” carry both the “local state” and the “global state”. ○ TL;DR: Quite an implementation exercise.
  • 27. Q&A