RV College of
Engineering
Go, change the world
1
Dr. G. Shobha
Professor, CSE Department
RV College of Engineering, Bengaluru - 59
Natural Language to SQL Query conversion using
Machine Learning Techniques on HPCC Systems
Platform
RV College of
Engineering
PRESENTATION CONTENTS
2
• Introduction and Motivation
• Components involved in NLP for NL to SQL Conversion
• Rule Based Architecture for NL to SQL conversion
• Machine Learning Based Architecture to Enrich NL for SQl
Conversion
• HPCC Systems Architecture
• Results & Conclusions
RV College of
Engineering
Introduction and Motivation
3
Key Factors of NL to SQL
Go, change the world
• Databases serve as the forefront for most systems today.
• Structured query language (SQL) is used to access and manipulate the
data stored in a relational database.
• Most end users have limited knowledge of SQL and thus face
difficulties in accessing such
• Critical to access the data
• Learn the Querying language and understand the various syntax
RV College of
Engineering
4
Components Involved in NLP for NL to SQL
Components of NLP
NLP
Part of Computer Science and Artificial Intelligence
which deals with Human Languages
Go, change the world
RV College of
Engineering
Rule Based Architecture for NL to SQl Conversion
5
Go, change the world
RV College of
Engineering
Rule Based Architecture for NL to SQl Conversion
6
Preprocessor
• Tokenizes the natural language input.
• Remove the redundant tokens
• The output of the preprocessor is duplicated
and supplied to two major components
- Entity Recognizer
- Intent Recognizer
Entity Recognizer
• entity extractor
• a classifier
• a filter.
Go, change the world
RV College of
Engineering
Rule Based Architecture for NL to SQl Conversion
7
Entity Extractor
• uses parts of speech tagging and a date parser to extract important
keywords from the sentence
• strong probable to form relation names, attribute names or data
• These are then fed into a classifier along with the user defined schema
mappings of relation names and attribute names.
Classifier
• The classifier uses various checks such as Direct, Concatenation, N gram,
hypernyms, synonyms to discriminate the keywords into relation names,
attribute names and residual keywords.
Filter
• The residual words are filtered to extract the words that form part of the data items of
the SQL query.
Go, change the world
RV College of
Engineering
Rule Based Architecture for NL to SQl Conversion
8
Intent Recognizer
• Process of creating a template of the SQL
query by performing checks for each SQL
clause.
• Various techniques such as the context
identification, distance metric, keyword
spotting, grammar rules etc. are applied to
check for the existence of a particular clause.
Go, change the world
RV College of
Engineering
Rule Based Architecture for NL to SQl Conversion
9
Challenges faced
• Specific Schema
• Identification of partial or implied data values
• Identification of descriptive values
Go To Solution : Machine Learning Techniques for NL to SQL
Go, change the world
RV College of
Engineering
10
Technologies Involved in Machine Learning for NLP to SQL
Feedforward neural networks
Recurrent Neural Networks (RNNs)
• Networks with feedback loops (recurrent edges)
• Output at current time step depends on current input as well
• as previous state (via recurrent edges)
Training RNNs
Problem: can’t capture long-term dependencies due to vanishing/exploding gradients during backpropagation
Go, change the world
RV College of
Engineering
11
Technologies Involved in ML for NLP to SQL
Go To Solution : Long Short Term Memory Model
A type of RNN architecture that addresses the vanishing/exploding gradient problem and allows learning of
long-term dependencies
Recently risen to prominence with state-of-the-art performance in speech recognition, language modeling, translation,
image captioning
Go, change the world
RV College of
Engineering
12
Technologies Involved in Machine Learning for NLP to SQL
RV College of
Engineering
13
Machine Learning Based Architecture to
Enrich NL for SQl Conversion
Go, change the world
RV College of
Engineering
14
Data Set Extraction
Go, change the world
• Data extracted from RDBMS
• Apache Common CSV Library - used to extract the dataset in the
form of CSV file
• Attributes which contain descriptive values’ (Ex: Experience,
Description. etc) is also provided as input.
• Three separate components work synchronously to extract
maximum latent information from the dataset, which can either
be used to enrich the natural language or be stored to use during
conversion.
Partial and Implied Values
• Pre-processing techniques
• Embedding Layer
• Long Short Term Memory
• Classification of Inputs
Machine Learning for Implied Data Values
RV College of
Engineering
15
Pre-processing techniques
Go, change the world
Machine Learning for Implied Data Values
RV College of
Engineering
16
Embedding Layer
Go, change the world
Machine Learning for Implied Data Values
RV College of
Engineering
17
LSTM Model
Go, change the world
Machine Learning for Implied Data Values
RV College of
Engineering
18
Proposed Model – Implied Data Values
Classification of Inputs
• The input Natural Language query is tokenized and
split into different sequences.
• Sequences of 1 word (1-gram) up to sequences of n
words (n-gram, where n is determined by the number
of tokens) is considered for prediction.
• The largest sequences and its classification are
considered (i.e., sub-sequences are ignored).
The final, high confidence classifications given by the
LSTM model can be used in multiple ways, couple of
them are outlined below:
• Enrich the Natural Language query
• Store the data values and attribute names
Go, change the world
RV College of
Engineering
19
Elastic Search –Descriptive Values
Go, change the world
Elastic Search
Stop Analyzer : Discards the Stop words
Ex :
Input: Get the doctors with masters degree
Analyzer: Get doctors masters degree
English Language Analyzer:
converts the words of the input query to its
root word.
Ex:
Input: Show all products which are red bikes.
Analyzer: Show all product which road bike
Components of Elastic Search
1. Analyzers
• The extracted CSV file is used to create an index in
Elastic Search.
• Elastic Search’s Bulk API provides the necessary
functions that can create and store large data
simultaneously.
RV College of
Engineering
20
Proposed Model – Descriptive Values
Go, change the world
Components of Elastic Search
2. Searching through multiple attributes
3. Generation of suitable fieldname-value pair in
WHERE clause
Multiple columns can be searched in Elastic
Search by using “multi_match” keyword
{ “query”:
{ “multi_match”:
{ “query”: input query,
“fields”:[list of descriptive
column names];
}
}
}
WHERE fieldname1 = value1 AND fieldname2 =
value2 AND.… fieldnameN = valueN
RV College of
Engineering
21
Proposed Model – Descriptive Values
Go, change the world
RV College of
Engineering
HPCC Systems Platform
22
Key Factors of HPCC Systems
Platform
Go, change the world
Go To Solutions : Synchronous Combination of Hybrid Machine Learning Model,
Elastic Search, WordNet , HPCC Systems Platform
• Highly integrated system environment
- capabilities from raw data processing to high-
performance queries and data analysis using a
common language;
• Optimized cluster approach
- provides high performance at a much lower system
cost than other system alternatives
• Stable and reliable processing environment proven in
production applications for varied organizations over a
15-year period;
• Innovative data-centric programming language (ECL)
• High-level of fault resilience and capabilities
• Suitable for a wide range of data-intensive
RV College of
Engineering
Introduction and Motivation
23
Go, change the world
RV College of
Engineering
24
Results
Input Natural
Language Query
Enriched Natural
Language Query
Output SQL Query
show all unmarried
customers who are
men
show all single Gender
'male' customers
SELECT * FROM
t_cstmrs WHERE
LOWER( MaritalStatus )
= 'single' AND LOWER(
Gender ) = 'male'
Names of customers
who have graduated
and from germany
or france
FullName Names of
customers who have
Education 'graduate
degree' and from
CountryRegion
'germany' or
CountryRegion 'france'
SELECT
t_cstmrs.FullName
FROM t_cstmrs INNER
JOIN t_ggrphy ON
t_ggrphy.GeographyKey
=
t_cstmrs.GeographyKey
WHERE LOWER (
t_ggrphy.CountryRegion
) = 'germany' OR
LOWER
(t_ggrphy.CountryRegion
) = 'france' ) AND
(LOWER(
t_cstmrs.Education ) =
'graduate degree' )
Go, change the world
RV College of
Engineering
25
Results
get the price of red or dark helmet
get the price of Color 'red' or Color
‘black' ProductSubCategoryName
'helmet'
SELECT ListPrice , Color FROM
t_prdsubcat INNER JOIN t_prds ON
t_prdsubcat.ProductSubCategoryKey =
t_prds.ProductSubCategoryKey WHERE
LOWER( Color ) = 'red' OR LOWER(
Color ) = 'black'
how much does tire tube cost
how much does ProductName ‘road tire
tube’ cost
SELECT ListPrice , ProductName FROM
t_prds WHERE LOWER( ProductName ) =
'road tire tube'
get the orders from new south wales
australia
get the orders from StateProvince 'new
south wales' CountryRegion 'australia'
SELECT t_saldtls.OrderQuantity,
t_ggrphy.CountryRegion, t_
t_cstmrs.FullName , t_ggrphy.StateProvince
FROM t_ggrphy INNER JOIN t_cstmrs ON
t_cstmrs.GeographyKey =
t_ggrphy.GeographyKey INNER JOIN
t_saldtls ON t_cstmrs.CustomerKey =
t_saldtls.CustomerKey WHERE LOWER(
t_cstmrs.StateProvince) = 'new south wales'
AND LOWER( t_ggrphy.CountryRegion ) =
'australia'
show subtotal of orders for helmet
show subtotal of orders for
ProductSubCategoryName 'helmet’
SELECT SUM( t_saldtls.SalesOrderint )
FROM t_prds INNER JOIN t_saldtls
ON t_prds.ProductKey =
t_saldtls.ProductKey WHERE LOWER(
t_prds.ProductName ) = 'helmet'
Go, change the world
RV College of
Engineering
26
Results – Descriptive values
Go, change the world
Select an item with mountain wheel for entry-
level rider.
SELECT * FROM t_prds WHERE t_prds.Description = 'Replacement mountain wheel for entry-level rider.'
Name the items which have pioneering frame
technology as the HQ steel frame.
SELECT t_prds.ProductName FROM t_prds WHERE t_prds.Description = 'The same pioneering frame
technology is used to give you the highest value as the HQ steel frame.'
RV College of
Engineering
27
Conclusion
• Partial and implied data values in the natural language queries are identified by a trained hybrid
ML model.
• WordNet is also used as a safety net to understand implied data values where the vocabulary of
the input relational database is not expressive.
• Descriptive values are identified with the help of Elastic Search.
• The accuracy of the system is 91.7% on IMDb database
Go, change the world
RV College of
Engineering
28
Acknowledge
Students of RVCE
1. Shubham Phal
2. Yatish H R
3. Tanmay Hukkeri
4. Akshar Prasad
5. Sourabh S Badhya
6. Yashwanth YS
7. Shetty Rohan
RV College of
Engineering
29
Go, change the world

Natural Language to SQL Query conversion using Machine Learning Techniques on HPCC Systems

  • 1.
    RV College of Engineering Go,change the world 1 Dr. G. Shobha Professor, CSE Department RV College of Engineering, Bengaluru - 59 Natural Language to SQL Query conversion using Machine Learning Techniques on HPCC Systems Platform
  • 2.
    RV College of Engineering PRESENTATIONCONTENTS 2 • Introduction and Motivation • Components involved in NLP for NL to SQL Conversion • Rule Based Architecture for NL to SQL conversion • Machine Learning Based Architecture to Enrich NL for SQl Conversion • HPCC Systems Architecture • Results & Conclusions
  • 3.
    RV College of Engineering Introductionand Motivation 3 Key Factors of NL to SQL Go, change the world • Databases serve as the forefront for most systems today. • Structured query language (SQL) is used to access and manipulate the data stored in a relational database. • Most end users have limited knowledge of SQL and thus face difficulties in accessing such • Critical to access the data • Learn the Querying language and understand the various syntax
  • 4.
    RV College of Engineering 4 ComponentsInvolved in NLP for NL to SQL Components of NLP NLP Part of Computer Science and Artificial Intelligence which deals with Human Languages Go, change the world
  • 5.
    RV College of Engineering RuleBased Architecture for NL to SQl Conversion 5 Go, change the world
  • 6.
    RV College of Engineering RuleBased Architecture for NL to SQl Conversion 6 Preprocessor • Tokenizes the natural language input. • Remove the redundant tokens • The output of the preprocessor is duplicated and supplied to two major components - Entity Recognizer - Intent Recognizer Entity Recognizer • entity extractor • a classifier • a filter. Go, change the world
  • 7.
    RV College of Engineering RuleBased Architecture for NL to SQl Conversion 7 Entity Extractor • uses parts of speech tagging and a date parser to extract important keywords from the sentence • strong probable to form relation names, attribute names or data • These are then fed into a classifier along with the user defined schema mappings of relation names and attribute names. Classifier • The classifier uses various checks such as Direct, Concatenation, N gram, hypernyms, synonyms to discriminate the keywords into relation names, attribute names and residual keywords. Filter • The residual words are filtered to extract the words that form part of the data items of the SQL query. Go, change the world
  • 8.
    RV College of Engineering RuleBased Architecture for NL to SQl Conversion 8 Intent Recognizer • Process of creating a template of the SQL query by performing checks for each SQL clause. • Various techniques such as the context identification, distance metric, keyword spotting, grammar rules etc. are applied to check for the existence of a particular clause. Go, change the world
  • 9.
    RV College of Engineering RuleBased Architecture for NL to SQl Conversion 9 Challenges faced • Specific Schema • Identification of partial or implied data values • Identification of descriptive values Go To Solution : Machine Learning Techniques for NL to SQL Go, change the world
  • 10.
    RV College of Engineering 10 TechnologiesInvolved in Machine Learning for NLP to SQL Feedforward neural networks Recurrent Neural Networks (RNNs) • Networks with feedback loops (recurrent edges) • Output at current time step depends on current input as well • as previous state (via recurrent edges) Training RNNs Problem: can’t capture long-term dependencies due to vanishing/exploding gradients during backpropagation Go, change the world
  • 11.
    RV College of Engineering 11 TechnologiesInvolved in ML for NLP to SQL Go To Solution : Long Short Term Memory Model A type of RNN architecture that addresses the vanishing/exploding gradient problem and allows learning of long-term dependencies Recently risen to prominence with state-of-the-art performance in speech recognition, language modeling, translation, image captioning Go, change the world
  • 12.
    RV College of Engineering 12 TechnologiesInvolved in Machine Learning for NLP to SQL
  • 13.
    RV College of Engineering 13 MachineLearning Based Architecture to Enrich NL for SQl Conversion Go, change the world
  • 14.
    RV College of Engineering 14 DataSet Extraction Go, change the world • Data extracted from RDBMS • Apache Common CSV Library - used to extract the dataset in the form of CSV file • Attributes which contain descriptive values’ (Ex: Experience, Description. etc) is also provided as input. • Three separate components work synchronously to extract maximum latent information from the dataset, which can either be used to enrich the natural language or be stored to use during conversion. Partial and Implied Values • Pre-processing techniques • Embedding Layer • Long Short Term Memory • Classification of Inputs Machine Learning for Implied Data Values
  • 15.
    RV College of Engineering 15 Pre-processingtechniques Go, change the world Machine Learning for Implied Data Values
  • 16.
    RV College of Engineering 16 EmbeddingLayer Go, change the world Machine Learning for Implied Data Values
  • 17.
    RV College of Engineering 17 LSTMModel Go, change the world Machine Learning for Implied Data Values
  • 18.
    RV College of Engineering 18 ProposedModel – Implied Data Values Classification of Inputs • The input Natural Language query is tokenized and split into different sequences. • Sequences of 1 word (1-gram) up to sequences of n words (n-gram, where n is determined by the number of tokens) is considered for prediction. • The largest sequences and its classification are considered (i.e., sub-sequences are ignored). The final, high confidence classifications given by the LSTM model can be used in multiple ways, couple of them are outlined below: • Enrich the Natural Language query • Store the data values and attribute names Go, change the world
  • 19.
    RV College of Engineering 19 ElasticSearch –Descriptive Values Go, change the world Elastic Search Stop Analyzer : Discards the Stop words Ex : Input: Get the doctors with masters degree Analyzer: Get doctors masters degree English Language Analyzer: converts the words of the input query to its root word. Ex: Input: Show all products which are red bikes. Analyzer: Show all product which road bike Components of Elastic Search 1. Analyzers • The extracted CSV file is used to create an index in Elastic Search. • Elastic Search’s Bulk API provides the necessary functions that can create and store large data simultaneously.
  • 20.
    RV College of Engineering 20 ProposedModel – Descriptive Values Go, change the world Components of Elastic Search 2. Searching through multiple attributes 3. Generation of suitable fieldname-value pair in WHERE clause Multiple columns can be searched in Elastic Search by using “multi_match” keyword { “query”: { “multi_match”: { “query”: input query, “fields”:[list of descriptive column names]; } } } WHERE fieldname1 = value1 AND fieldname2 = value2 AND.… fieldnameN = valueN
  • 21.
    RV College of Engineering 21 ProposedModel – Descriptive Values Go, change the world
  • 22.
    RV College of Engineering HPCCSystems Platform 22 Key Factors of HPCC Systems Platform Go, change the world Go To Solutions : Synchronous Combination of Hybrid Machine Learning Model, Elastic Search, WordNet , HPCC Systems Platform • Highly integrated system environment - capabilities from raw data processing to high- performance queries and data analysis using a common language; • Optimized cluster approach - provides high performance at a much lower system cost than other system alternatives • Stable and reliable processing environment proven in production applications for varied organizations over a 15-year period; • Innovative data-centric programming language (ECL) • High-level of fault resilience and capabilities • Suitable for a wide range of data-intensive
  • 23.
    RV College of Engineering Introductionand Motivation 23 Go, change the world
  • 24.
    RV College of Engineering 24 Results InputNatural Language Query Enriched Natural Language Query Output SQL Query show all unmarried customers who are men show all single Gender 'male' customers SELECT * FROM t_cstmrs WHERE LOWER( MaritalStatus ) = 'single' AND LOWER( Gender ) = 'male' Names of customers who have graduated and from germany or france FullName Names of customers who have Education 'graduate degree' and from CountryRegion 'germany' or CountryRegion 'france' SELECT t_cstmrs.FullName FROM t_cstmrs INNER JOIN t_ggrphy ON t_ggrphy.GeographyKey = t_cstmrs.GeographyKey WHERE LOWER ( t_ggrphy.CountryRegion ) = 'germany' OR LOWER (t_ggrphy.CountryRegion ) = 'france' ) AND (LOWER( t_cstmrs.Education ) = 'graduate degree' ) Go, change the world
  • 25.
    RV College of Engineering 25 Results getthe price of red or dark helmet get the price of Color 'red' or Color ‘black' ProductSubCategoryName 'helmet' SELECT ListPrice , Color FROM t_prdsubcat INNER JOIN t_prds ON t_prdsubcat.ProductSubCategoryKey = t_prds.ProductSubCategoryKey WHERE LOWER( Color ) = 'red' OR LOWER( Color ) = 'black' how much does tire tube cost how much does ProductName ‘road tire tube’ cost SELECT ListPrice , ProductName FROM t_prds WHERE LOWER( ProductName ) = 'road tire tube' get the orders from new south wales australia get the orders from StateProvince 'new south wales' CountryRegion 'australia' SELECT t_saldtls.OrderQuantity, t_ggrphy.CountryRegion, t_ t_cstmrs.FullName , t_ggrphy.StateProvince FROM t_ggrphy INNER JOIN t_cstmrs ON t_cstmrs.GeographyKey = t_ggrphy.GeographyKey INNER JOIN t_saldtls ON t_cstmrs.CustomerKey = t_saldtls.CustomerKey WHERE LOWER( t_cstmrs.StateProvince) = 'new south wales' AND LOWER( t_ggrphy.CountryRegion ) = 'australia' show subtotal of orders for helmet show subtotal of orders for ProductSubCategoryName 'helmet’ SELECT SUM( t_saldtls.SalesOrderint ) FROM t_prds INNER JOIN t_saldtls ON t_prds.ProductKey = t_saldtls.ProductKey WHERE LOWER( t_prds.ProductName ) = 'helmet' Go, change the world
  • 26.
    RV College of Engineering 26 Results– Descriptive values Go, change the world Select an item with mountain wheel for entry- level rider. SELECT * FROM t_prds WHERE t_prds.Description = 'Replacement mountain wheel for entry-level rider.' Name the items which have pioneering frame technology as the HQ steel frame. SELECT t_prds.ProductName FROM t_prds WHERE t_prds.Description = 'The same pioneering frame technology is used to give you the highest value as the HQ steel frame.'
  • 27.
    RV College of Engineering 27 Conclusion •Partial and implied data values in the natural language queries are identified by a trained hybrid ML model. • WordNet is also used as a safety net to understand implied data values where the vocabulary of the input relational database is not expressive. • Descriptive values are identified with the help of Elastic Search. • The accuracy of the system is 91.7% on IMDb database Go, change the world
  • 28.
    RV College of Engineering 28 Acknowledge Studentsof RVCE 1. Shubham Phal 2. Yatish H R 3. Tanmay Hukkeri 4. Akshar Prasad 5. Sourabh S Badhya 6. Yashwanth YS 7. Shetty Rohan
  • 29.