SlideShare a Scribd company logo
1 of 17
Dialog Engine for Product
Information
Vamsee Chamakura - 201301243
Satyam Verma-201505604
Jitta Divya Sai-201225167
Problem Statement
The aim of the project is to implement a dialog system, which answers the users
queries related to product information.
The implementation of this project for has been divided into the following
modules:
Crawling and scraping the product information from Flipkart.
Processing the scraped data.
Saving the data in MongoDB (NoSql database).
Preprocessing the query.
Querying the database and extracting the relevant results.
Database
Crawler
&
Scraper
Query
Processor
MongoDB
Driver
User’s Query
Result
Flipkart
System Design
Crawling and Scraping
Tools used were Scrapy and BeautifulSoup for crawling the data from Flipkart’s
website.
The categories that were scraped is mobiles, televisions, laptops, air
conditioners, refrigerators and cameras.
The amount of data that was extracted was around 3000 products from the
above mentioned categories.
Processing the Scraped Data
BeautifulSoup is a python library used for extracting data from the HTML or XML
pages.
We used BeautifulSoup to extract all the properties from the crawled web pages.
Different products had varied number and type of properties, so we used
MongoDB for flexibility.
r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.content,"lxml")
keys = soup.find_all("td", {"class": "specsKey"})
vals = soup.find_all("td", {"class": "specsValue"})
Storing the Data in MongoDB
MongoDB is a NoSql database used for storing big data.
The database has six collections namely:
Mobiles
Laptops
Television
Air Conditioner
Camera
Refrigerator
Contd…
● It stores each row as a JSON, called a document and allows a lot of flexibility in
storage.
● The properties of each product are stored as key-value pairs with the primary
key as model name.
Preprocessing the Query
We handled three types of queries:
Template Based Queries.
Natural Language Based Queries.
Comparison Based Queries.
Template for queries :
QUERY SYNTAX ERROR, REQUIRED: [PRODUCT NAME], [PROPERTY]
NL Queries:
What is the price of Apple iPhone 5s?
● CB Queries:
Which among apple iphone, samsung galaxy has the best price?
Natural Language Based Queries :
In the template based queries, we directly get the product and property names
that the user is interested in.
But in NL queries we need to extract them from the given sentence.
This needs a pre-processing step, which involves removal of stop words.
The probability that the property name comes before the product name is very
high due to syntactic constraints in the English language.
So, keeping this in mind the property name and product name are extracted.
Approach
The approach followed:
We maintain three lists namely - product name list, brand name list and property list.
We extract the brand of the product from the given query by iterating through the
brand name list using edit distance algorithm which also helps in handling spelling
errors or typos.
Elements of product_name list are tuples of size 3 - brand, model name and category.
After the extraction of the brand name, we consider only those products from that
particular brand for further processing.
Contd..
Approach continued:
For determining the exact model name and the property name, we use a similar
approach but add an additional similarity measure along with the edit distance as
mentioned before.
The second similarity measure is calculated by dividing the maximum length of the two
strings by the number of character matches between the strings.
We take the harmonic mean of the edit-distance score and the above metric to get a
final similarity measure.
We take the top 10 results for products and the best one for property.
Similarity Measure - Edit Distance
def editDistance(word1, word2):
len_1=len(word1)
len_2=len(word2)
x =[[0]*(len_2+1) for _ in range(len_1+1)]
for i in range(0,len_1+1):
x[i][0]=i
for j in range(0,len_2+1):
x[0][j]=j
for i in range (1,len_1+1):
for j in range(1,len_2+1):
if word1[i-1]==word2[j-1]:
x[i][j] = x[i-1][j-1]
else :
x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
return x[i][j]
Similarity Measure - 2
def similarityMetric(word1, word2):
w1_c = [0]*256
w2_c = [0]*256
for i in word1:
w1_c[ord(i)] += 1
for i in word2:
w2_c[ord(i)] += 1
matched_words = 0
for i in xrange(256):
matched_words += min(w1_c[i], w2_c[i])
if (matched_words == 0):
return 99999
return (max(len(word1), len(word2)) / float(matched_words))
For every product name from the top 10, we query the database and obtain
respective results and display to the user .
Querying the Database
Since scraping the data requires sending a lot of requests to a server, there is a
chance that the server may temporarily block or blacklist our IP. But
fortunately, we did not come across this problem.
Initially we hand picked a few specific properties, but as we expanded the
domain we had to include all the available properties. This posed a problem,
when it came to answering queries that asked for a particular property, but the
corresponding data was stored under a synonymous or extended name. This
can be handled using Synsets (WordNet).
Challenges
Results
Thank You

More Related Content

Viewers also liked (10)

Presentación Andrés Bernal - eCommerce Day Bogotá 2016
Presentación Andrés Bernal - eCommerce Day Bogotá 2016Presentación Andrés Bernal - eCommerce Day Bogotá 2016
Presentación Andrés Bernal - eCommerce Day Bogotá 2016
 
Factorizacion (1)
Factorizacion (1)Factorizacion (1)
Factorizacion (1)
 
Application of artificial neural network in metropolitan landscape
Application of artificial neural network in metropolitan landscapeApplication of artificial neural network in metropolitan landscape
Application of artificial neural network in metropolitan landscape
 
4 ijcse-01218
4 ijcse-012184 ijcse-01218
4 ijcse-01218
 
Se vive la sexualidad
Se vive la sexualidadSe vive la sexualidad
Se vive la sexualidad
 
Documentation of lessons and the best practice for csa
Documentation of lessons and the best practice for csaDocumentation of lessons and the best practice for csa
Documentation of lessons and the best practice for csa
 
Arquitectura controlador sata
Arquitectura   controlador sataArquitectura   controlador sata
Arquitectura controlador sata
 
Análisis proyecto
Análisis proyectoAnálisis proyecto
Análisis proyecto
 
Mobile jammer
Mobile jammer Mobile jammer
Mobile jammer
 
Transonic Combustion Seminar Report
Transonic Combustion Seminar ReportTransonic Combustion Seminar Report
Transonic Combustion Seminar Report
 

Similar to Dialog Engine for Product Information Query Processing

DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_ReportUrjit Patel
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewIRJET Journal
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...predictionio
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research ReportAlex Sumner
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdfvkharish18
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google CloudMongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google CloudMongoDB
 
Android Beat the-quiz application
Android Beat the-quiz applicationAndroid Beat the-quiz application
Android Beat the-quiz applicationAyush Singh
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
 
IRJET- American Sign Language Classification
IRJET- American Sign Language ClassificationIRJET- American Sign Language Classification
IRJET- American Sign Language ClassificationIRJET Journal
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET Journal
 
Welcome Webinar Slides
Welcome Webinar SlidesWelcome Webinar Slides
Welcome Webinar SlidesSumo Logic
 
Localization and Shared Preferences in android
Localization and Shared Preferences in androidLocalization and Shared Preferences in android
Localization and Shared Preferences in androidAly Arman
 
Jayateerth.V.S(Oracle SQLPLSQL)-3
Jayateerth.V.S(Oracle SQLPLSQL)-3Jayateerth.V.S(Oracle SQLPLSQL)-3
Jayateerth.V.S(Oracle SQLPLSQL)-3Jayateerth Sullad
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State Universitydhabalia
 
Connecting your Python App to OpenERP through OOOP
Connecting your Python App to OpenERP through OOOPConnecting your Python App to OpenERP through OOOP
Connecting your Python App to OpenERP through OOOPraimonesteve
 

Similar to Dialog Engine for Product Information Query Processing (20)

DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_Report
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer Review
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
 
ArduinoWorkshop2.pdf
ArduinoWorkshop2.pdfArduinoWorkshop2.pdf
ArduinoWorkshop2.pdf
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdf
 
I explore
I exploreI explore
I explore
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Ashwin resume
Ashwin resumeAshwin resume
Ashwin resume
 
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google CloudMongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
 
Android Beat the-quiz application
Android Beat the-quiz applicationAndroid Beat the-quiz application
Android Beat the-quiz application
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
IRJET- American Sign Language Classification
IRJET- American Sign Language ClassificationIRJET- American Sign Language Classification
IRJET- American Sign Language Classification
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
 
Welcome Webinar Slides
Welcome Webinar SlidesWelcome Webinar Slides
Welcome Webinar Slides
 
Localization and Shared Preferences in android
Localization and Shared Preferences in androidLocalization and Shared Preferences in android
Localization and Shared Preferences in android
 
Jayateerth.V.S(Oracle SQLPLSQL)-3
Jayateerth.V.S(Oracle SQLPLSQL)-3Jayateerth.V.S(Oracle SQLPLSQL)-3
Jayateerth.V.S(Oracle SQLPLSQL)-3
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
Connecting your Python App to OpenERP through OOOP
Connecting your Python App to OpenERP through OOOPConnecting your Python App to OpenERP through OOOP
Connecting your Python App to OpenERP through OOOP
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...Call Girls in Nagpur High Profile
 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
 

Dialog Engine for Product Information Query Processing

  • 1. Dialog Engine for Product Information Vamsee Chamakura - 201301243 Satyam Verma-201505604 Jitta Divya Sai-201225167
  • 2. Problem Statement The aim of the project is to implement a dialog system, which answers the users queries related to product information. The implementation of this project for has been divided into the following modules: Crawling and scraping the product information from Flipkart. Processing the scraped data. Saving the data in MongoDB (NoSql database). Preprocessing the query. Querying the database and extracting the relevant results.
  • 4. Crawling and Scraping Tools used were Scrapy and BeautifulSoup for crawling the data from Flipkart’s website. The categories that were scraped is mobiles, televisions, laptops, air conditioners, refrigerators and cameras. The amount of data that was extracted was around 3000 products from the above mentioned categories.
  • 5. Processing the Scraped Data BeautifulSoup is a python library used for extracting data from the HTML or XML pages. We used BeautifulSoup to extract all the properties from the crawled web pages. Different products had varied number and type of properties, so we used MongoDB for flexibility. r = requests.get(url) if r.status_code == 200: soup = BeautifulSoup(r.content,"lxml") keys = soup.find_all("td", {"class": "specsKey"}) vals = soup.find_all("td", {"class": "specsValue"})
  • 6. Storing the Data in MongoDB MongoDB is a NoSql database used for storing big data. The database has six collections namely: Mobiles Laptops Television Air Conditioner Camera Refrigerator
  • 7. Contd… ● It stores each row as a JSON, called a document and allows a lot of flexibility in storage. ● The properties of each product are stored as key-value pairs with the primary key as model name.
  • 8. Preprocessing the Query We handled three types of queries: Template Based Queries. Natural Language Based Queries. Comparison Based Queries. Template for queries : QUERY SYNTAX ERROR, REQUIRED: [PRODUCT NAME], [PROPERTY] NL Queries: What is the price of Apple iPhone 5s? ● CB Queries: Which among apple iphone, samsung galaxy has the best price?
  • 9. Natural Language Based Queries : In the template based queries, we directly get the product and property names that the user is interested in. But in NL queries we need to extract them from the given sentence. This needs a pre-processing step, which involves removal of stop words. The probability that the property name comes before the product name is very high due to syntactic constraints in the English language. So, keeping this in mind the property name and product name are extracted.
  • 10. Approach The approach followed: We maintain three lists namely - product name list, brand name list and property list. We extract the brand of the product from the given query by iterating through the brand name list using edit distance algorithm which also helps in handling spelling errors or typos. Elements of product_name list are tuples of size 3 - brand, model name and category. After the extraction of the brand name, we consider only those products from that particular brand for further processing.
  • 11. Contd.. Approach continued: For determining the exact model name and the property name, we use a similar approach but add an additional similarity measure along with the edit distance as mentioned before. The second similarity measure is calculated by dividing the maximum length of the two strings by the number of character matches between the strings. We take the harmonic mean of the edit-distance score and the above metric to get a final similarity measure. We take the top 10 results for products and the best one for property.
  • 12. Similarity Measure - Edit Distance def editDistance(word1, word2): len_1=len(word1) len_2=len(word2) x =[[0]*(len_2+1) for _ in range(len_1+1)] for i in range(0,len_1+1): x[i][0]=i for j in range(0,len_2+1): x[0][j]=j for i in range (1,len_1+1): for j in range(1,len_2+1): if word1[i-1]==word2[j-1]: x[i][j] = x[i-1][j-1] else : x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1 return x[i][j]
  • 13. Similarity Measure - 2 def similarityMetric(word1, word2): w1_c = [0]*256 w2_c = [0]*256 for i in word1: w1_c[ord(i)] += 1 for i in word2: w2_c[ord(i)] += 1 matched_words = 0 for i in xrange(256): matched_words += min(w1_c[i], w2_c[i]) if (matched_words == 0): return 99999 return (max(len(word1), len(word2)) / float(matched_words))
  • 14. For every product name from the top 10, we query the database and obtain respective results and display to the user . Querying the Database
  • 15. Since scraping the data requires sending a lot of requests to a server, there is a chance that the server may temporarily block or blacklist our IP. But fortunately, we did not come across this problem. Initially we hand picked a few specific properties, but as we expanded the domain we had to include all the available properties. This posed a problem, when it came to answering queries that asked for a particular property, but the corresponding data was stored under a synonymous or extended name. This can be handled using Synsets (WordNet). Challenges