The aim of the project is to implement a dialog system, which answers the users queries related to product information.
Vamsee Chamakura - 201301243
Satyam Verma-201505604
Jitta Divya Sai-201225167
[Team 35] [April 2016]
Information Retrieval and Extraction (CSE474) Spring ‘16
Professor: Vasudeva Verma
Mentor: Anurag Tyagi
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
Dialog Engine for Product Information Query Processing
1. Dialog Engine for Product
Information
Vamsee Chamakura - 201301243
Satyam Verma-201505604
Jitta Divya Sai-201225167
2. Problem Statement
The aim of the project is to implement a dialog system, which answers the users
queries related to product information.
The implementation of this project for has been divided into the following
modules:
Crawling and scraping the product information from Flipkart.
Processing the scraped data.
Saving the data in MongoDB (NoSql database).
Preprocessing the query.
Querying the database and extracting the relevant results.
4. Crawling and Scraping
Tools used were Scrapy and BeautifulSoup for crawling the data from Flipkart’s
website.
The categories that were scraped is mobiles, televisions, laptops, air
conditioners, refrigerators and cameras.
The amount of data that was extracted was around 3000 products from the
above mentioned categories.
5. Processing the Scraped Data
BeautifulSoup is a python library used for extracting data from the HTML or XML
pages.
We used BeautifulSoup to extract all the properties from the crawled web pages.
Different products had varied number and type of properties, so we used
MongoDB for flexibility.
r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.content,"lxml")
keys = soup.find_all("td", {"class": "specsKey"})
vals = soup.find_all("td", {"class": "specsValue"})
6. Storing the Data in MongoDB
MongoDB is a NoSql database used for storing big data.
The database has six collections namely:
Mobiles
Laptops
Television
Air Conditioner
Camera
Refrigerator
7. Contd…
● It stores each row as a JSON, called a document and allows a lot of flexibility in
storage.
● The properties of each product are stored as key-value pairs with the primary
key as model name.
8. Preprocessing the Query
We handled three types of queries:
Template Based Queries.
Natural Language Based Queries.
Comparison Based Queries.
Template for queries :
QUERY SYNTAX ERROR, REQUIRED: [PRODUCT NAME], [PROPERTY]
NL Queries:
What is the price of Apple iPhone 5s?
● CB Queries:
Which among apple iphone, samsung galaxy has the best price?
9. Natural Language Based Queries :
In the template based queries, we directly get the product and property names
that the user is interested in.
But in NL queries we need to extract them from the given sentence.
This needs a pre-processing step, which involves removal of stop words.
The probability that the property name comes before the product name is very
high due to syntactic constraints in the English language.
So, keeping this in mind the property name and product name are extracted.
10. Approach
The approach followed:
We maintain three lists namely - product name list, brand name list and property list.
We extract the brand of the product from the given query by iterating through the
brand name list using edit distance algorithm which also helps in handling spelling
errors or typos.
Elements of product_name list are tuples of size 3 - brand, model name and category.
After the extraction of the brand name, we consider only those products from that
particular brand for further processing.
11. Contd..
Approach continued:
For determining the exact model name and the property name, we use a similar
approach but add an additional similarity measure along with the edit distance as
mentioned before.
The second similarity measure is calculated by dividing the maximum length of the two
strings by the number of character matches between the strings.
We take the harmonic mean of the edit-distance score and the above metric to get a
final similarity measure.
We take the top 10 results for products and the best one for property.
12. Similarity Measure - Edit Distance
def editDistance(word1, word2):
len_1=len(word1)
len_2=len(word2)
x =[[0]*(len_2+1) for _ in range(len_1+1)]
for i in range(0,len_1+1):
x[i][0]=i
for j in range(0,len_2+1):
x[0][j]=j
for i in range (1,len_1+1):
for j in range(1,len_2+1):
if word1[i-1]==word2[j-1]:
x[i][j] = x[i-1][j-1]
else :
x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
return x[i][j]
13. Similarity Measure - 2
def similarityMetric(word1, word2):
w1_c = [0]*256
w2_c = [0]*256
for i in word1:
w1_c[ord(i)] += 1
for i in word2:
w2_c[ord(i)] += 1
matched_words = 0
for i in xrange(256):
matched_words += min(w1_c[i], w2_c[i])
if (matched_words == 0):
return 99999
return (max(len(word1), len(word2)) / float(matched_words))
14. For every product name from the top 10, we query the database and obtain
respective results and display to the user .
Querying the Database
15. Since scraping the data requires sending a lot of requests to a server, there is a
chance that the server may temporarily block or blacklist our IP. But
fortunately, we did not come across this problem.
Initially we hand picked a few specific properties, but as we expanded the
domain we had to include all the available properties. This posed a problem,
when it came to answering queries that asked for a particular property, but the
corresponding data was stored under a synonymous or extended name. This
can be handled using Synsets (WordNet).
Challenges