SlideShare a Scribd company logo
1 of 23
Predicting the Relevance of Search
Results for
E-Commerce Systems
Mohammed Zuhair Al-Taie
Joel Pinho Lucas
Siti Mariyam Shamsuddin
International Workshop Big Data Analytics 2015 “Multi Strategy Learning Analytics for Big Data”
Universiti Teknologi Malaysia, Kuala Lumpur, 17-18 Aug 2015
Study Published in
International Journal of Advances in Soft Computing and its Applications (IJASCA)
Introduction
► Search engines (e.g. Google.com, Yahoo.com, and
Bing.com) have become the dominant model of online
search
► Large business are able to hire the necessary skills to build
advanced search engines, while
♦ small online business still lack the ability to evaluate the
results of their search engines  losing the opportunity to
compete
► The purpose of this paper is to build an open-source model
that can:
♦ Measure the relevance of search results for online businesses
♦ Measure the accuracy of their underlined search algorithms
Data Pre-Processing
Data preprocessing usually consists of four steps
1. Data cleaning: a.k.a. data cleansing or scrubbing which
aims at removing noise, filling missing values and
correcting inconsistencies in data
♦ Involves the identification and removal of outliers
2. Data integration: seeks to combine data from varied and
different sources into a coherent data storage
3. Data transformation: aims at converting data to more
appropriate form.
♦ The goal is to have more efficient data mining operations
by making the data more understandable
4. Data reduction: aims at reducing the size of data while
minimizing any possible loss of information
Text Retrieval Systems
► Text retrieval involves retrieving the documents that
contain particular keywords for a given query
♦ Given a string T=t1 t2…tn, and a particular keyword pattern
P=p1 p2…pm, the goal is to verify whether P is a substring of
T
► Pattern matching can be applied in two ways:
♦ forward pattern matching, where the text and pattern are
matched in the forward direction; and
♦ Backward pattern matching, where the matching process is
done from right to left.
CrowdFlower Dataset
► CrowdFlower created its dataset with the help of its crowd
♦ The crowd rated a list of products from a handle of
ecommerce websites on a scale from 1 to 4
♦ “4” indicates the product completely satisfied the search
query and “1” to indicate that the result did not match the
query search
► 261 search terms were generated by CrowdFlower
► The dataset incorporated six attributes: id, query text,
product title, product description, median relevance and
relevance variance of the query.
♦ The train file contains 10,158 rows in total
♦ The test file contains another portion of data with the same
attributes, except the attributes related to the query relevance
(median relevance and relevance variance), which is the
label attribute to be predicted
CrowdFlower Dataset (cont.)
Data Cleaning and
Transformation
► The CrownFlower dataset described in last section is just
an amount of raw tuples related to text queries and product
descriptions and has some noise
♦ Apply data preprocessing which included the removal of
undesired content to improve data quality and make it ready
for mining and analysis
► We also need to pre-process the data in order to extract
some value and then transform in order to:
♦ Provide an input for some machine learning algorithms
► The first task to be accomplished is feature extraction
♦ We need to transform raw text search attributes in valuable
features
♦ To provide an input for running in machine learning
algorithms
Word Match Counting
► The dataset encompasses text attributes related to textual
searches. To extract features:
♦ The first method to extract feature is counting how many
words, in each text search, match the product title and
product description
► In order to implement word match counting, we took in
advantage facilities and expressiveness of the Python
language,
♦ Use the widely employed built-in packages for data
manipulation, including NumPy and SciPy
NumPy and SciPy
► NumPy is an extension to the
Python programming language,
adding support for large, multi-
dimensional arrays and matrices,
along with a large library of high-
level mathematical functions to
operate on these arrays (Wikipedia)
► SciPy (pronounced “Sigh Pie”) is
an open source Python library used
by scientists, analysts, and
engineers doing scientific
computing and technical computing
(Wikipedia)
Word Match Counting (cont.)
► The steps of word match counting were as follows:
1. First: we extracted word tokens from text (i.e.: splitting text
by spaces)
2. Second: we removed non-alphanumeric characters from
text.
♦ Words with just one or two characters (mostly articles and
prepositions) were also removed
3. Third: we transformed each list of words (from search text
attributes) in an array
► NumPy methods were applied for intersecting the search
word vector with product title and product description
vectors, respectively.
Word Match counting with tf-idf
► At this point we simply counted how many words
remained in each intersection,
♦ S={t1,t2,…,tn} is the search words (terms) vector,
D={t1,t2,…,tn} the product description words vector and
T={t1,t2,…,tn} the product title words vector, respectively:
Feature1 = count (T  S)
Feature2 = count (D  S)
► Since t 𝑓 − 𝑖𝑑𝑓 weight-based algorithms are broadly
employed by e-commerce and publishing companies for
text retrieval and searching 
♦ We decided to make use of it for enhancing the
expressiveness of the two features we acquired so far
TF-IDF
► Term Frequency-Inverse Document Frequency (TF-IDF)
produces a weight for each term fi in each document dj.
► It is basically a combination of two earlier methods: Term
Frequency (TF) and Inverse Document Frequency (IDF).
𝑡𝑓 − 𝑖𝑑𝑓 is defined as in Eq. 1:
𝑡𝑓 − 𝑖𝑑𝑓 𝑓𝑖, 𝑑𝑗 = 𝑡𝑓𝑖𝑗 ∗ 𝑖𝑑𝑓 (𝑓𝑖)
► In order to acquire word weights we used the
TfidfVectorizer class available in scikit-learn Python
package
♦ scikit-learn is a machine learning library for Python to serve
different purposes such as classification, regression and
clustering
Word Match counting with tf-idf
(cont.)
► Before calculating weights: stop words are ignored, which
means that irrelevant words for searching, such as articles,
prepositions and pronouns, are ignored
► The features we extracted now are much more expressive,
since they take into account weights of word terms,
♦ Higher weights are related to rare terms and lower weights
are related to high frequent terms across all documents
► Feature values are now the sum of all intersection word
weights
♦ Instead of merely using a string to represent words in
vectors, now we use Python dictionary to represent
words along with their weights, where words are the
keys and weights are the values in dictionaries
Word Match counting with tf-idf
(cont.)
► In this way, we have
♦ a search dictionary Sdict={t1w1, t2w2, …, tnwn},
♦ a product description words dictionary Ddict={t1w1, t2w2,
…, tnwn},
♦ a product title words dictionary
Tdict={t1w1,t2w2,…,tnwn}
♦ We may also represent, as vectors, the keys from dictonary:
Skeys={tk1, tk2, …, tkn}
► Thus, features are calculated as follows:
Feature1 = sum(values(Tkeys  Skeys))
Feature2 = sum(values(Dkeys  Skeys))
Data Analysis and Prediction
► The main goal with CrowdFlower data is to predict the
relevance of search queries.
♦ The label attribute for this study will be the median relevance
of the search
► We first tried to predict the values of median relevance
from CrowdFlower test set rows employing the (SVM)
Support Vector Machine
♦ Implementation from the scikit-learn
► We also tried CrowdFlower data with a more sophisticated
algorithm: the Random Forest
♦ Random Forest is an ensemble learning method that is
largely employed in machine learning
♦ Largely implemented in scikit-learn user community
scikit-learn
• scikit-learn is an open source machine learning library for
Python
• It features various classification, regression and clustering
algorithms including support vector machines, random
forests, gradient boosting, k-means and DBSCAN,
• It Is designed to interoperate with NumPy and SciPy.
Data Analysis and Prediction
(cont.)
► after acquiring the training data, and storing it properly in
data frame, learning in scikit-learn should be taken in three
simple steps:
1. initializing the model,
2. fitting it to the training data, and
3. predicting new values
► We stored the train set provided by CrowdFlower, but
using our extracted features, in a data frame and provide it
as input for both algorithms.
► After initializing and fitting train data, we can now predict
the label attribute (search terms median relevance)
Learning Models Benchmark
► We describe the resulted scores we obtained running the
four combination methods we applied:
♦ SVM with word match counting based features,
♦ SVM with tf-idf based features,
♦ Random Forest with word match counting based
features and,
♦ Random Forest with tf-idf based features
Learning Models Benchmark
(cont.)
►The best score was obtained with Random Forest with
tf-idf based features.
►We can also notice that Random Forest obtained better
score than SVM and,
►conversely, that tf-idf obtained better results than word
match counting based features.
►However, the impact of applying tf-idf on
preprocessing was substantially higher than Random
Forest over SVM
Match Counting tf-idf
SVM 0.51241 0.57654
Random Forest 0.53834 0.59211
Conclusion
► With our benchmark on CrowdFlower test set, we could
attest that Random Forest is an efficient machine-learning
algorithm.
► Random Forest shown better accuracy compared to a
simple SVM implementation. This, in part, is due to:
♦ The ensemble nature of Random Forest, which allows
multiple learning algorithms to be run
♦ Beside this fact, the nature of the dataset is also another
reason for SVM disadvantage, since such algorithm is likely
to provider poorer performances when the number of
features is much greater than the number of samples
Conclusion (cont.)
► Employing tf-idf for preprocessing features and, Random
Forest is a powerful and effective approach for predicting,
and measuring, the relevance of text search in e-commerce
scenarios.
► Such approach suits small e-commerce businesses emerged
in big data necessities.
► The use of scikit-learn package, as well as other built-in
Python packages (NumPy and SciPy), saves substantial
development time
Conclusion (cont.)
► The preprocessing step is crucial for the whole data
analysis because:
1. It consumes most of the time and implementation efforts
needed on the whole analysis
2. Preprocessing may be more critical for precision than the
machine-learning algorithm itself 
♦ SVM combined with tf-idf have shown higher score than
Radom Forest combined with word matches counting
Thank you

More Related Content

What's hot

High Performance Python - Marc Garcia
High Performance Python - Marc GarciaHigh Performance Python - Marc Garcia
High Performance Python - Marc GarciaMarc Garcia
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using pythonPurna Chander
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo dbMongoDB
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsToyotaro Suzumura
 
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 Deep Anomaly Detection from Research to Production Leveraging Spark and Tens... Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...Databricks
 
Interpreting Relational Schema to Graphs
Interpreting Relational Schema to GraphsInterpreting Relational Schema to Graphs
Interpreting Relational Schema to GraphsNeo4j
 
2014.06.24.what is ubix
2014.06.24.what is ubix2014.06.24.what is ubix
2014.06.24.what is ubixJim Cooley
 
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web...
The FLuID Meta Model: Incrementally Compute  Schema-level Indices for the Web...The FLuID Meta Model: Incrementally Compute  Schema-level Indices for the Web...
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web...Till Blume
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App developmentLuca Garulli
 
Web Data Extraction Como2010
Web Data Extraction Como2010Web Data Extraction Como2010
Web Data Extraction Como2010Giorgio Orsi
 
Tapestry
TapestryTapestry
TapestrySutha31
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 

What's hot (20)

High Performance Python - Marc Garcia
High Performance Python - Marc GarciaHigh Performance Python - Marc Garcia
High Performance Python - Marc Garcia
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Session 2
Session 2Session 2
Session 2
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
 
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 Deep Anomaly Detection from Research to Production Leveraging Spark and Tens... Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 
Interpreting Relational Schema to Graphs
Interpreting Relational Schema to GraphsInterpreting Relational Schema to Graphs
Interpreting Relational Schema to Graphs
 
2014.06.24.what is ubix
2014.06.24.what is ubix2014.06.24.what is ubix
2014.06.24.what is ubix
 
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web...
The FLuID Meta Model: Incrementally Compute  Schema-level Indices for the Web...The FLuID Meta Model: Incrementally Compute  Schema-level Indices for the Web...
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web...
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
Web Data Extraction Como2010
Web Data Extraction Como2010Web Data Extraction Como2010
Web Data Extraction Como2010
 
Tapestry
TapestryTapestry
Tapestry
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 

Similar to Predicting the relevance of search results for e-commerce systems

Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessingAbdurRazzaqe1
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet SentimentLucinda Linde
 
UNIT_5_Data Wrangling.pptx
UNIT_5_Data Wrangling.pptxUNIT_5_Data Wrangling.pptx
UNIT_5_Data Wrangling.pptxBhagyasriPatel2
 
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...VNIT-ACM Student Chapter
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET Journal
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Avelin Huo
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...IRJET Journal
 
Data Science Using Scikit-Learn
Data Science Using Scikit-LearnData Science Using Scikit-Learn
Data Science Using Scikit-LearnDucat India
 
Internship project presentation_final_upload
Internship project presentation_final_uploadInternship project presentation_final_upload
Internship project presentation_final_uploadSuraj Rathore
 
Text Analytics
Text AnalyticsText Analytics
Text AnalyticsAjay Ram
 
Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14sudhir11292rt
 
Query expansion_group42_ire
Query expansion_group42_ireQuery expansion_group42_ire
Query expansion_group42_ireKovidaN
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1arthi v
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNINGMLReview
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Miningiosrjce
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with pythonKumud Arora
 

Similar to Predicting the relevance of search results for e-commerce systems (20)

Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
 
UNIT_5_Data Wrangling.pptx
UNIT_5_Data Wrangling.pptxUNIT_5_Data Wrangling.pptx
UNIT_5_Data Wrangling.pptx
 
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
 
Data Science Using Scikit-Learn
Data Science Using Scikit-LearnData Science Using Scikit-Learn
Data Science Using Scikit-Learn
 
Internship project presentation_final_upload
Internship project presentation_final_uploadInternship project presentation_final_upload
Internship project presentation_final_upload
 
fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
 
Text Analytics
Text AnalyticsText Analytics
Text Analytics
 
Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14
 
Query expansion_group42_ire
Query expansion_group42_ireQuery expansion_group42_ire
Query expansion_group42_ire
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
 
E017252831
E017252831E017252831
E017252831
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 

More from Universiti Technologi Malaysia (UTM)

More from Universiti Technologi Malaysia (UTM) (10)

A self organizing communication model for disaster risk management
A self organizing communication model for disaster risk managementA self organizing communication model for disaster risk management
A self organizing communication model for disaster risk management
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Python 3.x quick syntax guide
Python 3.x quick syntax guidePython 3.x quick syntax guide
Python 3.x quick syntax guide
 
Social media with big data analytics
Social media with big data analyticsSocial media with big data analytics
Social media with big data analytics
 
Scientific theory of state and society parities and disparities between the p...
Scientific theory of state and society parities and disparities between the p...Scientific theory of state and society parities and disparities between the p...
Scientific theory of state and society parities and disparities between the p...
 
Nation building current trends of technology use in da’wah
Nation building current trends of technology use in da’wahNation building current trends of technology use in da’wah
Nation building current trends of technology use in da’wah
 
Flight MH370 community structure
Flight MH370 community structureFlight MH370 community structure
Flight MH370 community structure
 
Visualization of explanations in recommender systems
Visualization of explanations in recommender systemsVisualization of explanations in recommender systems
Visualization of explanations in recommender systems
 
Explanations in Recommender Systems: Overview and Research Approaches
Explanations in Recommender Systems: Overview and Research ApproachesExplanations in Recommender Systems: Overview and Research Approaches
Explanations in Recommender Systems: Overview and Research Approaches
 
Factors disrupting a successful implementation of e-commerce in iraq
Factors disrupting a successful implementation of e-commerce in iraqFactors disrupting a successful implementation of e-commerce in iraq
Factors disrupting a successful implementation of e-commerce in iraq
 

Recently uploaded

Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Centuryrwgiffor
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesDipal Arora
 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsMichael W. Hawkins
 
John Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfJohn Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfAmzadHosen3
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...rajveerescorts2022
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyEthan lee
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...Paul Menig
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...lizamodels9
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒anilsa9823
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMRavindra Nath Shukla
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Dave Litwiller
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with CultureSeta Wicaksana
 

Recently uploaded (20)

Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
 
HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael Hawkins
 
John Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfJohn Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdf
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
Forklift Operations: Safety through Cartoons
Forklift Operations: Safety through CartoonsForklift Operations: Safety through Cartoons
Forklift Operations: Safety through Cartoons
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
 
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 

Predicting the relevance of search results for e-commerce systems

  • 1. Predicting the Relevance of Search Results for E-Commerce Systems Mohammed Zuhair Al-Taie Joel Pinho Lucas Siti Mariyam Shamsuddin International Workshop Big Data Analytics 2015 “Multi Strategy Learning Analytics for Big Data” Universiti Teknologi Malaysia, Kuala Lumpur, 17-18 Aug 2015 Study Published in International Journal of Advances in Soft Computing and its Applications (IJASCA)
  • 2. Introduction ► Search engines (e.g. Google.com, Yahoo.com, and Bing.com) have become the dominant model of online search ► Large business are able to hire the necessary skills to build advanced search engines, while ♦ small online business still lack the ability to evaluate the results of their search engines  losing the opportunity to compete ► The purpose of this paper is to build an open-source model that can: ♦ Measure the relevance of search results for online businesses ♦ Measure the accuracy of their underlined search algorithms
  • 3. Data Pre-Processing Data preprocessing usually consists of four steps 1. Data cleaning: a.k.a. data cleansing or scrubbing which aims at removing noise, filling missing values and correcting inconsistencies in data ♦ Involves the identification and removal of outliers 2. Data integration: seeks to combine data from varied and different sources into a coherent data storage 3. Data transformation: aims at converting data to more appropriate form. ♦ The goal is to have more efficient data mining operations by making the data more understandable 4. Data reduction: aims at reducing the size of data while minimizing any possible loss of information
  • 4. Text Retrieval Systems ► Text retrieval involves retrieving the documents that contain particular keywords for a given query ♦ Given a string T=t1 t2…tn, and a particular keyword pattern P=p1 p2…pm, the goal is to verify whether P is a substring of T ► Pattern matching can be applied in two ways: ♦ forward pattern matching, where the text and pattern are matched in the forward direction; and ♦ Backward pattern matching, where the matching process is done from right to left.
  • 5. CrowdFlower Dataset ► CrowdFlower created its dataset with the help of its crowd ♦ The crowd rated a list of products from a handle of ecommerce websites on a scale from 1 to 4 ♦ “4” indicates the product completely satisfied the search query and “1” to indicate that the result did not match the query search ► 261 search terms were generated by CrowdFlower ► The dataset incorporated six attributes: id, query text, product title, product description, median relevance and relevance variance of the query. ♦ The train file contains 10,158 rows in total ♦ The test file contains another portion of data with the same attributes, except the attributes related to the query relevance (median relevance and relevance variance), which is the label attribute to be predicted
  • 7. Data Cleaning and Transformation ► The CrownFlower dataset described in last section is just an amount of raw tuples related to text queries and product descriptions and has some noise ♦ Apply data preprocessing which included the removal of undesired content to improve data quality and make it ready for mining and analysis ► We also need to pre-process the data in order to extract some value and then transform in order to: ♦ Provide an input for some machine learning algorithms ► The first task to be accomplished is feature extraction ♦ We need to transform raw text search attributes in valuable features ♦ To provide an input for running in machine learning algorithms
  • 8. Word Match Counting ► The dataset encompasses text attributes related to textual searches. To extract features: ♦ The first method to extract feature is counting how many words, in each text search, match the product title and product description ► In order to implement word match counting, we took in advantage facilities and expressiveness of the Python language, ♦ Use the widely employed built-in packages for data manipulation, including NumPy and SciPy
  • 9. NumPy and SciPy ► NumPy is an extension to the Python programming language, adding support for large, multi- dimensional arrays and matrices, along with a large library of high- level mathematical functions to operate on these arrays (Wikipedia) ► SciPy (pronounced “Sigh Pie”) is an open source Python library used by scientists, analysts, and engineers doing scientific computing and technical computing (Wikipedia)
  • 10. Word Match Counting (cont.) ► The steps of word match counting were as follows: 1. First: we extracted word tokens from text (i.e.: splitting text by spaces) 2. Second: we removed non-alphanumeric characters from text. ♦ Words with just one or two characters (mostly articles and prepositions) were also removed 3. Third: we transformed each list of words (from search text attributes) in an array ► NumPy methods were applied for intersecting the search word vector with product title and product description vectors, respectively.
  • 11. Word Match counting with tf-idf ► At this point we simply counted how many words remained in each intersection, ♦ S={t1,t2,…,tn} is the search words (terms) vector, D={t1,t2,…,tn} the product description words vector and T={t1,t2,…,tn} the product title words vector, respectively: Feature1 = count (T  S) Feature2 = count (D  S) ► Since t 𝑓 − 𝑖𝑑𝑓 weight-based algorithms are broadly employed by e-commerce and publishing companies for text retrieval and searching  ♦ We decided to make use of it for enhancing the expressiveness of the two features we acquired so far
  • 12. TF-IDF ► Term Frequency-Inverse Document Frequency (TF-IDF) produces a weight for each term fi in each document dj. ► It is basically a combination of two earlier methods: Term Frequency (TF) and Inverse Document Frequency (IDF). 𝑡𝑓 − 𝑖𝑑𝑓 is defined as in Eq. 1: 𝑡𝑓 − 𝑖𝑑𝑓 𝑓𝑖, 𝑑𝑗 = 𝑡𝑓𝑖𝑗 ∗ 𝑖𝑑𝑓 (𝑓𝑖) ► In order to acquire word weights we used the TfidfVectorizer class available in scikit-learn Python package ♦ scikit-learn is a machine learning library for Python to serve different purposes such as classification, regression and clustering
  • 13. Word Match counting with tf-idf (cont.) ► Before calculating weights: stop words are ignored, which means that irrelevant words for searching, such as articles, prepositions and pronouns, are ignored ► The features we extracted now are much more expressive, since they take into account weights of word terms, ♦ Higher weights are related to rare terms and lower weights are related to high frequent terms across all documents ► Feature values are now the sum of all intersection word weights ♦ Instead of merely using a string to represent words in vectors, now we use Python dictionary to represent words along with their weights, where words are the keys and weights are the values in dictionaries
  • 14. Word Match counting with tf-idf (cont.) ► In this way, we have ♦ a search dictionary Sdict={t1w1, t2w2, …, tnwn}, ♦ a product description words dictionary Ddict={t1w1, t2w2, …, tnwn}, ♦ a product title words dictionary Tdict={t1w1,t2w2,…,tnwn} ♦ We may also represent, as vectors, the keys from dictonary: Skeys={tk1, tk2, …, tkn} ► Thus, features are calculated as follows: Feature1 = sum(values(Tkeys  Skeys)) Feature2 = sum(values(Dkeys  Skeys))
  • 15. Data Analysis and Prediction ► The main goal with CrowdFlower data is to predict the relevance of search queries. ♦ The label attribute for this study will be the median relevance of the search ► We first tried to predict the values of median relevance from CrowdFlower test set rows employing the (SVM) Support Vector Machine ♦ Implementation from the scikit-learn ► We also tried CrowdFlower data with a more sophisticated algorithm: the Random Forest ♦ Random Forest is an ensemble learning method that is largely employed in machine learning ♦ Largely implemented in scikit-learn user community
  • 16. scikit-learn • scikit-learn is an open source machine learning library for Python • It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, • It Is designed to interoperate with NumPy and SciPy.
  • 17. Data Analysis and Prediction (cont.) ► after acquiring the training data, and storing it properly in data frame, learning in scikit-learn should be taken in three simple steps: 1. initializing the model, 2. fitting it to the training data, and 3. predicting new values ► We stored the train set provided by CrowdFlower, but using our extracted features, in a data frame and provide it as input for both algorithms. ► After initializing and fitting train data, we can now predict the label attribute (search terms median relevance)
  • 18. Learning Models Benchmark ► We describe the resulted scores we obtained running the four combination methods we applied: ♦ SVM with word match counting based features, ♦ SVM with tf-idf based features, ♦ Random Forest with word match counting based features and, ♦ Random Forest with tf-idf based features
  • 19. Learning Models Benchmark (cont.) ►The best score was obtained with Random Forest with tf-idf based features. ►We can also notice that Random Forest obtained better score than SVM and, ►conversely, that tf-idf obtained better results than word match counting based features. ►However, the impact of applying tf-idf on preprocessing was substantially higher than Random Forest over SVM Match Counting tf-idf SVM 0.51241 0.57654 Random Forest 0.53834 0.59211
  • 20. Conclusion ► With our benchmark on CrowdFlower test set, we could attest that Random Forest is an efficient machine-learning algorithm. ► Random Forest shown better accuracy compared to a simple SVM implementation. This, in part, is due to: ♦ The ensemble nature of Random Forest, which allows multiple learning algorithms to be run ♦ Beside this fact, the nature of the dataset is also another reason for SVM disadvantage, since such algorithm is likely to provider poorer performances when the number of features is much greater than the number of samples
  • 21. Conclusion (cont.) ► Employing tf-idf for preprocessing features and, Random Forest is a powerful and effective approach for predicting, and measuring, the relevance of text search in e-commerce scenarios. ► Such approach suits small e-commerce businesses emerged in big data necessities. ► The use of scikit-learn package, as well as other built-in Python packages (NumPy and SciPy), saves substantial development time
  • 22. Conclusion (cont.) ► The preprocessing step is crucial for the whole data analysis because: 1. It consumes most of the time and implementation efforts needed on the whole analysis 2. Preprocessing may be more critical for precision than the machine-learning algorithm itself  ♦ SVM combined with tf-idf have shown higher score than Radom Forest combined with word matches counting