In this paper we compare two alternate machine-learning techniques from the Apache Mahout stable, namely: Apache Sparks’, spark-itemsimilarity, and its counterpart Apache Hadoop’s MapReduce. We compare these both qualitatively as well as quantitatively in the context of two ecommerce stores with different behavior to determine which one is more effective and efficient in a given context.
2. Background:
In this paper we compare two alternate machine-learning techniques
from the Apache Mahout stable, namely: Apache Sparks’, spark-itemsimilarity,
and its counterpart Apache Hadoop’s MapReduce. We compare these both
qualitatively as well as quantitatively in the context of two ecommerce stores
with different behaviour to determine which one is more effective and efficient
in a given context.
Subjects:
• The subjects under test are two ecommerce stores.
• Sample Store 1:
– Traffic: Approximately 3 M unique visitors per month
– Transactions: 2500 transactions daily
• Sample Store 2:
– Traffic: Approximately 1 M unique visitors per month
– Transactions: 250 Transactions daily.
Data Gathering and setup:
• Relevant click stream data for both subjects was collected. This constitutes
user behaviour, namely view and buy. Based on this, predictive analytics for
item-similarity was run using the Apache Spark and Apace Hadoop map
reduce Log Likelihood in both cases. The subjects were observed for 1 week
to gather both quantitative and qualitative results.
3. Quantitative Analysis:
• We gathered data for both stores and plotted the
following data points hourly for a one-week period.
That explains the peaks and troughs where activity
goes down at night and peaks during the day.
• Total products viewed ( blue )
• Recommendation available from Apace Hadoop
mapreduce log likelihood (LLR ) ( red )
• Recommendations available from Apache
SPARK (Spark ) ( grey )
5. • In the case (Sample store 2) where we have lower
transactions and lower visitors, we see that Spark yields far
fewer results (i.e. recommendations) than in the case (Sample
store 1) where there are large number of transactions and
more traffic. We see that in (Sample store 1) the total product
views, the total products for which we have recommendations
from LLR and recommendations from SPARK are almost
identical, which shows we have recommendations for almost
all products that are viewed both using Spark as well as LLR. In
Sample store 2, we see that the total product views and the
total products for which we have recommendations from LLR
are almost identical, but the recommendations from Spark lag
behind significantly.
• Inference:
• Hence we conclude that quantitatively if the there are large
number of transactions then quantitatively Spark and LLR are
almost equivalent in terms of the number of
recommendations they yield.
6. Qualitative Analysis:
• We gathered data for both stores and plotted the
following data points hourly for a one-week period.
• Total products bought ( purple )
• Products that were recommended by Apace Hadoop
mapreduce log likelihood (LLR ) that were bought (
Blue )
• Products that were recommended by Apace Spark
(Spark) that were bought ( grey )
Observations:
• Sample Store 1:
7. Sample store 2:
We see that in (Sample store 1) the total product buys, and the total products which were
recommended by SPARK and bought are almost equal, which suggests that most buys were for
products that were recommended by Spark. However products recommended by LLR which
were bought lag behind significantly.
We see that in (Sample store 2) the total product buys, and the total products, which were
recommended by SPARK and LLR and bought, are further apart than in Sample store1, which
suggests that most buys were for products that were not recommended by Spark or LLR. We
also see that while spark still does marginally better than LLR, both are comparable, and deviate
from the products that were actually bought.
Inference:
Hence we conclude that qualitatively if the there are large number of transactions then
qualitatively Spark is significantly better than LLR, and almost all products that are
recommended by Spark are also bought. LLR lags behind significantly.
When there are lesser transactions, we see that Spark is still marginally better than LLR
qualitatively, but products that are actually bought, are different from the ones that are
recommended by both Spark and LLR.