Time series classification (search queries)

Time Series Trend Classification:
Classifying Google Search Queries
According to Their Search Volume
Trend Over Time
Esteban Ribero
Head of Strategy - Planning & Insights, at Performics (Publicis Groupe)
(eribero@gmail.com)
May 5, 2021
Abstract
People’s queries to Google are powerful behavioral traces of people’s intentions
and needs of information. Google classifies the trends in search volume over time
of the queries submitted to its system into 4 district categories: Declining,
Sustained Growth, Fast Rising, and Emerging. This report describes 3 modeling
approaches explored to develop a classifier that replicates Google’s categorization
of times series: 1) conventional tree-based models such as random forest and
gradient boosting, 2) time-specific models such as time series forest and k-nearest
neighbors with dynamic time warping (DTW) and 3) deep learning models such as
recurrent neural networks and 1-d convolutional neural networks. We found
conventional tree-based models, especially gradient boosting (Xgboost) to be best
suited for this task achieving 92% levels of F1-accuracy with high precision and
decent recall.
Keywords: Google’s search queries, Time Series Classification, Search Queries
Trends, Deep Learning, Random Forrest, Gradient Boosting, Time Series Trends,
RISE, BOSS, MiniRocket, RNNs, CNNs.
Google’s search data are a powerful source of information about people’s intentions and
interests in content. Google makes some of these data available to different users (the public,
advertising agencies, marketers, etc.) via different tools. A common use of the data is to look at
the trends in search volume over time to learn about the changing intentions and needs of

1
consumers of different product categories. Google classifies these trends into four distinct
categories: Declining, Sustained Growth, Fast Rising, and Emerging. Figure 1 shows examples
of Declining and Emerging search queries. This typification of time-series trends is a simple yet
powerful way to group search queries with similar trends over time making the analysis and
identification of important trends faster and more meaningful.
Figure 1. Examples of Declining and Emerging, search queries according to their search
volume trend over time.

2
The challenge is that these labels are only available in a few tools and for only certain
product categories while other relevant data, including the search volume over time for a broader
set of search queries, is available in other tools. Developing a classifier to label search queries
according to their trends over time will be a powerful insights tool that can be combined with
other tools that have been developed to analyze search queries for topic extraction and intent
identification using Natural Language Processing (NLP). This report describes the method and
results of the development of such a classifier.
Time Series Classification Overview
Time Series Classification (TSC) is a special case of the classification problems in
supervised machine learning. The important difference between TSC and traditional
classification problems is that the attributes are ordered (Bagnall, et al., 2017). This ‘temporal’
ordering characteristic is not limited to time but to any situation where there might be
discriminatory features depending on the ordering. In fact, ‘time series’ data are present in
almost every task requiring some sort of human cognition making it an important and
challenging problem in data mining (Fawaz, Forestier, Weber, et al., 2019). Since this is an
active area of research in Machine Learning, there are thousands of algorithms and approaches
developed over the years (Bagnall, et al. 2017, Fawazm et al., 2019). Following is a short
description of a few of the most popular approaches.
Conventional tree-based models. TSC problems can be cast as traditional classification
problems when the order of the features is not considered. In this approach feature extraction and
feature engineering are an important part of the process where features are extracted or

3
constructed, and conventional classification algorithms such Random Forest or Gradient
Boosting classifiers are used.
Time-series-specific non-deep learning models. To take advantage of the ordered
nature of the values in time series, time-series-specific classifiers have been developed. In this
section, a few non-deep learning time-series-specific models are described. Deep-learning
models are described in a separate section.
1-Nearest Neighbors with Dynamic Time Warping (1-NN-DTW) is one of the most
popular and long-time benchmarks for time-series classification (Bagnall, et al., 2017). It is the
conventional KNN algorithm with Dynamic Time Warping (DTW) as the distance measure
instead of the Euclidean distance. DTW measures similarity between two sequences that may not
align exactly in time, speed, or length.
Time Series Forest Classifier (TSF) is an adaptation of the Random Forest classifier to
time series data. It splits the series into random intervals, extracts summary features such as the
mean, standard deviation, and slope from each interval, and trains a decision tree on the extracted
features. It repeats this process for a set number of trees and classifies the series according to
majority vote (Deng et al., 2013).
Random Interval Spectral Ensemble (RISE) is a popular variant of TSF. It only uses a
single time interval per tree, instead of multiple time intervals as the TSF and it is trained using
spectral features extracted from the series, instead of summary statistics. RISE uses several
series-to-series feature extraction transformers such as fitted auto-regressive coefficients,
estimated autocorrelation coefficients, and power spectrum coefficients.
BOSS Ensemble (Schäfer, 2015) is an ensemble of dictionary-based classifiers that
transform time series values into a sequence of discrete ‘words’ using a truncated Discrete

4
Fourier Transform. The distribution of the extracted words is then the basis of the classification.
Dictionary-based algorithms are useful when the frequency of repetition of subseries (words) is
more important than their presence or absence (Bagnall, et al., 2017).
Contractable BOSS (cBOSS) is a variation of the original BOSS Ensemble algorithm
with several improvements in terms of memory requirements and speed.
Deep-learning models for time series data. Popular Deep Neural Networks (DNN) such
as Recurrent Neural Networks (RNNs) with the different variations of recurrent units
(simpleRNN, Gated Recurrent Unit -GRU, Long Short Term Memory -LSTM) and 1-
Dimensional Convolutional Neural Networks (1D-CNNs) can be used for time series
classification (Chollet, 2017). Sophisticated architectures such as ResNet or hybrid approaches
such as ROCKET (RandOm Convolutional KErnel Transform) are now reaching state-of-the-art
performance for TSC tasks with fewer computational requirements (Dempster, et al., 2020a).
MiniRocket (Dempster, et al., 2020b) is the most current and optimized version of
ROCKET. The algorithm first transforms the time series using random convolutional kernels,
such as those used in a CNN, and then trains a linear classifier (usually a Ridge Classifier) with
these features. Unlike typical CNN’s, ROCKET uses a variety of kernels and many of them
(10,000 is the default). The random lengths, dilations, paddings, weights, and biases of these
kernels allow ROCKET to capture a wide range of information making it a formidable TSC that
is fast and achieves state-of-the-art accuracy.

5
Data
5642 search queries and their monthly search volume for 2 full years (24 months each)
were collected from Google’s Insights Finder tool. The data set was assembled by selecting
search queries from 12 different product categories (shoes, cars, electronics, bicycles, pet
products, etc.), and sampling queries from different calendar months so highly seasonal months
such as December or November would not always be months 12 and 24 or 11 and 23. The data
are already scaled and indexed to the maximum, so for each period, the search volume takes
values from 0 to100. The original data came with 36 observations for each time series and the
corresponding label, as well as their average monthly search volume in absolute terms. Since the
data sets to which we would end up using the classifier in production often have 24 observations
for each time series (the last two years of monthly data) we discarded the first 12 months of data
for each time series keeping the last 24. Since Google’s rules to classify the trends use daily
search volume (in absolute terms not indexed), it is possible that once the search volume gets
aggregated to a monthly basis and indexed several search queries, originally on different classes,
would end up in the same class afterward introducing some noise to the data. So, some noise in
the data is expected. There is a lot of variability whiting each of the categories, as can be
observed in figure 2, but their general trend can be easily identified. There are several outliers,
particularly for the Emergent category, and it is likely that those might be better represented by
the Fast-Rising category.

6
Figure 2. Box and whisker plot for times series in each category. Although there is great
variability at each time period within each category and between time periods, the general
pattern represented by the median and the interquartile range is distinct for each category.
Feature Engineering. To identify potential features for the conventional Random Forest
and Gradient Boosting models a thorough exploratory data analysis was performed. The
following three groups of features were extracted and used for modeling:
Group 1. Features exploiting difference across time periods (Figure 3):
• Last 3 months of most recent year vs prior year (Q4Y2vsQ4Y1).
• Last 3 months of most recent year vs prior 3 months (Q4Y2vsQ3Y2).
• Last 6 months of most recent year vs prior year (H2Y2vs H2Y1).
• Last 6 months of most recent year vs first 6 months (H2Y2vsH1Y2).
• Year 2 vs year 1 (Y1vsY2).
Search Trend Categories

7
Figure 3. Box and whisker plot for features taking advantage of indexed-search-volume
differences across different time periods. Declining search queries are the most distinctive
followed by Sustained Growth. Fast Rising queries often overlap with Emerging and Sustain
Growth queries.
Group 2. Descriptive features (Figure 4):
• Mean across 24 months
• Median across 24 months
• Standard Deviation across 24 months
• Min across 24 months
• Average monthly searches in absolute numbers (Volume)

8
Figure 4. Box and whisker plot for descriptive features. Emerging search queries are the most
distinctive across these features followed by Fast Rising. Sustained Growth and Declining
queries overlap often.

9
Group 3. Trend shape over time (Figure 5). To capture the shapes of the trends over time
with a few attributes, a 3rd order polynomial trend line was fitted to each of the 5642 time series
and each of the coefficients of the resulting formula representing the trend were added as a
feature. Figure 6 shows some examples of the original time series and their corresponding
polynomial trend line.
Figure 5. Box and whisker plot for trendline-shape features. The coefficients of the 3rd
polynomial trend appear to be powerful for differentiating Emerging search queries while they
seem to overlap often for the other categories.

10
Figure 6. Sample of time series and their 3rd
order polynomial trendline by category. The
trendlines make it easier to visually identify the trend lover time. For instance, the inflection
points appear more dramatic for Emerging and Fast Rising search queries than the other ones.
The downward trend is also evident for Declining queries. However, the shape of the trends is
often similar and only the rotation and relative position in the y-axis appears to be the key
differentiating characteristics.
Polynomial Trendlines

11
Methods
Models. Different versions of the models described in the TSC overview section above
(except for ResNet) were trained and tested with different data inputs.
For the conventional tree-based models, 4 data inputs were used: 1) The original data set
with monthly data plus average monthly search volume (no feature engineering). 2) The same
data set but aggregated by quarters to reduce variability. 3) The set of engineered features
described above. 4) The combination of quarterly data and feature engineering.
For the time-series-specific non-deep learning models, 3 data inputs were used: 1) The
original monthly data (excluding the average search volume feature). 2) the monthly data
smoothed with a 3-month rolling average. 3) the monthly data smoothed with a 6-month rolling
average.
For the deep-learning models, 3 data inputs were used: 1) The original monthly data
standardized with mean 0 and standard deviation equal to 1. 2) With a 3-month-rolling-average
smoothed data, also standardized. 3) With standardized quarterly data. Additionally, two
conventional Neural Networks (Baseline NNs), one with 2 layers and another one with 3-layers
of 100 units each, were trained to serve as a baseline for the deep-leaning models. These models
were trained with the original monthly data, standardized, as well as the engineered features also
standardized.
Train and test data sets for each of the data inputs described above were created using
75/25 % splits. The train data set was further divided into train and validation set for the deep
learning models using an 80/20 % split.
The best models from each of the 3 modeling approaches were combined in an ensemble
model to see if higher levels of performance were possible. These models were the Xgboost with

12
feature engineering + quarterly data, the Random Forest with original monthly data, the Time
Series Forest with original monthly data, and the MiniRocket_Ridge with smoothed data.
Finally, the best model of all, the Xgboost with feature engineering + quarterly data, was
further fine-tuned and trained by calibrating its hyperparameters using Grid Search with 5-fold
cross-validation with the entire data set.
Performance metric. To compare the models’ performance, a Precision and Recall
framework was used with the weighted F1-Score as the measure of accuracy. The models were
also compared using Recessive Operating Characteristic (ROC) curves and their corresponding
area under the curve (AUC) measure for each class.
Results and Discussion
Table 1 summarizes the results for each of the models trained. In comparison with a
dummy classifier that would always predict the most frequent class, the performance of all
models is quite an improvement. Except for the 1-NN-DTW and RISE models, the weighted F1
score is above 80% and the best performer reaches 88%. The winning models are the
conventional tree-based models. Regardless of the data input, the conventional Random Forest
and Xgboost models achieve > 85% F1 accuracy on the test set. Only the MiniRocket_Ridge
model with and without smoothed data also reached 85% F1 accuracy.

13
Model Train set Test set
Baseline Reference
Dummy Classifier 0.2432 0.2332
Conventional Tree-based Models
Random Forest (with monthly data) 1.0000 0.8716
Random Forest (quarterly data) 0.9962 0.8579
Random Forest (with feature engineering) 0.9915 0.8554
Random Forest (with feature engineering + quarterly data) 0.9998 0.8690
Xgboost (with monthly data) 0.9055 0.8691
Xgboost (with quarterly data) 0.8846 0.8574
Xgboost (with feature engineering) 0.8975 0.8518
Xgboost (with feature engineering + quarterly data) 0.9115 0.8829
Time-Series-Specific Non-DL Models
1-NN-DTW 0.9993 0.7498
1-NN-DTW (with smoothed data 3) 0.9995 0.7676
1-NN-DTW (with smoothed data 6) 0.9991 0.7775
Time Series Forest 0.9976 0.8497
Time Series Forest (with smoothed data 3) 0.9995 0.8438
Time Series Forest (with smoothed data 6) 0.9995 0.8382
RISE 0.9969 0.7564
RISE (with smoothed data 3) 0.9962 0.7533
RISE (with smoothed data 6) 0.9986 0.7160
BOSSEnsemble 0.9960 0.8048
cBOSS 0.9969 0.8054
Deep Learning Models
Baseline NN_100_100 0.8531 0.8228
Baseline NN_100_100_100_100 0.8756 0.8199
Baseline NN_100_100_100_100 (with feature engineering) 0.8361 0.8154
Simple_RNN_24_12_6_3 0.8294 0.8193
GRU_32_32_32_32 0.8184 0.8105
GRU_32_32_32_32 (with smoothed data 3) 0.7991 0.8031
GRU_32_32_32_32_Q 0.8288 0.8261
LSTM_24_24_24_24 0.8049 0.8000
LSTM_24_24_24_24_ (with smoothed data 3) 0.8009 0.7969
LSTM_24_24_24_24_Q 0.7865 0.7807
1D-CNN_32_32_32_3_GlobalMax_Pooling 0.8510 0.7906
1D-CNN_32_32_32_3_GlobalMax_Pooling (with smoothed 3) 0.8493 0.7932
1D-CNN_32_32_32_3_GlobalMax_Pooling_Q 0.8561 0.8316
MiniRocket_Ridge 0.8998 0.8578
MiniRocket_Ridge (with smoothed data 3) 0.8957 0.8581
F1 Score (w)
Table 1. Weighted F1 Scores for all models trained. The models in each section with the
best scores on the test set are highlighted.

14
At first glance, it is surprising that the conventional machine learning models, not
designed specifically for time series, perform the best across the board. Even more surprising is
the fact that it is possible to achieve almost the highest performance by simply using the raw
monthly data with these conventional tree-based models. These models perform better even with
quarterly raw data vs features that have been engineered to take advantage of volume differences
between periods. Although the differences are minor, it suggests that the temporal ordering of the
values is less important for this challenge. Only when combining quarterly data with the
engineered features for the Xgboost we can achieve slightly higher performance.
The lack of strong importance of the temporal ordering of the values is also suggested by
the fact that the time-series-specific models are the ones performing the worst among these
models except for the Time Series Forest that barely missed the 85% mark. This is even true for
the deep learning models where a basic 2-layer Neural Net (Baseline NN_100_100) slightly
outperforms most of the more sophisticated sequence-based models such as the traditional RNN
(Simple_RNN), the Gated Recurrent Unit (GRU), and the Long Short Term Memory (LSTM)
models. The 1D-CNN did not perform better either. It is important to note that some light
hyperparameter tuning was performed for the deep learning models, and so it may be possible to
improve their performance with more fine-tuning, but the gains are probably going to be small.
As mentioned before, only the MiniRocket_Ridge achieved similar performance as the
conventional tree-based models. MiniRocket is a promising model that combines the
sophistication of CNNs with the speed of the conventional tree-based models, however, in this
study, it trailed slightly.
The lack of strong importance of temporal patterns could be explained by the large
diversity of the time series within each category, as evidenced by the boxplots in Figure 2 and

15
the few samples of queries in Figure 6. These time series are similar to one another mostly in
their general long-term trend and not in specific short-term or cyclical patterns. For this precise
reason, we were not expecting the BOSSEnsemble or the cBOSS models to perform particularly
well since they are best suited to pick up patterns that repeat frequently in time series. Maybe this
is also the reason for the relatively poor performance of the other models.
Smoothing the data did not help much. Only the 1-NN-DTW model improved
significantly with smoothed data, and the more pronounced the smoothing the better. This makes
sense since reducing the temporal noise by smoothing makes the queries more similar to one
another, the basic mechanism by which K-Nearest Neighbors models work. The
MiniRocket_Ridge model also performs better with smoothed data, but the difference is so small
that it may be a random coincidence.
Figure 7. Feature importance for two versions of the Random Forest and the Xgboost models.
Random Forest (feature engineering)
Random Forest (raw monthly data) Xgboost (raw monthly data)
Xgboost (feature engineering)

16
Looking at the feature importance of the best models in figure 7, it is possible to see that
each model accomplishes its task differently which suggested that combining the predictions
from these models could also improve the performance as they rely on different aspects of the
data.
Figure 8. ROC curves and Confusion Matrices for top 2 models.
However, when looking at the confusion matrices and ROC curves in figure 8 we can
observe that the models are performing very similarly and struggle to differentiate Fast Rising

17
Model Train set Test set
Best individual models
Random Forest (with monthly data) 1.0000 0.8716
Xgboost (with feature engineering + quarterly data) 0.9115 0.8829
Time Series Forest 0.9976 0.8497
MiniRocket_Ridge (with smoothed data 3) 0.8957 0.8581
Ensemble models
Random Forest + Xgboost 0.9794 0.8841
Random Forest + Xgboost + MiniRocket_Ridge 0.9794 0.8841
Random Forest + Xgboost + MiniRocket_Ridge + Time Series Forest 0.9943 0.8790
Final model
Xgboost (with feature engineering + quarterly data -fine tuned) 0.9270 0.9182
F1 Score (w)
from Sustained Growth. There are still differences in the mistakes they make: The Random
Forest makes more mistakes misclassifying Emerging as Fast Rising but fewer mistakes
misclassifying Fast Rising as Sustained Growth. We hypothesized that combining these two
models would balance the mistakes and would give us a boost in performance, but as seen in
Table 2, the gains are minimal.
Table 2. Weighted F1 Scores for best individual models, ensemble models, and final
model after fine-tuning hyperparameters. The models in each section with the best scores
on the test set are highlighted.
Is worth noting that most of the best individual models here are already ensembles of
models picking up signals from different aspects of the data, and so it may not be that surprising
after all that combining them may not add much to an already varied set of models. What it is
surprising is that the MiniRocket_Ridge model did not add anything to the mix, it being the most
different of all the models. The models were ensembled averaging their predicted probabilities
for each class and then assigning the final prediction to the highest probability class. The
MiniRocket_Ridge model was either predicting the same labels or overwritten by the other 2
models in the ensemble. Adding the Time Series Forest did the opposite, it switched some

18
correct predictions from the other three models into incorrect predictions worsening the
performance.
Given that the gains from combining the prediction from the Random Forest and the
Xgboost did not increase performance significantly, it was decided to further fine-tune the
Xgboost with grid search as described in the method sections. The calibrated model was then
trained with the same train and test data as the others for a fair comparison. In this case, we did
break the > 90% performance wall and achieved a weighted F1 score of 92%. To further assess
the performance of the final model and to identify potential drawbacks, the model trained via
grid search with cross-validation was used to predict the label of all the observations in the data
set. Table 3 shows the detailed precision/recall classification report for the final model using the
entire data set.
precision recall f1-score support
Class
Declining 1 1 1 672
Emerging 0.95 0.98 0.96 674
Fast Rising 0.9 0.88 0.89 1946
Sustained Growth 0.92 0.92 0.92 2350
accuracy 0.92 5642
macro avg 0.94 0.95 0.94 5642
weighted avg 0.92 0.92 0.92 5642
Classification Report
Table 3. Precision/recall classification report for the fined-tuned Xgboost model across
the entire data set. All classes are getting precision scores equal to or higher than 90%.
Only the recall for Fast Rising queries is below the 90% mark.
The performance of the final model is satisfactory. The precision of the predictions for
classes is accurate 90% of the time or above. The model is capable of identifying all the
Declining search queries without any error. Similarly, its recall of Emerging queries is 98% and

19
will only make precision mistakes 1 out of 20 times. The performance for search queries with
Sustained Growth, the biggest class, is at or above 92%, both in terms of recall and precision.
The model is less accurate for Fast Rising search queries: The prediction tends to be accurate 9
out of 10 times but the model may miss identifying about 12% of the search queries in this
category. This is ok given that precision is the most important measure in this case since the
model will be used to identify search queries with clearly identified trends for insights purposes,
and so making sure the prediction is accurate to a high degree is the main goal, even if it is
failing to identify a few queries that should have been recalled. The user would have to keep an
eye on the accuracy of the prediction for Fast Rising knowing that some will tend to be
misclassified as Sustained Growth and a few as Emerging as evidenced in the confusion matrix
shown in Figure 9. Also, some queries predicted as Fast Rising will be Sustained Growth.
Figure 9. Confusion Matrix for the fine-tuned Xgboost model across the entire data set.
Most mistakes will be misclassified Fast Rising queries as Sustained Growth and
Sustained Growth as Fast Rising.

20
Conclusion. The results obtained above are satisfactory. The final model is predicting the
general trend of the search volume over time of queries submitted to search engines such as
Google with a high degree of precision. The model is missing the mark on a few of the
predictions for Fast Rising search queries, but this is an acceptable trade-off given the usefulness
of the model in automating the identification of the trends.
In terms of learnings from the modeling exercise, the lack of strong importance of the
temporal ordering of the values in these time series is surprising but not concerning. The long-
term pattern in each of the classes is not very complex and so it may be sufficient, even
preferable, for some of these models to rely on the raw values at each time. This may also be due
to the large variability of the values at each period for the queries in each of the classes. The
general long-term trend is not related to the micro patterns one may find when looking into
shorter intervals of time or for particular cyclical patterns that are frequently present in search
queries for seasonal products or high-demand periods such as holidays.
A final note regarding the best-performing models for this task. It is also worth noting
that although we refer to Random Forrest and Xgboost as ‘conventional’ tree-based models, they
are sophisticated algorithms that tend to perform extremely well in general. Xgboost, in fact, has
become the winner in many competitions (Chollet, 2018) and is suggested as a good alternative
to deep learning models when structured data is available by the very same people that are at the
front of the deep learning revolution such as François Chollet (2018) the developer of Keras.

21
References
Bagnall, Anthony, Jason Lines, Aaron Bostrom, James Large, and Eamonn Keogh. “The Great
Time Series Classification Bake off: a Review and Experimental Evaluation of Recent
Algorithmic Advances.” Data Mining and Knowledge Discovery 31, no. 3 (2016): 606–60.
https://doi.org/10.1007/s10618-016-0483-9.
Chollet François. Deep Learning with Python. Shelter Island, NY: Manning Publications Co.,
2018.
Dempster, Angus, François Petitjean, and Geoffrey I. Webb. “ROCKET: Exceptionally Fast and
Accurate Time Series Classification Using Random Convolutional Kernels.” Data Mining
and Knowledge Discovery 34, no. 5 (2020): 1454–95. https://doi.org/10.1007/s10618-020-
00701-z.
Dempster, Angus, Daniel F. Schmidt, and Geoffrey I. Webb. “MINIROCKET: A Very Fast
(Almost) Deterministic Transform for Time Series Classification.” arXiv.org, December
16, 2020. https://arxiv.org/abs/2012.08791v1.
Deng, Houtao, George Runger, Eugene Tuv, Martyanov Vladimir. “A time series forest for
classification and feature extraction.” Information Sciences. Volume 239, 2013, Pages 142-
153, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2013.02.030.
Fawaz, Hassan Ismail, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-
Alain Muller. “Deep Learning for Time Series Classification: a Review.” Data Mining and
Knowledge Discovery 33, no. 4 (2019): 917–63. https://doi.org/10.1007/s10618-019-
00619-1.
Schäfer, P. The BOSS is concerned with time series classification in the presence of noise. Data
Min Knowl Disc 29, 1505–1530 (2015). https://doi.org/10.1007/s10618-014-0377-7

Time series classification (search queries)

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Time series classification (search queries)