3. Terminology
Search query is a request made by an Internet user to
obtain information from a search engine; statistics on
search queries are obtained from services from search
engines:
https://www.google.ru/trends/
https://adwords.google.com/
http://wordstat.yandex.ru/.
Descriptor is a word or a phrase that forms part of
search queries introduced by users;
INTRODUCTION
4. Terminology
Indicators are economic, social, demographic and other
indicators that are analyzed or forecasted by analysts and
researchers;
Top-rated lists of descriptors are search queries that are
the most highly correlated with selected indicators;
Barometer is the mean value of the normalized dynamics
of the top-rated selection.
INTRODUCTION
5. Hypothesis
There is a stable statistical dependence between the intensity of
search queries and real-world events and social processes.
Fig. 1. The dynamics of the descriptor ‘swimsuit’ in
U.S.A.: peaks in February and May-June
INTRODUCTION
6. INTRODUCTION
Relevance
We can use search queries
• for monitoring the economic situation in regions in real time
avoiding difficulties related to the lack of data, as explained
above;
• for parallel control of official information, which allows to reveal
distortions introduced by official institutions;
• for forecasting economic, demographic and social parameters
during a crisis period;
• for forecasting dynamics of various socio-economic and socio-
political processes;
• for analysis of other countries. Here we do not need official data
that is published with delay.
7. State of the art
2009 — Google has launched a service showing pest holes
in real time based on the intensity of queries from different
regions
2009 — H. Choi and H. Varian introduced the first model
predicting fluctuations in business cycles with the help of
search queries statistics
2011 — D. Engelberg, C. Da and P. Gao demonstrated that
analyzing the dynamics in Google searches for companies
gives a 10% advantage to traders;
INTRODUCTION
8. State of the art
2011 — Michael Stolbov (MGIMO-University) demonstrated
the feasibility of using Google search statistics to explain the
dynamics of aggregated financial indicators (for example,
deposits of individuals).
2013 — Tobias Prize demonstrated the work "Complex
dynamics of our economic life on different scales: insights
from search engine query data“;
The work is dedicated to the market shares; he analyzed
outbursts in searches «Subprime», «Lehman Brothers» and
«Financial Crisis», followed by a drop in S&P 500 Index.
INTRODUCTION
10. Domain-oriented databases of descriptors
• economic terms — 25000 SQ;
• juridical terms — 4500 SQ;
• crime articles — 365 SQ;
• well-known brands and goods — 3013 SQ;
• emotions:
with positive tonality — 400 SQ
with negative tonality — 400 SQ;
• slang used in finance, computers and other fields — 3300 SQ;
• medical terms —1600 SQ.
DATABASES
Technical databases
• lemmas —18638 ПЗ;
• n-grams ( n=2,3,..8 ) of letters and syllables ~ 90 000 ПЗ.
11. Lemma – the initial form of the word
Examples: avant-garde, sauna, drum, dune, velvet, bass, basketball, the
battalion commander, a comet, a compass, Icon, contour, piggy, mop,
cordon
n-gram
Examples: ев, ег, ед, ее, еж, ти, тк, тл, тм, тн, то, авв, авг, авп, бре, бри,
бро, век, вел, вес, лак, лал, лам, лан, лао, лап, греч, декс, сдел, кром
Emotion words with positive tonality
Examples: good, great, beautiful, holiday, goodness, beauty, super, fun,
cool, happy, dream, luck, well, success, joy, laugh, nice
Emotion words with negative tonality
Examples: chaos, amoral, immoral, sabotage, punishment, violation,
cattle, schmuck, moron, hopeless, useless, helpless
DATABASES
12. Barometers
Examples of words that got into the "barometer" with direct positive
correlation with the indicator “Consumer Price Index":
"treat" – 0.93
"okmarket" (hypermarket) – 0.91
"pariet" (drug for ulcer) – 0.89
"patents" – 0.87
"mfbank" (commercial bank) – 0.87
"headhunter" (site to find job) – 0.86
"pediashur" (baby food) – 0.86
"convenient" – 0.86
"often" – 0.85
"close" – 0.85
DATABASES
13. Barometers
Examples of words that got into the "barometer" with direct positive
correlation with the indicator “Consumer Price Index":
"chemical" (british musical duet) – -0.92
"artofvar" (musical group of war veterans) – -0.91
"incest" – -0.87
"group" – -0.87
"babylon" (the italian brand of clothing) – -0.87
“young child" – -0.86
"diprivan" (a sedative) – -0.86
"ilarauto" (van selling) – -0.86
"miss" – -0.86
DATABASES
14. Bases of indicators
• Retail trade turnover (mln of roubles);
• Consumer Price Index;
• Entrepreneurs Price Index on industrial products;
• Entrepreneurs Price Index on minerals;
• Unemployment (thousands);
• Sales of new passenger cars and light commercial vehicles (units)
• per capite income (thousands of roubles);
• The dollar/ruble exchange rate. (USDTOM_UTS);
• Brent price (ICE.Brent), USD/баррель;
DATABASES
16. Programes
1. The program for the collection of search queries’ dynamics from
the statistical service of Yandex;
2. The program for the automatic processing of the files and the
formation of an Excel spreadsheet;
3. The program for the automatic processing of the tables and
selection of top search queries.
DATABASES
18. Distribution of positive searches on the correlation with the indicator
“Retail turnover"
Statistics of queries by regions
ANALYSIS
19. Values of correlation coefficients are located on the ordinate axis.
The number of positive descriptors with corresponding level of
correlation relative to the indicator "Turnover of retail trade“ are located
on the horizontal axis.
ANALYSIS
Statistics of queries by regions
20. Query statistics on domain-oriented databases
Example: distribution of queries from the database "Brands and products"
relative to the indicators, with which there is a high level of correlation.
Observation: newlyweds are buying more than young parents
ANALYSIS
21. Example: distribution of queries with a high level of correlation with the
indicator "Sales of new cars» through thematic databases.
Observation: the active usage of slang, a variety of products/services
Consumer profiling
ANALYSIS
22. спазм диафрагмы +21% вертиго +54%
потеря вкусовых
ощущений
+108% горький вкус во
рту
+52%
дежурный врач +105% телефон аптеки +46%
полный пульс +91% приемный покой +46%
кашель с желтой
мокротой
+73% маниакальная
фаза
+45%
онемение шеи +70% нафтизин +43%
нечувствительность +70% кровотечение из
ушей
+38%
ночная потливость +69% вызвать врача +38%
стерильные бинты +67% эфералган +37%
абстинент +53% лекарства купить +36%
дежурная больница +55% свистящее
дыхание
+34%
Excess frequency of search
queries on the base of medical
terms in Leningrad
Region compared to
Data for Russia
Data for Russia are accepted for
100%
Increased mortality in Leningrad Region
ANALYSIS
24. Group method of data handling (GMDH)
allows to select the model of optimal complexity in a given class of
models to describe the current set of experimental data
Polynomial class of models:
where x = {xi | i = 1, … , m} is a set of indicators
and w = (wi , wij, wijk, … | i, j, k = 1, … , m) is a weight vector.
FORECAST
25. GMDH Shell actualize GMDH
Possibilities:
• Approximation
• Extrapolation
• Classification
http://www.gmdhshell.com
Main constructor: Candidate of Technical Sciences Koshulko A.A.
Program GMDH Shell
FORECAST
26. 1st criterion: MAPE (mean absolute percentage error):
𝑀𝐴𝑃𝐸 =
1
𝑁
𝑦𝑡 − 𝑦𝑡
𝑦𝑡
∗ 100%
𝑁
𝑡=1
,
where N is sample size, 𝑦𝑡 is real value for 𝑡, 𝑦𝑡 is estimated value for 𝑡;
2nd criterion: P (one-month step forward forecast error):
𝑃 =
𝑦 𝑁+1 − 𝑦 𝑁+1
𝑦 𝑁+1
∗ 100%.
Error evaluation
FORECAST
27. Observations are pseudo mixed;
Checking method is cross-checking with two parts;
Internal criterion is OLS;
External criteria is RMSE (root mean squared error) with a
penalty in the form of the difference between the RMSE
value on training and examination parts of the sample;
Neuron function is linear;
The maximum number of layersis 6;
The initial layer width is 5.
Forecast settings
FORECAST
28. Neural algorithm with linear barometers
MAPE = 1.0%,
One-month forward forecast error P=1.8%
Forecast of retail turnover
FORECAST
37. CONCLUSION
Scientific results – 1 (databases)
Experimentally we have shown the possibilities of effective
implications of:
• Few domain-oriented databases instead of one;
• Bases of n-grams (𝑛 = 2,8);
• Significantly negatively correlated descriptors along with
significantly positively correlated descriptors;
38. CONCLUSION
Scientific results – 2 (analysis)
We suggested interpretation of results of statistic analysis:
• in the field of the evalution of reasons of increased
mortality at the beginning of 2015 in the regions of
Russia;
• in the field of the evalution of people’s reaction on retail
trade turnover;
• in the field of the revealing groups of consumers.
39. CONCLUSION
Scientific results – 3 (forecast)
Experimentally we have shown high accuracy of GMDH
algorithms, which allows such error levels as
~3%-6% in the best models of crimes;
~1%–4% in the models for economy and social indicators;
40. As the future work we consider
• proposing a technology to use the mentions of descriptors in
social media;
• developing a procedure for processing queries including
outliers related to major circumstances;
• developing models for fuzzy forecasting taking into account
qualitative dynamics of queries.
Future research
CONCLUSION
42. X_pol – барометр с сильной прямой положительной корреляцией
относительно индикатора X
X_otr – барометр с сильной прямой отрицательной корреляцией
относительно индикатора X
Cmim_pol – барометр с сильной положительной корреляцией с
лагом в i месяцев относительно индикатора X
Cmim_otr – барометр с сильной отрицательной корреляцией с
лагом в i месяцев относительно индикатора X
Обозначения
FORECAST
43. Latest research papers
• Boldyreva A., Alexandrov M., Koshulko O., Sobolevskiy O.: Queries to Internet
as a tool for analysis of regional police work and forecast of crimes in regions:
Proc. of 15th Mexican Intern. Conf. on Artificial Intelligence, Springer, LNCS,
2016, 12 p. [to be published]
• Boldyreva A., Sobolevskiy O., Alexandrov M., Danilova V.: Creating collections
of descriptors based on Internet queries: Proc. of 15th Mexican Intern. Conf. on
Artificial Intelligence, Springer, LNCS, 11 p. [to be published]
• Boldyreva A.: An integral method for investigating attitudes of Internet users
based on search queries. “Mathematical modeling of social processes”, Proc.
of Sociological Faculty of MSU, Publ. House MSU (Moscow State Lomonosov
Univ.), 2016, vol. 18, pp. 26-34, [rus]
• Boldyreva A.: Building predictive models of economic and social conditions
based on the intensity of search queries to the Internet. “Modern economics:
theory, policy, innovation. Collection of student research papers”, Moscow,
Publ. House RANEPA, 2016, pp. 36-61, [rus]
• Boldyreva A., Alexandrov M., Surkova D.: Words with negative sentiment in
search queries to the Internet as an indicator of per capita income in the
Federal Districts of Russia. Inductive modeling of complex systems, NAS of
Ukraine, Kyev, 2015, vol. 7, pp. 77-92, [rus]