Processing queries to search engine of Yandex:
possibilities of analysis and forecast
AINL FRUCT 2016
Boldyreva Anna (RANEPA-MIPT)
Content
Introduction
Databases
Tasks of analysis
Tasks of forecast
Conclusion
INTRODUCTION
Terminology
Search query is a request made by an Internet user to
obtain information from a search engine; statistics on
search queries are obtained from services from search
engines:
https://www.google.ru/trends/
https://adwords.google.com/
http://wordstat.yandex.ru/.
Descriptor is a word or a phrase that forms part of
search queries introduced by users;
INTRODUCTION
Terminology
Indicators are economic, social, demographic and other
indicators that are analyzed or forecasted by analysts and
researchers;
Top-rated lists of descriptors are search queries that are
the most highly correlated with selected indicators;
Barometer is the mean value of the normalized dynamics
of the top-rated selection.
INTRODUCTION
Hypothesis
There is a stable statistical dependence between the intensity of
search queries and real-world events and social processes.
Fig. 1. The dynamics of the descriptor ‘swimsuit’ in
U.S.A.: peaks in February and May-June
INTRODUCTION
INTRODUCTION
Relevance
We can use search queries
• for monitoring the economic situation in regions in real time
avoiding difficulties related to the lack of data, as explained
above;
• for parallel control of official information, which allows to reveal
distortions introduced by official institutions;
• for forecasting economic, demographic and social parameters
during a crisis period;
• for forecasting dynamics of various socio-economic and socio-
political processes;
• for analysis of other countries. Here we do not need official data
that is published with delay.
State of the art
2009 — Google has launched a service showing pest holes
in real time based on the intensity of queries from different
regions
2009 — H. Choi and H. Varian introduced the first model
predicting fluctuations in business cycles with the help of
search queries statistics
2011 — D. Engelberg, C. Da and P. Gao demonstrated that
analyzing the dynamics in Google searches for companies
gives a 10% advantage to traders;
INTRODUCTION
State of the art
2011 — Michael Stolbov (MGIMO-University) demonstrated
the feasibility of using Google search statistics to explain the
dynamics of aggregated financial indicators (for example,
deposits of individuals).
2013 — Tobias Prize demonstrated the work "Complex
dynamics of our economic life on different scales: insights
from search engine query data“;
The work is dedicated to the market shares; he analyzed
outbursts in searches «Subprime», «Lehman Brothers» and
«Financial Crisis», followed by a drop in S&P 500 Index.
INTRODUCTION
Content
Introduction
Databases
Tasks of analysis
Tasks of forecast
Conclusion
Domain-oriented databases of descriptors
• economic terms — 25000 SQ;
• juridical terms — 4500 SQ;
• crime articles — 365 SQ;
• well-known brands and goods — 3013 SQ;
• emotions:
with positive tonality — 400 SQ
with negative tonality — 400 SQ;
• slang used in finance, computers and other fields — 3300 SQ;
• medical terms —1600 SQ.
DATABASES
Technical databases
• lemmas —18638 ПЗ;
• n-grams ( n=2,3,..8 ) of letters and syllables ~ 90 000 ПЗ.
Lemma – the initial form of the word
Examples: avant-garde, sauna, drum, dune, velvet, bass, basketball, the
battalion commander, a comet, a compass, Icon, contour, piggy, mop,
cordon
n-gram
Examples: ев, ег, ед, ее, еж, ти, тк, тл, тм, тн, то, авв, авг, авп, бре, бри,
бро, век, вел, вес, лак, лал, лам, лан, лао, лап, греч, декс, сдел, кром
Emotion words with positive tonality
Examples: good, great, beautiful, holiday, goodness, beauty, super, fun,
cool, happy, dream, luck, well, success, joy, laugh, nice
Emotion words with negative tonality
Examples: chaos, amoral, immoral, sabotage, punishment, violation,
cattle, schmuck, moron, hopeless, useless, helpless
DATABASES
Barometers
Examples of words that got into the "barometer" with direct positive
correlation with the indicator “Consumer Price Index":
"treat" – 0.93
"okmarket" (hypermarket) – 0.91
"pariet" (drug for ulcer) – 0.89
"patents" – 0.87
"mfbank" (commercial bank) – 0.87
"headhunter" (site to find job) – 0.86
"pediashur" (baby food) – 0.86
"convenient" – 0.86
"often" – 0.85
"close" – 0.85
DATABASES
Barometers
Examples of words that got into the "barometer" with direct positive
correlation with the indicator “Consumer Price Index":
"chemical" (british musical duet) – -0.92
"artofvar" (musical group of war veterans) – -0.91
"incest" – -0.87
"group" – -0.87
"babylon" (the italian brand of clothing) – -0.87
“young child" – -0.86
"diprivan" (a sedative) – -0.86
"ilarauto" (van selling) – -0.86
"miss" – -0.86
DATABASES
Bases of indicators
• Retail trade turnover (mln of roubles); 
• Consumer Price Index;
• Entrepreneurs Price Index on industrial products;
• Entrepreneurs Price Index on minerals;
• Unemployment (thousands);
• Sales of new passenger cars and light commercial vehicles (units)
• per capite income (thousands of roubles);
• The dollar/ruble exchange rate. (USDTOM_UTS); 
• Brent price (ICE.Brent), USD/баррель;
DATABASES
Базы индикаторов
• newborns (thousands);
• marriages (thousands);
• real activity (thousands);
• deaths (thousands); 
• registered economic crimes
DATABASES
Programes
1. The program for the collection of search queries’ dynamics from
the statistical service of Yandex;
2. The program for the automatic processing of the files and the
formation of an Excel spreadsheet;
3. The program for the automatic processing of the tables and
selection of top search queries.
DATABASES
Content
Introduction
Databases
Tasks of analysis
Tasks of forecast
Conclusion
Distribution of positive searches on the correlation with the indicator
“Retail turnover"
Statistics of queries by regions
ANALYSIS
Values of correlation coefficients are located on the ordinate axis.
The number of positive descriptors with corresponding level of
correlation relative to the indicator "Turnover of retail trade“ are located
on the horizontal axis.
ANALYSIS
Statistics of queries by regions
Query statistics on domain-oriented databases
Example: distribution of queries from the database "Brands and products"
relative to the indicators, with which there is a high level of correlation.
Observation: newlyweds are buying more than young parents
ANALYSIS
Example: distribution of queries with a high level of correlation with the
indicator "Sales of new cars» through thematic databases.
Observation: the active usage of slang, a variety of products/services
Consumer profiling
ANALYSIS
спазм диафрагмы +21% вертиго +54%
потеря вкусовых
ощущений
+108% горький вкус во
рту
+52%
дежурный врач +105% телефон аптеки +46%
полный пульс +91% приемный покой +46%
кашель с желтой
мокротой
+73% маниакальная
фаза
+45%
онемение шеи +70% нафтизин +43%
нечувствительность +70% кровотечение из
ушей
+38%
ночная потливость +69% вызвать врача +38%
стерильные бинты +67% эфералган +37%
абстинент +53% лекарства купить +36%
дежурная больница +55% свистящее
дыхание
+34%
Excess frequency of search
queries on the base of medical
terms in Leningrad
Region compared to
Data for Russia
Data for Russia are accepted for
100%
Increased mortality in Leningrad Region
ANALYSIS
Content
Introduction
Databases
Tasks of analysis
Tasks of forecast
Conclusion
Group method of data handling (GMDH)
allows to select the model of optimal complexity in a given class of
models to describe the current set of experimental data
Polynomial class of models:
where x = {xi | i = 1, … , m} is a set of indicators
and w = (wi , wij, wijk, … | i, j, k = 1, … , m) is a weight vector.
FORECAST
GMDH Shell actualize GMDH
Possibilities:
• Approximation
• Extrapolation
• Classification
http://www.gmdhshell.com
Main constructor: Candidate of Technical Sciences Koshulko A.A.
Program GMDH Shell
FORECAST
1st criterion: MAPE (mean absolute percentage error):
𝑀𝐴𝑃𝐸 =
1
𝑁
𝑦𝑡 − 𝑦𝑡
𝑦𝑡
∗ 100%
𝑁
𝑡=1
,
where N is sample size, 𝑦𝑡 is real value for 𝑡, 𝑦𝑡 is estimated value for 𝑡;
2nd criterion: P (one-month step forward forecast error):
𝑃 =
𝑦 𝑁+1 − 𝑦 𝑁+1
𝑦 𝑁+1
∗ 100%.
Error evaluation
FORECAST
Observations are pseudo mixed;
Checking method is cross-checking with two parts;
Internal criterion is OLS;
External criteria is RMSE (root mean squared error) with a
penalty in the form of the difference between the RMSE
value on training and examination parts of the sample;
Neuron function is linear;
The maximum number of layersis 6;
The initial layer width is 5.
Forecast settings
FORECAST
Neural algorithm with linear barometers
MAPE = 1.0%,
One-month forward forecast error P=1.8%
Forecast of retail turnover
FORECAST
Y1[t] = 64.4813 + Cm3m_pol[t-3]*40.7607 + N2*0.966131
N2[t] = -599.916 - Cm1m_pol[t-1]*410.022 + N3*1.3329
N3[t] = -45.4194 + N12*0.261196 + N4*0.759475
N4[t] = -99.6924 + ORT_PK_otr[t-2]*118.488 + N6*1.02163
N6[t] = 5.48719 - ORT_PK_pol[t-1]*181.667 + N10*1.0246
N10[t] = 1926.58 + Cm1m_pol[t-1]*1209.97 - Cm3m_otr[t-1]*241.064
N12[t] = 2327.24 + Cm3m_pol[t-3]*685.968 - Cm3m_otr[t-3]*581.729
FORECAST
Forecast of retail turnover
Neural algorithm with barometers in squared roots form
MAPE = 1.4%
One-month forward forecast error P=-2%.
Forecast of USD/ruble exchange course
FORECAST
Y1[t] = -2.49737e-10 + N2*1
N2[t] = -769.864 + "Cm2_otr[t-2], sqrt"*522.473 + N3*1.09103
N3[t] = 2267.23 - "Cm3_otr[t-3], sqrt"*1563.91 + N6*0.738373
N6[t] = -5926.55 + ""$_PK_otr"[t-2], sqrt"*4826.58 + N10*1.48421
N10[t] = 8962.17 - "Cm2_otr[t-2], sqrt"*3666.07 - "Cm3_otr[t-4], sqrt"*2607.62
FORECAST
Forecast of USD/ruble exchange rate
Combinatorial algorithm with linear variables
MAPE = 4.5%
One-month forward error in forecast P=-2.9%.
Forecast of economic crimes
FORECAST
Y[t] = 5368.54 + Cm2_pol[t-1]*8610.05 + Cm2_pol[t-2]*4452.71 +
+Cm2_otr[t-2]*(-11350) + Cm3_pol[t-2]*11285.4
Forecast of economic crimes
FORECAST
Comparison of algorithms. MAPE
МАРЕ Neuro, no
roots
Neuro, with
squared
roots
Combi, no
roots
Combi, with
squared
roots
Retail turnover in Russia 1,0% 2,1% 2,3% 1,4%
Unemployment 0,8% 0,5% 0,9% 0,6%
Marriages 7,5% 6,9% 9,9% 8,3%
Real activity 0,1% 0,1% 0,1% 0,1%
Price Consumer Index 0,1% 0,1% 0,1% 0,1%
Entrepreneurs Price Index on industrial
products 0,3% 0,3% 0,3% 0,3%
Entrepreneurs Price Index on minerals 1,4% 1,4% 1,4% 1,4%
The dollar/ruble exchange rate 0,8% 1,4% 1,1% 1,2%
Newborns 1,2% 1,8% 1,9% 2,3%
Sales of new passenger cars and light
commercial vehicles 3,5% 1,6% 3,9% 5,4%
Per capita income 2,4% 1,6% 1,9% 1,0%
Economic crimes 3,4% 6,4% 4,5% 6,5%
Oil prices 1,0% 1,7% 1,7% 1,6%
FORECAST
One-month forward forecast Neuro, no
roots
Neuro, with
squared
roots
Combi, no
roots
Combi, with
squared roots
Retail turnover in Russia 1,8% -3,8% 3,8% -2,9%
Unemployment -2,8% 2,5% -0,9% 3,1%
Marriages -33,2% -20,4% -52,1% -20,2%
Real activity -0,3% -0,2% -0,3% -0,2%
Price Consumer Index 0,1% -0,2% 0,1% 0,7%
Entrepreneurs Price Index on industrial
products 2,3% -0,2% 4,1% -0,2%
Entrepreneurs Price Index on minerals 86,6% 37,0% 1,5% 37,0%
The dollar/ruble exchange rate 12,7% -2,0% 10,6% 17,3%
Newborns -0,7% -0,7% -0,6% 7,6%
Sales of new passenger cars and light
commercial vehicles 7,8% -16,9% 24,9% -100,0%
Per capita income 6,7% 10,8% 13,8% 23,6%
Economic crimes -16,5% 6,6% -2,9% 6,4%
Oil prices -5,3% -7,0% -14,0% -7,0%
Comparison. One-month forward deviations in forecast with barometers
FORECAST
Content
Introduction
Databases
Tasks of analysis
Tasks of forecast
Conclusion
CONCLUSION
Scientific results – 1 (databases)
Experimentally we have shown the possibilities of effective
implications of:
• Few domain-oriented databases instead of one;
• Bases of n-grams (𝑛 = 2,8);
• Significantly negatively correlated descriptors along with
significantly positively correlated descriptors;
CONCLUSION
Scientific results – 2 (analysis)
We suggested interpretation of results of statistic analysis:
• in the field of the evalution of reasons of increased
mortality at the beginning of 2015 in the regions of
Russia;
• in the field of the evalution of people’s reaction on retail
trade turnover;
• in the field of the revealing groups of consumers.
CONCLUSION
Scientific results – 3 (forecast)
Experimentally we have shown high accuracy of GMDH
algorithms, which allows such error levels as
~3%-6% in the best models of crimes;
~1%–4% in the models for economy and social indicators;
As the future work we consider
• proposing a technology to use the mentions of descriptors in
social media;
• developing a procedure for processing queries including
outliers related to major circumstances;
• developing models for fuzzy forecasting taking into account
qualitative dynamics of queries.
Future research
CONCLUSION
Thank you!
anna.boldyreva@phystech.edu
+7-916-542-37-64
X_pol – барометр с сильной прямой положительной корреляцией
относительно индикатора X
X_otr – барометр с сильной прямой отрицательной корреляцией
относительно индикатора X
Cmim_pol – барометр с сильной положительной корреляцией с
лагом в i месяцев относительно индикатора X
Cmim_otr – барометр с сильной отрицательной корреляцией с
лагом в i месяцев относительно индикатора X
Обозначения
FORECAST
Latest research papers
• Boldyreva A., Alexandrov M., Koshulko O., Sobolevskiy O.: Queries to Internet
as a tool for analysis of regional police work and forecast of crimes in regions:
Proc. of 15th Mexican Intern. Conf. on Artificial Intelligence, Springer, LNCS,
2016, 12 p. [to be published]
• Boldyreva A., Sobolevskiy O., Alexandrov M., Danilova V.: Creating collections
of descriptors based on Internet queries: Proc. of 15th Mexican Intern. Conf. on
Artificial Intelligence, Springer, LNCS, 11 p. [to be published]
• Boldyreva A.: An integral method for investigating attitudes of Internet users
based on search queries. “Mathematical modeling of social processes”, Proc.
of Sociological Faculty of MSU, Publ. House MSU (Moscow State Lomonosov
Univ.), 2016, vol. 18, pp. 26-34, [rus]
• Boldyreva A.: Building predictive models of economic and social conditions
based on the intensity of search queries to the Internet. “Modern economics:
theory, policy, innovation. Collection of student research papers”, Moscow,
Publ. House RANEPA, 2016, pp. 36-61, [rus]
• Boldyreva A., Alexandrov M., Surkova D.: Words with negative sentiment in
search queries to the Internet as an indicator of per capita income in the
Federal Districts of Russia. Inductive modeling of complex systems, NAS of
Ukraine, Kyev, 2015, vol. 7, pp. 77-92, [rus]

AINL 2016: Boldyreva

  • 1.
    Processing queries tosearch engine of Yandex: possibilities of analysis and forecast AINL FRUCT 2016 Boldyreva Anna (RANEPA-MIPT)
  • 2.
    Content Introduction Databases Tasks of analysis Tasksof forecast Conclusion INTRODUCTION
  • 3.
    Terminology Search query isa request made by an Internet user to obtain information from a search engine; statistics on search queries are obtained from services from search engines: https://www.google.ru/trends/ https://adwords.google.com/ http://wordstat.yandex.ru/. Descriptor is a word or a phrase that forms part of search queries introduced by users; INTRODUCTION
  • 4.
    Terminology Indicators are economic,social, demographic and other indicators that are analyzed or forecasted by analysts and researchers; Top-rated lists of descriptors are search queries that are the most highly correlated with selected indicators; Barometer is the mean value of the normalized dynamics of the top-rated selection. INTRODUCTION
  • 5.
    Hypothesis There is astable statistical dependence between the intensity of search queries and real-world events and social processes. Fig. 1. The dynamics of the descriptor ‘swimsuit’ in U.S.A.: peaks in February and May-June INTRODUCTION
  • 6.
    INTRODUCTION Relevance We can usesearch queries • for monitoring the economic situation in regions in real time avoiding difficulties related to the lack of data, as explained above; • for parallel control of official information, which allows to reveal distortions introduced by official institutions; • for forecasting economic, demographic and social parameters during a crisis period; • for forecasting dynamics of various socio-economic and socio- political processes; • for analysis of other countries. Here we do not need official data that is published with delay.
  • 7.
    State of theart 2009 — Google has launched a service showing pest holes in real time based on the intensity of queries from different regions 2009 — H. Choi and H. Varian introduced the first model predicting fluctuations in business cycles with the help of search queries statistics 2011 — D. Engelberg, C. Da and P. Gao demonstrated that analyzing the dynamics in Google searches for companies gives a 10% advantage to traders; INTRODUCTION
  • 8.
    State of theart 2011 — Michael Stolbov (MGIMO-University) demonstrated the feasibility of using Google search statistics to explain the dynamics of aggregated financial indicators (for example, deposits of individuals). 2013 — Tobias Prize demonstrated the work "Complex dynamics of our economic life on different scales: insights from search engine query data“; The work is dedicated to the market shares; he analyzed outbursts in searches «Subprime», «Lehman Brothers» and «Financial Crisis», followed by a drop in S&P 500 Index. INTRODUCTION
  • 9.
  • 10.
    Domain-oriented databases ofdescriptors • economic terms — 25000 SQ; • juridical terms — 4500 SQ; • crime articles — 365 SQ; • well-known brands and goods — 3013 SQ; • emotions: with positive tonality — 400 SQ with negative tonality — 400 SQ; • slang used in finance, computers and other fields — 3300 SQ; • medical terms —1600 SQ. DATABASES Technical databases • lemmas —18638 ПЗ; • n-grams ( n=2,3,..8 ) of letters and syllables ~ 90 000 ПЗ.
  • 11.
    Lemma – theinitial form of the word Examples: avant-garde, sauna, drum, dune, velvet, bass, basketball, the battalion commander, a comet, a compass, Icon, contour, piggy, mop, cordon n-gram Examples: ев, ег, ед, ее, еж, ти, тк, тл, тм, тн, то, авв, авг, авп, бре, бри, бро, век, вел, вес, лак, лал, лам, лан, лао, лап, греч, декс, сдел, кром Emotion words with positive tonality Examples: good, great, beautiful, holiday, goodness, beauty, super, fun, cool, happy, dream, luck, well, success, joy, laugh, nice Emotion words with negative tonality Examples: chaos, amoral, immoral, sabotage, punishment, violation, cattle, schmuck, moron, hopeless, useless, helpless DATABASES
  • 12.
    Barometers Examples of wordsthat got into the "barometer" with direct positive correlation with the indicator “Consumer Price Index": "treat" – 0.93 "okmarket" (hypermarket) – 0.91 "pariet" (drug for ulcer) – 0.89 "patents" – 0.87 "mfbank" (commercial bank) – 0.87 "headhunter" (site to find job) – 0.86 "pediashur" (baby food) – 0.86 "convenient" – 0.86 "often" – 0.85 "close" – 0.85 DATABASES
  • 13.
    Barometers Examples of wordsthat got into the "barometer" with direct positive correlation with the indicator “Consumer Price Index": "chemical" (british musical duet) – -0.92 "artofvar" (musical group of war veterans) – -0.91 "incest" – -0.87 "group" – -0.87 "babylon" (the italian brand of clothing) – -0.87 “young child" – -0.86 "diprivan" (a sedative) – -0.86 "ilarauto" (van selling) – -0.86 "miss" – -0.86 DATABASES
  • 14.
    Bases of indicators •Retail trade turnover (mln of roubles);  • Consumer Price Index; • Entrepreneurs Price Index on industrial products; • Entrepreneurs Price Index on minerals; • Unemployment (thousands); • Sales of new passenger cars and light commercial vehicles (units) • per capite income (thousands of roubles); • The dollar/ruble exchange rate. (USDTOM_UTS);  • Brent price (ICE.Brent), USD/баррель; DATABASES
  • 15.
    Базы индикаторов • newborns(thousands); • marriages (thousands); • real activity (thousands); • deaths (thousands);  • registered economic crimes DATABASES
  • 16.
    Programes 1. The programfor the collection of search queries’ dynamics from the statistical service of Yandex; 2. The program for the automatic processing of the files and the formation of an Excel spreadsheet; 3. The program for the automatic processing of the tables and selection of top search queries. DATABASES
  • 17.
  • 18.
    Distribution of positivesearches on the correlation with the indicator “Retail turnover" Statistics of queries by regions ANALYSIS
  • 19.
    Values of correlationcoefficients are located on the ordinate axis. The number of positive descriptors with corresponding level of correlation relative to the indicator "Turnover of retail trade“ are located on the horizontal axis. ANALYSIS Statistics of queries by regions
  • 20.
    Query statistics ondomain-oriented databases Example: distribution of queries from the database "Brands and products" relative to the indicators, with which there is a high level of correlation. Observation: newlyweds are buying more than young parents ANALYSIS
  • 21.
    Example: distribution ofqueries with a high level of correlation with the indicator "Sales of new cars» through thematic databases. Observation: the active usage of slang, a variety of products/services Consumer profiling ANALYSIS
  • 22.
    спазм диафрагмы +21%вертиго +54% потеря вкусовых ощущений +108% горький вкус во рту +52% дежурный врач +105% телефон аптеки +46% полный пульс +91% приемный покой +46% кашель с желтой мокротой +73% маниакальная фаза +45% онемение шеи +70% нафтизин +43% нечувствительность +70% кровотечение из ушей +38% ночная потливость +69% вызвать врача +38% стерильные бинты +67% эфералган +37% абстинент +53% лекарства купить +36% дежурная больница +55% свистящее дыхание +34% Excess frequency of search queries on the base of medical terms in Leningrad Region compared to Data for Russia Data for Russia are accepted for 100% Increased mortality in Leningrad Region ANALYSIS
  • 23.
  • 24.
    Group method ofdata handling (GMDH) allows to select the model of optimal complexity in a given class of models to describe the current set of experimental data Polynomial class of models: where x = {xi | i = 1, … , m} is a set of indicators and w = (wi , wij, wijk, … | i, j, k = 1, … , m) is a weight vector. FORECAST
  • 25.
    GMDH Shell actualizeGMDH Possibilities: • Approximation • Extrapolation • Classification http://www.gmdhshell.com Main constructor: Candidate of Technical Sciences Koshulko A.A. Program GMDH Shell FORECAST
  • 26.
    1st criterion: MAPE(mean absolute percentage error): 𝑀𝐴𝑃𝐸 = 1 𝑁 𝑦𝑡 − 𝑦𝑡 𝑦𝑡 ∗ 100% 𝑁 𝑡=1 , where N is sample size, 𝑦𝑡 is real value for 𝑡, 𝑦𝑡 is estimated value for 𝑡; 2nd criterion: P (one-month step forward forecast error): 𝑃 = 𝑦 𝑁+1 − 𝑦 𝑁+1 𝑦 𝑁+1 ∗ 100%. Error evaluation FORECAST
  • 27.
    Observations are pseudomixed; Checking method is cross-checking with two parts; Internal criterion is OLS; External criteria is RMSE (root mean squared error) with a penalty in the form of the difference between the RMSE value on training and examination parts of the sample; Neuron function is linear; The maximum number of layersis 6; The initial layer width is 5. Forecast settings FORECAST
  • 28.
    Neural algorithm withlinear barometers MAPE = 1.0%, One-month forward forecast error P=1.8% Forecast of retail turnover FORECAST
  • 29.
    Y1[t] = 64.4813+ Cm3m_pol[t-3]*40.7607 + N2*0.966131 N2[t] = -599.916 - Cm1m_pol[t-1]*410.022 + N3*1.3329 N3[t] = -45.4194 + N12*0.261196 + N4*0.759475 N4[t] = -99.6924 + ORT_PK_otr[t-2]*118.488 + N6*1.02163 N6[t] = 5.48719 - ORT_PK_pol[t-1]*181.667 + N10*1.0246 N10[t] = 1926.58 + Cm1m_pol[t-1]*1209.97 - Cm3m_otr[t-1]*241.064 N12[t] = 2327.24 + Cm3m_pol[t-3]*685.968 - Cm3m_otr[t-3]*581.729 FORECAST Forecast of retail turnover
  • 30.
    Neural algorithm withbarometers in squared roots form MAPE = 1.4% One-month forward forecast error P=-2%. Forecast of USD/ruble exchange course FORECAST
  • 31.
    Y1[t] = -2.49737e-10+ N2*1 N2[t] = -769.864 + "Cm2_otr[t-2], sqrt"*522.473 + N3*1.09103 N3[t] = 2267.23 - "Cm3_otr[t-3], sqrt"*1563.91 + N6*0.738373 N6[t] = -5926.55 + ""$_PK_otr"[t-2], sqrt"*4826.58 + N10*1.48421 N10[t] = 8962.17 - "Cm2_otr[t-2], sqrt"*3666.07 - "Cm3_otr[t-4], sqrt"*2607.62 FORECAST Forecast of USD/ruble exchange rate
  • 32.
    Combinatorial algorithm withlinear variables MAPE = 4.5% One-month forward error in forecast P=-2.9%. Forecast of economic crimes FORECAST
  • 33.
    Y[t] = 5368.54+ Cm2_pol[t-1]*8610.05 + Cm2_pol[t-2]*4452.71 + +Cm2_otr[t-2]*(-11350) + Cm3_pol[t-2]*11285.4 Forecast of economic crimes FORECAST
  • 34.
    Comparison of algorithms.MAPE МАРЕ Neuro, no roots Neuro, with squared roots Combi, no roots Combi, with squared roots Retail turnover in Russia 1,0% 2,1% 2,3% 1,4% Unemployment 0,8% 0,5% 0,9% 0,6% Marriages 7,5% 6,9% 9,9% 8,3% Real activity 0,1% 0,1% 0,1% 0,1% Price Consumer Index 0,1% 0,1% 0,1% 0,1% Entrepreneurs Price Index on industrial products 0,3% 0,3% 0,3% 0,3% Entrepreneurs Price Index on minerals 1,4% 1,4% 1,4% 1,4% The dollar/ruble exchange rate 0,8% 1,4% 1,1% 1,2% Newborns 1,2% 1,8% 1,9% 2,3% Sales of new passenger cars and light commercial vehicles 3,5% 1,6% 3,9% 5,4% Per capita income 2,4% 1,6% 1,9% 1,0% Economic crimes 3,4% 6,4% 4,5% 6,5% Oil prices 1,0% 1,7% 1,7% 1,6% FORECAST
  • 35.
    One-month forward forecastNeuro, no roots Neuro, with squared roots Combi, no roots Combi, with squared roots Retail turnover in Russia 1,8% -3,8% 3,8% -2,9% Unemployment -2,8% 2,5% -0,9% 3,1% Marriages -33,2% -20,4% -52,1% -20,2% Real activity -0,3% -0,2% -0,3% -0,2% Price Consumer Index 0,1% -0,2% 0,1% 0,7% Entrepreneurs Price Index on industrial products 2,3% -0,2% 4,1% -0,2% Entrepreneurs Price Index on minerals 86,6% 37,0% 1,5% 37,0% The dollar/ruble exchange rate 12,7% -2,0% 10,6% 17,3% Newborns -0,7% -0,7% -0,6% 7,6% Sales of new passenger cars and light commercial vehicles 7,8% -16,9% 24,9% -100,0% Per capita income 6,7% 10,8% 13,8% 23,6% Economic crimes -16,5% 6,6% -2,9% 6,4% Oil prices -5,3% -7,0% -14,0% -7,0% Comparison. One-month forward deviations in forecast with barometers FORECAST
  • 36.
  • 37.
    CONCLUSION Scientific results –1 (databases) Experimentally we have shown the possibilities of effective implications of: • Few domain-oriented databases instead of one; • Bases of n-grams (𝑛 = 2,8); • Significantly negatively correlated descriptors along with significantly positively correlated descriptors;
  • 38.
    CONCLUSION Scientific results –2 (analysis) We suggested interpretation of results of statistic analysis: • in the field of the evalution of reasons of increased mortality at the beginning of 2015 in the regions of Russia; • in the field of the evalution of people’s reaction on retail trade turnover; • in the field of the revealing groups of consumers.
  • 39.
    CONCLUSION Scientific results –3 (forecast) Experimentally we have shown high accuracy of GMDH algorithms, which allows such error levels as ~3%-6% in the best models of crimes; ~1%–4% in the models for economy and social indicators;
  • 40.
    As the futurework we consider • proposing a technology to use the mentions of descriptors in social media; • developing a procedure for processing queries including outliers related to major circumstances; • developing models for fuzzy forecasting taking into account qualitative dynamics of queries. Future research CONCLUSION
  • 41.
  • 42.
    X_pol – барометрс сильной прямой положительной корреляцией относительно индикатора X X_otr – барометр с сильной прямой отрицательной корреляцией относительно индикатора X Cmim_pol – барометр с сильной положительной корреляцией с лагом в i месяцев относительно индикатора X Cmim_otr – барометр с сильной отрицательной корреляцией с лагом в i месяцев относительно индикатора X Обозначения FORECAST
  • 43.
    Latest research papers •Boldyreva A., Alexandrov M., Koshulko O., Sobolevskiy O.: Queries to Internet as a tool for analysis of regional police work and forecast of crimes in regions: Proc. of 15th Mexican Intern. Conf. on Artificial Intelligence, Springer, LNCS, 2016, 12 p. [to be published] • Boldyreva A., Sobolevskiy O., Alexandrov M., Danilova V.: Creating collections of descriptors based on Internet queries: Proc. of 15th Mexican Intern. Conf. on Artificial Intelligence, Springer, LNCS, 11 p. [to be published] • Boldyreva A.: An integral method for investigating attitudes of Internet users based on search queries. “Mathematical modeling of social processes”, Proc. of Sociological Faculty of MSU, Publ. House MSU (Moscow State Lomonosov Univ.), 2016, vol. 18, pp. 26-34, [rus] • Boldyreva A.: Building predictive models of economic and social conditions based on the intensity of search queries to the Internet. “Modern economics: theory, policy, innovation. Collection of student research papers”, Moscow, Publ. House RANEPA, 2016, pp. 36-61, [rus] • Boldyreva A., Alexandrov M., Surkova D.: Words with negative sentiment in search queries to the Internet as an indicator of per capita income in the Federal Districts of Russia. Inductive modeling of complex systems, NAS of Ukraine, Kyev, 2015, vol. 7, pp. 77-92, [rus]