Session I - Big Data De Vitiis, Sampling schemes using scanner data for the consumer price index

Sampling schemes using scanner data for the
consumer price index
De Vitiis C., Guandalini A., Inglese F., Terribili M. D.
Workshop of the ADVISORY COMMITTEE ON STATISTICAL METHODS
Rome, 19th November 2018

Outline
1. ISTAT and international context
2. The ISTAT sampling experiments of SD
3. Experiments of sampling from SD
4. Experiments of static and dynamic approaches
5. The sample design for the outlets
6. Conclusions and open issues
Workshop of the ADVISORY COMMITTEE ON STATISTICAL METHODS, Rome, 19th November 2018

 ISTAT started using the scanner data (SD) in 2018 to compute the
consumer price index (CPI) for retail modern distribution (food and grocery)
 SD are regularly provided since 2014 to ISTAT through the market research
company ACNielsen, for the main chains in Italy and for a sample of about
2,000 outlets, scattered throughout the country
 Nielsen provide to ISTAT the Dictionary which associates to items identifying attributes
and ECR classification codes, allowing to link to COICOP classification
 For 2018 the compilation of the CPI using scanner data is based on a fixed
basket perspective, but a gradual transition to a flexible basket is the goal
for ISTAT
 At the moment, for the traditional retail distribution of food and grocery, data
are collected by the current survey on field: traditional and scanner data are
combined through weights derived from the expenditure survey
 Aim of this presentation is showing and discussing the experiments
carried out by our research group using SD in a sampling perspective, to
compare sample schemes for different index calculation approaches
1. Context: the ISTAT project on Scanner Data

 SD are used In several countries for the compilation of the CPI:
Switzerland, Norway, Netherlands and Sweden for some years, while
Belgium and Denmark only from 2016
 Scanner data are exploited in different ways:
 Replacing price collection in the stores, without changing the traditional
principles of computing the price indices (Swiss Federal Statistical
Office)
 As a universe from which samples of references can be selected
following different methods (Norway and Sweden)
 Extensively to compile price indices, without a strict sample selection,
but with consequences on the theoretical definition of the index
(Nederland )
1. Context: The international use of SD

SCANNER DATA
 Electronic sales data, containing both prices and quantities sold for every
item purchased by consumers in modern distribution
Opportunities
 To introduce improvements in the CPI also with regards the sampling
perspective, considering a dynamic population approach and different index
formulas for the lowest aggregation level (elementary indices), allowing to take
into account the dynamicity of the market
 To offer a real possibility of computing more accurate indices, since SD allow,
potentially, to include in indices the expenditure share of each item
Issues to be addressed
 Attrition of products, temporary missing products, entry of new products,
relaunches and volatility of the prices and quantities due mainly to sales.
These aspects need to be addressed from both a theoretical and a practical
point of view (de Haan et al., 2016)
1. ISTAT and international context

OBJECTIVE
 The general aim was evaluating the use of SD for the compilation of
elementary price indices (lowest aggregation level of price index calculus, on which
the subsequent aggregations are based, in the Italian case the consumption segments)
 from a sampling perspective
 considering both static and dynamic population
 following a simulative approach
The experiments were developed in two phases:
 In the first phase, following a fixed basket approach, probability and
nonprobability selection schemes of series (references individuated by
EAN and outlet codes) are compared
 In the second phase, an experiment of evaluating the differences
between a fixed and a flexible basket approach was carried out, trying
to measure the magnitude of sampling error and bias for different index
formulas in both approaches

The DATA used for the experimental phase
 Data referred to the first Italian provinces for which ISTAT got data, food and
grocery sector (covering 11% of the total index, while modern distribution
covers the 55% of the sector)
 Weekly data on turnover and quantities per EAN-code (European Article
Number, or GTIN, Global Trade Item Number) and outlet allow obtaining
weekly unit values
 One series (=EAN by outlet) is considered present in a specific month if it has
associated a non zero value in at least one week out of the first full three
weeks
 Our experiments are based on a panel of PERMANENT SERIES
 The permanent series are defined as having non-zero turnover for at
least one relevant week (one of the first three full weeks) every month for
13 months
 Permanent series provide generally a good coverage of the total turnover
 Each experiment focused on one province, some consumption segments
(Italian COICOP6) or markets, permanent series

2. The ISTAT sampling experiments of SD: the theoretical context
FRAMEWORK of the experiments
 Two selection stages for each domain (the whole province, not only the chief
towns as the current survey on field)
o Outlets selected through a probability sampling from the archive
o Items (references) selected either by probability sampling or cut-off selection
 For the item the definition of the population and parameter of interest is crucial :
o Static population  fixed approach – fixed sample/ direct index (weighted or
unweigthed)
o The indices at the lowest aggregation level are obtained from the monthly
price ratios (micro-indices) with fixed base (December previous year)
o Dynamic population  dynamic approach – dynamic sample/ chained index
(weighted or unweigthed)
o The elementary indices are calculated on matched-items: only price
relatives of items that are sold in two consecutive months enter in the
index formulas (flexible basket)

The bilateral indices in the Static and Dynamic approach
 Static population  fixed basket/ direct index
 The loss of representativeness of the basket (shrinkage effect) is addressed through
the yearly base change of the index and the renewal of the basket
 The fixed basket approach do not allow to exploit all the information provided by
scanner data, do not take into account the population dynamics, new products are
ignored
 Dynamic population  matched model/ chained index
 Chain indices take into account the movements of prices within the considered time
interval and the shrinkage effect over time due to the attrition of a fixed basket of
products is solved
 Period-on-period chaining of weighted indices introduces chain drift - bias due to the
non-transitivity of the chained weighted price indices, nonsymmetric effect on quantities
sold and expenditures share before and after sale (de Haan and van der Grient, 2011;
Ivancic et al., 2011)

Aggregation formulas for the elementary indices
The price indices considered are:
 Fisher (superlative, weighted) index is preferred by the economic theory, it uses
quantities in different times and allows for substitution effects
 Törnqvist (superlative, weighted) index is a weighted geometric mean, weights are
the average of expenditure share calculated in two times
 Lowe index is a modified Laspeyres in which a base month for the price (December
of previous year) and a base year (previous year) for the quantities are taken into
account for the weight computation
 Jevons index unweighted geometric mean
 In the experiments unbiased estimators of these indices were defined
through the sampling weights

First experiments : Comparison of probability and nonprobability
sampling selection schemes
 Non-probability sample: Cut-off selection of series based on thresholds
of covered turnover; the index is compiled using all series covering 60%
or 80% of all turnover in each outlet for the consumption segment, in
previous year 2013; Reference method: most sold items in each outlet
for representative products (current fixed basket approach)
 Probability sampling: two-stage sampling design with stratification of the
first stage units
 Stratification of outlets by 6 chains (Conad, Coop, Esselunga,
Auchan, Carrefour, Selex) and outlet type (hypermarket and
supermarket)
 The size of the sample of outlet has been fixed at a number of 30
out of 121 outlets available for Turin, selected with equal
probabilities
 Sample size for the second stage (EANs) is given by a 5%
sampling rate and selection by Sampford pps
 The universe is the set of panel series

 Three different sampling designs have been considered:
 Stratified one-stage sampling
 Cluster sampling (outlet)
 Two-stage sampling (outlet and EAN)
 The universe of EANs is stratified by market (ECR groups) in six consumption
segments (coffee, pasta, mineral water, olive oil, spumante and ice cream)
 Different criteria of sample allocation (proportional and Neyman) for both outlets
and elementary items (EANs) and different methods for the selection of sampling
units have been considered (SRS and PPS)
Further experiments: Comparison among probability sampling schemes
The comparison between probability and nonprobability selection
schemes was made for each price index taking the corresponding
universe index value (panel series SD) as reference true value

Figure 1 – First experiments: Jevons, Lowe and Fisher indices computed on samples of series selected with
different selection schemes - Coffee and pasta segments, Turin province, year 2014
COFFEE PASTA
Jevons
Lowe
Fisher
Lowe
Fisher
Results of the experiments in the first phase

Absolute Relative Bias and Relative Sampling Error distributions of monthly Lowe, Fisher and Jevons
indices for consumption segment (Sampford sampling)
Results of the experiments in the first phase
Index
Coffee consumption segment
Min Q1 Me Q3 Max
Lowe
ARB -0.0005 -0,0014 -0,0013 -0,0008 -0,0017
RSE 1,32 1,38 1,41 1,43 1,52
Fisher
ARB 0,0005 0,0015 0,0021 0,0030 0,0039
RSE 1,14 1,36 1,53 1,73 1,98
Jevons
ARB -0,0199 -0,0485 -0,0400 -0,0353 -0,0609
RSE 1,05 1,09 1,14 1,20 1,24
Index
Pasta consumption segment
Min Q1 Me Q3 Max
Lowe
ARB 0.0008 -0,0003 -0,0001 0,0005 -0,0007
RSE 1,13 1,20 1,28 1,32 1,38
Fisher
ARB -0,0009 -0,0043 -0,0034 -0,0025 -0,0051
RSE 1,21 1,61 1,67 1,83 2,06
Jevons
ARB -0,0017 0,0031 0,0105 0,0209 0,0314
RSE 0,85 0,92 1,04 1,10 1,22

The results highlight the following evidences about the performances of different
series selection schemes
 Probability pps sample produces approximately unbiased estimates for indices using
weights (Lowe and Fisher)
 Increasing the sampling rate produce a remarkable improvement of the bias in all
indices, in addition to an obvious reduction of sampling error
 Stratified One stage sampling produces estimators more efficient, but for practical
problems we need to pass through outlets.
 Cluster sampling produces estimators having good performances, but the high
number of products sold in each outlets rapidly makes increase the size of SSU.
 Two stage sampling produces estimators with quite good performances, probably
because of the most of variability is among outlets.
 The performance depends in general on type of turnover distribution of items for the
product segments
I simulation study: summary of the evidences

 The second experimental phase is an attempt to make a comparison
between static and dynamic approach, evaluating the total error of the
estimates of indices
 Simulation of alternative population scenarios
 artificial population: starting from the panel, new products have been
introduced considering the monthly birth rates and old products have been
removed in accordance to a survival function (both monthly birth and survival
rates have been estimated on the real open population)
 From a sampling perspective the two approaches refer to different
sampling designs:
 STATIC two-stage sampling (outlets, EANs)
 DYNAMIC cluster sampling (outlets)
4. Experiments of Static and Dynamic approaches (second phase)

(A) Panel (Static approach)
All the products enter in the
computation of indices
(B) Static approach without
disappearing products
The products presented in December
of the base month are followed during
all the year. The new products are not
considered by definition. However,
disappearing products are assumed to
be present
(C) “Pure” Static approach
Only the fixed basket approach is
applied. The new products are not
considered
Scenarios implemented under Static approach

(A) Panel (dynamic approach)
All the products enter in the
computation of indices.
(D) Dynamic approach
Only the matched products in two
months in a row enter in the
computation of price indices.
Scenarios implemented under Dynamic approach

STATIC vs DYNAMIC – Jevons with and without Treshold

Comments on the experiments on
static and dynamic approaches
 The simulation allowed to experiment the two approaches and to
highlight the practical and theoretical implications
 Non-sampling error seems to affect the estimators of price indices
under the dynamic approach more than the under the static
approach
 The sampling error of the estimator under the dynamic approach
is much lower than in the static approach due to the difference in
sample size

5. The sample design for the selection of outlets from the Nielsen universe
 For 2017 a random sample of outlets has been selected from the Nielsen
universe of hyper and supermarkets covering all Italian provinces and 16
chains, to start experimenting the index production
 To allocate efficiently the sample of 2100 outlets, the variability in the
universe of the prices indices referred to the initial set of 37 provinces and
6 chains has been studied
 The values of monthly price indices with Jevons, Laspeyres and Lowe
aggregation formula have been computed for each available outlet of the
37 prov.
o Models for analysing the variance of the outlet monthly indices
o Study of distributions of the standard deviation of outlet monthly indices

Jevons
Laspeyres
Lowe
Standard deviation of monthly indices by provinces, 2015
Variability of interest parameter in the population of outlets
 The three aggregation formulas provide the same results.
 There is a variability among provinces, but there is not a clear tendency we can use for improving the
allocation.

The allocation of the sample of outlets from Nielsen universe
The Nielsen universe for 2017:
16 chains (out of 25) gave to Nielsen the authorization to provide data to
ISTAT, covering nearly 80% of total turnover (but not all outlets are
available)
 Stratification of stores in around 1000 strata, cross-classification of
 Province (107)
 chain (16)
 type (hyper /supermarket)
 No clear evidences could direct the definition of the allocation of the sample
 we chose for 2017 a compromise allocation:
 2100 sample outlets distributed among strata with a compromise
allocation: uniform, proportional to turnover, proportional to the
number of outlets
 In 2018 a slightly reduced sample of outlet was used for the index compilation
following the current fixed basket approach
5. The sample design for the selection of outlets from the Nielsen universe

Allocation of outlets sample
is a compromise between:
 a uniform allocation
 proportional allocation to turnover
 proportional allocation to the
number of outlets
ROME
CHAIN 1 16
HYP 2
SUP 14
CHAIN 2 8
HYP 2
SUP 6
CHAIN 3 2
SUP 2
CHAIN 4 9
HYP 2
SUP 7
CHAIN 5 12
HYP 2
SUP 10
CHAIN 6 4
HYP 2
SUP 2
CHAIN 7 6
SUP 6
CHAIN 8 4
SUP 4
The aim of the first sample of outlets,
drawn for 2017 and 2018, is to gather
information on the variability of the
indices for allocating more efficiently
the sample of outlets for 2019.

Conclusion and open issues
 The experiments on sampling designs showed the feasibility of the
procedures for the selection of items and the index elaboration
 Computational and process issues related to the implementation of static
and dynamic approach should be taken into account
 We are planning of using resampling methods to evaluate the variance of
the estimates
 One limit of the current implementation of dynamic approach is the use of
unweighted indices, whereas SD would allow using prices and quantities
 Further studies are needed to evaluate the properties of multilateral
indices, which could allow to overcome the chain drift and other issues of
weighted indices in the dynamic approach
 Since May 2018 we did not work on this project but we are planning to
follow some of the suggestions of the Committee to study an optimal
design for the selection of outlets for 2019 for the implementation of the
dynamic approach which is under experimentation at the survey sector

Chessa A. G., Verburg J. and Willenborg L. A (2017). Comparison of Price Index Methods for Scanner Data.
de Haan, J., Opperdoes, E., Schut, C.M. (1999). Item selection in the Consumer Price Index: Cut-off versus probability
sampling. Survey Methodology, 25(1), 31-41.
de Haan, J. and van der Grient, H. A. (2011). Eliminating chain drift in price indexes based on scanner data. Journal of
Econometrics, 161, 36-46.
de Haan, J., Willemborg, L. and Chessa, A. G. (2016). An overview of price index methods for scanner data.
Feldmann B. (2015). Scanner-data-current-practice, http://www.istat.it/en/files/2015/09/5-WS-Scanner-data-Rome-1-2-
Oct_Feldmann-Scanner-data-current-pratice.pdf.
Gábor, E. and Vermeulen, P. New evidence in elementary index bias. (2014) ILO, IMF, OECD, Eurostat, United Nations,
World Bank Consumer Price Index Manual: Theory and Practice, Geneva: ILO Publications (2004)
Ivancic, L., Diewert, W.E., Fox, K.J. Scanner Data (2011). Time Aggregation and the Construction of Price Indexes.
Journal of Econometrics, 161(1), 24-35.
Nygaard, R. Chain drift in a monthly chained superlative price index. Workshop on scanner data, Geneva, 10 may
(2010)
Norberg A. (2014). Sampling of scanner data products offers in the Swedish CPI. Draft version 8 – Statistics Sweden.
van der Grient, H. A. and de Haan J. (2010). The Use of Supermarket Scanner Data in the Dutch CPI. Paper presented
at the Joint ECE/ILO Workshop on Scanner Data, Eurostat.
van der Grient, H. and J. de Haan (2011). Scanner Data Price Indexes: The “Dutch” Method versus Rolling Year GEKS”.
Paper presented at the twelfth meeting of the Ottawa Group, 4-6 May 2011, Wellington, New Zealand.
Vermeulen, B. C. and Herren, H. M. (2006). Rents in Switzerland: sampling and quality adjustment. 11th Meeting -
Ottawa Group - Neuchâtel 27-29 May (2006)
References

Thank you for your attention !

Session I - Big Data De Vitiis, Sampling schemes using scanner data for the consumer price index

Recommended

Recommended

More Related Content

Similar to Session I - Big Data De Vitiis, Sampling schemes using scanner data for the consumer price index

Similar to Session I - Big Data De Vitiis, Sampling schemes using scanner data for the consumer price index (20)

More from Istituto nazionale di statistica

More from Istituto nazionale di statistica (20)

Recently uploaded

Recently uploaded (20)

Session I - Big Data De Vitiis, Sampling schemes using scanner data for the consumer price index