2024: The FAR, Federal Acquisition Regulations, Part 31
Machine learning, Kristian Harald Myklatun, Statistics Norway
1. Machine learning in the Norwegian CPI
10/18/2019
Kristian Harald Myklatun, Statistics Norway
1
2. Consumer price index
• Monthly statistic
• Changes in consumer prices
• Follows COICOP classification standard:
• Five-tier structure
• 12 divisions at second-tier level
• Many groups under that
10/18/2019 Kristian Harald Myklatun, Statistics Norway 2
3. New data sources
• Previously mostly web
questionnaires and price
collectors
• Increasing availability of new
data sources: transaction
data
• Problem: unclassified
10/18/2019 Kristian Harald Myklatun, Statistics Norway 3
28%
22%17%
16%
10%
7%
Web questionnaires
Scanner data
Internet
Rents
Other electionic data
Other
4. Index for food and non-alcoholic drinks
• 117 different groups
• Monthly chaining
• 400-1200 new items each month
10/18/2019 Kristian Harald Myklatun, Statistics Norway 4
5. Classification process
• Manual classification
• Receive only : text, internal classification code
• Use imperfect mapping catalogue (from chains internal classification)
• Requires manual checks on all new items
• Solution: supervised machine learning
10/18/2019 Kristian Harald Myklatun, Statistics Norway 5
6. Supervised machine learning
• Self-learning pattern recognition
• Use a sample of correctly labelled data (training set)
• Algorithm finds mapping function between features (text) and labels
(COICOP group)
• Feed new data, get:
• Hopefully correct label
• Likelihood of label being correct
10/18/2019 Kristian Harald Myklatun, Statistics Norway 6
7. Support vector machine
• Two-class problem:
• Want to separate blues from
reds
• Maximize distance to closest
item of each class
• New items on the left side are
labelled as blue, and vice
versa for red
10/18/2019 Kristian Harald Myklatun, Statistics Norway 7
8. Bag of words
• Way of representing text as
numbers
• Matrix format
• Each unique word is a column
• Each item/document is a row
• Frequency count for each
word
10/18/2019 Kristian Harald Myklatun, Statistics Norway 8
PIZZA ENVA2917 ENVA2868 ITALIENSK ITAL PEPPERONI VEGETARPIZZA HALAL CLASSICO
PIZZA ITALIENSK PEPPERONI ENVA2917 1 1 0 1 0 1 0 0 0
PIZZA ITAL CLASSICO ENVA2917 1 1 0 0 1 0 0 0 1
VEGETARPIZZA ENVA2917 0 1 0 0 0 0 1 0 0
PIZZA HALAL ENVA2868 1 0 1 0 0 0 0 1 0
9. Performance: accuracy
• Able to classify around 90
percent of new items correctly
• Close to realistic upper
boundary
• Training times vary with size
of training set, and can be
improved
10/18/2019 Kristian Harald Myklatun, Statistics Norway 9
10. Performance: certainty
• Probability:
• Model-assigned likelihood of
item belonging to certain class
• Relative:
• Relative probability between
first and second choice of
model
• Trade-off between accuracy
and time
10/18/2019 Kristian Harald Myklatun, Statistics Norway 10
11. Conclusion
• Substantial reduction in time
• Probably better quality (humans make mistakes too)
• However: requires training, investment to implement
• Nonetheless: Try it
10/18/2019 Kristian Harald Myklatun, Statistics Norway 11