Standardized Annotations for Survey Datasets: Enabling Automated Quality Assurance and Evaluation

Standardized Annotations for Survey Datasets:
Enabling Automated Quality Assurance and Evaluation
Dr. Georg Wittenburg ▪ Inspirient
Martin Rathje ▪ Kantar Public
General Online Research Conference 2023 (GOR 23)
Kassel, 21 September 2023

1
General Online Research Conference 2023, 21.09.2023
Martin Rathje
Senior Consultant at Kantar Public
Dr. Georg Wittenburg
Co-founder and CEO at Inspirient

AI-supported Survey Analytics
2
Six Reasons for AI / Automation
1) Improved Efficiency
Deliver detailed analytical evaluations in less time and
with more deep-dives
2) Quick Deliverables
Quickly deliver key results to clients, based on pre-
produced stories and prioritized insights
3) Preview Results during FW Data Collection
Preview results with client to gather feedback and steer
project
4) Advanced Data Validation
Collected data is more thoroughly checked for data quality
issues
5) Analytical Methods, Applied at Scale
Valuable insights surfaced with automated advanced
methods
6) Paper Trail for Every Result
Detailed, audit-proof documentation how each analytical
result was derived
Brief Recap
Survey raw data
(SPSS, Excel, CSV, …)
AI-supported
Survey Analytics
Client-facing deliverables
(PowerPoint, Excel, …)

AI-supported Survey Analytics @ GOR
3
Four sessions at GOR 23 on 20-22 September
Analysis
(qualitative / quantitative)
Synthesis
with Generative AI
Control
Hands-on
Experience
Community
Discussion
“Insights beyond Human Intuition”
(Talk at GOR 22)
“Trustworthy Analytics with Generative AI:
Four Use Cases for ChatGPT/GPT-4”
(Talk, 21 Sep 2023, 17:00, Rooms 1117/1118)
“Trustworthy Analytics with Generative AI:
ChatGPT/GPT-4 Hands-on”
(Workshop, 20 Sep 2023, 13:30, Room 1124)
Panel Discussion on AI / LLM
(22 Sep 2023, 11:45, Room 1101)
“Standardized Annotations for Survey Datasets:
Enabling Automated Quality Assurance and Evaluation”
(Talk, 21 Sep 2023, 12:00, Room 1124)
current
session

What needs to be known about survey data to auto-generate crosstabs?
4
Quick Example (1/3)
?

5
Quick Example (2/3)
Required
• ‘Gender’ is an independent
socio-demographic variable
• ‘Planning to vote’ is a
dependent response variable
Optional
• Number of samples to display
results as trustworthy
Focus of this talk

6
Quick Example (3/3)
• ‘Gender’ 
DEMOGRAPHIC_VARIABLE
• ‘Planning to vote’ 
SURVEY_RESPONSE
+

Introducing DIN SPEC 32792
7
• Co-written by Inspirient GmbH, Kantar Public and Scheer
PAS Deutschland GmbH
• Funded (in part) by Deutsches Institut für Normung (DIN)
• Published in July 2023
• Contents:
 Semantic data annotations, incl. for survey use case
 Annotation syntax in JSON
 Name space and vendor extensions
 Examples and best practices
• Available free of charge at www.beuth.de/en/technical-
rule/din-spec-32792/368566803
Semantic Data Annotations to Support AI-enabled Data Processing

Annotation Syntax and Storage/Exchange Definitions
8
Annotation Syntax
in EBNF
annotation = identifier, [ "(" , parameter list , ")" ] ;
identifier = uppercase letter , { uppercase letter | "_" } ;
parameter list = parameter, { "," , parameter } ;
parameter = numeric value | string value ;
numeric value = integer value | floating point value ;
integer value = [ "-" ] , digit , { digit } ;
floating point value = integer value , "." , digit , { digit } ;
string value = '"' , { all visible characters - '"' }, '"' ;
uppercase letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I"
| "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R"
| "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
all visible characters = ? all visible Unicode characters ?
JSON File Format for Storing / Exchanging
in JSON Schema1
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://din.de/norm/32792.schema.json",
"title": "Semantic Data Annotation for AI-supported Data Processing (DIN SPEC 32792)",
"description": "Schema for a single Semantic Data Annotation for AI-supported Data Processing (DIN SPEC
32792)",
"type": "object",
"properties": {
"annotationID": {
"description": "The unique identifier for an annotation",
"type": "string"
},
"annotationParameter": {
"description": "The parameter of the annotation",
"type": "array",
"items": {
"anyOf": [
{ "type": "string" },
{ "type": "number" }
]
},
"minItems": 0,
"uniqueItems": true
}
},
"required": [ "annotationID", ]
}
as per Section 5 of DIN SPEC 32792
1. Alternatives exists for project-specific embedded JSON and for stand-alone processing

Independent/dependent variable-annotations enable regression analysis
9
Example (1/3): Automated Regression Analysis
• ‘Age Group’, ‘Party ID’, ‘Party
Registration’, ‘Political ideology’,
‘Presidential vote choice’ 
INDEPENDENT_VARIABLE
• ‘The cost of holding the recall election
is a waste of taxpayer money’ 
DEPENDENT_VARIABLE
+
Berkeley IGS Q4 2021 Survey Data

Independent/dependent variable-annotations enable regression analysis
10
Example (1/3): Automated Regression Analysis
• ‘Age Group’, ‘Party ID’, ‘Party
Registration’, ‘Political ideology’,
‘Presidential vote choice’ 
INDEPENDENT_VARIABLE
• ‘The cost of holding the recall election
is a waste of taxpayer money’ 
DEPENDENT_VARIABLE
+
Berkeley IGS Q4 2021 Survey Data

Annotations on survey duration and interviewers enables quality checks
11
Example (2/3): Automated Survey Quality Control
• ‘dauer’  SURVEY_DURATION
• ‘intnr’  SURVEY_INTERVIEWER
+
Kantar Public Quality Test Dataset1
1. Sanitized

12
+
1. Sanitized

13
+
1. Sanitized

14
+
1. Sanitized

15
+
1. Sanitized

Natural language text-annotation enables topic extraction, coding and analysis
16
Example (3/3): Automated Natural Language Text Analysis
• ‘Finally, what else should SFO be
doing to help travelers feel their health
is being protected when using SFO?’
 INDEPENDENT_VARIABLE and
NATURAL_LANGUAGE_TEXT
+
San Francisco Airport COVID-19 Recovery (2020)

17
+

18
+

19
+

20
+

DIN SPEC 32792 defines 53 semantic data annotations
21
Annotations for Basic Data Properties
DEFAULT_VALUE, IGNORE_VALUE,
MORE_IS_BETTER, LESS_IS_BETTER,
HAS_SUBTOTALS, ID,
Annotations for Basic Statistical Properties
NOMINAL, CATEGORICAL, ORDINAL,
SUMMABLE, MAXIMIZABLE, MINIMIZABLE
Annotations for Advanced Statistical Properties
DEPENDENT_VARIABLE,
INDEPENDENT_VARIABLE,
REFERENCE_CATEGORY,
WEIGHTING_FACTOR
Annotations for Data Processing
FILTER_ON, FILTER_ON_DOMINANT_DOMAIN,
FILTER_ON_DOMINANT_DOMAIN_BY_VALUE,
FILTER_ON_TOP_3_BY_VALUE,
FILTER_ON_TOP_10_BY_VALUE,
DRILL_DOWN,
DRILL_DOWN_ON_DOMINANT_DOMAIN,
JOINABLE_ID_VALUES, JOINABLE, USE_AS_IS
Annotations for Process Data (use case-specific)
PROCESS_VARIABLE, PROCESS_ID,
INSTANCE_VARIABLE, INSTANCE_ID,
EVENT_VARIABLE, EVENT_ID,
NEXT_EVENT_ID, PREVIOUS_EVENT_ID,
EVENT_CATEGORY,
EVENT_START_TIMESTAMP,
EVENT_END_TIMESTAMP, EVENT_DURATION,
EVENT_RESULT, EVENT_OWNER
Annotations for Surveys / Opinion Polling
(use case-specific)
DEFINE_AS_MISSING,
DEFINE_AS_NO_OPINION,
DEMOGRAPHIC_VARIABLE,
MULTIPLE_RESPONSE_VARIABLE,
MULTI_PUNCH_VARIABLE, SURVEY_CASE_ID,
SURVEY_DURATION, SURVEY_INTERVIEWER,
SURVEY_META, SURVEY_MODE,
SURVEY_RESPONSE, SURVEY_WAVE
Miscellaneous Annotations
ANONYMIZE
27 general-purpose and 26 use case-specific annotations
DIN SPEC 32792 also has
provisions for vendor-specific
extensions to support additional
semantics and/or use cases

Fully Automate the Analysis of Your Survey Data with DIN SPEC 32792!
22
Survey Raw Data
DIN SPEC
32792
DIN SPEC 32792 Results by Inspirient
+

23
Thank you for your attention!
Martin Rathje
martin.rathje@kantar.com
Dr. Georg Wittenburg
georg.wittenburg@inspirient.com

Standardized Annotations for Survey Datasets: Enabling Automated Quality Assurance and Evaluation

Recommended

Recommended

More Related Content

Similar to Standardized Annotations for Survey Datasets: Enabling Automated Quality Assurance and Evaluation

Similar to Standardized Annotations for Survey Datasets: Enabling Automated Quality Assurance and Evaluation (20)

More from Inspirient

More from Inspirient (9)

Recently uploaded

Recently uploaded (20)

Standardized Annotations for Survey Datasets: Enabling Automated Quality Assurance and Evaluation