Presented at GOR 23 on 21 September 2022 together with Kantar Public. We give a hands-on introduction into recently published DIN SPEC 32792 "Semantic Data Annotations to Support AI-enabled Data Processing" that enables controlled use of Artificial Intelligence in data processing, with predictable, trustworthy results.
DIN SPEC 32972 is available online at https://www.beuth.de/en/technical-rule/din-spec-32792/368566803
This work is part of Kantar Public's Public Data Innovation Hub (https://www.kantarpublic.com/de/Unsere-Expertise/daten-und-fakten/public-data-innovation-hub) and was presented in the session on 'ML and AI in Surveys' at the General Online Research Conference 2023 (https://www.conftool.org/gor23/index.php?page=browseSessions&form_session=25). Supporting write-ups are available at https://www.inspirient.com/case-studies/survey_analysis_automation.php and https://www.inspirient.com/case-studies/survey_quality_assurance.php
Work on DIN SPEC 32972 was in part funded by DIN Deutsches Institut für Normung e. V.
-- Full abstract below --
Relevance & Research Question
To reliably automate processing of survey datasets – as we introduced in last year’s GOR – it is critical to reduce the ambiguity inherently contained in raw survey data. This, for example, includes the question whether a dimension is a meta, socio-demographic or a response variable; or the selection of which time dimension can be used for wave comparisons. To address this problem, we propose a set of standardized data annotations that enable automated processing across heterogeneous datasets. Our research question is thus to what extend annotations may facilitate more efficient and accurate survey data processing.
Methods & Data
After introducing Inspirient’s AI system for automated survey data processing at Kantar Public last year, we evaluated which steps of automated processing are most susceptible to ambiguities, thus obstructing efficient processing and leading to costly manual readjustments. We distilled these experiences into a set of data annotations that address these ambiguities and formalized them in DIN SPEC 32792. The updated AI system now incorporates these annotations in its internal reasoning process, and we evaluated their effectiveness in multiple real-world survey data processing tasks.
Results
Our results show that the proposed annotations can help facilitate fully automated processing of survey data without the need for iterative manual readjustments, e.g., for quality assurance and analytical evaluation, while maintaining quality and consistency of results. Examples included in our talk cover basic and advanced use, incl. socio-demographic / response annotations to generate contingency tables as well as dependent / independent variable annotations to automate multivariate regression analyses.
Added Value
Our work provides a foundation upon which more efficient automation / AI initiatives in the field of social research / survey analytics [...]
Standardized Annotations for Survey Datasets: Enabling Automated Quality Assurance and Evaluation
1. Standardized Annotations for Survey Datasets:
Enabling Automated Quality Assurance and Evaluation
Dr. Georg Wittenburg ▪ Inspirient
Martin Rathje ▪ Kantar Public
General Online Research Conference 2023 (GOR 23)
Kassel, 21 September 2023
2. 1
General Online Research Conference 2023, 21.09.2023
Martin Rathje
Senior Consultant at Kantar Public
Dr. Georg Wittenburg
Co-founder and CEO at Inspirient
3. AI-supported Survey Analytics
2
General Online Research Conference 2023, 21.09.2023
Six Reasons for AI / Automation
1) Improved Efficiency
Deliver detailed analytical evaluations in less time and
with more deep-dives
2) Quick Deliverables
Quickly deliver key results to clients, based on pre-
produced stories and prioritized insights
3) Preview Results during FW Data Collection
Preview results with client to gather feedback and steer
project
4) Advanced Data Validation
Collected data is more thoroughly checked for data quality
issues
5) Analytical Methods, Applied at Scale
Valuable insights surfaced with automated advanced
methods
6) Paper Trail for Every Result
Detailed, audit-proof documentation how each analytical
result was derived
Brief Recap
Survey raw data
(SPSS, Excel, CSV, …)
AI-supported
Survey Analytics
Client-facing deliverables
(PowerPoint, Excel, …)
4. AI-supported Survey Analytics @ GOR
3
General Online Research Conference 2023, 21.09.2023
Four sessions at GOR 23 on 20-22 September
Analysis
(qualitative / quantitative)
Synthesis
with Generative AI
Control
Hands-on
Experience
Community
Discussion
“Insights beyond Human Intuition”
(Talk at GOR 22)
“Trustworthy Analytics with Generative AI:
Four Use Cases for ChatGPT/GPT-4”
(Talk, 21 Sep 2023, 17:00, Rooms 1117/1118)
“Trustworthy Analytics with Generative AI:
ChatGPT/GPT-4 Hands-on”
(Workshop, 20 Sep 2023, 13:30, Room 1124)
Panel Discussion on AI / LLM
(22 Sep 2023, 11:45, Room 1101)
“Standardized Annotations for Survey Datasets:
Enabling Automated Quality Assurance and Evaluation”
(Talk, 21 Sep 2023, 12:00, Room 1124)
current
session
5. What needs to be known about survey data to auto-generate crosstabs?
4
General Online Research Conference 2023, 21.09.2023
Quick Example (1/3)
?
6. What needs to be known about survey data to auto-generate crosstabs?
5
General Online Research Conference 2023, 21.09.2023
Quick Example (2/3)
Required
• ‘Gender’ is an independent
socio-demographic variable
• ‘Planning to vote’ is a
dependent response variable
Optional
• Number of samples to display
results as trustworthy
Focus of this talk
7. What needs to be known about survey data to auto-generate crosstabs?
6
General Online Research Conference 2023, 21.09.2023
Quick Example (3/3)
• ‘Gender’
DEMOGRAPHIC_VARIABLE
• ‘Planning to vote’
SURVEY_RESPONSE
+
8. Introducing DIN SPEC 32792
7
General Online Research Conference 2023, 21.09.2023
• Co-written by Inspirient GmbH, Kantar Public and Scheer
PAS Deutschland GmbH
• Funded (in part) by Deutsches Institut für Normung (DIN)
• Published in July 2023
• Contents:
Semantic data annotations, incl. for survey use case
Annotation syntax in JSON
Name space and vendor extensions
Examples and best practices
• Available free of charge at www.beuth.de/en/technical-
rule/din-spec-32792/368566803
Semantic Data Annotations to Support AI-enabled Data Processing
9. Annotation Syntax and Storage/Exchange Definitions
8
General Online Research Conference 2023, 21.09.2023
Annotation Syntax
in EBNF
annotation = identifier, [ "(" , parameter list , ")" ] ;
identifier = uppercase letter , { uppercase letter | "_" } ;
parameter list = parameter, { "," , parameter } ;
parameter = numeric value | string value ;
numeric value = integer value | floating point value ;
integer value = [ "-" ] , digit , { digit } ;
floating point value = integer value , "." , digit , { digit } ;
string value = '"' , { all visible characters - '"' }, '"' ;
uppercase letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I"
| "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R"
| "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
all visible characters = ? all visible Unicode characters ?
JSON File Format for Storing / Exchanging
in JSON Schema1
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://din.de/norm/32792.schema.json",
"title": "Semantic Data Annotation for AI-supported Data Processing (DIN SPEC 32792)",
"description": "Schema for a single Semantic Data Annotation for AI-supported Data Processing (DIN SPEC
32792)",
"type": "object",
"properties": {
"annotationID": {
"description": "The unique identifier for an annotation",
"type": "string"
},
"annotationParameter": {
"description": "The parameter of the annotation",
"type": "array",
"items": {
"anyOf": [
{ "type": "string" },
{ "type": "number" }
]
},
"minItems": 0,
"uniqueItems": true
}
},
"required": [ "annotationID", ]
}
as per Section 5 of DIN SPEC 32792
1. Alternatives exists for project-specific embedded JSON and for stand-alone processing
10. Independent/dependent variable-annotations enable regression analysis
9
General Online Research Conference 2023, 21.09.2023
Example (1/3): Automated Regression Analysis
• ‘Age Group’, ‘Party ID’, ‘Party
Registration’, ‘Political ideology’,
‘Presidential vote choice’
INDEPENDENT_VARIABLE
• ‘The cost of holding the recall election
is a waste of taxpayer money’
DEPENDENT_VARIABLE
+
Berkeley IGS Q4 2021 Survey Data
11. Independent/dependent variable-annotations enable regression analysis
10
General Online Research Conference 2023, 21.09.2023
Example (1/3): Automated Regression Analysis
• ‘Age Group’, ‘Party ID’, ‘Party
Registration’, ‘Political ideology’,
‘Presidential vote choice’
INDEPENDENT_VARIABLE
• ‘The cost of holding the recall election
is a waste of taxpayer money’
DEPENDENT_VARIABLE
+
Berkeley IGS Q4 2021 Survey Data
12. Annotations on survey duration and interviewers enables quality checks
11
General Online Research Conference 2023, 21.09.2023
Example (2/3): Automated Survey Quality Control
• ‘dauer’ SURVEY_DURATION
• ‘intnr’ SURVEY_INTERVIEWER
+
Kantar Public Quality Test Dataset1
1. Sanitized
13. Annotations on survey duration and interviewers enables quality checks
12
General Online Research Conference 2023, 21.09.2023
Example (2/3): Automated Survey Quality Control
• ‘dauer’ SURVEY_DURATION
• ‘intnr’ SURVEY_INTERVIEWER
+
Kantar Public Quality Test Dataset1
1. Sanitized
14. Annotations on survey duration and interviewers enables quality checks
13
General Online Research Conference 2023, 21.09.2023
Example (2/3): Automated Survey Quality Control
• ‘dauer’ SURVEY_DURATION
• ‘intnr’ SURVEY_INTERVIEWER
+
Kantar Public Quality Test Dataset1
1. Sanitized
15. Annotations on survey duration and interviewers enables quality checks
14
General Online Research Conference 2023, 21.09.2023
Example (2/3): Automated Survey Quality Control
• ‘dauer’ SURVEY_DURATION
• ‘intnr’ SURVEY_INTERVIEWER
+
Kantar Public Quality Test Dataset1
1. Sanitized
16. Annotations on survey duration and interviewers enables quality checks
15
General Online Research Conference 2023, 21.09.2023
Example (2/3): Automated Survey Quality Control
• ‘dauer’ SURVEY_DURATION
• ‘intnr’ SURVEY_INTERVIEWER
+
Kantar Public Quality Test Dataset1
1. Sanitized
17. Natural language text-annotation enables topic extraction, coding and analysis
16
General Online Research Conference 2023, 21.09.2023
Example (3/3): Automated Natural Language Text Analysis
• ‘Finally, what else should SFO be
doing to help travelers feel their health
is being protected when using SFO?’
INDEPENDENT_VARIABLE and
NATURAL_LANGUAGE_TEXT
+
San Francisco Airport COVID-19 Recovery (2020)
18. Natural language text-annotation enables topic extraction, coding and analysis
17
General Online Research Conference 2023, 21.09.2023
Example (3/3): Automated Natural Language Text Analysis
• ‘Finally, what else should SFO be
doing to help travelers feel their health
is being protected when using SFO?’
INDEPENDENT_VARIABLE and
NATURAL_LANGUAGE_TEXT
+
San Francisco Airport COVID-19 Recovery (2020)
19. Natural language text-annotation enables topic extraction, coding and analysis
18
General Online Research Conference 2023, 21.09.2023
Example (3/3): Automated Natural Language Text Analysis
• ‘Finally, what else should SFO be
doing to help travelers feel their health
is being protected when using SFO?’
INDEPENDENT_VARIABLE and
NATURAL_LANGUAGE_TEXT
+
San Francisco Airport COVID-19 Recovery (2020)
20. Natural language text-annotation enables topic extraction, coding and analysis
19
General Online Research Conference 2023, 21.09.2023
Example (3/3): Automated Natural Language Text Analysis
• ‘Finally, what else should SFO be
doing to help travelers feel their health
is being protected when using SFO?’
INDEPENDENT_VARIABLE and
NATURAL_LANGUAGE_TEXT
+
San Francisco Airport COVID-19 Recovery (2020)
21. Natural language text-annotation enables topic extraction, coding and analysis
20
General Online Research Conference 2023, 21.09.2023
Example (3/3): Automated Natural Language Text Analysis
• ‘Finally, what else should SFO be
doing to help travelers feel their health
is being protected when using SFO?’
INDEPENDENT_VARIABLE and
NATURAL_LANGUAGE_TEXT
+
San Francisco Airport COVID-19 Recovery (2020)
22. DIN SPEC 32792 defines 53 semantic data annotations
21
General Online Research Conference 2023, 21.09.2023
Annotations for Basic Data Properties
DEFAULT_VALUE, IGNORE_VALUE,
MORE_IS_BETTER, LESS_IS_BETTER,
HAS_SUBTOTALS, ID,
NATURAL_LANGUAGE_TEXT
Annotations for Basic Statistical Properties
NOMINAL, CATEGORICAL, ORDINAL,
SUMMABLE, MAXIMIZABLE, MINIMIZABLE
Annotations for Advanced Statistical Properties
DEPENDENT_VARIABLE,
INDEPENDENT_VARIABLE,
REFERENCE_CATEGORY,
WEIGHTING_FACTOR
Annotations for Data Processing
FILTER_ON, FILTER_ON_DOMINANT_DOMAIN,
FILTER_ON_DOMINANT_DOMAIN_BY_VALUE,
FILTER_ON_TOP_3_BY_VALUE,
FILTER_ON_TOP_10_BY_VALUE,
DRILL_DOWN,
DRILL_DOWN_ON_DOMINANT_DOMAIN,
JOINABLE_ID_VALUES, JOINABLE, USE_AS_IS
Annotations for Process Data (use case-specific)
PROCESS_VARIABLE, PROCESS_ID,
INSTANCE_VARIABLE, INSTANCE_ID,
EVENT_VARIABLE, EVENT_ID,
NEXT_EVENT_ID, PREVIOUS_EVENT_ID,
EVENT_CATEGORY,
EVENT_START_TIMESTAMP,
EVENT_END_TIMESTAMP, EVENT_DURATION,
EVENT_RESULT, EVENT_OWNER
Annotations for Surveys / Opinion Polling
(use case-specific)
DEFINE_AS_MISSING,
DEFINE_AS_NO_OPINION,
DEMOGRAPHIC_VARIABLE,
MULTIPLE_RESPONSE_VARIABLE,
MULTI_PUNCH_VARIABLE, SURVEY_CASE_ID,
SURVEY_DURATION, SURVEY_INTERVIEWER,
SURVEY_META, SURVEY_MODE,
SURVEY_RESPONSE, SURVEY_WAVE
Miscellaneous Annotations
ANONYMIZE
27 general-purpose and 26 use case-specific annotations
DIN SPEC 32792 also has
provisions for vendor-specific
extensions to support additional
semantics and/or use cases
23. Fully Automate the Analysis of Your Survey Data with DIN SPEC 32792!
22
General Online Research Conference 2023, 21.09.2023
Survey Raw Data
DIN SPEC
32792
DIN SPEC 32792 Results by Inspirient
+
24. 23
General Online Research Conference 2023, 21.09.2023
Thank you for your attention!
Martin Rathje
martin.rathje@kantar.com
Dr. Georg Wittenburg
georg.wittenburg@inspirient.com